Smartproxy

Table of content

March 04, 2022
10 minutes read

Python Tutorial: How To Scrape Images From Websites

So, you’ve found yourself in need of some images, but looking for them individually doesn’t seem all that exciting? Especially if you are doing it for a machine learning project. Fret not; web scraping comes in to save the day as it allows you to collect massive amounts of data in a fraction of the time it would take you to do it manually. 

There are quite a few tutorials out there, but in this one, we’ll show you how to get the images you need from a static website in a simple way. We’ll use Python, some additional Py libraries, and proxies – so stay tuned.

A robot presenting a python tutorial about web scraping images
  • Smartproxy >
  • Blog >
  • Python Tutorial: How To Scrape Images From Websites

Know your websites

First things first – it’s very important to know what kind of website you want to scrape images from. And by what kind, we mean dynamic or static. As it’s quite an extensive topic, we’ll only go over the basics in this tutorial. But if you’re genuinely interested in learning more about it, we highly recommend checking out our other tutorial on scraping dynamic content.

Dynamic website 

A dynamic website has elements that change each time a different user (or sometimes, even the same user) visits a website. It stores certain information (if it’s provided to the website) about you, like your age, gender, location, payment information, etc. Sometimes, even the weather and season in your location. 

It may sound a little unnerving at first, but all of this is done to ensure that users have the best-tailored experience. The more you visit the website, the more personalized and convenient your experience will be.

Understandably, building a dynamic website includes advanced programming and databases. Those sites don't have HTML files for each page; their servers create them "on-the-fly." In response to a user request, the server gathers data from one or more databases and creates a unique HTML file for the customer. The HTML file is sent back to the user's browser when the page is ready.

Static website

As the name suggests, these websites are static – meaning they don’t change, unlike dynamic websites. These types of websites are kind of “take it or leave it.” The displayed content isn’t affected by the viewer in any way whatsoever. So unless the content is changed manually, everyone will see the exact same thing. A static website is usually written entirely in HTML. 

Web scraping: dynamic website vs. static website

You’re probably wondering what this means in terms of web scraping images? Well, as fun as dynamic websites are, web scraping them is no easy feat. Since the content is changed to suit each user according to their preferences and other previously discussed criteria, you can imagine how difficult it can be to scrape all data (or images) from such websites. 

The process is rather tedious and requires not just knowledge of web scraping but experience as well. It also calls for more Py libraries and additional tools to tackle this quest. This is precisely why, for this tutorial, we opted to web scrape images from a static website.

Static website vs. dynamic website

Determining whether a website is static or dynamic

If, upon opening a website, it greets you like this: “Hey there, so-and-so, it’s been a while. Remember that item you viewed before? It’s on sale now!”. Well, it’s very enthusiastically calling itself a dynamic website. 

But all jokes aside, several ways will help you to know if you’re facing static or dynamic websites:

  1. Check if the web server software supports dynamic content. Static websites are often hosted on Apache servers, and dynamic websites are typically managed on IIS servers. 
  2. Examine web content. Static websites are often packed with non-changing material such as text and photos. Dynamic websites may have a mix of static and dynamic material, such as submission forms, user logins for customized content, online survey, and dynamic components that alter based on search terms entered into a search box. 
  3. Look at the web address. The static website's address remains the same, while the dynamic website's web address is likely to change with each page load.

Besides, remember that dynamic websites are the ones where information changes quite frequently, like weather or news sites and stock exchange pages. Such changing news can be loaded by an application using resources in a database, while the information on static websites has to be updated manually. 

Getting started: what you’ll need

Just like in a recipe, it’s best to first look over what we’ll need before diving hands deep into work. Otherwise, it can get confusing later on if you have to figure out whether everything is in place or not. So, for this tutorial, you’ll need:

Python – we used version 3.8.9. In case you don’t have it yet, though, here’s the link: https://www.python.org/downloads/.

BeautifulSoup 4 – BS4 is a Py package that parses HTML and XML formats. In this case, BS4 will help turn a website’s content into an HTML format and then extract all of the ‘img’ objects within the HTML. 

Requests – this Py library is needed to send requests to a website and save it in the response object. 

Proxies – whether it’s your first or zillionth time attempting to scrape the web, proxies are an important part of it. Proxies help shield you in the eyes of the internet and allow you to continue your work without a single IP ban, block, or CAPTCHA.

Let’s get those images – scraping tutorial 

Now that we’ve covered all the basics let’s get this show on the road. Compared to other tutorials on the subject, this is simpler, but it still requires coding. No worries, we’re going to proceed with a step-by-step explanation of each code line to ensure nothing slips through the cracks. 

Loading video...

Step 1 – Setting up proxies

We suggest using our residential proxies – armed with Python and Beautifulsoup 4; they’re more than enough to handle this task. Your starting point:

  • Head over to https://dashboard.smartproxy.com/ 
  • Register and confirm your registration. 
  • Navigate to the left side of the screen and click on the “Residential” tab and click on “Pricing” to subscribe to the plan best suiting your needs.
  • Create a user and choose an authentication method – whitelisting your IP or user:pass. Press the “Authentication method” in the “Residential” section to do this.

Here’s how you can set up proxies if you picked user:pass authentication option:

import requests

url = 'https://ip.smartproxy.com'
username = 'username'
password = 'password'

Now that you’ve set up your proxies, you can choose whichever endpoint you want from more than 195 countries, including any city, thanks to our latest updated backconnect node.

proxy = f'http://{username}:{password}@gate.smartproxy.com:7000'
response = requests.get(url, proxies={'http': proxy, 'https': proxy})
print(response.text)

Oh, and if you run into any hiccups, check our documentation, or if you’d prefer some human connection, hit up our customer support – they’re around 24/7. 

Step 2 – Adding our libraries

Before we jump into the code, we should add BS4 and requests.

from bs4 import BeautifulSoup
import requests

Step 3 – Selecting our target website

Let’s go ahead and select a target website to scrape images from. For the purposes of this tutorial, we’re gonna use our help docs page: https://help.smartproxy.com/docs/how-do-i-use-proxies.

A friendly reminder, always make sure to note the terms of service of any website you’d like to scrape. Just because a website can be accessed freely doesn’t mean that the information provided there – in this case, images – can be taken just like that as well. 

Now that we’ve got that out of our way let’s add the following code line – it will include our target in our code.

html_page = "https://help.smartproxy.com/docs/how-do-i-use-proxies"

Step 4 – Sending a request

Now let’s add another code line that will request information from the website with a GET command.

response = requests.get(html_page, proxies={'http': proxy, 'https': proxy})

Step 5 – Scraping images

With this next code line, we’ll turn response.text into a BeautifulSoup object by using BS4.

soup = BeautifulSoup(response.text, 'html.parser')

It’s time to identify all img objects within the HTML by using the for loop. 

for img in soup.findAll('img'):

Moving forward, let’s identify whether or not an image has an src in the img object. Src simply means source of the image.

if img.get('src') != None:

Now, this code line will be used to get the image links in our response after running the code. 

print(img.get('src'))

At the end of this step, your code should look like this: 

from bs4 import BeautifulSoup
import requests
html_page = "https://help.smartproxy.com/docs/how-do-i-use-proxies"
proxy = f'http://username:[email protected]:7000'
response = requests.get(html_page, proxies={'http': proxy, 'https': proxy})
soup = BeautifulSoup(response.text, 'html.parser')
for img in soup.findAll('img'):
if img.get('src') != None:
print(img.get('src'))

Step 6 – Getting URLs of the images

If you need to scrape only the images’ URLs, all that’s left to do is hit ‘Enter’ on your keyboard and get those sweet results. The response should look something like this:

https://files.readme.io/c78c9d4-small-smartproxy-residential-rotating-proxies.png
https://files.readme.io/c78c9d4-small-smartproxy-residential-rotating-proxies.png
https://files.readme.io/d5fb07a-2ndshot.jpg
https://files.readme.io/d5fb07a-2ndshot.jpg
https://files.readme.io/119ef97-3rdstep.jpg
https://files.readme.io/119ef97-3rdstep.jpg
https://files.readme.io/4021757-4thstep.png
https://files.readme.io/4021757-4thstep.png
https://files.readme.io/6018b04-smartproxy-nl-rotating-residential-proxy-example.png
https://files.readme.io/6018b04-smartproxy-nl-rotating-residential-proxy-example.png

But if you came here to gather the actual images, there’re a few more steps to follow.

Step 7 – Downloading scraped images

First, save the received URLs to a new variable.

img_url = img.get('src')

Then get the image’s name. It’ll be the text after the last slash in the URL (in this case “c78c9d4-small-smartproxy-residential-rotating-proxies.png”, if we’re talking about the first one).

name = img_url.split('/')[-1]

Now form a new request for getting an image. We’ll do this for each image URL we got from the initial request.

img_response = requests.get(img_url) 

Next, open a file and label it with the name variable we used before. Yup, that “c78c9d4-small-smartproxy-residential-rotating-proxies.png”

file = open(name, "wb")

And write the image response content to the file. 

file.write(img_response.content) 

Finally, let’s close the file. The code will now move on to the next image URL and stop when all image URLs will be scraped.

file.close()

Hooray! You’re done! The final code should look like this:

from bs4 import BeautifulSoup
import requests
html_page = "https://help.smartproxy.com/docs/how-do-i-use-proxies"
proxy = f'http://username:[email protected]:7000'
response = requests.get(html_page, proxies={'http': proxy, 'https': proxy})
soup = BeautifulSoup(response.text, 'html.parser')
for img in soup.findAll('img'):
if img.get('src') != None:
print(img.get('src'))
img_url = img.get('src')
name = img_url.split('/')[-1]
img_response = requests.get(img_url)
file = open(name, "wb")
file.write(img_response.content)
file.close()

The images will be automatically stored in the same directory as our code after downloading.

On a final note

Web scraping is a process that you can use to optimize your work and improve your overall performance. Besides, it’s not just something used in the tech world – more and more people are using web scraping to achieve their goals (such as doing market or even academic research, job and apartment hunting, or SEO). 

However, let’s not forget that not everything can be scraped, including images. Each website has its own terms of service as well as conditions. Some photos may have strict copyright rules we must adhere to. But if we respect one another online and throw in some fancy netiquette in the mix, we’ll all enjoy a smoother and more fruitful experience on the world wide web.

smartproxy

Ella Moore

Ella’s here to help you untangle the anonymous world of residential proxies to make your virtual life make sense. She believes there’s nothing better than taking some time to share knowledge in this crazy fast-paced world.

Frequently asked questions

Are there any no-code solutions to web scraping images?

You bet – our very own No-Code Scraper. It's a fantastic no-code tool that lets you scrape content and images and download files in JSON or CSV formats with just a couple of clicks. You can scrape any website, including Google. Dope, innit?

With No-Code Scraper, you can pick from pre-made scraping templates, choose a favorable data delivery option, and schedule the recurring data gathering process. By the way, you can grab a free version of this tool – No-Code Scraper extension – on the Chrome store. Just keep in mind that it doesn’t support task scheduling, scraped data storage, and pre-made scraping templates.

Is image scraping legal?

Generally speaking, scraping is legal; however, some websites can have clear-cut rules that don’t allow scraping their content. In that case, you must hold back. Gladly, you can check whether the site is scrapable by adding "/robots.txt" to the end of the URL of your target website.

Just remember that if you’re using scraped images, your actions cannot infringe copyright law!

What are the challenges of web scraping?

CATPCHAs are some of the most frequent challenges web scrapers face. Use residential proxies to have a smooth scraping experience and not get caught for being a robot. These proxies come from a residential network or, in other words, are real device IPs. In turn, any traffic coming from residential proxies to a website looks like a request from an ordinary person.