Scraping the Web with Selenium and Python: A Step-By-Step Tutorial
Since the late 2000s, web scraping has become essential for extracting public data, giving a competitive edge to those who use it. A common challenge is scraping pages with delayed data loading due to dynamic content, which traditional tools often struggle with. Fortunately, Selenium Python web scraping can effectively handle this issue. In this blog post, you'll learn how to scrape dynamic web data with delayed JavaScript rendering using Python and the Selenium library, with a complete code example and a video tutorial available at the end.
Preparing Selenium Python
First things first, let’s prepare our Selenium Python web scraping approach by using the virtualenv package. It should come as a default library with Python 3.3 and above, or you can learn how to install it here.
- Download the full project from our GitHub.
- Open the Terminal or command-line interface based on your operating system.
- Navigate to the directory where you downloaded the project to create the virtual environment. You can use the command cd path/to/directory to get there quickly.
- Input the virtualenv package commands. On macOS and Linux: source myenv/bin/activate. On Windows (in Command Prompt or PowerShell): .\myenv\Scripts\activate.
Now, you’ll be working within the virtual environment, and any Python packages you install will be local to that environment. So, let’s talk about the packages we’ll need for this Selenium Python web scraping method:
- Webdriver-manager is a utility tool that streamlines the process of setting up and managing different web drivers for browser automation.
- Selenium is a powerful tool for controlling a web browser through code, facilitating automated testing and web scraping.
- Bs4, also known as BeautifulSoup, is a parsing library that makes it easy to parse the scraped information from web pages, allowing for efficient HTML and XML data extraction.
You can download the packages using these commands via your terminal:
pip install webdriver-managerpip install seleniumpip install beautifulsoup4
Once we have the packets installed, the first thing to do is to import everything into the script file:
from webdriver_manager.chrome import ChromeDriverManagerfrom selenium import webdriverfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom extension import proxiesfrom bs4 import BeautifulSoupimport json
Setting up residential proxies
The next step involves integrating proxies into our code. Without proxies, target websites might detect your Selenium Python web scraping efforts and halt any program attempting data collection. To efficiently gather public data, your scraping project must seem like a regular internet user.
A residential proxy is an intermediary that provides the user with an IP address allocated by an Internet Service Provider (ISP). Maintaining a low profile when web scraping is essential, so residential proxies are the perfect choice. They provide a high level of anonymity and are unlikely to be blocked by websites.
We at Smartproxy offer industry-leading residential proxies with a vast 55M+ IP pool across 195+ locations, the fastest response time in the market (<0.5s), a 99.68% success rate, and an excellent entry-point via the Pay As You Go payment option.
Once you get yourself a proxy plan and set up your user, insert the proxy credentials into the code:
username = ’your_username’password = ’your_password’endpoint = ’proxy_endpoint’port = ’proxy_port’
Replace your proxy username, password, endpoint, and port by replacing your_username, your_password, proxy_endpoint, and proxy_port, respectively.
WebDriver page properties
Time to truly unleash the power of Selenium Python web scraping. The first line creates a web driver with Chrome options that define how the browser should work. The first thing we add to the options is telling it to use proxies. We enable an extension (from the extension.py file) with our credentials to enable the proxy, ensuring your scraping activity remains anonymous and uninterrupted. Note that you don’t have to enter your proxy information here; it’s already been defined before.
Then, we’re adding one more Chrome option to activate headless mode instead of browser mode. The last line indicates that we’re spawning a web driver over the Chrome instance and providing Chrome options, saying we’d like to install the proxy extension.
chrome_options = webdriver.ChromeOptions()proxies_extension = proxies(username, password, endpoint, port)chrome_options.add_extension(proxies_extension)chrome_options.add_argument("--headless=new")chrome = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
Targeting and delaying
When performing Selenium Python web scraping, precision is of the utmost importance. A good rule of thumb is to define what you’re targeting on the page that you intend to scrape. Dynamic content sometimes means that you’ll have to adapt creatively.
In our example, we’ve selected the URL of a website dedicated to showcasing quotes from famous people. It’s a purposefully slow-loading page, so it will return an error if we don’t give the web driver enough delay time before scraping. Therefore, we’re setting a delay time of 30 seconds and targeting only the quote element by class name.
url = "https://quotes.toscrape.com/js-delayed/"chrome.get(url)wait = WebDriverWait(chrome, 30)quote_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "quote")))
Extracting the HTML element
Once we get all the elements from the page, we can create a simple loop where we iterate through all the elements and extract the necessary data from the HTML code. For this, we’re using the BeautifulSoup library.
We extract the quote by targeting a specific span with a class text and extracting the text. We do the same for the elements author and tags. Then, we create a dictionary and format it based on our preferences.
quote_data = []for quote_element in quote_elements:soup = BeautifulSoup(quote_element.get_attribute("outerHTML"), ’html.parser’)quote_text = soup.find(’span’, class_=’text’).textauthor = soup.find(’small’, class_=’author’).texttags = [tag.text for tag in soup.find_all(’a’, class_=’tag’)]quote_info = {"Quote": quote_text,"Author": author,"Tags": tags}quote_data.append(quote_info)with open(’quote_info.json’, ’w’) as json_file:json.dump(quote_data, json_file, indent=4)chrome.quit()
Run the code by executing the following command in your terminal:
python quotes.py
The result will appear in your Terminal or command line interface and be saved in a JSON file. The benefit of storing JSON files is that it makes the data well-organized and easy to interpret.
Prefer browser mode?
For those who like a visual representation of the Selenium Python web scraping process, you can switch the headless mode off. In that case, you’ll witness a Chrome instance being launched, offering a real-time view of the scraping. It’s a matter of personal preference, but it’s always good to have an option for checking if it works or at which point the errors strike.
If you go to the WebDriver page properties step, simply put a # symbol before the line that mentions headless to comment it out and make it inactive:
# chrome_options.add_argument("--headless=new")
The full Selenium Python web scraping code and video
Let's recap. The project is downloadable from our GitHub. And the code is as follows:
from webdriver_manager.chrome import ChromeDriverManagerfrom selenium import webdriverfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom extension import proxiesfrom bs4 import BeautifulSoupimport json# Credentials and Proxy Detailsusername = ’your_username’password = ’your_password’endpoint = ’proxy_endpoint’port = ’proxy_port’# Set up Chrome WebDriverchrome_options = webdriver.ChromeOptions()proxies_extension = proxies(username, password, endpoint, port)chrome_options.add_extension(proxies_extension)# Comment the next line to disable headless modechrome_options.add_argument("--headless=new")chrome = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)# Open the desired webpageurl = "https://quotes.toscrape.com/js-delayed/"chrome.get(url)# Wait for the "quotes" divs to loadwait = WebDriverWait(chrome, 30)quote_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "quote")))# Extract the HTML of all "quote" elements, parse them with BS4 and save to JSONquote_data = []for quote_element in quote_elements:soup = BeautifulSoup(quote_element.get_attribute("outerHTML"), ’html.parser’)quote_text = soup.find(’span’, class_=’text’).textauthor = soup.find(’small’, class_=’author’).texttags = [tag.text for tag in soup.find_all(’a’, class_=’tag’)]quote_info = {"Quote": quote_text,"Author": author,"Tags": tags}quote_data.append(quote_info)# Save data to JSON filewith open(’quote_info.json’, ’w’) as json_file:json.dump(quote_data, json_file, indent=4)# Close the WebDriverchrome.quit()
To wrap up
We hope our walkthrough has taught you how to be mindful of your target and how to extract the desired data from dynamically rendering pages successfully. The Selenium Python web scraping technique in our vast digital universe of data is like equipping yourself with a Swiss army knife in the dense, unpredictable jungles of the Amazon.
Remember that our residential proxies’ added power will ensure a smooth, uninterrupted scraping journey. Whether you’re a beginner or someone with a bit more experience, combining these tools guarantees efficiency for any web scraping project.
About the author
Dominykas Niaura
Copywriter
As a fan of digital innovation and data intelligence, Dominykas delights in explaining our products’ benefits, demonstrating their use cases, and demystifying complex tech topics for everyday readers.
All information on Smartproxy Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Smartproxy Blog or any third-party websites that may be linked therein.