Back to blog

Take Your Web Scraping To The Next Level – Scraping Dynamic Content With Python

The internet has changed quite a bit, hasn't it? Today, almost every popular website you go to is tailored to your specific needs. The goal is to make the user experience as good as possible. It sounds amazing for the end-user, but for someone who’s trying to data scrape dynamic content, it can prove to be quite the challenge. That doesn’t mean it’s not doable! 


In this blog post, we’ll go through a step-by-step guide on how to web scrape dynamic content with Python and Selenium. While we’ll do our best to make this guide as clear as possible, this guide isn’t exactly a walk in the park. So if you’re new to web scraping, you might want to gain more scraping experience before tackling this quest.

James Keenan

Jan 17, 2022

12 min read

Hero presenting how to web scrape dynamic content.

What’s dynamic content?

More and more websites are investing time and money into researching their client base. This information is used to make a unique experience on a website for each user. It allows websites to make sure that their visitors get exactly what they need. 

That’s essentially what dynamic content is. The content changes according to several factors such as demographics, language settings, time of day or season, location, etc. By employing this method, online business owners have an opportunity to engage with their clients, build interest and, in the best-case scenario, convert visitors into customers. 

I mean, look at Google. By knowing the location, time, language, demographics, and even the behavior (debatable) of its users, Google provides the most relevant information. Granted, that can at first feel a little uneasy – we often don’t realize just how much information about us there is online. But in the end, we come to Google with a very specific need, and we want to get it ASAP.

Dynamic website vs. static website.

Dynamic content vs. static content 

There are two types of online content – dynamic and static. Let’s look over them and compare them briefly to get a clearer image.

  • Dynamic content is unique to every person who visits a website. Not only that, but the more they return to the same dynamic website, the more it will change. Dynamic websites store information about you and adapt their content. This is also why dynamic websites are faster than static websites – because of the stored information. 
  • Static content can be observed in websites where, no matter who is the visitor, the content stays the same for everyone. So let’s say you encounter two websites on the same topic, but one is dynamic and the other – static. A dynamic website will try to tailor its’ content to what best matches a certain user; on the other hand,  static content will remain the same for every user.

Benefits of dynamic websites

Improved professional image. Making sure that your website caters to each individual visitor helps boost trust and interest. 

Shorter buyer’s journey. More often than not potential buyers leave a website because of the check-out process. It can either be too long or laggy. But with a dynamic website, the process becomes much easier since your information is stored. 

Personalized user experience. Let’s be honest, it’s always nice to be appreciated and noticed. That’s exactly what you’ll get with a dynamic website.

Better website ranking. By researching your audience and catering to their needs you’ll definitely see a boost in your SEO metrics. 

Faster loading time. Since dynamic websites store information about your visits, the loading time becomes much shorter as there’s no need to retrieve the same information. 

Examples of dynamic content

Okay, so dynamic content is essentially displaying specific information to a user according to several factors we went over in the previous section. But where exactly can we find dynamic content? Let’s look at some examples.

  • Landing pages. 
  • Emails and newsletters.
  • Product pages.
  • Ads on websites.
  • Blog posts and articles. 
  • Website navigation. 
  • YouTube.

In all of the above-mentioned examples, the information that is displayed is unique to the user. Sure, the ultimate purpose of dynamic content is conversion, but it’s also important to generate interest and build a long-lasting relationship.

Challenges of web scraping dynamic content

As you’ve no doubt already gathered – otherwise, you wouldn’t have landed on this blog post – scraping dynamic websites can be vexing. However, with the right tech know-how, it can be done relatively easily. Here are some of the main challenges you might encounter:

  • Finding the elements you need for the data you want to scrape. Dynamic content isn’t stored in HTML but rather server-based. It changes with every new visitor and visit, so getting everything regardless of that can be difficult. 
  • Making your scraper “catch” selected elements on every page you’ve selected. Dynamic content is fickle, so you can imagine how troublesome it can be to have a scraper catch all of the needed elements.
  • Not very beginner-friendly. Web scraping dynamic websites requires both knowledge and experience. 
Hero getting ready to present the guide.

Scraping time – what you’ll need to get started

Python is one of the most popular programming languages out there, and for good reason. Its versatility makes it easy to work with, and the extensive library of additional tools is definitely another plus for Python. That’s why we’ll also use it to web scrape dynamic content. 

This guide was made using Windows, but the process is very much the same with Mac OS and Linux. Before we begin with this guide, let’s start with what we’ll need for this guide.

Smartproxy’s help documents

Our documentation is very detailed and almost literally covers it all. You’ll find loads of valuable information with great examples and explanations. Whether you find this guide helpful or not, having our help docs somewhere nearby will definitely not hurt. 

Python

You probably already have Python, but if you want to get down to specifics, we used version 3.9.7. If by any chance you don’t have it, simply download it from https://www.python.org/downloads/.

Smartproxy’s residential proxies

To start using our residential proxies, first sign up to our dashboard. Once you’ve done that and verified your account, simply select a residential subscription plan that suits your needs best. Then you’ll need to choose an authentication method

  • Username:Password
  • Whitelisted IP

Choose the second option and whitelist your IP. We’ll need to set up Selenium for headless scraping, which works only with IP whitelisting. In other cases, when you’re running a Selenium in headful mode, you can also choose the username:password authentication method (you can check how to do it on our GitHub.)

If there are any hiccups along the way, drop a line to our awesome round-the-clock customer support team. They’ll help you with any issues that may arise with your subscription. 

Selenium 

Selenium is a great automation tool that will help us in this guide. There are a few steps to follow in order to get it up and running, so here’s a link to our GitHub where everything is explained thoroughly and with screenshots included as well. 

Don’t forget to choose the web driver of the browser you’ll be using. It will be important in the upcoming steps. 

And it doesn’t stop there, we’ll also need other libraries – Pprint and By method. But I will show you how to add them later in the guide to make the process of scraping dynamic content as understandable as we can. 

Target website

It can literally be almost anything you want, but for this guide, we’ll be using http://quotes.toscrape.com/. From this dynamic website, we will scrape quotes, tags, and authors.

Scraping dynamic content with Python and Selenium: a step-by-step guide

Step 1: Viewing the page source

Let’s get this show on the road! Go to our target website and inspect the HTML.

Viewing page in HTML.

Step 2: Choosing elements

Now that you have the HTML opened up in front of you, you’ll need to find the correct elements. Since the website we chose contains quotes from famous authors, we are going to scrape the elements with the following classes:

  • quote’ (the whole quote entry)
  • Text’ 
  • author
  • tag

And then for each of the quotes, we’ll select these classes:

<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="change, deep-thoughts, thinking, world" / >
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

Step 3: Adding the By method and pprint

We already have Selenium set up, but we also need to add the By method from the Selenium Python library to simplify selection. Basically, this method lets you know whether or not certain elements are present.

from selenium import webdriver
from selenium_python import get_driver_settings, smartproxy
# using By to simplify selection
from selenium.webdriver.common.by import By

We’re also gonna need pprint – it’ll format the results in a neat way. You can add it by typing in the code line as in the example down below

#pprint to make console output look nice
import pprint

Step 4: Choosing a browser and specifying the target 

You can totally leave the driver setup like it is in the example, just make sure to add choose the browser you’ll be scraping. As for the target (http://quotes.toscrape.com), it should be typed in exactly as in the example.

def webdriver_example():
driver get_driver_settings()
if driver['DRIVER'] == 'FIREFOX':
browser =
webdriver.Firefox(executable_path=r'{driver_path}'.format(driver_path=driver('DRIVER_PATH']), proxy=smartproxy())
elif driver ['DRIVER'] == 'CHROME':
browser =
webdriver.Chrome(executable_path=r'{driver_path}'.format(driver_path=driver('DRIVER_PATH']), desired_capabilities=smartproxy())
browser.get('http://quotes.toscrape.com/')

Now let’s also set up a variable of how many pages we will want to scrape. Let’s say we’ll want 5. The scraper output is going to be a list of JSON objects quotes_list containing information about the quotes we’ll scrape.

browser.get('http://quotes.toscrape.com/')
pages = 5
quotes_list = []

Step 5: Choosing a selector

You can choose from a variety of selectors, but for this guide we used Class. Let’s go back to the Selenium Python library for a bit. We’re going to locate our elements by using the Class selector in this case ( By.CLASS_NAME ). 

Now that we have our selector up and running, the first thing that we have to do is open the target website in Selenium. The command line will do just that. It’s time to get the information that we need.

browser.get('http://quotes.toscrape.com/')

Step 6: Scraping

It’s finally time to start doing some scraping. Don’t worry, yes, there will be quite a bit of code to tackle, but we’ll give examples and explain each step of the process. This way we’ll break it down into smaller, easier-to-understand bits and you’ll be cracking dynamic content in no time.

for i in range (pages) :
quotes = browser.find_elements(By.CLASS_NAME, 'quote')
for quote in quotes:
tags = quote.find_elements(By.CLASS_NAME, 'tag'),
tag_list = []
for tag_in tags:
tag_list.append(tag.text)
quotes_list.append({
"author' : quote.find_element(By.CLASS_NAME, 'author').text,
'text' : quote.find_element(By.CLASS_NAME, 'text').text,
'tags' : tag_list,
})
next_page_btn = browser.find_element(By.PARTIAL_LINK_TEXT, 'Next').click()
for i in range(pages):

The loop is going to run as many times as we have selected in the pages variable. We went for 5 pages.

quotes = browser.find_elements(By.CLASS_NAME, ‘quote’)

With this line, we will get all of the ‘quote’ class elements from the page. As in the example below:

<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="change, deep-thoughts, thinking, world" / >
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
for quote in quotes:

The above code line goes through all of the quote elements repeatedly.

tags = quote.find_elements(By.CLASS_NAME, 'tag')
tag_list = []
for tag in tags:
tag_list.append(tag.text)

By selecting the tag class we will get all of the tags for a given quote. Then we will iterate through them to put them in a list – tag_list.

quotes_list.append({

Adds the output to the quotes_list list that was mentioned earlier. The list will contain JSON elements for each quote.

'author' : quote.find_element(By.CLASS_NAME, 'author').text,
'text' : quote.find_element(By.CLASS_NAME, 'text').text,

Here we’re selecting the quote text itself as well as the author. There can only be one of each so there’s no need to iterate. The values are assigned to the ‘author’ and ‘text’ keys respectively.

'tags' : tag_list,

Assigning the tag_list to the ‘tags’ key.

next_page_btn = browser.find_element(By.PARTIAL_LINK_TEXT, 'Next ').click()

We’re finding the next page button (using the PARTIAL_LINK_TEXT selector) and clicking on it to go to the next page. A new loop iteration starts after that and the page is scraped for quotes again.

<nav>
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
</li>
</ul>
</nav>

Step 7: Formatting the results

We’re nearing the finish line. All that’s left now is to add the last line of code that will print out the results of your scraping request to the console window after the pages finish scraping.

pprint.pprint (quotes_list)

…and we’re done! After this, you should see something similar to the example below. Now, mind you, the results are quite a bit more extensive than shown here, but, well, you get the point.

[{'author': 'Albert Einstein',
'tags': ['change', 'deep-thoughts', 'thinking', 'world'],
'text': '"The world as we have created it is a process of our thinking. It'
'cannot be changed without changing our thinking."'},
{'author': 'J.K. Rowling',
'tags': ['abilities', 'choices'],
'text': '"It is our choices, Harry, that show what we truly are, far more'
'than our abilities."'},
'{'author': 'Albert Einstein',
'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles'],
'text': "There are only two ways to live your life. One is as though'
'nothing is a miracle. The other is as though everything is a'
'miracle."'},

Conclusion

Web scraping is becoming more and more popular these days. There’s just so much knowledge you can gain with this data collection method. Not to mention the numerous ways you can use this information to your advantage. We chose a rather simple website to scrape, so it can require more work to scrape a more sophisticated website. But at least now you have a good example of how to work with dynamic websites!

About the author

James Keenan

Senior content writer

The automation and anonymity evangelist at Smartproxy. He believes in data freedom and everyone’s right to become a self-starter. James is here to share knowledge and help you succeed with residential proxies.

All information on Smartproxy Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Smartproxy Blog or any third-party websites that may be linked therein.

Frequently asked questions

How’s a dynamic website different from dynamic content?

A dynamic website doesn’t guarantee that the content displayed on it will be dynamic. It has more to do with how the website is built regardless of its content. Dynamic content on the other hand will always be dynamic. 

Let’s take e-shops as an example – they’re dynamic websites. If you’re looking for the best sink, the website will make sure to show you the best picks based on general popularity and your filter criteria (if you’re looking for some specific sink). 

The difference between a dynamic website and dynamic content here is that all visitors will get the same results for the same filters. Dynamic content in this context would be some personalized deal or suggestion. 

Can I use a different element selector?

Is there a way to web scrape non-dynamic content without tech knowledge?

© 2018-2024 smartproxy.com, All Rights Reserved