Back to blog

Google Sheets Web Scraping: An Ultimate Guide for 2024

Google Sheets is a powerful tool that hosts various data management and analysis features. While it usually deals with information already gathered elsewhere, few know that Sheets have built-in functions that can also gather website data on their own! This article will explore the many benefits of using Google Sheets for web scraping and how to build a powerful in-house web scraping machine without ever leaving your browser window.

Zilvinas Tamulis

Jan 26, 2024

6 min read

Google Sheets Web Scraping An Ultimate Guide for 2024

Why use Google Sheets for web scraping?

Google Sheets is a non-standard tool for web scraping, and you may be wondering why you should even give it a shot. Let’s explore a few reasons that make it such a hidden gem.

No coding required

If you’ve just found out that Python isn’t just a name for a snake and want to scrape information from the web, you’ve probably run into the issue of having to learn a programming language to build scripts. It can be daunting to dive into a completely new subject or simply something you don’t have time for. We understand – in today’s fast-paced world, efficiency is key and Google Sheets offers just that.

To scrape data with Google Sheets, you don’t need any knowledge of snakes or why the Java island has a script named after it. Through simple functions such as IMPORTXML, you’ll only need to give it a few parameters to get the data you want from a website. You can do any further processing of that data as you would normally in a Google Sheet through functions or filters.

If you’re looking for something more streamlined, we urge you to check out Smartproxy's Web Scraping API. It’s a powerful scraping tool with a network of 65M+ proxies, task scheduling, and other advanced features to ensure your data-collecting activities remain efficient and anonymous.

Integration with Google services

Google offers an extensive workspace for your team that makes it easy to collaborate and work together all in one place. Services such as Gmail, Docs, Forms, Calendar, Meet, Drive, and many more can be accessed with one Google account. The best part? They all work together simultaneously and offer many cross-functional features, meaning that information and data in one service can easily be shared with another one. 

Google Sheets is a part of the gang that’s easy to work with. For example, you can scrape the web, process the information, and then save all of that information in your Google Drive for safekeeping. The possibilities are endless, and the things you can do are as sweet as caramel pastries.

User-friendly interface

Google Sheets has a user-friendly interface designed to be intuitive and accessible. It features a clean and simple layout with consistent icons, a toolbar at the top, and a spreadsheet grid at the bottom. This makes it easy to navigate and always find the necessary tools. Features such as conditional formatting and autocomplete are also integral in everyday use that simplify complex tasks and can save you many headaches regarding data management and web scraping.

Automation with Google Apps Script

Take your Google Sheets game to a whole new level by integrating it with Apps Script. This tool allows you to write your own JavaScript code that easily integrates with every other Google service. You can use it to automate tasks, import and export data, create custom functions, automatically create and update visual charts, get email notifications for changes in data, and so much more. If you ever felt Google Sheets were missing a feature or two that you can’t live without, grab your building hammer and start creating something new! 

Cloud-based storage

Did your colleague go on a holiday on another side of the world and forget to share the spreadsheet file with the rest of the team? You wouldn’t have to worry about this if you were using Google services. All files created with Google Sheets are automatically saved in your Google Drive, so they won’t go missing or accidentally get lost in case of hardware failure. Sharing these files is also a piece of cake, as it simply requires you to send someone a link, and they’ll have access to it. You can even customize what the person can do with it, whether you want them to be a viewer or a collaborator who can work on the Sheet with you.

How do you scrape data from a website into Google Sheets?

Using ImportXML in Google Sheets

If scraping data from the web with Google Sheets was magic, the IMPORTXML function would be the wand you couldn’t cast the spell without. It’s the key element when it comes to web scraping and one of the few methods that makes it possible to do so. Let’s explore what it is, how it can be used, and try to scrape web pages within Google Sheets.

What is IMPORTXML?

IMPORTXML is a function in Google Sheets that allows you to import structured data from XML, HTML, CSV, TSV, RSS documents, and other data types on the web. It retrieves information from a specified URL and XPath query and places it into a cell in the spreadsheet.

How does IMPORTXML work?

IMPORTXML works like most other Google Sheets functions – it has to be declared in a cell with up to 3 parameters. Here’s the syntax for writing an IMPORTXML function:

IMPORTXML(url, xpath_query, locale)

The parameters used are as follows:

  • url – the target website URL you want to scrape data from.
  • xpath_query – the XPath query that points to the data you want to extract. You’ll learn what this means in just a bit.
  • locale – the language and region to use when parsing data. This is useful when a website offers different content based on region.

Remember that each parameter must be written in quotation marks (“”) for it to work! Here’s an example:

IMPORTXML("https://example.com/", "/html/body/div[1]/p[1]/text()", "en-us")

What is XPath?

XPath stands for XML Path Language. It’s a core component in the IMPORTXML function that helps find exactly what you need by using a path-like syntax to find nodes (elements, attributes, text, and more) within an XML, HTML, or other supported document type.

To fully grasp how to write an XPath expression, you need to understand the different kinds of relationships nodes can have with each other. Here are the five main ones:

  • Children – all nodes that are below another one.
  • Parents – nodes that have one or several children under them.
  • Siblings – nodes that share the same parent node.
  • Descendants – all nodes that are below the current node, including children and everything under them.
  • Ancestors –  all nodes that are above the current node, including parents and everything above them.

The table below lists a few key syntax options when building an XPath:

Syntax

Description

Example

Explanation

/node



A double forward slash selects a descendant node under the parent node regardless of where they are



body//div



Selects a <div> element located anywhere under the <body> tag



//node



A double forward slash selects a descendant node under the parent node regardless of where they are



body//div



Selects a <div> element located anywhere under the <body> tag



[x]



The two square brackets are called predicates. They link to a specific node based on the parameters inside it



body/div[2]



Selects the 2nd <div> element directly under the <body> tag



@



The at sign selects attributes



//@class



Selects all nodes that have the class attribute



Take this HTML code as an example:

<!DOCTYPE html>
<html>
<head>
<title>Smartproxy</title>
</head>
<body>
<h2 class="heading">Types of proxies</h2>
<div>
<ul class="list">
<li class="list-item">Residential</li>
<li class="list-item">Datacenter</li>
<li class="list-item">Mobile</li>
<li class="list-item">ISP</li>
</ul>
<a href="https://help.smartproxy.com/docs/">Documentation</a>
<a href="https://smartproxy.com/blog">Blog</a>
</div>
</body>
</html>

If you want to write an XPath to the list item that contains the word Datacenter, you’ll have to write the following:

/html/body/div/ul/li[2]/text()

This XPath goes from the very top of the document, which is the <html> tag, and descends downwards. Since the <body> tag is the only child of it, simply use a single forward slash to refer to it. Same thing with <div> and <ul>; they’re both unique children of the previous node. For the <li> tag, add a predicate with an index of 2 to select the 2nd node since there’s more than one <li> under <ul>.

Alternatively, shorten it to //li[2]/text(). It will select any descendent node from the top <html>. As there are no other <li> items in the entire HTML document, it’s perfectly fine to skip the unnecessary information, skip the hassle, and make your XPath more readable.

Another example task is selecting all <li> nodes with the list-item attribute:

//li[@class='list-item']

Here, instead of picking just one <li> with an index, select all of the items. Inside the predicate, specify that it should select all nodes with a class attribute with the list-item value.

XPath is a very rich expression language featuring 200+ functions for all sorts of purposes ranging from simple to complex. For a more in-depth look into all the possible methods and ways XPath can be used, check out the W3Schools tutorial or refer to the devhints.io cheatsheet.

How to build a Google Sheets web scraper

Now that you’ve learned how XPath works, you should know how to use it with the IMPORTXML function. Let’s try to build our first Google Sheets web scraper.

Step 1: Start with a fresh Google Sheet

To get started, you’ll need a Google account to use Sheets and other services. Create an account or log in, then open a new Google Sheet.

Step 2: Find the content you need to scrape

For this example, we’ll use a sample website, quotes.toscrape.com, which is perfect for testing web scraping stuff.

Alternatively, you can use a real website of your choice, but remember that IMPORTXML counts as an automated request. This means that a few websites may block or limit your attempts to get website data, which will yield no results.

Step 3: Find The XPath

To find the XPath of the content you want to scrape, you’ll have to look at the HTML structure of the page through tools such as Google Chrome’s DevTools (right-click anywhere on a page and click Inspect). Find where your desired element is located in the HTML and use the knowledge from earlier to build the path to it. If you feel like cheating, simply right-click the content and then Copy → Copy full XPath. Here, we’ll simply take one of the quotes from the sample website.

web scraping google sheets
Step 4: Extract the data into Google Sheets

Return to your Google Sheets document. Select a cell you want to extract the data to and write the following function:

=IMPORTXML("https://quotes.toscrape.com/", "/html/body/div[1]/div[2]/div[1]/div[2]/span[1]/text()")

Here, you use only two parameters – the website URL and the XPath to the content you want to get. Click Enter, and voila, you’ve got the desired quote from the website!

web scraping google sheets
web scraping google sheets

Simple improvements

IMPORTXML works as any other spreadsheet function; therefore, if you want, you can improve the sheet for more efficient web scraping with a few simple tweaks. For example, you can write the website URL in a separate cell and then re-use it by simply linking to that cell in a function. Let’s do that with another simple task – import all quotes from the website’s front page and write the name of the author next to each of them.

Begin by writing the website URL in an empty cell (make it pretty if you feel fancy):

web scraping google sheets

Then, let’s find the XPath for all of the quotes. This time, you can’t just right-click and copy the path, as there are multiple quotes to target. Instead, write an XPath that gets all of the <span> tags on the page with a class name text:

//span[@class='text']

Similarly, you can see that the author name is located under <small> tags with the class name author:

//small[@class='author']

Put both of these XPaths in their own cells so they can be referenced in a function.

web scraping google sheets

Finally, write two IMPORTXML functions and point them to the cells that contain the URL and XPath values (in this case, cells C3, C4, and C5).

Import the quotes:
=IMPORTXML(C3,C4)
Import the authors:
=IMPORTXML(C3,C5)
web scraping google sheets

Each value will be vertically printed in its own cell. That’s all! Since the data is already in a Google Sheets document, you can easily analyze, filter, share, or perform any other actions.

Other useful functions

IMPORTXML is only one of a few import functions available on Google Sheets. A few more are very useful to know and might come in handy for specific tasks:

  • IMPORTHTMLIMPORTHTML – used to import data from tables or lists within an HTML document. Compared to IMPORTXML, it’s more limited in what it can scrape; however, it doesn’t require the user to write an XPath. You simply need to write the table number or list index.
Syntax:
=IMPORTHTML(url, query, index)
Example:
=IMPORTHTML("https://example.com", "table", 2)
  • IMPORTFEED – used to retrieve and parse data from RSS or Atom feeds, making it easy to incorporate dynamic content from feeds into your spreadsheet.
Syntax:
=IMPORTFEED(url, [query], [headers], [num_items])
Example:
=IMPORTFEED("https://example.com/feed", "title", TRUE, 5)
  • IMPORTDATA – used to import data from CSV files from a given URL.
Syntax:
=IMPORTDATA(url)
Example:
=IMPORTDATA("https://example.com/data.csv")

As you can see, each of the import functions serves a specific purpose based on what you need to get from the web. While most information is commonly found in HTML documents, the value of information found in RSS, Atom, and CSV feeds and file types shouldn’t be underestimated.

How do you import a table from a website into Google Sheets?

Tables are everyone’s favorite way of listing information on the internet. It’s clean, concise, and easy for the user to read. Unfortunately, it’s not as simple to scrape with functions such as IMPORTXML, as you’d have to write an XPath for every single element in that table. What if there was a better way?

IMPORTHTML is the best method for importing large lists or table data from a web page. Its simple syntax only requires three simple parameters – the website URL, the indication of whether you want to scrape a table or list, and the index number. 

Let’s say you want to import the information from a table from the following Wikipedia page. All you need to do is write this function in your Google Sheets spreadsheet:

=IMPORTHTML("https://en.wikipedia.org/wiki/List_of_cat_breeds", "table", 2)

Here, the parameters simply indicate that the function should go to the URL and find the table with an index of 2. How do you know the index number, though? Simply count the number of <table> elements from the top of the HTML document. The first one will have an index of 1, the second will be 2, and so on.

The result is a clean and formatted table with all the information from the web page. This is a much easier way to import a large set of data than using XPaths and can be done in just a matter of seconds.

web scraping google sheets

How do you import data from XML feeds into Google Sheets?

An XML (eXtensible Markup Language) feed, often called an RSS (Rich Site Summary) feed or Atom feed, is a structured data format used to publish frequently updated content such as news articles, blog posts, or other types of information. XML feeds are crucial in providing a standardized way for websites to share dynamic content.

Why not use Google Sheets to get the latest scoop right in your spreadsheet? IMPORTFEED is a function created exactly for this purpose, so let’s see how it can be used to get information from the New York Times website’s Technology section.

Let’s remember the structure:

=IMPORTFEED(url, [query], [headers], [num_items])

And set real values:

=IMPORTFEED("https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml", "items", FALSE, 5)

Here, we gave it a URL of the RSS feed and told it to retrieve all items (titles, descriptions, authors, dates, etc.). The next option for headers is set to FALSE, so it doesn’t return any column headers as an extra row in the result. Finally, retrieve only the 5 most recent items, but this can be changed to include more or less, depending on preference.

That’s it! Once again, you can draw a comparison between this function and IMPORTXML and see that even though IMPORTXML is the most well-rounded function that can fit every need, it still doesn’t beat IMPORTHTML or IMPORTFEED in terms of convenience when it comes to tables or XML feeds.

What are the pros and cons of using import functions?

Here are some of the most notable benefits of using import functions:

  • Ease of use – Google Sheets functions are relatively easy to understand and only require a few parameters. It’s a great way to get started with web scraping.
  • Spreadsheet flexibility – if you’ve worked with spreadsheets before, you’ll know exactly what you can and can’t do. Google Sheet formulas, user-friendly interface, and various features make it a breeze to work with data, whether sorting, filtering, or analysis.
  • No code required – many web scraping tools will require you to have coding knowledge to write your scripts. Google Sheets doesn’t need coding experience to create something powerful besides knowing how to use XPath. 
  • Integration with Google Services – since Google Sheets is part of a massive Google Service ecosystem, the results can be re-used across many applications such as Docs, Calendar, Forms, and others.
  • Integration with Google Apps Script – if you’re familiar with coding with JavaScript, you can expand any limitations of Google Sheets by writing your code. Apps Script can perform many automation tasks or help you build the necessary tools and features.
  • Limited to its environment – import functions can only be used within Google Sheets, and it might be difficult to expand them further. While there are many features available that can get the job done, they are complex and require a lot of extra effort that could be otherwise used to build your custom scripts.
  • No request customization – many web scraping tools offer features on how requests are made to a web server. These include a user agent, custom headers, cookies, location data, and other fingerprinting options, something that import requests can’t do. These are important when trying to circumvent any geo-restrictions or access content based on the type of platform, device, or cookies.
  • Can’t handle dynamic content – many modern web pages nowadays use JavaScript to render dynamic content, meaning that the first request might yield no results, as the content you’re looking for hasn’t been rendered yet. You’ll need to use headless browsers built with tools like Selenium to get what you want.
  • XPath usage – if this is your first time dealing with XPath, you might find it challenging to understand. While it’s a powerful and flexible expression language, it can be challenging to pick up and learn all the possible ways to write paths. Some longer paths might feel unintuitive or hard to read, even for an experienced user.
  • Can’t prevent rate-limiting or blocking – making too many automated requests will often lead to a rate-limitation or IP block as web pages employ policies that prevent excessive web scraping activities. A standard solution is to use proxy services; however, neither import functions nor Google Sheets can set up proxies for your requests. 

However, here are a few reasons why import functions might fall short in some areas:

In short, while Google Sheets offers many unique features that make your life easier, there are some limitations to it. If you’re looking for the best experience while gathering web data, we urge you to try the Smartproxy Web Scraping API. It’s a tool without limits, featuring an easy-to-use interface, handling dynamic content, and, most importantly, proxy integration to make sure your web scraping activities remain private and avoid the risk of IP bans or rate limitations.

Wrapping up

While Google Sheets isn’t the first choice for many when it comes to web scraping, it’s a tool that shouldn’t be underestimated. It’s an excellent solution for beginners, small-scale projects, or building a basic web scraper. The rich infrastructure of Google is the main benefit of using it, but it does fall short in customizability or freedom of what you can do with your web requests. All in all, it’s a hidden gem in the web scraping world that can definitely shine in its own unique way.

About the author

Zilvinas Tamulis

Technical Copywriter

Zilvinas is an experienced technical copywriter specializing in web development and network technologies. With extensive proxy and web scraping knowledge, he’s eager to share valuable insights and practical tips for confidently navigating the digital world.

LinkedIn

All information on Smartproxy Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Smartproxy Blog or any third-party websites that may be linked therein.

Gathering Amazon Data | Best Tools and Practices

Gathering Amazon Data | Best Tools and Practices

At Smartproxy, we’re always cookin’ up new ways to make scraping a breeze. Starting from eCommerce Scraping API to our most recent creation — Web scraping API. And don’t let anyone tell you that proxies and scraping are as complicated as rocket science. It could actually be a rather simple (and sometimes even fun) process!

But let’s be real, even the top guns extracting data from eCommerce giants like Amazon might get those pesky CAPTCHAs or worse — IP ban. Luckily, we’re offering a complete toolbox that will help you hop around the geo-restrictions, avoid errors & bans and get that sweet 100% success rate with ready-to-digest data.

Mariam Nakani

Feb 02, 2023

1 min read

Frequently asked questions

Can you web scrape with Google Sheets?

Yes, it’s possible to scrape the web with Google Sheets. It uses various import functions, such as IMPORTXML, IMPORTHTML, and IMPORTFEED, to get data from the internet, retrieve it, and present it in a spreadsheet, just like a basic web scraper tool.

Can I use a web query in Google Sheets?

How do I get data from Google Sheets to HTML?

Does Google Sheets have an API?

Is this Google Sheets scraping method scalable?

© 2018-2024 smartproxy.com, All Rights Reserved