Selenium Scraping with Node.Js






This is an article about Web Scraping with Selenium and Node.js for people interested in collecting public data from a high-value website to gain good sales leads or data for pricing analysis.

price scdaping

Using Selenium for scraping


selenium scraping

Selenium was originally developed as a driver to test web applications, but it has since become a great tool for getting data from web sites. Since it can automate a browser, Selenium lets you forego some honeypot traps that many scraping scripts run into on high-value websites.

A great example of why people use Selenium for scraping is its delay function, which is perfect for loading up delayed data, especially when a site uses lazy-loading, Ajax or endless scroll. After you access the data with Selenium, you will need something to parse the data with. In this article, we use Node.js, but there are many other HTML parsers out there you can use.




First things first: read the ToS


First things first – scraping a target site might be illegal. Even if you cannot access the data you want through an API and see web scraping as the only solution to collect the data you need, you still have to consider your target site. Many scrapers ignore the target site’s request limits in the robots.txt file, but those limits are there for a reason. Many times you will be able to crash a site by sending too many requests, especially if you use a scraping proxy network like ours, which allows you to send unlimited requests at the same time through unique IP addresses.

Crashing your data source is not only bad for your data collection, it’s also bad for legal reasons: your scrape might be seen as a DDoS attack, especially if you do the scrape through a datacenter proxy network. If this happens to a US company, you might face federal charges for CFAA infringement.




Setting up Selenium with Node.js for scraping


Since you are looking to web scrape, you probably don't need information on how to install Selenium webdriver or get the Node.Js library for your device. To check whether you are ready to scrape after installing Selenium and Node.js, launch PowerShell, Terminal or any other command line prompt and use the command:


npm -v


Also, you will need to download a webdriver like Chromium for Selenium to use. If you choose to use several browsers for a scrape, it will make it less detectable. Also, consider having a large list of random User Agents to keep the scrape under wraps, especially if you are ignoring my first tip to follow the target’s ToS.

Since we will use Selenium to access a web page and Node.js to parse the html file, we have to know what Selenium is capable of doing. It has many functions, which let it navigate any website. For example, you can use


action.click()
action.doubleClick()
action.contextClick()


for different click events. Most of the time you will use only a few commands to navigate a page:


driver.get()
driver.navigate().back()
driver.navigate.forward()


Even though these examples are very simple and bare-bones, they will be enough for most scraping targets. To find out more about Selenium driver's possibilities, read the Selenium documentation.

Now, to start Selenium and access a website, you can use code like this:


var webdriver = require ('selenium-webdriver'),
  By = webdriver.By;
var driver = new webdriver.Builder()
  .forBrowser('chrome')
  .build();

driver.get("https://www.smartproxy.com/");


This will launch Selenium, which will use the Chromium driver to load a website you specify. You can also command Selenium to navigate the site, enter texts into fields, click buttons or do other actions that you need to get to the page with data. When you are ready to automatically navigate to a page you need, it's time to parse that data!



Parsing data with Node.js




Selenium Scraping with Node.Js


Node.js is what lets you parse the html document and extract the data you need. As most data scraping is textual, scrapers use the get.Text() command to get any text from an element on the page. Now, we will give you a couple of ways you could scrape a web page element, but you need to combine these methods for a particular site, as each one is different and has its own structure.

To parse elements in an html file you can use findElement() or findElements() commands. You can find an element or a set of elements by id, class, name, tag name or absolute/relative xpath with Node.js.



Using getText() with Node.js to scrape text data


Since you are looking to scrape a page, you must know how to check its structure. Use any browser's Developer tool to inspect an element you want to scrape, then use any method (xpath or other) to make Node.Js access it and then get the information you need. We'll wrap up this article with a couple of examples of how to scrape a simple web element with Node.js.



EXAMPLE 1 – scraping web page elements by their id name.


This example works for scraping data from sites that use ID names, for example:


<h2  id="employee-name"> Jane Doe</h2>

As this website uses 'employee-name' ID for employee names, we can scrape it with the command:


driver.findElement(By.id('employee-name').then(function(element){
    element.getText().then(function(text){
        console.log(text);
    });
});
    

This command will output the first element with ID name 'employee-name' in the command prompt. Now, if there are multiple items with the same ID name and you want to scrape them all, you'll need to use this command:


driver.findElements(By.id('employee-name').then(function(elements){
  for (var i = 0; i < elements.length; i++){
      elements[i].getText().then(function(text){
        console.log(text)
      });
    };
});
    


EXAMPLE 2 - scraping web page elements by xpath.


This example works for scraping data from sites that use ID names, for example:


<figure class="box">
    <div class="name">
        Jane Doe
    </div>
</figure>


To get the employee name with a relative xpath, we can scrape it with the command:


driver.findElement(By.xpath('//figure[@class="box"]/div[@class="name"]').then(function(element){
    element.getText().then(function(text){
        console.log(text);
    });
});
    

This command will output the first element within a figure class 'box' and div class 'name'. If there are multiple items with the same ID name and you want to scrape them all, you'll need to use this command:


driver.findElements(By.xpath('//figure[@class="box"]/div[@class="name"]').then(function(elements){
  for (var i = 0; i < elements.length; i++){
      elements[i].getText().then(function(text){
        console.log(text)
      });
    };
});

Setting up a Selenium proxy for scraping

Selenium is very good for scraping because it can use a proxy. You can set a proxy up for Selenium with our Selenium proxy middleware on GitHub.

There's so much more to learn about scraping than we can ever write in a single article. To gain more knowledge about these topics, read our other blog posts to find out more about web scraping:

  • What is web scraping
  • Selenium proxies

  • Or contact us about using proxies for scraping.