smartproxy

Table of content

Dec 21, 2021
11 minutes read

How To Choose The Right Selector For Web Scraping: XPath vs CSS

If you're fresh-new to web scraping, you may not be familiar with selectors yet. Let us introduce ya – selectors are objects that find and return web items on a page. These pieces are an essential part of a scraper, as they affect your tests' outcome, efficiency, and speed.

Yep, understanding the idea of a selector isn't that complicated. Finding the right selector itself might be. To be honest, even the two languages that define them, XPath and CSS, have their own pros and cons. So it can quickly become a headache to choose one of them. But here's some good news – we're here to help! Let's explore it together.

Choosing between XPath and CSS
  • Smartproxy >
  • Blog >
  • Parsing >
  • How To Choose The Right Selector For Web Scraping: XPath vs CSS

What is an XPath selector?

XPath means XML Path. In essence, it’s a query language that uses non-XML syntax to make it easy for you to identify elements in XML documents.

You may have already tried to guess if there’s a meaning behind XML Path’s name. The answer is… yes! The selector uses expressions so that they can be traced from the start of the document to the intended element. BINGO – that seems like forming a path.

XPath might be your best buddy in plenty of contexts – it’s applicable to provide links to nodes, search repositories, and many other applications. There are two ways to create an XPath:

Absolute

An absolute XPath starts with the / symbol. It’s usually quite long and hard to maintain.

It has a complete path, beginning from the root to the element that you wanna identify. It’s worth mentioning, however, that if there’s any change in attributes starting from the root, the absolute XPath will become invalid. You need to watch out.

Relative

A relative XPath starts with the // symbols. It’s shorter than the absolute XPath.

It begins by referring to the element you wanna identify, not from the root node. If you remove or add the element, you won’t affect the relative XPath. That is why it works well with automation.

Let’s take a look at a basic format of XPath (this one is the relative one, tho!) and try to understand its syntax:

//tagname[@attribute=’value’]

This is what it means:

  1. //: the current node;
  2. tagname: the type of HTML element;
  3. @: attribute selector;
  4. attribute: the attribute of the node;
  5. value: the value of the attribute.

XPath: pros and cons

Pros

XPath has some dope features to offer. It allows navigating up the DOM when lookin’ for elements to scrape. If you don’t know the name of an element, you don’t have to sweat – you can use contains to search for possible matches. Oh, and by the way – XPath slaps even when scraping with old browsers, including ancient versions of Internet Explorer.

Cons

Nothing is perfect. XPath isn’t an exception. The biggest con of it is being easily-breakable. XPath is also slow, and its complexity can make it hard to read.

XPath selector vs CSS selector

What is a CSS selector?

In a nutshell, CSS stands for Cascading Style Sheets – a text-based language. It allows you to control various style elements, functionalities, and the formatting and display of HTML docs. So, thanks to CSS, you can choose the color or the font of the text, the spacing between paragraphs, the column sizes, and much more.

When identifying the various elements on the page based on styles, you select the class it falls into. CSS lets you pick out the type of content to test, edit, and copy. In other words, selectors help you target the HTML elements you wanna style.

CSS selectors come in several types and have actual names and categories:

Simple

These selectors pick up elements based on their class or ID. They include only one element in the structure.

Attribute

Attribute selectors search for elements based on the values assigned to them. These have symbols or spaces to separate elements.

Pseudo

You choose pseudo selectors when the states of elements are declared with CSS. They use keywords that you add to select a specific part or state of the element.

Let’s see a simple example of CSS:

tagname[attribute=value]

Here,

  1. tagname: the type of HTML element;
  2. attribute: the attribute of the node;
  3. value: the value of the attribute.

CSS: pros and cons

Pros

CSS may be your go-to selector for various reasons. First of all, you can use CSS on the development side. It’s also compatible with most browsers. Finally, with CSS, you have a high chance to find the elements you want. 

Cons

As there are so many levels of CSS, it can create confusion among web browser developers and beginners.

Nodes and relationships: XPath and CSS

Here’s the thing: you can really start web scraping without having a basic knowledge about it. So, fasten your seatbelt, and let’s learn the beginner terminology for both – XPath and CSS.

Let’s start with an example:

<html>
<head>
<title>
How To Choose The Right Selector For Web Scraping: XPath vs CSS
</title>
</head>
<body>
<h2 id="name">Smartproxy.com</h2>
<div id="navbar">
<a href="https://smartproxy.com/" id="smart" class="nav">
Visit Smartproxy
</a>
<a href="https://smartproxy.com/blog" class="nav">
Check out our Blog
</a>
<a href="https://smartproxy.com/web-scraping" class="nav">
Learn more about web scraping
</a>
</div>
</body>
</html>

Nodes

Nodes represent entities of a domain. Every XML or HTML has three types of nodes:

  • Element Node. They’re simply known as elements, or tags. From the code above:
	<title>
How To Choose The Right Selector For Web Scraping: XPath vs CSS
</title>
  • Attribute Node. They’re the attributes of an element node. For instance,
id="smart"
  • Atomic Value. They’re the final values. These are represented by a text or a value of an attribute. For example, it can be:

“Learn more about scraping”.

Relationships

There are more types of nodes, such as entity nodes, CDATA section nodes, document nodes, and others. All of these nodes can have different kinds of relationships with each other. Every relationship must have exactly one relationship type.

  • Parent. This is a root element that is one level up in the code. Each element has exactly one parent: the parent of <a> element is <div>.
  • Children. This is a sub-element that is one level down in the code: the elements <h2> and <div> are children of the <body>.
  • Siblings. These are the elements that share the same parent: <h2> and <div> are siblings, as they share the parent <body>
  • Descendants. These are all the elements that are any level down: the element <title> is the descendant of <head>.
  • Ancestor. This is the element that is any level up: ancestors of <a> are <div>, <body>, <html>.

Comparison: XPath vs CSS

Yeah, we got familiar with both of the selectors, but wait! Before jumping to conclusions on which one to choose, let’s compare them.

XPath vs CSS: a comparison table
  • Complexity. XPath tends to be more complex and harder to read than CSS. It makes CSS easier to learn and implement.
  • Performance and speed. XPath is much slower when we compare it to the CSS selector.
  • Consistency. XPath engines are different in every browser, which makes a selector inconsistent. On the other hand, a variety of sites have applied CSS consistently. 
  • Text recognition. XPath handles text recognition better than CSS.
  • Flow. With XPath, you can traverse both ways from parent to child and the other way around. CSS allows one-directional flow – from parent to child only.

Choose your character

It’s finally the time to choose your character! While web scraping, we’d suggest not focusing on one or a couple of benefits too much. Your choice should depend on your situation and factors, including finding the required feature-set or compatibility. So, try to see the complete picture and explore both options yourself. 

While using web scraping tools, such as Beautiful Soup, consider the find and find_all methods. They’re optimized for Beautiful Soup – it means there’s no need to worry about choosing between two selectors.

Conclusion

Choosing the right selector may be difficult but surely important when web scraping. For publicly accessible data gathering, you also need suitable tools for that. For Google’s SERP results, pick our full-stack SERP Scraping API. We also have eCommerce and Web Scraping APIs if your targets are not limited only by search engines. All our APIs deliver results at a 100% success rate and are armed by our elite proxies for your ultimate data gathering experience.

In case you already have your scraping proxies, scale your project by increasing speed and IP stability with datacenter proxies. And go for residential proxies for unblocking data in any location and scraping the websites that are extra sensitive to automated activity.

smartproxy

James Keenan

Senior content writer

The automation and anonymity evangelist at Smartproxy. He believes in data freedom and everyone’s right to become a self-starter. James is here to share knowledge and help you succeed with residential proxies.

Frequently asked questions

What is web scraping for?

Web scraping allows you to gather publicly accessible data.Read our blog post about some most common cases, or watch a YouTube video about web scraping projects.

What web scraping tool should I choose?

It depends on the nature of the target and how complex it is. The most common ways are to build a data scraping tool yourself, use pre-made web scraping tools like Scraping API, or try no-code solutions like the No-Code Scraper

Scraping APIs usually differ by their targets and unique technical aspects. If you're developing your eCommerce gig and want to be all eyes and ears on your competitors' behavior in real-time, your top choice could be eCommerce Scraping API. If you're doing large-scale research and rely on search engine data, the most logical way would be to go for SERP Scraping API. And if you're gathering data from websites having a solid anti-bot system but don’t have any desire to deal with CAPTCHAs, better rely on Web Scraping API. AND if you have no coding experience but still are eager to scrape that data, grab our No-Code Scraper.

Not sure which tool could be your one and only? Try one of these three days for free, test how it works yourself, and then decide if it's what you want.

Related Articles

unstructured data issues

Digesting parsing: what is parsing? [VIDEO]

Just like we translate natural languages to have efficient international communication, we need to perform similar processes with programming languages to ensure success in computer sciences. That’s why parsing has emerged. Data parsing is turning raw, unstructured data into well-structured and understandable information. So if you get data in raw HTML after scraping a page with proxies, you still must run that data through a parser. It will take the HTML and convert it into a readable format that you can easily understand.

smartproxy

James Keenan

Aug 25, 2021

12 min read

Hero presenting how to web scrape dynamic content.

Take Your Web Scraping To The Next Level – Scraping Dynamic Content With Python

The internet has changed quite a bit, hasn't it? Today, almost every popular website you go to is tailored to your specific needs. The goal is to make the user experience as good as possible. It sounds amazing for the end-user, but for someone who’s trying to web scrape dynamic content, it can prove to be quite the challenge. That doesn’t mean it’s not doable!  In this blog post, we’ll go through a step-by-step guide on how to web scrape dynamic content with Python and Selenium. While we’ll do our best to make this guide as clear as possible, this guide isn’t exactly a walk in the park. So if you’re new to web scraping, you might want to gain more scraping experience before tackling this quest.

smartproxy

James Keenan

Jan 17, 2022

12 min read

Get in touch

Follow us

Company

© 2018-2024 smartproxy.com, All Rights Reserved