Playwright Web Scraping: A Practical Tutorial

Ever feel like extracting data from the web is like trying to direct a play without a script? Enter Playwright – your all-in-one stage manager for seamless web scraping. It handles the browser, the elements, and even the unpredictable plot twists of modern web pages. Follow this tutorial to learn how to use this powerful tool to extract data from any web page.

Zilvinas Tamulis

Jan 13, 2025

8 min read

Twitter

Facebook

What is Playwright?

Playwright is a modern web scraping and browser automation framework that simplifies data extraction from web pages. It supports multiple headless browsers, including Chromium, Firefox, and WebKit, making it a convenient tool that covers many popular developer requirements. It also offers a great and simple API that allows developers to interact with dynamic user interfaces, locate elements using CSS selectors, and easily extract structured data.

While Playwright is a new actor in the scene, it stands out above many older tools for its extensive list of features. It excels at handling modern, JavaScript-heavy websites and supports multiple programming languages like JavaScript, Python, and C#, allowing developers to write scripts in any preferred language. Playwright can also create isolated browser contexts that enable scraping across multiple pages simultaneously without sharing state, making it both efficient and secure. You can tell it's been created by people familiar with the struggles of web scraping and packed all the best features in this fantastic framework.

If you feel like the websites you're trying to get data from are as complex as the intricate schemes of William Shakespeare's Much Ado About Nothing – worry not, as Playwright is built to tackle any web scraping or web automation challenges easily.

Methods for web scraping using Playwright

Playwright provides several powerful methods for web scraping across different programming languages, including Python, Node.js, and JavaScript. Here's a list of a few of them:

Page navigation. With Playwright, you can navigate to a web page using functions such as page.goto(). This allows you to navigate the website's pages, which is especially useful when content isn't limited to a single page. It's a commonly used method for scraping eCommerce websites that list products across several pages.
Element selection. Playwright allows you to select elements on the page using CSS selectors or XPath. Regardless of your preference, the framework will enable you to easily select HTML elements with methods such as page.querySelector(). Once elements are selected, you can extract various types of data, including text, links, images, and attributes.
Handling dynamic content. Playwright can interact with JavaScript-heavy websites by waiting for elements to load with page.waitForSelector() or page.waitForTimeout(), ensuring the content is fully loaded before scraping.
Interacting with elements. Playwright allows you to simulate actions like clicking buttons, filling out forms, and scrolling through pages to load more content. Methods such as page.click() are helpful for scraping content behind interactive elements.
Handling browser contexts. Playwright's support for multiple browser contexts allows you to scrape data from various pages or simulate user sessions without conflicts. Paired with reliable proxies, this feature is a great way to stay anonymous and undetected while browsing. This is useful for multi-tab scraping, multiple account management, or automating several actions simultaneously.
Network interception. You can intercept network requests and responses using page.route() to gather dynamically loaded data via API calls, providing an advanced method of scraping data directly from the network traffic.
Browser automation. Playwright enables automating complex workflows, such as logging into websites, submitting forms, and navigating through various pages, making it suitable for scraping data from applications with login mechanisms or multi-step interactions.

Web scraping with Playwright: a step-by-step guide

Now that you know the whole repertoire of Playwright, let's get started with setting it up for web scraping. For this tutorial, we're going to use Node.js, but you can also install the framework using Python. Follow these steps to set up and get started right away:

Install Playwright. You can get Playwright using npm, yarn, or pnpm by entering the command below into your terminal. You'll have a few prompts to answer, such as picking between TypeScript and JavaScript, the name of your tests folder, and browsers:

npm

npm init playwright@latest

yarn

yarn create playwright

pnpm

pnpm create playwright

2. Include Playwright in your script. Create a new JavaScript (.js) file and include the line below at the beginning. You can switch the chromium option with webkit or firefox if you have them installed.

const { chromium } = require('playwright');

3. Navigate to a web page. For this example, we're going to extract data from a website called ScrapeMe, which is ideal for various web scraping tests. With the following code, you'll be able to launch a new browser window and navigate to the web page:

const { chromium } = require('playwright');

(async () => {
// Launch a new browser instance
const browser = await chromium.launch({ headless: false });
// Open a new page
const page = await browser.newPage();
// Navigate to the ScrapeMe website
await page.goto('https://scrapeme.live/shop/');
// After the above actions are performed, close the browser.
await browser.close();
})();

4. Select and extract a specific element. The website has a list of items similar to those of a regular online shop. While Playwright offers a wide range of features to interact with web pages, for this example, we'll simply select the 3rd product from the list by its class name. Let's expand the previous code:

const { chromium } = require('playwright');

(async () => {
// Launch a new browser instance
const browser = await chromium.launch({ headless: false });
// Open a new page
const page = await browser.newPage();
// Navigate to the ScrapeMe website
await page.goto('https://scrapeme.live/shop/');

// Select all elements matching the class
const productElements = await page.$$('.woocommerce-loop-product__title');

// Access the 3rd element (index 2) and get its text content
const thirdProductTitle = await productElements[2].textContent();
console.log(`3rd Product Title: ${thirdProductTitle}`);

// After the above actions are performed, close the browser.
await browser.close();
})();

const { chromium } = require('playwright');

(async () => {
// Launch a new browser instance
const browser = await chromium.launch({ headless: false });
// Open a new page
const page = await browser.newPage();
// Navigate to the ScrapeMe website
await page.goto('https://scrapeme.live/shop/');

// Select all elements matching the class
const productElements = await page.$$('.woocommerce-loop-product__title');

// Access the 3rd element (index 2) and get its text content
const thirdProductTitle = await productElements[2].textContent();
console.log(`3rd Product Title: ${thirdProductTitle}`);

// After the above actions are performed, close the browser.
await browser.close();
})();

The script opens a browser window, navigates to the target website, selects all elements with a defined class, and then prints the text content of the 3rd element from a list of items with that class. If you're unsure how to inspect a website's HTML and find the class name of the titles, check out our comprehensive guide on inspecting elements.

In these examples, we used the headless: false option, which makes the browser visible when performing script actions. You can set it to true to save computer resources and only get the result in your terminal.

Proxy implementation

While Playwright allows automated scripts to work for you, remember that the requests still come from your IP address. For pure anonymity and risk-free web scraping, it's highly recommended that you use high-quality proxies. Smartproxy offers a great range of cost-effective proxy solutions, including cheap proxies, with coverage in 195+ countries, <0.3s average speed, and 99.99% uptime, ensuring that your web scraping activities with Playwright go undetected.

To use proxies with Playwright, you can pass proxy settings through the browser's launch or launchPersistentContext options. Playwright supports proxy integration via the proxy object, which accepts the proxy server URL.

Here's how you can modify your script to include the proxy with authentication:

const { chromium } = require('playwright');

(async () => {
 // Proxy server
 const proxy = 'gate.smartproxy.com:10001';
 // Launch a new browser instance with proxy settings
 const browser = await chromium.launch({
  headless: false,
  proxy: {
  server: `http://${proxy}`,
},
});

// Open a new browser context and pass the credentials
const context = await browser.newContext({
 httpCredentials: {
  username: 'user',
  password: 'pass',
 },
});

// Open a single page
const page = await context.newPage();

// Check IP on the same page by navigating to the IP check URL
await page.goto('https://ip.smartproxy.com/ip');
const content = await page.evaluate(() => document.body.innerText);
console.log(`Your IP: ${content}`);

// Navigate to the ScrapeMe website
await page.goto('https://scrapeme.live/shop/');

// Select all elements matching the class
const productElements = await page.$$('.woocommerce-loop-product__title');

// Access the 3rd element (index 2) and get its text content
const thirdProductTitle = await productElements[2].textContent();
console.log(`3rd Product Title: ${thirdProductTitle}`);

// Close the browser
await browser.close();
})();

const { chromium } = require('playwright');

(async () => {
 // Proxy server
 const proxy = 'gate.smartproxy.com:10001';
 // Launch a new browser instance with proxy settings
 const browser = await chromium.launch({
  headless: false,
  proxy: {
  server: `http://${proxy}`,
},
});

// Open a new browser context and pass the credentials
const context = await browser.newContext({
 httpCredentials: {
  username: 'user',
  password: 'pass',
 },
});

// Open a single page
const page = await context.newPage();

// Check IP on the same page by navigating to the IP check URL
await page.goto('https://ip.smartproxy.com/ip');
const content = await page.evaluate(() => document.body.innerText);
console.log(`Your IP: ${content}`);

// Navigate to the ScrapeMe website
await page.goto('https://scrapeme.live/shop/');

// Select all elements matching the class
const productElements = await page.$$('.woocommerce-loop-product__title');

// Access the 3rd element (index 2) and get its text content
const thirdProductTitle = await productElements[2].textContent();
console.log(`3rd Product Title: ${thirdProductTitle}`);

// Close the browser
await browser.close();
})();

This script does several things – first, it connects to a proxy server to make any future requests through a different IP address. Then, it makes a request to the Smartproxy IP-checker website to print your IP address to check if the connection is coming from a different address from your own. Finally, it makes the same request to the ScrapeMe website that prints the 3rd element from the product page.

Playwright vs. other frameworks

Playwright isn't the only name mentioned in the end credits roll of the most popular web scraping tools. Two more famous names pop up when searching for the most efficient frameworks – Puppeteer and Selenium. How are these tools different from Playwright, and why should you choose them? Below is a brief comparison table:

Playwright

Puppeteer

Selenium

Speed

Fast (supports modern browsers)

Fast (only for Chromium-based browsers)

Slower (supports older browsers)

Features

Advanced automation, cross-browser support

Focus on Chromium; less features for other browsers

Extensive but less modernized features

Efficiency

High (headless browser by default, runs several instances at once)

High (limited to Chromium, suitable for modern setups)

Medium (larger footprint due to legacy support)

Ease of use

Easy (developer-friendly API, easy setup)

Easy (simple APIs)

Moderate (steeper learning curve)

Community

Small (backed by Microsoft)

Medium (supported by Google)

Large (long-standing veterans of the industry)

Documentation

Excellent (detailed and regularly updated)

Good (focused on Chromium use cases)

Extensive (covers legacy and modern use cases)

Browser support

Chromium, Firefox, WebKit

Chromium-based browsers only

Chromium, Firefox, Safari, Internet Explorer, Edge

Programming language support

Multiple (JavaScript, Python, Java, C#, etc.)

Limited (primarily JavaScript)

Extensive (JavaScript, Python, Java, Ruby, etc.)

Playwright vs. Puppeteer for scraping

Playwright and Puppeteer offer fast and efficient scraping capabilities but cater to different audiences. Playwright supports multiple browsers, making it ideal for cross-browser scraping tasks, whereas Puppeteer focuses exclusively on Chromium-based browsers. Playwright also provides advanced features like headless mode by default and concurrent sessions, giving it an edge in efficiency for complex workflows. However, Puppeteer’s simplicity and close integration with Chromium make it an excellent choice for more straightforward scraping projects.

Playwright vs. Selenium for scraping

Playwright and Selenium are another pair of excellent frameworks that take upon different stages. Playwright offers modern APIs, headless browser mode by default, and superior efficiency, making it ideal for complex workflows. In contrast, Selenium has extensive support for legacy browsers like Internet Explorer and a wider range of programming languages, making it a better choice for projects needing legacy compatibility. While Selenium boasts a larger community and a more mature ecosystem, Playwright is faster and more efficient for modern browser automation tasks.

Curtain call

The Playwright comes on stage to take the final bow – did you enjoy their performance? With its extensive browser support, modern features, and easy setup and usability, Playwright has undoubtedly earned a standing ovation for its role as one of the best frameworks for web scraping. Whether tackling complex web automation projects or running a simple script to extract data, this tool ensures the show goes on without a hitch.

About the author

Zilvinas Tamulis

Technical Copywriter

A technical writer with over 4 years of experience, Žilvinas blends his studies in Multimedia & Computer Design with practical expertise in creating user manuals, guides, and technical documentation. His work includes developing web projects used by hundreds daily, drawing from hands-on experience with JavaScript, PHP, and Python.

Connect with Žilvinas via LinkedIn

All information on Smartproxy Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Smartproxy Blog or any third-party websites that may belinked therein.

In this article

Unblock any target with Site Unblocker

Bypass restrictions like CAPTCHAs and IP bans, and effortlessly collect real-time public web data.

Get Site Unblocker

Chat with us

Frequently asked questions

Can Playwright be used for scraping?

Yes, Playwright is great for web scraping. It uses headless browsers to extract data from web pages efficiently. Its support for multiple programming languages and ability to handle dynamic content makes it a robust web scraping tool.

How to use Playwright for web scraping?

Is Selenium better than Playwright for web scraping?

What is the difference between Puppeteer and Playwright for scraping?

Unblock

Business Automation

How to Bypass CAPTCHA With Puppeteer: A Step-By-Step Guide

Since their inception in 2000, CAPTCHAs have been crucial for website security, distinguishing human users from bots. They are a savior for website owners and a nightmare for data gatherers. While CAPTCHAs enhance website integrity, they pose challenges for those reliant on automated data gathering. In this comprehensive guide, we delve into the fundamentals of Puppeteer, focusing on techniques for CAPTCHA detection and avoidance using Puppeteer. We also explore strategies for how to bypass CAPTCHA verification, methods for solving CAPTCHAs with specialized third-party services, and the alternative solutions provided by our Site Unblocker.

Dominykas Niaura

Dec 04, 2023

10 min read

Unblock

Node Unblocker: A Comprehensive Guide

Imagine this scenario: you’re bored at work with nothing to do. You decide to check out Reddit for a few minutes, but oh no, your network admin has blocked the website! Restrictions like that can be very annoying. Thankfully, Node Unblocker is an easy solution to overcoming any imposed limitations.

Zilvinas Tamulis

Apr 26, 2024

8 min read

Playwright Web Scraping: A Practical Tutorial

What is Playwright?

Methods for web scraping using Playwright

Web scraping with Playwright: a step-by-step guide

Proxy implementation

Playwright vs. other frameworks

Playwright vs. Puppeteer for scraping

Playwright vs. Selenium for scraping

Curtain call

Frequently asked questions

Can Playwright be used for scraping?

How to use Playwright for web scraping?

Is Selenium better than Playwright for web scraping?

What is the difference between Puppeteer and Playwright for scraping?

Related articles

High speed proxies for all kinds of use cases