Smartproxy>Case Studies>How to Double Your Revenue from Web Scraping with Python

How to Double Your Revenue from Web Scraping with Python

112.13% boost, a surge in traffic, and a shortcut to price aggregation

Are you ready for your business to take off too?

Consider this case of an online e-commerce website. This business is a small startup (15 people). Their starting budget is around 14,000$. However, they managed to double their revenue and compete with other retail giants.

Since we cannot disclose the name, let’s call this company “Shopmania”. They compare prices and help their clients find the cheapest option. But how do they find it themselves? Through scraping, of course. Let’s see exactly how they find trending items and anything else that’s reduced on Amazon!

Attempts with Parsehub and an Open Source Project

“Shopmania’s” employee tried using a well known data scraping method first – Parsehub. However, it was a minefield. Parsehub was expensive, it did not understand complex logic, and required a lot of editing. There had to be a better way.

What about reworking an open source project? It made sense – it was fast and free. The downside was that it required coding and scraping experience, and that not all open source solutions were properly maintained.

Let’s see how he did it.

The go-to method: Python and proxies

Step one. The specialist needed the trifecta: Python, Selenium, and proxies. Selenium is an automation tool that supports Python and is widely used by programmers. When it comes to proxies, it’s up to you – Amazon is not too careful, so you can try datacenter proxies. On the other hand, residential proxies are more reliable, and that’s why they were “Shopmania’s” preferred choice. Here’s the magic formula: Regular residential proxy plan + Python + Selenium.

Step two. Here is the script that “Shopmania” used for scraping Amazon (in its final form):

import re

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium_python import smartproxy
import json

def scraper():
    chrome_options = Options()
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    desired_capabilities = smartproxy()

    driver = webdriver.Chrome('selenium-scraper/chromedriver', options=chrome_options, desired_capabilities=desired_capabilities)
    items_xpath = "//div[starts-with(@id, '100_dealView')]"
    driver.get('https://www.amazon.com/international-sales-offers/b/?ie=UTF8&node=15529609011&ref_=nav_navm_intl_deal_btn&nocache=1569852387822')

    elems = driver.find_elements_by_xpath(items_xpath)

    items_list = []
    single_digit_deal = re.compile(r'\d% off')
    double_digit_deal = re.compile(r'\d% off')

    for i, item in enumerate(elems):
        off_match = single_digit_deal.findall(item.text)
        if not off_match:
            off_match = double_digit_deal.findall(item.text)
            if off_match is not None:
                strip_data(i, item.text)
        else:
            strip_data(i, item.text)


def strip_data(item_index, data):

    items = {}

    price_regex = re.compile(r'\$\d{1,10}.\d{1,2}')
    price = price_regex.findall(data)

    # assuming those with price are legit
    if price:
        # account for price range
        if len(price) == 1:
            # Ignore random one price items
            new_price = None
        elif len(price) <= 2:
            new_price = price[0]
            list_price = price[1]
        elif len(price) > 2:
            # price range gotcha
            new_price = f"{price[0]} - {price[1]}"
            list_price = f"{price[2]} - {price[3]}"

        # check for new_price only then proceed
        if new_price:
            if "Ends in" not in data:
                title_slice = data.split('\n')[2]
                # Check for misplaced items check for basic length > 10
                if len(title_slice) >= 10:
                    product_title = title_slice
                else:
                    return

                items[item_index] = {}
                items[item_index]['product_title'] = product_title
                items[item_index]['new_price'] = new_price
                items[item_index]['list_price'] = list_price

                with open('products.json', 'w') as f:
                    f.write(json.dumps(items, sort_keys=True, indent=4))



if __name__ == '__main__':
    scraper()

import re

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium_python import smartproxy
import json

def scraper():
    chrome_options = Options()
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    desired_capabilities = smartproxy()

    driver = webdriver.Chrome('selenium-scraper/chromedriver', options=chrome_options, desired_capabilities=desired_capabilities)
    items_xpath = "//div[starts-with(@id, '100_dealView')]"
    driver.get('https://www.amazon.com/international-sales-offers/b/?ie=UTF8&node=15529609011&ref_=nav_navm_intl_deal_btn&nocache=1569852387822')

    elems = driver.find_elements_by_xpath(items_xpath)

    items_list = []
    single_digit_deal = re.compile(r'\d% off')
    double_digit_deal = re.compile(r'\d% off')

    for i, item in enumerate(elems):
        off_match = single_digit_deal.findall(item.text)
        if not off_match:
            off_match = double_digit_deal.findall(item.text)
            if off_match is not None:
                strip_data(i, item.text)
        else:
            strip_data(i, item.text)


def strip_data(item_index, data):

    items = {}

    price_regex = re.compile(r'\$\d{1,10}.\d{1,2}')
    price = price_regex.findall(data)

    # assuming those with price are legit
    if price:
        # account for price range
        if len(price) == 1:
            # Ignore random one price items
            new_price = None
        elif len(price) <= 2:
            new_price = price[0]
            list_price = price[1]
        elif len(price) > 2:
            # price range gotcha
            new_price = f"{price[0]} - {price[1]}"
            list_price = f"{price[2]} - {price[3]}"

        # check for new_price only then proceed
        if new_price:
            if "Ends in" not in data:
                title_slice = data.split('\n')[2]
                # Check for misplaced items check for basic length > 10
                if len(title_slice) >= 10:
                    product_title = title_slice
                else:
                    return

                items[item_index] = {}
                items[item_index]['product_title'] = product_title
                items[item_index]['new_price'] = new_price
                items[item_index]['list_price'] = list_price

                with open('products.json', 'w') as f:
                    f.write(json.dumps(items, sort_keys=True, indent=4))



if __name__ == '__main__':
    scraper()

Of course this script is not the final version of what you should be using to make a fortune, however, it’s a step in the right direction. Simply edit it, adjust it to your needs, and send us a thank you note.

Market data is valuable: 112.13% revenue boost

This code is one of the last stages of the quest. Expect a lot of intel – and a lot of material for your hustle:

As you can imagine, “Shopmania” really took off – and all the business needed was a custom script and some proxies. Now they work as affiliates with the retail giants that they wanted to compete with originally.

The company ended up boosting their revenue by 112.13%, cutting down on employment resources, and providing more value to new users.

Impressive, isn’t it?

Are you ready for your takeoff?

Get me started

How to Double Your Revenue from Web Scraping with Python

Attempts with Parsehub and an Open Source Project

The go-to method: Python and proxies

Market data is valuable: 112.13% revenue boost

High speed proxies for all kinds of use cases