- Smartproxy >
- Case Studies >
How to Double Your Revenue from Web Scraping with Python
How to Double Your Revenue from Web Scraping with Python
112.13% boost, a surge in traffic, and a shortcut to price aggregation
Are you ready for your business to take off too?
Consider this case of an online e-commerce website. This business is a small startup (15 people). Their starting budget is around 14,000$. However, they managed to double their revenue and compete with other retail giants.
Since we cannot disclose the name, let’s call this company “Shopmania”. They compare prices and help their clients find the cheapest option. But how do they find it themselves? Through scraping, of course. Let’s see exactly how they find trending items and anything else that’s reduced on Amazon!
Attempts with Parsehub and an Open Source Project
“Shopmania’s” employee tried using a well known scraping method first – Parsehub. However, it was a minefield. Parsehub was expensive, it did not understand complex logic, and required a lot of editing. There had to be a better way.
What about reworking an open source project? It made sense – it was fast and free. The downside was that it required coding and scraping experience, and that not all open source solutions were properly maintained.
Let’s see how he did it.
The go-to method: Python and proxies
Step one. The specialist needed the trifecta: Python, Selenium, and proxies. Selenium is an automation tool that supports Python and is widely used by programmers. When it comes to proxies, it’s up to you – Amazon is not too careful, so you can try datacenter proxies. On the other hand, residential proxies are more reliable, and that’s why they were “Shopmania’s” preferred choice. Here’s the magic formula: Regular residential proxy plan + Python + Selenium.
Step two. Here is the script that “Shopmania” used to scrape Amazon (in its final form):
import re
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium_python import smartproxy
import json
def scraper():
chrome_options = Options()
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
desired_capabilities = smartproxy()
driver = webdriver.Chrome('selenium-scraper/chromedriver', options=chrome_options, desired_capabilities=desired_capabilities)
items_xpath = "//div[starts-with(@id, '100_dealView')]"
driver.get('https://www.amazon.com/international-sales-offers/b/?ie=UTF8&node=15529609011&ref_=nav_navm_intl_deal_btn&nocache=1569852387822')
elems = driver.find_elements_by_xpath(items_xpath)
items_list = []
single_digit_deal = re.compile(r'\d% off')
double_digit_deal = re.compile(r'\d% off')
for i, item in enumerate(elems):
off_match = single_digit_deal.findall(item.text)
if not off_match:
off_match = double_digit_deal.findall(item.text)
if off_match is not None:
strip_data(i, item.text)
else:
strip_data(i, item.text)
def strip_data(item_index, data):
items = {}
price_regex = re.compile(r'\$\d{1,10}.\d{1,2}')
price = price_regex.findall(data)
# assuming those with price are legit
if price:
# account for price range
if len(price) == 1:
# Ignore random one price items
new_price = None
elif len(price) <= 2:
new_price = price[0]
list_price = price[1]
elif len(price) > 2:
# price range gotcha
new_price = f"{price[0]} - {price[1]}"
list_price = f"{price[2]} - {price[3]}"
# check for new_price only then proceed
if new_price:
if "Ends in" not in data:
title_slice = data.split('\n')[2]
# Check for misplaced items check for basic length > 10
if len(title_slice) >= 10:
product_title = title_slice
else:
return
items[item_index] = {}
items[item_index]['product_title'] = product_title
items[item_index]['new_price'] = new_price
items[item_index]['list_price'] = list_price
with open('products.json', 'w') as f:
f.write(json.dumps(items, sort_keys=True, indent=4))
if __name__ == '__main__':
scraper()
Of course this script is not the final version of what you should be using to make a fortune, however, it’s a step in the right direction. Simply edit it, adjust it to your needs, and send us a thank you note.
Market data is valuable: 112.13% revenue boost
This code is one of the last stages of the quest. Expect a lot of intel – and a lot of material for your hustle:
As you can imagine, “Shopmania” really took off – and all the business needed was a custom script and some proxies. Now they work as affiliates with the retail giants that they wanted to compete with originally.
The company ended up boosting their revenue by 112.13%, cutting down on employment resources, and providing more value to new users.
Impressive, isn’t it?
Are you ready for your takeoff?