Web Scraping for Academic Research

Mariam Nakani

Say hello to Mariam! She is very tech savvy - and wants you to be too. She has a lot of intel on residential proxy providers, and uses this knowledge to help you have a clear view of what is really worth your attention.

Ok, let’s be real. Academic research is not all sunshine and rainbows. No matter if you’re doing quantitative or qualitative research, you have to gather massive amounts of data. Are there any shortcuts? Hell yes! Cue web scraping.

web scraping for academic research

How Academics Learned to Stop Worrying and Love Web Scraping

Web scraping can do wonders for your academic research. Indeed, more and more academics rely on this method as it allows them to perform their research more efficiently. For example, you can scrape the web to gather data from web forums and social media or monitor web page changes over time. Besides, you can scrape academic papers to find the ones relevant to your research!

But how do you scrape for academic research? To put it shortly, if you’re trying to reach data that is behind a login or you’re collecting information from a private forum, you might be lurking in muddy waters.

So let’s dissect some ethical issues that come with web scraping for academic research.

Abide by the Rules

There’s a golden rule in the web scraping world: if the regular user cannot access this data on the website, you shouldn’t try to reach it too. This is probably sensitive information that you shouldn’t put your hands on anyway.

Besides, before starting any of your web scraping projects, be sure to reach out to both your university’s IT department and IRB to form a data management plan. Also, always read the website’s Terms & Conditions to avoid any legal trouble and check whether it has its API.

Respect the Websites You’re Scraping

Well, respect will never go out of style. So when you’re scraping, try to appreciate the site’s bandwidth. For instance, if you’re not coding yourself, download some web scraping applications designed to gather only the files you’re after. In this way, you consume far less bandwidth, make your scraping experience more efficient and minimize the impact on the website’s servers.

Moreover, be sure to wait at least a few minutes between your requests and scrape during off-peak hours if possible. In the meantime, grab that lovely cup of coffee!

web scraping for research

How to Scrape Social Media for Academic Purposes

Social media is a cornucopia of political and social behavior examples for many researchers. It allows conducting various observational research on the relevant topics, like political engagement dynamics or the spread of fake news.

But it’s not a come-and-get-it situation here. You have to be really conscious of how you collect this data for your academic needs.

See, social media holds personal data. Many legal regulations protect such data. Besides, the scientific community’s ethical standards themselves dictate that you have to preserve users’ privacy. This means that you have to avoid any possible harm resulting from connecting actual people to a person mentioned in your research.

Additionally, you cannot observe any of your subjects in their private environment. For instance, this may include their Facebook wall, private messages, or closed groups you don’t have access to. I mean, you don’t want to be a Big Brother, do you?

Of course, it’s quite unlikely that individuals will be personally harmed due to a data leakage if you’re doing quantitative research. You have to be wary when conducting qualitative research as you may disclose individuals’ data by quoting users’ posts as evidence. The best way is to use the pseudonymization technique. This will allow you to analyze data and track your subjects’ activities without harming them.

Ethical Web Scraping With Proxies

Proxies play a huge role here too. For example, if you need to gather a lot of data from the same website, there’s no question that you will have to work with ethically sourced residential or datacenter proxies. They will help you avoid any IP bans and gather relevant information more efficiently.

Moreover, as in the case of data journalism, you should decide whether to identify yourself or not. In most cases, you’ll want to identify yourself. So just write a note in the HTTP header that contains your name and the fact that you’re a researcher. You can also leave your phone number in case a webmaster wants to contact you!

On the other hand, in some instances, you might need to use proxies without identifying yourself. As it was mentioned before, don’t forget to drop a line for your IT and IRB teams about that!