How to Scrape the Web for Data Journalism
Data journalism relies on endless digital data to produce complex and engaging stories. Also, it allows journalists to maintain the high-quality and up-to-date information that is expected in this cybernetic era. But do you really have to know how to scrape the web for this? Yup, you do.
What Is Data Journalism?
If you said that data journalism is a bit like data science, you wouldn’t be far from the truth. In order to produce accurate insights with journalistic value, data journalists have to gather data on the web in bulks and then systematically analyze it. This is why web scraping is the backbone of data journalism.
The Ethics of Web Scraping for Data Journalists
Ok, let’s address the elephant in the room. Yes, it’s perfectly legal and ethical to scrape the web for journalism if you abide by the law. If someone publishes data on their website, this data is automatically public. Of course, journalists cannot collect any data that is hidden from the regular visitors. If visitors cannot access it, journalists shouldn’t try to get it either.
Besides, reporters should always read the Terms & Conditions of a particular website to avoid any legal trouble.
To Identify Yourself, or Not to Identify?
Well, that’s the real question, Shakespeare. Naturally, there are two ways to address this issue.
Some data journalists believe that it is necessary to identify themselves when they’re scraping a specific website. After all, it’s common practice to introduce yourself when you’re about to interview someone.
So you might want to write a note in the HTTP header that contains your name and the fact that you’re a journalist. You can also leave your phone number, and make a webmaster smile by wishing them a great day!
But some journalists disagree with this. They claim that it’s actually a good thing to remain unidentified when they’re collecting data for stories. Of course, these reporters respect the rules of the websites and use proxies to stay undetected. This practice can be seen as similar to an undercover interview when the subject is unaware that they are being interviewed.
Why Data Journalism Is Worth Pursuing?
Data on the internet is almost endless, and so are your possibilities when you know how to collect it. Let’s be honest, if data is easy to access, you can be almost sure that another journalist already beat you to it. It wouldn’t hurt to be a tad bit more original and dig deeper!
Not only can this method generate new topics, but it can also become an excellent monitoring tool and save a lot of your precious time. For instance, you can scrape different sources every other hour for specific information. It’s an easy way to gather up-to-date data effortlessly. Just add a keyword into your scraper, and the job is done!
Of course, web scraping might look intimidating at first. But I can assure you, it’s not that hard, mostly because you can use a web scraping tool for your projects. Even if you decide to build your own scraper, there are limitless courses online you can learn from.
One last thing, proxies are a great help too. You can access any content around the world by using them. Even the websites that are unavailable in your country. Yup, our proxies can really make the data on the internet limitless!