“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so must data be broken down, analysed for it to have value.” - Clive Humby
A frequent criticism of Kaggle and MOOC (Massive Open Online Courses) is that they are reducing data science to predictive modelling and they are using pre-cleaned or non-real-world datasets. Data Science is more than that. Data retrieval and cleansing are essential parts of the field too. You and I know that data can make a difference. If you can get better data, you perform better. Do you think data retrieval is a task for a data engineer? In a small startup or as an entrepreneur following your passions, you don’t have access to one. You might have any data in the beginning. What is the solution to this problem?
Scraping the data from websites is a solution. Web scraping has a bad reputation, as people think that you are stealing the data. It is unethical in my eyes in case you are building up a competitive product with the data. But what is about taking the labelled photos to teach a neural network? I think this is OK. You might give the company even something valuable back or push their business with your data product.
I got my hands dirty over the last years to learn data science and to build up my own data products. In this time I scraped data from many websites and learned a lot. I want to share the most important lessons with you:
Before you scrape a website ALWAYS check it’s Terms of Service and the robots.txt to answer the question what you are allowed to fetch and what not. This saves you a lot of trouble and the risk of a court case. Please remember that scraping is a grey zone. Do you want to get sued by Google or Amazon?
You also need to check what headers your browser sends to your website. The developer tools of the browsers e.g Mozilla Browser Developer Tools help you with this.
Build a docker container for your scraping application. Scraping configurations can be a bit complicated, especially if you fetch the data from the resource with Selenium, which uses a browser like Chromium or Firefox. You need operating system dependent drivers for this situation. You can use my docker image with Firefox/Geckodriver and Chromium/Chromedriver on Dockerhub as a base image for your scraping application. The source can be downloaded from Github. Use Docker Compose: It is a tool for defining and running multi-container Docker applications. This way you can set up a database with no effort along with your scraping container. If you are not familiar with Docker then try it out. It’s a secret weapon for reliable runtime environments for all kind of software projects.
Use proxies already during development. Rotate the proxies randomly. This saves you a lot of trouble. My rule of a thumb is to use 10 proxies per scraping process/thread. Try to avoid free proxies lists. The proxies on these lists are often slow or already blocked. It is a waste of time to try them out. Google for proxy providers. You need to invest some money here. You can even buy non-shared proxies, but they have their costs.
Write first a scraper that just fetches a resource and saves it. Ensure that this data is the same as the data downloaded from your browser. If this is the case you can increase the number of processes/threads and load more data. Once you have a bunch of raw data files (not too many!) you start to develop the parser. Extract the data you need and check if the data is OK. Do an exploratory data analysis, check for data anomalities like missing values or outliers to ensure a high quality parsing logic. Log all errors while you parse. This way you can find out, if the logic works for all resources the same.
Keep the scraping logic seperated from the parsing logic.
I try to scrape each URL only once. Or very very rarely!
You don’t want to download the same resource over and over again, especially during development. We all have a kind of trial and error programming style. We write a piece of code and test if it works or not, do some bug fixing. This can be fatal during scraper development as you risk to get blocked already in this phase. The block is sometimes only for a few hours, but the block blocks your programming. Seperating the parsing logic from the scraper logic helps you to reduce the lines of code in which you can make errors. Keep this in mind:
It’s essential to keep the raw data once the resource is scraped. This has several advantages:
Set proper headers, when fetching a resource from a website. Check what your browser is sending and mimic a real browser in your code. In my experience one of the most important one is the user agent header. I am sharing a very simple python module for useragent rotation with you. It takes randomly one user agent out of thousands.
Save the raw data in a database. I love MongoDB’s GridFS for this task. The API is very simple. You have different advantages with this:
Use a database for the parsed data. You are getting an improved data access through the use of query languages.
MongoDB is an excellent choice here, as you don’t need a scheme, you can just save the data as it is. Especially if you able to scrape from an internal API. From APIs you get often JSON, and you can toss the JSON as it is into the MongoDB.
Once you have a bunch of raw files and your parser is written do an Exploratory Data Analysis (EDA) to check the data quality. Check for missing values. I wrote a blog post about Missing values visualization with tidyverse in R which might help you. Do a univariate data analysis to check for anomalities for each feature. Do a multivariate analysis to check, if values in linked features are consistent. Visualizations are your best friends here.
This step is essential and missed out by many people. You find the weak points in your parsing logic. You might go through several iterations to improve the parsing logic until you are happy.
Feel free to send me a message in case you need support in your scraping project.Written on February 19th, 2019 by Jens Laufer