Scraping is a technique for automatically extracting data from the web.
Scraping usually involve a script or software that simulates human behaviour on the Internet.
- The first technique is search engine scraping. Very similar to what a human could do, but automatically. We will start by selecting keywords, which entered into a search engine, will be able to return relevant results according to what we are looking for. We will then use the APIs (Application programming interface) of different search engines such as Google, Bing or Yahoo! to perform queries using the previously selected keywords. We will retrieve all the links returned by the APIs and scrap each link. This allows us to collect the information we are looking for and that we will save in our database. We could improve the solution by implementing NLP (Natural Language Processing: Teaching computers to understand human language).
- The second solution is targeted scraping. We will select a list of websites that contain the information we need. Instead of scrapping the entire site, we will only collect the information we are interested in at specific locations using the HTML tags. Then we will save this information in our database. The advantage of this technique is that only relevant and quality data will be collected. However, in terms of number of companies, we will have fewer companies and this requires more development because each site will have a different code.
To develop this expertise, we used Python scripts using some libraries:
- Selenium and BeautifulSoup are scraping libraries, they allow us to collect information on a web page according to HTML elements (Tags, Classes or Id).
- Pandas allows us to manipulate dataframes or concatenate the different Excel files.
With Scraping you can collect a lot of data on the web, however there are limits. Since the Cambridge Analytics case, data on the Internet has become more secure. There is also a legality dimension that is still unclear. The sites are secure and it is not uncommon to be excluded from websites (Example: Captcha, Ban IP).