top of page
Isabel Dunkley

An introduction to Data and Web Scraping



What is Data and Web Scraping?

Data and web scraping; A process used to extract publicly available data from various websites, this can be done either manually or using specialist software, but unlike screen scraping, this method extracts underlying HTML code and data stored in a database. Once the information is gathered, it is then changed into a format that is more useful and effective for users.

How it’s done

To scrape information from specific websites, the source code must be accessed, which is usually unavailable without permission from the site owner. High level programming languages such as Python are typically used to then extract and analyse HTML pages. A crawler can then be used to browse websites, follow links, download and extract data as it goes. Another option is the use of a scraping spider as this is a program that uses a crawler to navigate websites and extract data. Once collected, the data can be cleaned and prepared to be formatted for business or individual use, such as for market research purposes.

Advantages of dark web scraping

An advantage of dark web scraping is that you can reach data that is generally unavailable on the internet. This can be useful for certain needs, for example if more in-depth research is necessary to complete tasks. Another benefit of this method is that it can help automate the process of gathering data from the internet; this can save working time so that more time can be allocated to utilise the data and create strategies to meet business or individual’s needs.


An additional benefit is that it is much more cost effective than employing a data entry

specialist. This is because the process can be automated to run without the need for

constant human intervention and input.

High security risks

Although all precautions can be implemented to keep data safe, security risks are still high

due to some malicious users abusing the process of collecting publicly available data. This

was recently seen when Facebook and LinkedIn user accounts were compromised because

of data leaks.

There are also risks to privacy as malicious users can design web scrapers that collect more sensitive, or even personally identifiable information. Again, this is often a major concern for social media platforms as they are especially vulnerable to criminal data scraping due to the extremely high volume of personally identifiable information that users frequently input and share.

Comments


bottom of page