Lecture 10 (Web Scraping)
Resources
You can find examples and motivation in the resources.
Summary
In this lecture, we will continue to study techniques for retrieving data from the internet. Specifcally, we will look at web scraping. We have looked at how REST APIs work and how we can retrieve resources from them. However, in many cases data will not be available through an API and we therefore need to turn to web scraping, a method for retrieving data that is visible on a website.
Web Scraping
Web scraping is the process of extracting data from websites. It involves accessing and collecting information present in the HTML code of a webpage. This technique provides a powerful means to gather data for analysis and research.
By scraping websites, we can retrieve information that is available on websites but that do not have APIs open to the public. Furthermore, we can stay updated with real-time information. This is particularly useful for tracking dynamic data that changes frequently. For instance, weather data, stock market data as so on.
How to Webscrape
To efficiently perform web scraping, an understanding of basic HTML is important.
HTML
HTML, or HyperText Markup Language, serves as the backbone of the internet by providing a standardized structure for creating and presenting content on webpages. At its core, HTML is a markup language that uses tags to define the structure and elements within a document. Tags encapsulate different content elements, such as headings, paragraphs, links, images, and more. Understanding HTML is important for effective web scraping because it allows practitioners to navigate the structure of a webpage and precisely target the data they seek. By understanding the hierarchy of HTML elements and their attributes, web scrapers can programmatically access and extract relevant information.
Read more about HTML here
Static vs Dynamic
Websites can be categorized as either static or dynamic. A static website consists of fixed content that remains unchanged until manually updated, making data extraction relatively straightforward using techniques like HTML parsing. On the other hand, dynamic websites use technologies like JavaScript to load content dynamically, altering the page after initial load. While static sites may be scraped using simple HTML parsing, dynamic sites often require tools like headless browsers (e.g., Selenium) to simulate user interactions and access dynamically rendered content.
In terms of REST, for static websites, we retrieve the complete website when making a request to the server. In contrast, dynamic websites retrieve the skeleton of the website and fills in the data using e.g JavaScript after the fact. Since, we are only retrieving the skeleton of the website when making a request, there is nothing to scrape, therefore the need for tool like headless browsers to fill in the data before scraping.
In general, it is a bit more difficult to scrape dynamically rendered websites, but not by much. The point is that we should be aware of the type of website being scraped and use the relevant tools for the job.
Ethical Considerations
While web scraping opens doors to a wealth of information, it's crucial to approach it ethically. Here are some considerations:
Always adhere to the terms of service of the website you are scraping. Some websites explicitly prohibit scraping in their terms.
To prevent server overload, implement rate limiting in your scraping scripts. This ensures you're not making too many requests in a short period, respecting the server's capacity.
Robots.txt files provide guidelines for web crawlers. Check for this file on a website to understand any restrictions or rules set by the website owner.
Example
To make web scraping more concrete, let's create a simple web scraper.
In the following website:
https://www.worldometers.info/world-population/population-by-country/
we can retrieve some some meta-data about countries around the world. The goal is to create a dataframe of the data in the website. We should begin by analysing the website. Right-click on the website and select inspect
, this may differ between web-browsers. Now, we can interactively anaylse the content(HTML) of the website. For this particular website, we see that the data is in the form of a table. So let's look for a <table>...</table>
tag. We can see that the data is under the table with id="example2"
. From <thead>...</thead>
, we get the variable names in the <th>...</th>
tags and from <tbody>...</tbody>
we get the data in the <tr>...</tr>
tags.
Using this, let's write some code.
The following code scrapes the website and creates a dataframe of the data:
import pandas as pd
import requests
from bs4 import BeautifulSoup
URL = "https://www.worldometers.info/world-population/population-by-country/"
r = requests.get(URL)
html = BeautifulSoup(r.content) # If this line causes an error, run 'pip install html5lib' or install html5lib
table = html.find("table", id="example2")
thead = table.find("thead")
tbody = table.find("tbody")
columns = []
for th in thead.find_all("th"):
columns.append(th.text)
data = []
for tr in tbody.find_all("tr"):
data.append(
[td.text for td in tr.find_all("td")]
)
df = pd.DataFrame(data, columns=columns)
# Or even more simply
#df = pd.read_html(str(table))[0]
import pandas as pd
import requests
from bs4 import BeautifulSoup
URL = "https://www.worldometers.info/world-population/population-by-country/"
r = requests.get(URL)
html = BeautifulSoup(r.content) # If this line causes an error, run 'pip install html5lib' or install html5lib
table = html.find("table", id="example2")
thead = table.find("thead")
tbody = table.find("tbody")
columns = []
for th in thead.find_all("th"):
columns.append(th.text)
data = []
for tr in tbody.find_all("tr"):
data.append(
[td.text for td in tr.find_all("td")]
)
df = pd.DataFrame(data, columns=columns)
# Or even more simply
#df = pd.read_html(str(table))[0]
library(tidyverse)
library(rvest)
URL <- "https://www.worldometers.info/world-population/population-by-country/"
html <- read_html(URL)
table <- html %>% html_elements("#example2")
thead <- table %>% html_element("thead")
tbody <- table %>% html_element("tbody")
columns <- list()
for (th in thead %>% html_elements("th")){
columns <- append(columns, th %>% html_text2())
}
df <- data.frame()
for (tr in tbody %>% html_elements("tr")) {
td_ = list()
for (td in tr %>% html_elements("td")){
val <- td %>% html_text2()
td_ <- append(td_, val)
}
df <- rbind(df, td_)
}
colnames(df) <- columns
# Or even more simply
#df <- table %>% html_table()
library(tidyverse)
library(rvest)
URL <- "https://www.worldometers.info/world-population/population-by-country/"
html <- read_html(URL)
table <- html %>% html_elements("#example2")
thead <- table %>% html_element("thead")
tbody <- table %>% html_element("tbody")
columns <- list()
for (th in thead %>% html_elements("th")){
columns <- append(columns, th %>% html_text2())
}
df <- data.frame()
for (tr in tbody %>% html_elements("tr")) {
td_ = list()
for (td in tr %>% html_elements("td")){
val <- td %>% html_text2()
td_ <- append(td_, val)
}
df <- rbind(df, td_)
}
colnames(df) <- columns
# Or even more simply
#df <- table %>% html_table()
The code produces the following dataframe.
# | Country (or dependency) | Population (2023) | Yearly Change | Net Change | Density (P/Km²) | Land Area (Km²) | Migrants (net) | Fert. Rate | Med. Age | Urban Pop % | World Share | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | India | 1,428,627,663 | 0.81 % | 11,454,490 | 481 | 2,973,190 | -486,136 | 2 | 28 | 36 % | 17.76 % |
1 | 2 | China | 1,425,671,352 | -0.02 % | -215,985 | 152 | 9,388,211 | -310,220 | 1.2 | 39 | 65 % | 17.72 % |
2 | 3 | United States | 339,996,563 | 0.50 % | 1,706,706 | 37 | 9,147,420 | 999,700 | 1.7 | 38 | 83 % | 4.23 % |
3 | 4 | Indonesia | 277,534,122 | 0.74 % | 2,032,783 | 153 | 1,811,570 | -49,997 | 2.1 | 30 | 59 % | 3.45 % |
4 | 5 | Pakistan | 240,485,658 | 1.98 % | 4,660,796 | 312 | 770,880 | -165,988 | 3.3 | 21 | 35 % | 2.99 % |
Some processing is needed but we already know how to do that.
Try scraping websites. The best way to learn is by practice!