Mastering Web Scraping: A Practical Guide to Beautiful Soup and Requests
Discover Web Scraping: A Step-by-Step Guide to Beautiful Soup and Requests
Hi guys, my name is Harshit and this is my first medium blog, it is going to be about Web Scraping using Beautiful Soup and Requests.
When I got into the field of Data Science like everybody else I used the old or you can say the datasets that have been used in every other Machine Learning Project. But as a student in the Data Science field, there are many ways to collect or create data, and that’s when I decided to learn Web Scraping.
So to create and collect your dataset there are many ways but the most organic and old-school way is to Scrape the data you see on the web. Yes, there are some limitations to it and we are going to talk about it in the last phase of the article.
In this article, we are going to understand the basics of Web Scraping and will practically implement it to create a Dataset.
Pre-Requisites for Web Scraping
Basic Knowledge of Python and HTML.
A computer with an internet connection.
Most important is the eagerness to learn new things.
If you don’t satisfy these conditions, no problem we are going to understand and learn Python and HTML along the way. That’s it, we are ready to move ahead.
Importing Python Libraries
There are 2 basic libraries we are going to install Beautiful Soup and Requests.
Open Jupyter Notebook and run this command in a cell:
%pip install -U bs4 requests lxml html5lib
Bs4 is a package that includes the Beautiful Soup module. The requests module helps us to send requests to the server of the link provided and fetch that web page's HTML code. Last but not least you must be thinking what are those lxml and html5lib, these two are the parser that helps us to parse the content returned from the requests we made.
Checking Response of Webpage
The first step of scraping is to send a request to the webpage. We are going to scrape Game Details to create a recommendation dataset from this website:
Video Game Reviews, Articles, Trailers, and more
Checking the response code:
from bs4 import BeautifulSoup as bs
import requests as req
# Checking the Sites response
url = r"https://www.metacritic.com/game/"
response = req.get(url)
print(response)
Output:
<Response [403]>
The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it.
So our request was proper but it didn’t get authorized by the server, so for getting approved we have to use headers in the requests we are making.
Headers are key-value pairs of information sent between clients and servers using the HTTP protocol. They contain data about the request and response, like the encoding, content-language, and request status.
Checking the response with headers:
from bs4 import BeautifulSoup as bs
import requests as req
# Initializing the headers
header = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36'}
# Checking the Sites response
url = r"https://www.metacritic.com/game/"
response = req.get(url,headers=header) # Added Header to get authorized
print(response)
Output:
<Response [200]>
Get your user agent by typing “My User Agent” on Google.
Parsing Response
Parsing simply means breaking down a blob of text into smaller and meaningful parts. This breaking down depends on certain rules and factors that a particular parser defines.
Response has the HTML code but it’s of no use unless we parse it.
HTML Parsing :
Parsing the response using lxml and printing the first “h1” heading present on the web page.
# Parsing the response using BeautifulSoup
soup = bs(response.content,'lxml')
print([heading.text for heading in soup.find_all('h1')])
Output :
['Games']
This shows that there’s only one “h1” heading present on the webpage. If you want to try on your own you can try replacing h3 in find_all.
Now we will go a step further and fetch some data from the webpages using “class”, “id” and other attributes used in making an HTML code.
Example Code 1 :
soup = bs(response.content,'lxml')
len([i.text for i in soup.find_all('div',attrs={'class':"c-globalCarousel g-grid-container"})])
Output :
5
Example Code 2 :
soup = bs(response.content,'lxml')
len([i.text for i in soup.find_all('div',attrs={'class':"c-globalCarousel g-grid-container",
'data-cy':'new-game-release-carousel'})])
Output :
1
Explanation :
As you can see in the first example the length came out to be 5, because if you know how HTML works one class can be assigned to many tags. And in Example 2 you can see it gives only 1 output because of the second attribute that is unique to one div only.
You can look at this documentation of Beautiful Soup gives a very detailed explanation of the parsers and methods/functions to fetch data:
Beautiful Soup Documentation - Beautiful Soup 4.12.0 documentation
Practical Implementation of Scraping:
We are going to scrape Game Details to make a recommendation Dataset.
Steps to Scrape Details :
Prepare the Page URL of Each Game so that the requests can be sent to each page and the data will be fetched from there.
Create a script to scrape details like Game Name, Game Platform, Game Release Date, Game Rating(Metacritic and User), Summary, Game Developer, Game Publisher, Game Genre.
Store the scraped data in the required format(.csv or .parquet).
Preparing Game URL :
We will adjust the filters on the Browse All Games page and will store all the Page URLs in a list.
# Ready the urls for scraping using page numbers.
url = r'https://www.metacritic.com/browse/game/?platform=ps5&platform=xbox-series-x&platform=nintendo-switch&platform=pc&platform=ps4&platform=xbox-one&platform=xbox-360&platform=xbox&platform=nintendo-64&releaseYearMin=1910&releaseYearMax=2023'
# Scraping no of pages.
total_pages = bs(requests.get(url,headers=header).content,'lxml').find_all('span',{"class":'c-navigationPagination_itemButtonContent u-flexbox u-flexbox-alignCenter u-flexbox-justifyCenter'})
no_of_pages = int(total_pages[-2].text.strip())
# Storing the URL's in a list.
page_url = [url + '&page' + str(i) if i == 1 else url + '&page=' + str(i) for i
in tqdm(range(1,no_of_pages+1))]
In this snippet, we have stored page URLs in a list by appending the page number(which is scraped using a class of the pagination bar) at the end. You have to find patterns in the URL’s how they change when switching to the next or any page.
# Scraping the urls of all the games
game_urls = []
for idx,link in tqdm(enumerate(page_url),total=len(page_url)):
soup = bs(requests.get(link,headers=header).content, 'lxml')
game_urls.extend(['https://www.metacritic.com'+link['href']
for link in soup.find('div',
{"class":"c-productListings",
"section":f"detailed|{idx+1}"}).find_all('a')])
# Save the urls in a csv file
df = pd.DataFrame({'Game Url':game_urls})
df.to_csv("URL.csv",index=False)
In this snippet, we have found out the pattern that the section no is changing the same as the page no. And Scraped all the Game URLs and stored them in a Pandas Data Frame.
Script for Scraping Game Details :
def scrape(urls):
game_name = []
game_platform = []
game_release_date = []
game_rating_metacritic = []
game_rating_user = []
summary = []
game_developer = []
game_publisher = []
game_genre = []
for url in urls:
response = requests.get(url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
# Game Name
try:
game_name.append(
soup.find(
"div",
{
"class": "c-productHero_title g-inner-spacing-bottom-medium g-outer-spacing-top-medium"
},
).text.strip()
)
except:
game_name.append(np.nan)
pass
# Game Platform
try:
game_platform.append(
soup.find(
"div",
{
"class": "c-ProductHeroGamePlatformInfo u-flexbox g-outer-spacing-bottom-medium"
},
).text.strip()
)
except:
game_platform.append(np.nan)
pass
# Game Release Date
try:
game_release_date.append(
soup.find(
"div",
{"class": "c-gameDetails_ReleaseDate u-flexbox u-flexbox-row"},
)
.find_all("span")[-1]
.text.strip()
)
except:
game_release_date.append(np.nan)
pass
# Rating from Metacritic
try:
game_rating_metacritic.append(
soup.find(
"div",
{
"class": "c-siteReviewScore u-flexbox-column u-flexbox-alignCenter u-flexbox-justifyCenter g-text-bold c-siteReviewScore_green g-color-gray90 c-siteReviewScore_large"
},
).text.strip()
)
except:
game_rating_metacritic.append(np.nan)
pass
# Ratings from users
try:
game_rating_user.append(
soup.find(
"div",
{
"class": "c-siteReviewScore u-flexbox-column u-flexbox-alignCenter u-flexbox-justifyCenter g-text-bold c-siteReviewScore_green c-siteReviewScore_user g-color-gray90 c-siteReviewScore_large"
},
).text.strip()
)
except:
game_rating_user.append(np.nan)
pass
# Summary of the Game
try:
summary.append(
soup.find(
"span",
{"class": "c-productionDetailsGame_description g-text-xsmall"},
).text.strip()
)
except:
summary.append(np.nan)
pass
# Game Developers
try:
game_developers = (
soup.find(
"div", {"class": "c-gameDetails_Developer u-flexbox u-flexbox-row"}
)
.find("ul", {"class": "g-outer-spacing-left-medium-fluid"})
.find_all("li")
)
game_developer.append(
[game_developers[i].text.strip() for i in range(len(game_developers))]
)
except:
game_developer.append(np.nan)
pass
# Game Publisher
try:
game_publisher.append(
soup.find(
"div",
{"class": "c-gameDetails_Distributor u-flexbox u-flexbox-row"},
)
.find(
"span",
{
"class": "g-outer-spacing-left-medium-fluid g-color-gray70 u-block"
},
)
.text.strip()
)
except:
game_publisher.append(np.nan)
pass
# Game Genre
try:
game_genres = soup.find(
"ul",
{
"class": "c-genreList u-flexbox u-block g-outer-spacing-left-medium-fluid"
},
).find_all("li")
game_genre.append(
[game_genres[i].text.strip() for i in range(len(game_genres))]
)
except:
game_genre.append(np.nan)
pass
# Creating a dataframe from the data saved into the lists.
df = pd.DataFrame(
{
"Game Name": game_name,
"Platform": game_platform,
"Release Date": game_release_date,
"Metacritic Rating": game_rating_metacritic,
"User Rating": game_rating_user,
"Summary": summary,
"Developer": game_developer,
"Publisher": game_publisher,
"Genre": game_genre,
}
)
# Returning the dataframe from the function
return df
# Calling the Scarping function
results = scrape(df['Game Url'].tolist())
Output :
I have applied try & except blocks using Python’s exception handling feature so that if any of the content is not found in web parsing then a null value will be appended in its place.
You can also find the dataset which I have hosted on my Kaggle profile.
BONUS
Pandas is one the popular libraries used in data manipulation and it also provides a method of scraping tables from a website and storing it in the form of a Data Frame.
Largest airlines in the world - Wikipedia
Code:
import pandas as pd
df = pd.read_html(r'https://en.wikipedia.org/wiki/Largest_airlines_in_the_world')
df[0]
Output will be a list of tables present on the website and you can access each table using indexing and have to figure out on your own which table you want to store.
Limitation of Web Scraping
Web scraping has many advantages like collecting large amounts of data quickly and easily. However, there are also several limitations of web scraping:
Websites change frequently: Websites often change their HTML structure and layout for various reasons. This can break existing scrapers built based on the previous website structure. Scrappers have to be constantly updated to work with new website changes.
Complex websites are difficult to scrape: While simple websites are easy to scrape, more complex websites with dynamic content, AJAX calls, infinite scrolling, etc. pose a challenge for scrapers.
IP blocking: Websites can detect scraping activity and block the scraper’s IP address to prevent excessive load. Scrapers have to use techniques like rotating IPs and proxies to avoid getting blocked.
CAPTCHAs: Websites use CAPTCHAs to distinguish humans from bots. While there are CAPTCHA solvers, they can slow down the scraping process.
Honeypot traps: Some websites include hidden elements that act as traps to detect scrapers. Scrapers have to be able to identify and avoid these traps.
Login requirements: Scraping data behind a login often requires storing and sending cookies with each request like a browser.
Real-time data scraping is difficult: Scraping real-time changing data is challenging due to delays in requests and data processing. Scheduled scraping at short intervals can help achieve near-real-time data.
Legal issues: Scraping public data is generally legal, but scraping private data without permission could pose legal risks.
In summary, while web scraping is a powerful technique, there are several limitations to keep in mind. Using a robust web scraping tool can help overcome some of these challenges and make scraping large amounts of data more efficient and reliable.
We implemented web scraping using Python libraries like BeautifulSoup and Requests. Covered basics of web scraping and provided an overview of scraping and the role of USER-AGENT while scraping from sites that require authorization from the servers.
That’s it for this article stay tuned for more cool articles like this in the future. Till next time, au revoir.
P.S. It was more than an introduction.