Web Scraping Using Pyhton & BeautifulSoup 4
Web Scraping Using Python and (BeautifulSoup)
By: Maroof Tahir and Faisal Rafiq
Introduction
In this post, we'll walk through a project that revolves around scraping and visualizing cricket data from ESPN Cricinfo. This project was an exciting venture into web scraping, data cleaning, and visualization, offering insights into the ICC rankings, top players, and teams' performances across various formats.
Project Phases:
- Data Collection: We scraped the data from ESPN Cricinfo using APIs.
- Data Cleaning: This phase involved dealing with missing values and organizing the data into a clean format.
- Data Storage: The cleaned data was stored in CSV files for further analysis.
- Data Visualization: Various charts and graphs were created to display metrics like player rankings and team performance.
Breaking Down the Process
1. Data Collection
The first step was to gather data from ESPN Cricinfo using web scraping techniques. We used libraries like BeautifulSoup and Requests to extract player and team data from the website.
The main goal of this phase was to fetch the latest ICC rankings for teams and players in formats such as ODIs, T20Is, and Tests, along with women's cricket data.
2. Data Cleaning
The extracted data wasn't always in a usable form. We encountered:
- Missing values
- Inconsistent formatting
- Duplicate entries
Using Pandas, we cleaned the data to ensure consistency. This involved filtering out unnecessary information and converting the data into a structured format.
3. Data Storage
Once the data was clean, we saved it into CSV files for further analysis and visualization. These files contained:
- Player names
- Team names
- Rankings
- Other stats like match wins and individual player achievements
4. Data Visualization
For visualization, we used Matplotlib and Pyplot to create charts that represent various metrics:
- The number of players per team
- Top players based on rankings
- Most team wins in a season
Key Technologies:
- Python: For scripting and processing data.
- Matplotlib: For visualizing the data.
- Pandas: For data cleaning and handling CSV files.
- BeautifulSoup & Requests: For web scraping.
The Heart of the Project
1. Scraping the Data
Let's dive into the actual code that handles scraping the data from ESPN Cricinfo. Below is a snippet from the ICC_Project.py file, which extracts player rankings:
import requests
from bs4 import BeautifulSoup
import pandas as pd
How It Works:
- The
requests.get()function fetches the HTML content from the ESPN website.
- BeautifulSoup parses the HTML, and we extract the player names, team names, and their ratings from the rankings table.
- The data is then stored in a Pandas DataFrame and saved as a CSV file for further use.
2. Data Cleaning and Processing
In the Players.py file, we processed the top 10 players from each team. Here's how the data cleaning and processing were handled:
import pandas as pd
# Function to clean and process player data
def clean_player_data(file_path):
df = pd.read_csv(file_path)
# Dropping duplicates and handling missing values
df.drop_duplicates(inplace=True)
df.fillna('Unknown', inplace=True)
# Sorting players by rating
df_sorted = df.sort_values(by='Rating', ascending=False)
return df_sorted.head(10)
# Example usage
top_players_df = clean_player_data('ICC_ODI_Rankings.csv')
print(top_players_df)
Explanation:
- We read the CSV file containing player rankings.
- Duplicates are removed, and missing values are replaced with "Unknown."
- Finally, players are sorted by their ratings, and the top 10 players are extracted.
3. Visualizing the Data
Now that we have successfully cleaned the data, it’s time to transform those numbers and stats into meaningful and visually engaging representations. Visualization is crucial for conveying insights at a glance, and here we’ll use Python’s Matplotlib library to create charts and graphs.
In this example, we’ll visualize the Team rankings. However, keep in mind that the data used here is just sample data; you can replace it with real-time data from the scraped datasets for more accuracy and relevancy.
import matplotlib.pyplot as plt
# Example data (team rankings for ODI, T20, and Test)
teams = ['India', 'Australia', 'Aus Women', 'Pakistan']
odi_rankings = [1, 3, 2, 6]
t20_rankings = [2, 5, 1, 3]
test_rankings = [2, 1, 3, 4]
# Plotting the rankings for each format
plt.figure(figsize=(10, 6))
plt.plot(teams, odi_rankings, marker='o', label='ODI Rankings', color='blue')
plt.plot(teams, t20_rankings, marker='o', label='T20 Rankings', color='green')
plt.plot(teams, test_rankings, marker='o', label='Test Rankings', color='red')
plt.title('Team Rankings Across Formats (ODI, T20, Test)')
plt.xlabel('Teams')
plt.ylabel('Ranking')
plt.gca().invert_yaxis() # Invert y-axis so 1 is at the top
plt.legend()
# Save the plot as an image
plt.savefig('team_rankings_visualization.png') # Save as PNG file
plt.show()
Visualization Output:
This visual representation allows us to quickly identify which teams have the highest or lowest ranks in different formats. For example, India ranks highest in ODIs, while AUS Women holds the top spot in T20I. The data can be updated to reflect the latest rankings from the scraped datasets. You can replace it with real-time data from the scraped datasets for more accuracy and relevancy as i did.
Top 5 ODI All-rounders, Batters and Bowlers
In ODI cricket, the top 5 all-rounders include Mohammad Nabi, known for his consistent all-around performances who excels with both bat and ball. Shakib al Hasan, Sikandar Raza, Rashid Khan, and Assad vala also feature prominently for their match-winning abilities.
Batters:
Among the top batters, Babar Azam continues to reign supreme, followed by Rohit Sharma, Shubman Gill, Virat Kholi, Harry Tector. These players have been crucial in accumulating runs for their teams.
Bowlers:
Similar Visualizations for T20, Test, and Women's Cricket
Just like we’ve visualized the top rankings for ODIs, the same process will be applied to T20, Test, and Women's Cricket. Data for these formats will be scraped, cleaned, and visualized to show top players, rankings, and team performances, giving us deeper insights into each format of the game. Stay tuned for more detailed charts and rankings for all formats!
Conclusion
In this blog, we covered how to scrape, clean, process, and visualize cricket data from ESPN Cricinfo. This project offered valuable insights into handling real-world data, overcoming challenges like missing values, and presenting the final output through visualization.
If you're a cricket enthusiast or a data science lover, this project provides a solid foundation for working with sports data. Whether you want to build an analytical tool or simply explore cricket stats, the techniques here can be easily adapted to other datasets.



Comments
Post a Comment