Spark:
Though IPL(Indian Premier League) is a bit late this year, the craze is still intact. Just like everything has taken a turn this year, IPL had to go through several changes too. I’m an ardent fan of a franchise called Chennai Super Kings, One of the most consistent teams in the history of the tournament. but they are having a really bad season this year. Cricket Experts criticized CSK’s choice of selection and the age of the players as the primary reason for the bad season. This theory has a wide range of approval from experts to fans, but I wanted to check this theory is actually real and can be supported by empirical data. This began my quest for my Project4 of #100MLProjects. (Click here to know about my 100MLProjects Challenge)
Problem:
As I’m into machine learning and spending a considerable amount of my time with data. So Obviously, I went to search for datasets in Kaggle — the most popular datasets platform. Kaggle did have multiple datasets on IPL, but none of them contained data on the on-going season. I spent a few more minutes on google search and found the primary source for all these datasets — Cricsheet. I looked up at the source and realized that the source is constantly updated with data from the latest matches, and the datasets in Kaggle are outdated.
Why?
I had a sense of realization, why not to use this situation to learn and build a dataset. In the future, when I join the workforce as a Machine Learning Engineer, I wouldn't be presented with the datasets most of the time to build machine learning models. I have to go and search for the data, create my own datasets if required, clean the data, perform data wrangling, scraping different sources, and collecting meaningful related data. Creating this dataset could serve as a practice step for my future works.
How?
Cricksheet is regulated once a day or once every couple of days in my observation. Cricksheet has ball by ball data on different formats of the game, and on different cricket leagues as well. IPL data is available as a zip file, for public download. For every match that took place in IPL, there is a YAML file associated with it in the archive, which contains Ball by Ball information about the match.
For each ball of the match, several data points like the name of the batsman, non-striker, bowler, over, ball in the over, extras runs, runs scored by the batsman and many more details are available.
Here is sample data for a particular delivery in a match.
- 1st innings:
team: Kolkata Knight Riders
deliveries:
- 0.1:
batsman: SC Ganguly
bowler: P Kumar
extras:
legbyes: 1
non_striker: BB McCullum
runs:
batsman: 0
extras: 1
total: 1
From this data we could see that, the batting team is Kolkata Knight Riders, the batsman facing the 0.1 — first delivery is “Dada” Ganguly, and the bowler is P Kumar. One run is scored in the first delivery and it’s come as a leg bye. The non-striker at the other end of the pitch is B McCullum.
This is sample data, and to create a meaningful dataset, I need to understand more deeply about this raw data. This is just a sample from the whole collection, and it doesn't represent the complete dataset as there may be many differences. After exploring the data, I found some characteristics of this raw data collection.
Apart from these attributes mentioned here, there are other kinds of attributes as well, like — penalty, super over, wide, batsman out, super over, and so on. These attributes do not occur frequently, so they are listed only at places where they happen to occur. For example, player_out is an attribute which is available only on the delivery where a particular player gets out. In such a scenario, additional information is also appended to that particular delivery detailing how the player got out, who were the fielders involved in that particular dismissal, and so on. So It’s necessary to understand the raw data before writing code to extract data from it.
Even after spending a chunk of my time understanding the data structure, I stumbled upon several blocks as I went moved towards extracting and creating a dataset. To create a dataset, I used Pandas DataFrame to store the extracted data. Every once in awhile after extracting the data, I use the describe, isna, isnull, sample, head methods from pandas to check if there are any anomalies or unintended behaviors. This was extremely handy, as I stumbled upon multiple anomalies that I had to take care of. Like for example, ‘penalty’ was an attribute that only appeared at two occurrences in the entire history of the tournament. I completely missed that attribute, which caused inconsistency in my data frame. In another case, there were matches where it was abandoned abruptly. Data Visualizing tools such as matplotlib and seaborn were extremely useful to understand the data.
Another idea that came to my mind when I was working on this dataset, was to include the player profile too. Since we have a ball to ball data on the matches, we already know who the batsman and bowler are. But we just have the names of the batsman and bowler and not any other useful information which could help us arrive at interesting insights. Adding player profile, which includes data like the batting style, bowling style, country of origin, primary responsibility in the team, and age of the player. Player profile data at first may seem not so significant, but analyzing a bit deeper, we could come up with very useful insights.
- We could predict the weakness of a batsman to a particular bowler
- We could use the primary responsibility of a player to help form a well-balanced team or look for a replacement
- Check how the older players are performance
- Which country players are performing better at a specific ground
and lots and lots more.
For this particular data, I scraped data using BeautifulSoup on the ESPN CricInfo site. I hardcoded the ESPN CrickInfo URLs of the IPL franchise pages, which had the list of all players associated with that team. So, from there, it's easy to use the player's hyperlink to collect and create the player profile data.
I’ve hosted this dataset on Kaggle for public use, along with the code that I used to create the dataset.
Kaggle IPL DataSet URL: IPL DataSet (2020 Included)
GitHub Repo URL: Project4 — IPL
Conclusion:
It was a really interesting experience working on this dataset. I had to search a lot and learn more about data cleaning, working with YAML files, extracting data, scraping the internet webpages, using pandas inbuilt functions to understand the data, think of novel ways to extract useful information.
Path Ahead:
I will be returning back to this project to build predictive models to accomplish some or all of these tasks like Batsman vulnerability detection, best team combination against a particular team or in a particular ground, Fantasy cricket player combination recommender, etc.
If you are interested, please feel free to use the dataset and do awesome projects.
If you want to reach out to me, you can connect me through LinkedIn.
Check out my other projects in 100MLProjects Series or other projects in general at my GitHub profile.
Happy an amazing day!