EDA(Exploratory Data Analysis) on English Premier League (football).
Image Source :https://www.exasol.com
1. Introduction
The data was originally acquired from the Premier League website and is representative of seasons 1993–94 to 2017/18. and contains various statistical data such as final and half time result, Goals scored by Home and Away teams.
Source : https://www.kaggle.com/zaeemnalla/premier-league
I would like to explain the various data analysis operation which have been performed on this data set and how to conclude which factors contribute to the Result of a football game.We will be taking Data of only 1 season here (2003–2004 Season).
Before going into any details, let us first take a look at which variables will we be dealing with.
1.Div : The Divison in which the Game is played in.
2.Date : Date of the game in DD/MM/YYYY.
3.HomeTeam :Name of the team playing at home.
4.AwayTeam :Name of the team playing Away.
5.FTHG : Goals Scored at Full time by Home team.
6.FTAG : Goals Scored at Full time by Away team.
7.HTHG : Goals Scored at Half time by Home team.
8.HTAG : Goals Scored at Half time by Away team.
9.Season : season in which the match took place.
10.HTR : Shows which team is winning at Half Time (H for Home,A for Away, D for Draw).
11.FTR : Shows which team is winning at Full Time (H for Home,A for Away, D for Draw)
2. Importing the libraries and loading the csv.
# Importing the Filesimport pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlineimport warnings
warnings.filterwarnings("ignore")
# Import the Dataset
os.chdir("C:/Users/Abhishek/Desktop/Data Sets/premier-league")
# Load the Dataset.
masterdata=pd.read_csv("results93-18.csv")
3. As stated earlier ,We will be taking Data of only 1 season here (2003–2004 Season).
#We need to make a Subset of this data with only the specified season.data=masterdata[masterdata["Season"]=="2003-04"]
4. Exploring the File Imported.
# Check Shape of the data
data.shape
Output : (380, 11).
Which is correct as we have 11 Columns, and 380 Games in a single season.
# Check Datatype of the Columns
data.info()
Output :
#Check for missing values
data.isna().sum()
Output : We can say there are no Missing values.
# Check Details for all numeric variables
data.describe()
Output :
Observations:
- Count : Shows the number of values present in respective columns.
- Mean: Mean of all the values present in the respective columns.
- Std: Standard Deviation of the values present in the respective columns.
- Min: The minimum value in the column.
- 25%: Gives the 25th percentile value.
- 50%: Gives the 50th percentile value.
- 75%: Gives the 75th percentile value.
- Max: The maximum value in the column.
5. Check if the Data set is Balanced.
# Check if the Dataset is Balanced
data["FTR"].value_counts()
Output :
1.Home : 167
2.Draw : 108
3.Away : 105
We can see the Data set is nor perfectly balanced at it leans towards the “Home” a bit more.
6. Univariate Analysis
The major purpose of the univariate analysis is to describe, summarize and find patterns in the single feature.
Analysis of 1 variable.
Ex. Question :Which of features is most useful to identify FTR?
We can Answer this Question using Univariate Analysis (Using PDF or Boxplot etc)
6.1 Counts for FTR (Full Time Result)
sns.countplot(data=data,x="FTR",palette="winter",)
plt.xlabel("FTR",size=15,color="Black")
plt.ylabel("Count",size=15,color="Black")
Observations:
- Just By looking at the Counts we can get the Understanding that the Home team has Significantly more wins.
- Another way of looking at is Away Team is more likely to get a Draw or a loss more often
3. Looking at this it looks like Playing at home is a big advantage.
6.2 Probability Density Function(PDF)
sns.set_style("whitegrid")
a=sns.FacetGrid(data,size=4) \
.map(sns.distplot,"FTHG") \
.add_legend()
plt.xlim(0,6)
plt.xlabel("FTHG",size=15,color="Black")
plt.title("Goals Scored at FT by Home team")
plt.show()sns.set_style("whitegrid")
sns.FacetGrid(data,size=4,) \
.map(sns.distplot,"FTAG") \
.add_legend()
plt.xlabel("FTAG",size=15,color="Black")
plt.title("Goals Scored at FT by Away team")
plt.xlim(0,6)
plt.show()
Observations:
- Most of the times, Both Home and Away Team scores 1 goal, Frequent scores are (1,0,2 in this sequence).Away team slightly ahead here.
2. However, When it come to more than 2 goals, Home Teams are ahead.
6.3 Box Plots
sns.boxplot(data=data,x="FTR",y="FTHG",palette="inferno_r", \
meanline=True,showmeans=True,\
meanprops={"marker":"^","markerfacecolor":"white", "markeredgecolor":"blue","color":"White"})
plt.title("Result and Goals Scored by Home Team")
plt.xlabel("FTR",size=15,color="Black")
plt.ylabel("FTHG",size=15,color="Black")
plt.show()
sns.boxplot(data=data,x="FTR",y="FTAG",palette="inferno", \
meanline=True,showmeans=True,\
meanprops={"marker":"^","markerfacecolor":"white", "markeredgecolor":"blue","color":"White"})
plt.title("Result and Goals Scored by Away Team")
plt.xlabel("FTR",size=15,color="Black")
plt.ylabel("FTAG",size=15,color="Black")
plt.show()
Observations:
Home Team :
1. While winning score mean of 2.5 Goals.
2. When Drawn mean is 0.9 Goals
3. When Lost Mean is 0.5 Goals
Away Team :
1. 1.While winning score mean of 2.1 Goals.
2. 2.When Drawn mean is 0.9 Goals.
3. When Lost Mean is 0.6 Goals
-> Overall Home Team is Scoring more goals, While will be a huge factor in winning the game.
7. Bi-Variate analysis :
- Bivariate analysis is one of the simplest forms of quantitative analysis.
2. It involves the analysis of two variables, for the purpose of determining the empirical relationship between them.
3. Bivariate analysis can be helpful in testing simple hypotheses of association.
7.1 Pair Plots
sns.pairplot(data,hue="FTR")
Observations:
- FTHG and FTAG are the values which clearly indicate who will win.So studying these 2 variables can be the best way to to predict FTR.
- Whichever of the 2 has higher value that team wins, Which translate to the team which score more Goals at FT wins the match.Which is how football works.
8.Check If a team is Winning at Half Time does it Change at Full Time.
# Check If a team is Winning at HalfTime does it Change at Full Time.sns.countplot(data=data,x="HTR",hue="FTR")
plt.legend(edgecolor="White",facecolor="White")
plt.xlabel("HTR",size=12,color="Black")
plt.ylabel("Count",size=12,color="Black")
H : Shows the Home team leading at Half Time.
A : Shows the Away team leading at Half Time.
D : Shows the game was level at Half Time.
1.Blue Bar shows the team went on to Win.
2. Orange Bar shows the team went on and Drew the game.
3. Green Bar shows the team Lost the game at Full time.
Observations:
1.The team Leading at Half Time almost always goes on to win the game at Full time.
2.If the game is level at Half Time it is more likely Home team will win than the Away team. Although the most likely outcome is a Draw only.
3. So HTR is a very important variable to determine who wins at Full time.
Conclusion :
1.There is a Higher percentage of Home team winning, so clearly the team playing at Home has an advantage.
2.Goals Scored at Full time (FTHG,FTAG) determine FTR i.e. which team will go on to win the game, team which score more Goals at FT wins the match.
3. The Home team usually score more goals. Ex While winning Home team score mean of 2.5 Goals as compared to 2.1 Goals by Away team while winning.
4. HTR is a very important variable to determine who wins at Full time. As we saw the Team winning at Half team does not usually end up Losing at Full time. So this Variable can effectively predict who is likely to win at full time.