EDA(Exploratory Data Analysis) on English Premier League (football).

Image Source :https://www.exasol.com

1. Introduction

The data was originally acquired from the Premier League website and is representative of seasons 1993–94 to 2017/18. and contains various statistical data such as final and half time result, Goals scored by Home and Away teams.

Source : https://www.kaggle.com/zaeemnalla/premier-league

I would like to explain the various data analysis operation which have been performed on this data set and how to conclude which factors contribute to the Result of a football game.We will be taking Data of only 1 season here (2003–2004 Season).

Before going into any details, let us first take a look at which variables will we be dealing with.

1.Div : The Divison in which the Game is played in.

2.Date : Date of the game in DD/MM/YYYY.

3.HomeTeam :Name of the team playing at home.

4.AwayTeam :Name of the team playing Away.

5.FTHG : Goals Scored at Full time by Home team.

6.FTAG : Goals Scored at Full time by Away team.

7.HTHG : Goals Scored at Half time by Home team.

8.HTAG : Goals Scored at Half time by Away team.

9.Season : season in which the match took place.

10.HTR : Shows which team is winning at Half Time (H for Home,A for Away, D for Draw).

11.FTR : Shows which team is winning at Full Time (H for Home,A for Away, D for Draw)

2. Importing the libraries and loading the csv.

3. As stated earlier ,We will be taking Data of only 1 season here (2003–2004 Season).

4. Exploring the File Imported.

Output : (380, 11).

Which is correct as we have 11 Columns, and 380 Games in a single season.

Output :

Output : We can say there are no Missing values.

Output :

Observations:

  1. Count : Shows the number of values present in respective columns.
  2. Mean: Mean of all the values present in the respective columns.
  3. Std: Standard Deviation of the values present in the respective columns.
  4. Min: The minimum value in the column.
  5. 25%: Gives the 25th percentile value.
  6. 50%: Gives the 50th percentile value.
  7. 75%: Gives the 75th percentile value.
  8. Max: The maximum value in the column.

5. Check if the Data set is Balanced.

Output :
1.Home : 167
2.Draw : 108
3.Away : 105

We can see the Data set is nor perfectly balanced at it leans towards the “Home” a bit more.

6. Univariate Analysis

The major purpose of the univariate analysis is to describe, summarize and find patterns in the single feature.

Analysis of 1 variable.

Ex. Question :Which of features is most useful to identify FTR?

We can Answer this Question using Univariate Analysis (Using PDF or Boxplot etc)

6.1 Counts for FTR (Full Time Result)

Observations:

  1. Just By looking at the Counts we can get the Understanding that the Home team has Significantly more wins.
  2. Another way of looking at is Away Team is more likely to get a Draw or a loss more often

3. Looking at this it looks like Playing at home is a big advantage.

6.2 Probability Density Function(PDF)

Observations:

  1. Most of the times, Both Home and Away Team scores 1 goal, Frequent scores are (1,0,2 in this sequence).Away team slightly ahead here.

2. However, When it come to more than 2 goals, Home Teams are ahead.

6.3 Box Plots

Observations:

Home Team :

1. While winning score mean of 2.5 Goals.

2. When Drawn mean is 0.9 Goals

3. When Lost Mean is 0.5 Goals

Away Team :

1. 1.While winning score mean of 2.1 Goals.

2. 2.When Drawn mean is 0.9 Goals.

3. When Lost Mean is 0.6 Goals

-> Overall Home Team is Scoring more goals, While will be a huge factor in winning the game.

7. Bi-Variate analysis :

  1. Bivariate analysis is one of the simplest forms of quantitative analysis.

2. It involves the analysis of two variables, for the purpose of determining the empirical relationship between them.

3. Bivariate analysis can be helpful in testing simple hypotheses of association.

7.1 Pair Plots

Observations:

  • FTHG and FTAG are the values which clearly indicate who will win.So studying these 2 variables can be the best way to to predict FTR.
  • Whichever of the 2 has higher value that team wins, Which translate to the team which score more Goals at FT wins the match.Which is how football works.

8.Check If a team is Winning at Half Time does it Change at Full Time.

H : Shows the Home team leading at Half Time.

A : Shows the Away team leading at Half Time.

D : Shows the game was level at Half Time.

1.Blue Bar shows the team went on to Win.

2. Orange Bar shows the team went on and Drew the game.

3. Green Bar shows the team Lost the game at Full time.

Observations:

1.The team Leading at Half Time almost always goes on to win the game at Full time.

2.If the game is level at Half Time it is more likely Home team will win than the Away team. Although the most likely outcome is a Draw only.

3. So HTR is a very important variable to determine who wins at Full time.

Conclusion :

1.There is a Higher percentage of Home team winning, so clearly the team playing at Home has an advantage.

2.Goals Scored at Full time (FTHG,FTAG) determine FTR i.e. which team will go on to win the game, team which score more Goals at FT wins the match.

3. The Home team usually score more goals. Ex While winning Home team score mean of 2.5 Goals as compared to 2.1 Goals by Away team while winning.

4. HTR is a very important variable to determine who wins at Full time. As we saw the Team winning at Half team does not usually end up Losing at Full time. So this Variable can effectively predict who is likely to win at full time.

AI and Machine Learning Enthusiast.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store