Try โ€‚โ€‰HackMD

Expected Goals (xG) - An Analysis Of The Popular Metric In The EPL 2022/23 Season

GitHub repository for this project
Author LinkedIn

As part of the final assessment for the BEE2041 module, this post aims to gain a better understanding of the xG metric in football through the use of Premier League data from the 2022/23 season. Throughout this post, we will explore how accurate xG is as a metric and its correlation to goals scored both at an individual match level and over the entire season. We will conclude with insight on the question about whether xG is a good predictor for match outcomes and whether you would be able to guess the match results solely from looking at the xG metric.

Introduction & Background

What Is xG?

Expected Goals (commonly known and denoted as xG), is an metric used in football to measure the probability that a shot will result in a goal. For example, an attempt on goal with a 0.2 xG will be expected to be scored once in every five attempts. In essence, xG is a proxy for quality of chances created in a game.

xG has slowly risen over the past decade to be one of the leading metrics in football analysis and commentary. After having been proposed by Sam Green in April 2012, xG now regularly features in punditry, managers post match interviews and fans post match analysis. xG has successfully moved to enhance traditional metrics such as 'Shots' and 'Shots on Target'.

xG is calculated using historical information from thousands of shots with similar characteristics to estimate how likely a goal is to be scored on a scale between 0 and 1. Many variables are used to compare shots to the historical model; These variables include:

  • Distance to the goal
  • Angle to the goal
  • Goalkeeper position
  • Position of other players relative to the goal
  • Pressure exerted by opposition players
  • Pattern of play, for example open play, free kick, corner or fast break
  • Information about the previous action, for example the type of assist

As this is a statistical model, the predictions vary depending on the calculation and specific model used. Consequently, it is common to be able to find different measures of xG on different platforms during and after games.

Data

In this analysis, information from understat.com has been scraped using Python and Selerium to provide the preliminary dataframe. The head of this raw dataframe can be seen below:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More โ†’

This project uses match data from the 2022/23 English Premier League season, comprising 380 matches from 20 different teams.

Highlights of the data are as follows:

  • There were an average of 1.63 goals scored by home teams and 1.22 goals scored by away teams
  • The average xG of home teams was 1.67 compared to 1.30 by away teams
  • The highest number of home goals scored was 9 compared to 6 by away teams
  • The highest xG created by a home team was 4.92 compared to 4.67 by an away team
  • The lowest xG created by a home team was 0.02 compared to 0.06 by an away team

Analysis

Is xG a good estimator of total goals scored over the entire season?

The first stage for analysing xG as a metric is to consider whether it is accurate when aggregated over the entire season.

xG Actual Goals Scored Difference Percentage Difference
Home 634.29 621 -13.29 -2.1%
Away 492.25 463 -29.25 -5.9%
Total 1126.54 1084 -42.54 -3.8%

It is clear from the table that over the whole period of the season, Premier League teams have scored less goals than they would have been expected to, considering the number and quality of their chances. In total, teams have scored 42.54 less goals than would have been expected using the xG measure. There is also seen to be a difference between the performance of teams who are home and away. Not only do teams score and create more chances in home games than in away games, but they are also shown to score a higher percentage of their chances in home games.

But what do these findings mean for xG? The 96.2% accuracy indicates that over the course of the entire season, xG is a relatively accurate predictor for how many goals will be scored. It is logical to assume that in each game, there will be an underestimation or overestimation of the goals scored (as it is unlikely for xG to equal a whole number). Therefore, we will look to explore whether the season long accuracy of xG translates into its performance for individual teams in specific games.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More โ†’

Is xG a good estimator of goals scored by individual teams in a match?

After finding that xG was 96.2% accurate in predicting total goals scored over the course of the whole season, the next step is to find the accuracy on a game by game basis.

In this section, we look into how each team compares to their respective xG metric in every game in the season. Therefore, in this data set, there are 760 data points (380 games x 2 teams). The scatter graph shows the goals scored and the associated xG for every team in every match this season.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More โ†’

The most under-performing performance had 2.9 less goals than would have been predicted using xG. Conversely, the highest outperforming performance scored 4.1 more goals than would have been predicted using xG.

There is shown to be a relatively even distribution around the "xG = Goals Scored" line, focused around the lower opportunity and lowest scoring games. This is supported by both the mean differential value of -0.056 and the median differential value of -0.18. This can help to explain the accuracy of xG previously found for the season as a whole. However, it is clear to see that xG consistently under and over-estimates the numbers of goals that teams score inside a given game.

To what extent is xG different goals scored in each individual game?

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More โ†’

The graph above is a histogram with a density curve representing the difference between xG and goals scored by individual teams in individual games. There is a clear bell-shaped curve weighted just below zero (the level of perfect accuracy which would be achieved when xG = goals scored). Because of this shape, the predictions are incorrect almost equally both in under-prediction and over-prediction. With standard deviation of 1.019, there is a significant discrepancy between the xG metric and the real outcomes at the individual level. One goal can easily change the outcome of a game, so this finding is highly significant.

Does this change for home vs away teams?

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More โ†’

No. On the whole, the accuracy of xG for individual matches for home and away teams are not significantly different.

Therefore, at the individual level, xG cannot be seen as a good estimator for the absolute number of goals scored within a match. At the match level, xG tells you more about the quality of chances and momentum in a game rather than directly the likelihood of scoring. What could this mean for football? Firstly, it indicates that the number and quality of chances is not the only factor that will impact the outcome of a match. Furthermore, it shows that teams can win a game of football in more than one way, whether it be intensive chance creation or a low block.

As the old adage goes "The ball is round, the game lasts 90 minutes, and everything else is just theory. The only thing that counts in football is winning, and the only thing that matters is whether or not you get the ball in the back of the net." - Jimmy Greaves

How good is xG as a predictor for EPL match outcomes in the 2022/23 season?

So far, we have seen that xG is a good metric for predicting goals scored in a season but it is a poor metric for indicating the precise numbers of goals scored by each team in a game. At this stage, xG appears to be a better metric for indicating the performance and nature of the game. Nonetheless, is it possible to accurately predict the outcomes of games using xG without knowing the actual number of goals scored?

In the 2022/23 Premier League season, xG alone could show spectators the correct outcome 61.3% of the time (in a total of 233 out of 380 games). This low number supports the previous conclusion around xG being a better indicator of performance and nature of the match.

When we break down the 147 games that xG would have incorrectly predicted, 61 games (41.5%) were in matches where one team beat the other and 86 (58.5%) were in matches that were drawn. There were 87 draws in the 2022/23 season so xG would only be able to predict 1 of these. In the Everton vs Liverpool game, both teams had an xG of 1.39 whilst failing to score, which was the only time an actual draw had matching xG between the teams. Logically, draws would be incorrectly predicted most often as goals can only be scored in integers whereas xG is measured to two decimal places in our data set; it seems much less likely for xG metrics to be identical than the scores in the matches.

In games which were either won or lost, xG would be able to show a spectator the correct outcome 79% of the time. Therefore, it is clear that the previously found inaccuracy on predicting numbers of goals scored has transferred to xG's ability to predict outcomes. As xG shows the total quality and quantity of shots in a match, this result can also show that the team who creates (and shoots) more frequently in better positions does not always win the match. In fact, in 38.7% of matches, the team with higher xG does not win. For any regular football spectator, this will not be a surprise however the finding is still interesting.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More โ†’

How good is xG as a predictor for EPL match outcomes in the 2022/23 season if xG is rounded to the nearest goal or half-goal?

After the previous question was answered, I wondered whether xG's ability to predict outcomes of matches would be affected by rounding the measure to the nearest integer (nearest goal) or nearest 0.5 (half-goal).

Comparing the pie-charts above and below, we can see that there is not a significant change in the proportion of results that could be correctly predicted using xG; the significant difference is the proportion of wins/losses and draws that were a part of these inaccuracies.

In both the rounded xG metrics, there has been an increase in wins/losses that have been incorrectly predicted. Additionally, there has been an overall reduction in the number of draws that have been incorrectly predicted. I would hypothesize that this is because the adjusted xG will more often be the same, meaning some wins/losses will now be presented as draws while draws are more likely to be correctly predicted.

Rounding xG is by no means perfect. A large proportion of these results can be attributed to xG falling on different sides of the rounding up/down boundaries (.5 for whole goals and .25 for half-goals). Whilst this measure has increased the accuracy of xG for draw predictions, it has significantly worsened the metric for wins and losses.

We have therefore seen that by rounding xG in these ways, the overall correct prediction proportion is mostly unchanged, however the bundle of individual games that are predicted correctly changes.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More โ†’

Conclusion / interpretation

To conclude, we have found that when aggregated over the whole season, xG is shown to be relatively accurate. However, xG is a better metric to understand the flow and nature of individual football matches than it is of precisely predicting goals scored or match outcomes. xG continues to hold value in helping to understand momentum, game-style and performance between the two teams in games. xG helps to explain the chance creation split between the teams and most importantly, the value of chances created. xG proves to be more valuable on a aggregate data level, with accuracy increasing when more games are included.

Therefore, it is important to remember that xG is a composite measure of quantity and quality of chances. In the end, all that matters to the result of a match is how many times you get the ball in the back of the net, not the quality of the chances to get it there.

Avenues For Further Research

To extend research found in this post, it would be interesting to:

  • Expand the data set into more seasons and leagues
  • Analyse whether there are differences between different leagues and seasons
  • Analyse whether rounding xG to a different number would make a difference to the prediction quality (e.g. rounding to the nearest 0.75 where you round 0.25-0.75 up to 0.75 and 0.75-1.25 down to 0.75)

Beyond this post, it would be interesting to:

  • Investigate more advanced and different types of xG including:
    • xGOT - a post shot model (where xG is a pre-shot model) where goalmouth location/destination of the actual shot is included
    • Non-Shot xG (NSxG) - Describes the likelihood that a possession will eventually turn into a goal.
  • Look into xG scored, created and prevented by individual players and teams
  • Consider the differences between different xG models (because of different algorithms and training data) as can be seen in this link: xG Model Comparison Post (Different Author)
  • Is it also possible to explain the difference between xG and goals scored by a difference in the attributes and play styles of teams between the historical training data and current season?