Let’s assume you have the stats of a player – this includes the usuals – assists, total passes, key passes, chances created, minutes played, shots taken et al. Now, if you were told to predict how many goals this unnamed player has scored, which of those statistics would you trust the most? Some might argue for minutes played – if a player hasn’t played too many minutes, there’s little chance of him/her scoring too many goals. Others might plump for shots taken – after all, if you don’t buy the ticket, you don’t win the raffle. With this piece, I aim to decipher which of these statistics is truly the most important in predicting the number of goals scored by a player.

I took the data of all the players from Europe’s top 5 leagues (Premier League, Bundesliga, La Liga, Serie A and Ligue Un) from Understat for the 2018-19 domestic season. With this data, I tried to build a Linear Regression model to predict the number of goals a player has scored. Roughly, this means that I am trying to predict one thing (in this case, goals scored by a player) using a variety of other important things (such as shots taken, minutes played, assists et al) by finding how these different factors are correlated with each other. I will be using different combinations of these features to try and observe which of these possess the greatest amount of predictive power.

Initially, I randomly divided my dataset into 2 parts – the training set and the testing set. With the training set, I hope to make my model learn what sorts of stats go together, to find how they are correlated with each other. For example, one might think that minutes played and total passes move together in the same direction. After that, I will use my testing set to test my model’s predictive power – I will input the features of my testing set (i.e. key passes, minutes played, shots taken etc) and try and predict the number of goals scored by those players. After that, I will compare my results with the actual number of goals scored.

My first model will include only 2 features – minutes played and shots taken. After training the model, my predictions look like this:

Our R-squared value is 0.7434 – this means that the proportion of variance explained by our features (Shots and Minutes Played) is around 74%. Thus, using goals and minutes played, we are able to explain around 74% of the goals scored by our players. Usually, the higher the value of our R-squared, the better our model is at explaining the data. Moving onto our error, the mean squared error of our model is 3.0218. In this model, almost all of our predictive power is because of shots. This is seen in the following graph (an empty bar means that the correlation is either 0 or very close to 0):

Now we can move onto some more advanced model with a greater number of features. Our next model will include Shots, Minutes Played, Key Assists and Key Passes. Our predictions look like this:

This doesn’t look too much of an improvement from our previous model. This is borne out by our R-squared value of 0.7449, which is only marginally higher than our previous model. Our mean squared error is also about the same (3.004). However, there is one noticeable change. Shots don’t seem to be our most valuable predictor anymore.

Assists have a much more positive correlation with goals than shots now. One possible reason for this might be that assists are usually higher for attacking players (forwards, wingers, attacking midfielders). Goals are also higher for attacking players. Thus, this high correlation between assists and goals may be simply because of the omitted variable of position – attacking players tend to score more goals and create more goals too. Thus, this may lead us to be biased in our prediction.

To try and come up with more robust predictions, I will now include a new variable called Expected Goals (or xG). According to Opta Sports, “Expected Goals (xG) measures the quality of a shot based on several variables such as assist type, shot angle and distance from goal, whether it was a headed shot and whether it was defined as a big change.” Thus, it seems that xG will be our best predictor for goals scored. We will now test this new metric in another model. This time, we have included all of the prior 4 features as well as xG. Our results seem to be much better this time.

It seems that our results are much more linear in shape now, showing how our new model fits the data better. Indeed quite a lot of the players seem to fall in one line, except the one outlier on the top right (who else – Lionel Messi). Our R-squared and error values too show this improvement, going to 0.8778 and 1.4390 respectively. As expected, xG blows all of the other features out of the ground in terms of importance as a predictor.

The metrics of our different models are shown in the following graph:

As we can see, the addition of xG greatly improves our model. Thus, it seems that xG is by far the best predictor of goals scored by a player. Assists also seem to be important (perhaps because they are an implicit variable for position). Shots seem to lose all importance as their usefulness has already been incorporated into xG. Minutes Played and Key Passes are poor predictors regardless.

Analysis conducted by Prithvi Pahwa.

Share your thoughts and follow us now on Facebook, Twitter & Instagram