clock menu more-arrow no yes mobile

Filed under:

What is the best method of predicting goals? Putting xG to the test

I have been writing about expected goals here for a little while. Today I want to do some early testing on how well they predict future outcomes. Warning: Nerdery to follow.

Jamie McDonald

run projections. Every week I plug in numbers and I write up what I think is going to happen based on those numbers. This raises the obvious question, what's so great about your numbers, huh? So this article attempts to show what is special about these stats. I plan on renoobulating my expected goals and game projection algorithms in a fully open way in the next couple months. Today, I begin by explaining why I plan to use a form of expected goals methodology.

Can we improve upon our projections of football matches with "Expected Goals"? I think we can.

Richard Whittall has given us an excellent summary of the current discussion regarding expected goals. They're everywhere now. Paul Riley of the Different Game blog developed one of the earliest publicly available xG models, and this season Colin Trainor and Constantinos Chappas at Stats Bomb have been doing football analysis using expected goals. Martin Eastwood of Penalty Blog likewise published an expected goals formula recently, and Daniel Altman at Bloomberg Sports has written up some of the statistical theory behind xG. I can keep going. And I will! The Shots on Target blog has a cool searchable table of expected goals by match. And 11tegen11 did something pretty similar to this article, putting his own xG to the test using a pan-European sample. My own Premier League advanced statistics page contains my expected goals numbers, as well as the component stats which underlie them.

Whew. So what are expected goals?

The concept is pretty simple. For every shot, you assign an "expected goals" value based on characteristics like the location on the pitch, whether the shot is taken with the foot or the head, whether the shot is assisted by a cross or through-ball, and so on. This is in no way a comprehensive list of the characteristics of each shot, but it provides a reasonable estimate when dealing with larger samples. A club's expected goals, then, is the sum of all their expected goals values for all their shots.

Expected Goals vs. Shot Ratio

The move to expected goals is in many ways a development following on the back of the concept of Total Shots Ratio. TSR is a very simple stat, just a ratio of shots for to shots against, which turns out to have impressive utility in explaining football outcomes. While no one thinks every shot is equal, it turns out that just logging all the shots tells you quite a bit. As James Grayson has shown, shots ratio is both a good predictor of future goals and goals allowed and it is useful even over relatively small samples.

While TSR has its roots in hockey analysis, in many ways it goes back to some of the earliest English statistical analysis of football. Charles Reep, when he wasn't abusing the data to support his own ideological vision of proper English football, argued in the middle of the 20th century that the key to winning a football match is taking more shots. He labelled sides which had scored an above average percentage of their shots as "overdrawn" and those below average as "in credit." The statistical logic, as with most of Reep's work, is imprecise at best and full-on quackery at worst. The language of "in credit" suggests that one is owed goals, when statistically what we should see, if all shots are equal, is regression to the mean.

Still, the idea is broadly the same. The quality of a football club can be measured reasonably well by total shots attempted and conceded. Can we improve upon TSR with a more fine-grained expected goals analysis? I think we can.

Statistical Testing

Using my Shot Matrix database, I can log every shot with a particular xG value. To test whether this is a useful thing to do, I've taken the last four and a half English Premier league seasons and divided them up into nine chunks. I take the first and second halves of each season, with the second half of the 2013-2014 season obvious being left the side as it has not yet been completed.

Expected goals produces better predictions of goals scored and allowed with smaller average errors.

As a test, I compare how well different metrics for a club in one chunk predict their goals, goals allowed, and goals ratio in the next chunk. So, for example, in the second half of the 2012-2013 season, Swansea scored a goal per match down the stretch. However, their underlying stats were somewhat better. My expected goals method rated the chances Swansea produced as worth about 1.22 goals scored per game. Based on just league average conversion, which is the theoretical basis of TSR, Swansea's shot total in the second half of the 2012-2013 season would be worth about 1.3 goals per match. So that's the goals, expected goals, and TSR goals from the first chunk.

I then compare those numbers from that chunk to the actual goals scored by Swansea in the first half of the 2013-2014 season (the second chunk). The Swans scored 24 goals in 19 matches, a rate of 1.26 goals per match. So goals was off by .06 goals per match, xG by .03 and TSR by .04. I do this for goals, goals scored and goals ratio for every club, for every half-season chunk. That creates a set of 148 comparisons.

Results: Expected Goals Wins

I found that Expected Goals produces better predictions of future goals scored and conceded. First, here are the results for goals ratio, the rate of goals scored to goals conceded. These plots shot every one of the 148 pairs, and there is a line showing the "R-Squared" correlation. This is a number between 0 and 1 measuring how much closer the correlation is. Higher (closer to 1) is better.


On this crucial measure, xG solidly outperforms both actual goals scored and conceded, as well as Total Shot Ratio, in predicting future goals scored and conceded.

One thing you can see here, as will be obvious in the future graphs as well, is that xG and TSR both compress the range of expected outcomes. Clubs basically never have a TSR under 0.3 or about 0.7, and xGR only occasionally breaks those marks, while you do see actual goals ratios both higher and lower. Probably the clubs with very high or low goals ratios should be expected to regress to the mean, but we would like to see a goals predictor with a range as close to actual goals as possible. I think we are losing some value in not predicting those tails. xG is better than TSR for this, but I think it can be improved upon still.

Ga_comparison Goals_comparison

One thing I like about those xG graphs is the way the line looks like it's about to bisect the origin. If you continue the line, you want it to run straight to 0,0. This means that a club that projects to score 1.5 expected goals per match actually did score 1.5 goals per match in the next chunk. That's a happier correlation.

Here's the correlation numbers in tabular form. You can see that the correlations for the expected goals method are higher than those for raw goals and TSR across the board. xG particularly beats the competition on GA more than goals scored. I'm not sure what to do with that, and I think it may be more of a fluke in the data than a real effect. But as I look at xG in other leagues, I'll be watching to see if this effect is replicated.

Method Goals Ratio Goals Goals Allowed
G 0.517 0.371 0.302
TSR 0.532 0.421 0.340
xG 0.589 0.422 0.433

Another Statistical Test: Errors

While R-Squared tests can be displayed on those kind of pretty scatter plots, they aren't the best method for evaluating a projection. R-Squared tests the fit of a formula, it doesn't tell you about the errors. What matters for a projection is whether it's making errors and how big they are.

So, I prefer to use error tests to measure predictive tools. I use root-mean-square error and mean-absolute error. Both of these tests basically rate how much I missed by. RMSE is set up to "punish" a method for making big errors, while MAE just takes the average by which a prediction missed. In the error tests, there's no particular meaning to an error number in the way that 0.0 and 1.0 R-Square have fixed meanings. Lower is better. In the table below, the error for Goals Ratio is scaled to goals ratio, so that's the amount of error on a scale between 0 and 1. Goals and goals allowed are scaled to goals over a half season, so an error of 5 means it missed by 5 goals over 19 matches.

By both MAE and RMSE, Expected Goals produces better predictions.

RMSE 0.093 0.089 0.082 7.69 6.83 6.72 6.81 5.96 5.52
MAE 0.075 0.071 0.066 5.91 5.09 5.09 5.44 4.79 4.36

So that's why I'm going to use expected goals when I renoobulate my projection engine. Expected goals produces better predictions with a higher correlation to future results and smaller errors in projections.

What's Next?

As I said, I don't think this is the end of the project. Expected goals are more interesting when they start a conversation. They certainly don't end it. For one thing, this has been a correlation to goals rather than to points. In future weeks I'm going to use my projections to test whether the team ratings here project points totals as well as they project goals totals.

There are any number of other pieces of information that can be brought to bear, from the outcome of the shot—did it hit the target, the post, did it miss by a lot or a little, was it blocked?—to the identity of the player shooting to the score and time of the game at the time the shot was taken. In time of course we'd love to have information on the location of the defenders, but we will have to wait until that data becomes even vaguely publicly available first. I do have numbers on "big chances" for some seasons, and I think it's likely that this statistic will add value, too. I'm going to be working to improve my expected goals projections by bringing in as much information as possible. For now, though, I'm going to be starting with expected goals.