clock menu more-arrow no yes mobile

Filed under:

What Is the Best Method for Predicting Football Matches?

Once again, expected goals produce better projection results than actual goals, actual points, or total shots ratio. Again, nerdery to follow.

Mike Hewitt

Last week, I wrote up my study showing that expected goals methodology better predicted future goals scored and goals allowed than other common methods. I got a few responses from people saying, aren't you trying to predict game outcomes? Aren't points the point? Predicting goals totals is all fine and good, but what we want is a method that tells us which clubs are going to win games.

So that's what I did. I decided to use the same simulation method I use for my projections to project past results. I'll create team ratings based on clubs' goals, expected goals, and total shots ratios and I'll simulate seasons to see what these methods predict. Then, because I'm simulating past matches, I can compare the results to the actual results that occurred to see what projections worked best.

If you want to knows how well a team is likely to play in the future, you're better off looking at their underlying stats than at their place in the standings.

I'm using my projection engine for two reasons. First, I'm doing these tests to determine what inputs to give that engine in the future, so obviously I should use it to see what works best. Second, the relationship between goals and points is humongously complex. As Howard Hamilton showed in his work on the "soccer pythagorean", having three unequal possible match results creates a weird, non-linear relationship. So instead of dealing with the math, I'm just simulating the games and comparing projected points to real points. There we should expect a simple linear relationship if the projections are good.

Results: Expected Goals Wins

Again. As I did last week, I split my Premier League data into nine half-season chunks, and I simulated each half-season based on the half-season which preceded it. So if Liverpool created chances worth about 30% more expected goals than league average in the first half of the 2011-2012, I gave them a projected goals rating of 30% better than average for the matches in the second half of that season. I do this for all clubs, attack and defense, for each half season chunk and run the projections.

In the first half of the 2011-2012 season, Liverpool took about 1.22 points per match. Their goal difference was unimpressive, and projected based on GD they were expected to take about 1.28 points per match in the second half of the season. Their underlying numbers were better, however, with xG projecting Liverpool 1.44 points per match and TSR projecting 1.48. Liverpool actually improved massively in the second half of the season and took about 1.8 points per match. So everyone missed, but xG and TSR missed by less. Do this for every club, every half season, and you get the results.

Which, as I said, are that expected goals won. The best test for a projection system is the size of the errors. I use two statistical methods to calculate the size of the errors these projections created. RMSE or "root mean square error" punishes the projection system for making really big mistakes, while MAE or "mean absolute error" just averages all the misses together. The errors are expressed in terms of points.

RMSE 6.729 6.513 6.469 6.115
MAE 5.356 5.166 5.248 4.919

So the projections produced by xG missed by about a half-point less than the projections of the other systems. Over the course of a full season, then, xG should produce projections about a point better per team. That's not bad. Using actual points from the first half of the season is a clearly the worst method of projecting future points here. The underlying stats, even simple goal and shot difference, notably outperform previous points in projecting future results. If you want to knows how well a team is likely to play, you're better off with the underlying stats than with their actual results.


I've also got the numbers here in graph form. The horizontal axis is projected points, the vertical axis is actual points. We're looking for a nice clean line running from zero to three. A club projected to take no points at all should take no points, and a club projected to win every match should do that.


I have also listed the R-Squared correlation here. As I said, errors are a bigger deal than correlation in this type of study. And as you can see, while TSR actually has the lowest correlation (a little under 0.5), its trendline has a good shape, running nearly into the bottom left of the graph. While goal difference and previous points have slightly higher RSq, their lines don't follow the right path. Expected goals once again wins, with both the highest R-Squared correlation and the line which best follows the correct path from the bottom left to top right corner.

As I have said, this is meant to be the beginning of the study, not its end. There are any number of further ways to try to improve xG, tons of data out there to play with. I will be continuing to renoobulate my projections. These studies were meant to demonstrate where I should start, and the answer is clear. Expected goals.