Once more into the breach. We're eight weeks in to the English Premier League season, and things are (still) moderately weird, with Chelsea nosediving and Liverpool and Tottenham trailing Leicester City and Crystal Palace. Can Chelsea recover? Are any of the newly-risen top four actual contenders for those positions? Is Tottenham in trouble?
To answer these questions, I am running projections. The method will be detailed in full, and nerds are welcome to skip down to the methodology section. Here's a quick version of what expected goals is, from my earlier discussion of the statistic:
Expected goals is a method for estimating the quality of chances that a football team creates or concedes in a match. This is the thing I like a lot about xG. It may take a lot of data crunching to create specific xG values, but the underlying idea makes football sense. How many good chances did a team create?
How many half-chances? Just how "good" were they? How many good chances did they concede, and so on? These are intuitive football questions. When you're following a match, you're watching for the creation of chances, getting excited when it appears a scoring chance might be conjured for your team or getting worried when the other team is building one. We all watch for something like "expected goals." Managers and players create tactics aimed at creating good chances and preventing them for their opponents.
The current method includes factors such as distance to goal, angle to goal, the type of pass that assisted the shot (cross / throughball / etc), the type of shot (header / foot / other body part), the type of play that leads to the shot (set play / counterattack / established possession) as well as a variety of other factors.
The Expected Goals Table, or, Things Look Good For Spurs
Despite a seventh place spot in the table, Tottenham's numbers look strong. This is in direct contrast to last season when Spurs scraped a fifth place finish despite consistently bad defensive indicators. With multiple defensive upgrades acquired through transfers or position chances, Tottenham have suddenly placed themselves among the Premier League's best defensive sides. And without any weakening of the team's chance creation.
The tables below show the teams with the best defensive and attacking expected goals statistics.
Spurs, Arsenal and City are the only teams in the top five in both tables, with Spurs top three in both. Certainly Arsenal and City have huge advantages in expected goals difference, and the gaps on the xGA table are not huge. But it's still cool.
What these tables show are the few teams separating themselves out at the top and bottom. Manchester City and Arsenal appear to be the class of the Premier League and it isn't particularly close. The teams from the Northeast bring up the bottom--while Newcastle's stats look utterly terrible, the Magpies have played the league's toughest schedule while Sunderland have faced relatively few challenging opponents. In context, they're both doing really bad.
Title and Top Four Race
At the beginning of the season, I had City and Chelsea effectively tied at the top. Things have changed. To get to 85 points, Chelsea would need to take 74 from their next 29, which would be a 97-point pace over a full season. That does not seem likely. Even just to get to 72 points would take a 80-point pace (61 points from 29 matches). Right now, Chelsea are legitimately in trouble. The other issue here is that Chelsea's current performance has just not been good. They're barely over 0.500 for expected goals ratio, the worst start by a defending champion in the top four leagues in the Opta era. The closest comparand is David Moyes' United. So... yikes.
But Manchester City aren't running away with it. Arsenal have been excellent this season and appear to be legitimate title contenders. I have the title race at about 60 percent Manchester City, 35 percent Arsenal and 5 percent Manchester United. (Those are rounded up, and there is a maybe 2 percent chance of someone from the field, including Chelsea, ending up at the top.)
In the top four race, there is likewise a big mess below the top two. Chelsea's deserving title-winning season last year and high payroll keep them afloat, but as I said it's no sure thing anymore. Manchester United, despite a series of big-money signings, still have yet to achieve a consistently high level of play and show themselves to be of the same quality as the top two. While Liverpool could improve with a new manager, so far the numbers are merely good. With Spurs making a big run there is not much room for Southampton or West Ham to join the top four race, but they're not out of it yet.
At the bottom, I don't even know if it's worth making a graphic. I have eight clubs with a chance of relegation of 10 percent or higher. This is basically the projections throwing up their hands and saying "pick 'em." Sunderland are in really bad shape but beyond that, your guess is as good as mine.
EPL Projections Table
|West Ham United||15.0||9.8||13.2||54.7||+5||0%||9%||20%||1%|
|West Bromwich Albion||10.3||10.0||17.7||40.9||-18||0%||0%||1%||25%|
So those are the projections. It's time for method. Let's go.
I published a method last year which is quite similar, but which I believe I've also improved significantly. The main way I try to improve the method is watching lots of football and then talking to people about it. The more time I watch chances being created and prevented, and the better I get at listening to smart people talk about this stuff, the more effectively I can model it, I think.
So, my touchstone throughout the (extended) process of developing this method was "does this make football sense?" I am by training something of a skeptic of regression methods. It is very easy to find "significant" effects when you do regression willy-nilly, but that doesn't mean that what you've found is actually a real finding about football. While obviously I had to do lots of regression to create this system, I tried to make sure I was only running regressions when I understood why and how the factors involved related to the creation of better and worse chances in a football match. This has probably let to some infelicities in the math, but I hope it also means that the logic of the system can be communicated reasonably clearly.
My goal was to create as few formulas as possible, but to divide chances up and create different formulas if it seemed that, for football reasons, these were distinctly different types of chances. I ended up with six shot types and six equations. They are these:
- Shots from direct free kicks
- Shots following a dribble of the keeper
- Headed shots assisted by crosses
- Headed shots not assisted by crosses
- Non-headed shots assisted by crosses
- Non-headed shots not assisted by crosses (or, you know, "regular" shots.)
(I should note here that while I'm saying "shots" this also includes what Opta considers "chances missed", where what should have been a shot attempt was blown by the attacking player and no shot was recorded. So I have those as well.)
The logic in breaking them up is that each one has a different "curve," a different relationship between the likelihood of scoring a goal and the distance and angle from the goal. For shots assisted by crosses, for instance, the angle off goal is not as important as it is for other shots. It's just hard to strike a crossed ball cleanly, whether you're directly in front of goal or well off at an angle, and when you do manage to make good contact you have a reasonably good chance of beating the keeper regardless. Angle still matters, but it doesn't matter as much. Similarly, headed shots have distinctly different curves for distance based on whether they're assisted by a cross. It's much easier to knock home a rebound with your head from four yards out than to score off a cross. But get to 12 yards out and finding the power in just your neck muscles to beat the keeper is not so easy, whereas if it's a crossed ball you can use the natural momentum of the ball.
Then within these six categories I use a variety of factors.
Angle and Distance
This is of course the beating heart of every expected goals system. It's easier to score when you're closer to goal, and it's easier to score from straight on to goal than from wide. But modeling these effects is kind of maddening.
Let me show you the bane of my existence. Well, a bane. Let's say a thing that has, perhaps through my own fault, cost me a lot of time in my life the last two or three years. This is a chart showing the likelihood of a (non-headed, not assisted by a cross) shot being scored based on location.
The rate at which shots are converted is clearly related to two factors. First the distance from goal, and second the angle from goal. Being ten yards from goal at a 45 degree angle is much, much worse than being ten yards out and straight on. But there's a bunch of funkiness here. The dark red zone extends out from goal only in the center. The wide areas of the six-yard box become very low value very quickly. But there are reasonably good quality shots to be had from the wide areas of the danger zone, in the six-yard box extended, that aren't directly on to goal.
I have tried many, many ways to create a single formula to capture this shape. I have decided it can't be done in any simple manner. The problem is that shot selection plays a big role. Players in the wide areas of the 18-yard-box will typically only decide to shoot if better options aren't available. But if you get on the end of a chance in the 6-yard-box, even if it's not the greatest, usually you will shoot. Further, if you're at a very difficult angle, you're unlikely to attempt a shot unless you're undefended or the keeper has gotten off their line. So player decision making is a major third factor, along with distance and angle. The equation becomes then exceptionally messy.
So instead of creating a single term for distance, my formula includes up to five. They are the distance to goal, the inverse of distance to goal, the relative angle to goal and its inverse, and the inverse of the product of the angle and the distance.
To calculate relative angle to goal, I use compare the angle to goal of the shot to the angle from straight on. So if a player is in a central position, between the goalposts, the angle is 1. If a player is wide of the posts, I take the angle from their location to the nearest post. For instance, if a player is at a 45 degree angle to the nearest post, the relative angle is 0.5.
I found that different combinations of distance and angle were needed for each of the six different categories. That's why I broke them up.
Then we add more factors. It gets complicated, of course, but I focused as I added factors on the underlying football reality being modeled. The biggest problem that all expected goals systems based on Opta have is the lack of defender positioning data. It is obviously the case that not being defended is better than being defended. But we don't know, from the data, whether that's the case. So most of the following factors are attempts to model defender location from secondary markers.
How The Shot Was Assisted
In the case of shot-assists, the big thing we can model is "defender elimination." That is, can we assume or at least reasonably propose that the pass has removed opposition defenders from the play and made the following scoring chance easier?
I've discussed elsewhere two of the best kinds of assists: throughballs and danger zone passes. I've added a third additional feature here which are shots following throughball second assists. If a player is put through with a defense-splitting pass and then lays off to a teammate, that's usually a huge chance. Just as throughball puts a player behind the defense and eliminates nearly all the opposition defenders, a throughball second assist then almost always isolates a player against the keeper. Finish your runs, boys and girls.
To model "danger zone passing" I've used a continuous system. This means, the better an attacking position you're in, the more value you add to the xG of the next shot by passing. Think of this as a "tiki-taka" factor. If a player is in a good attacking position, the defense should be on him. If he can complete a pass from there, it will eliminate at least those defenders who were focused on him. Making the next pass is not necessarily a good idea—losing the ball from a position where you could have made a quality attempt on goal is bad—but when that extra pass can be completed, it typically leads to better chances. (Barcelona unsurprisingly dominates the best expected goals / shot table with all five of the best seasons in the last five years.)
The following two graphics show the odds of a shot being scored based on the location of the assist pass. There is a large effect for crosses as well as for regular passes.
Then there are two more kinds of passes included. The first is the cutback. This is a negative factor. That might surprise you. This does not mean that cutbacks are bad. Far from it. Shots assisted by cutbacks are all open-play shots taken with the feet from the danger zone, the area in the center of the 18-yard-box from which most goals are scored. That's a great thing! You start at a very high level with cutback assists.
But in terms of defender elimination, they're not ideal. The concept of a cutback is to get an attacker free on the baseline, draw the defense deep, and then she plays it back to a teammate between the lines, in the danger zone. So even if the best-case scenario, we expect there to be multiple defenders, as well as the keeper, between the shooter and the goal. And in many cases, the player who receives the cutback won't be free between the lines but will be defended as well.
The cutback factor is not massively negative. But it is real, and I think this makes sense. You would rather get a ball by the goal-mouth with a defender-eliminating throughball or interior pass, but that's incredibly hard to do. Against a set defense, a cutback is far more efficient than a cross, even if it does not usually create the very biggest chances.
The final kind of pass is yet another Barcelona specialty. This is the pass "across the face" of goal, where a player in a wide area plays a square ball through the middle of the box. These are very difficult passes to play, since they are quite easy to defend—you expect a central defender to be protecting precisely this zone when the ball is in a wider area—which means they can usually only be completed when the defense is not properly organized. So a completion across the face of goal indicates that the defense has been broken down, and as such that defenders have been eliminated at some point.
(As I noted above, shots from cross assists are so different, and so much worse than shots assisted by regular passes, that I separated them to a different bucket entirely. With crosses the issue is not defender elimination but rather just the difficulty of striking a crossed ball.)
Type of Attacking Play
I distinguish between several types of play, extrapolated from the underlying data. These are (1) play from a corner kick, (2) play from a free kick, (3) counterattacks and fast breaks, (4) established attacking zone possession and (5) regular open play.
Corner kicks are the most defended actions in football. Even the most aggressive teams station seven or eight players deep in their own box. As such there is a notable negative effect on xG from corner kicks. Set plays are well-defended, but not as easy to stop as a corner kick.
The best kind of attacking play, for xG, is the counterattack or fast break. When you can get into the open field at speed, you can attack a defense that is out of position and scrambling. I use two markers for this type of play. The first is the "fast break" as marked by Opta coders. They know when a team is breaking down the field at speed and they code play as such. That's very useful! But Opta is very strict in its definition of a "fast break"—Heung-Min Son's goal against Crystal Palace was not coded as a fast break despite it clearly being a counterattacking action—so I created a secondary marker of a counterattack. These are actions that begin with an open play turnover of possession, in which the attacking team moves steadily forward to goal without recirculating the ball. I found that once I took counterattacks and fast breaks into consideration, the raw "speed of attack" became no longer a significant factor.
This is the logic, I think. We don't know how fast a team needs to move to stay ahead of the defense. If the defense is reasonably well-positioned to get back into its shape, you may need to move the ball vertically very quickly, say 6-8 yards per second. But if the defense is far out of position, an attack of 4-5 yards per second will be just as effective. The better marker is whether the defensive team is able to prevent steady forward movement to goal, not the exact speed of the attacking move.
Finally, I found that established attacking zone possession is a small positive factor. I defined this as an attack that involves at least five completed passes in the attacking half without the ball being forced back into the defensive zone. A move that dominates space like this tends to lead to better chances than other open play attacks.
Other Indicators of Defensive Pressure
There are two really big indicators that Opta provides. The first is the "big chance" which is defined as a scoring chance the player should be expected to score. This is an overbid—about 40 percent of open play big chances are scored—but these are chances where the defense has been beaten. That a chance is marked as "big" tells us precisely that defensive pressure is close to nil and a player has a great opportunity to score.
The other big indicator is the defensive error. If a shot is attempted shortly following a "defensive error" it is again highly likely to be scored. When defenders screw up and gift the ball to an attacker in a good area (Opta is very strict in what it calls a "defensive error" and this needs to be a pretty huge blunder), it is unusual that they can snap back into place and properly challenge the shot.
In both of these cases, I'm a little bit skeptical of the size of the effect. I think Opta coders do amazing work, and I know that their decisions are always double-checked as well. But it's hard for me to believe that there would be zero outcome bias, that a shot being scored wouldn't make the preceding chance look bigger, or the preceding defensive mistake look worse. So I have chosen to knock down the regression value given to both of these factors slightly.
The next largest factor here is dribbling. A completed dribble, by definition, removes from play the defender who was on the ball. A player who can create space with a dribble has a notably better chance of scoring. I found further than the distance toward goal that a player runs with the ball—the distance between where the player picked up the ball and the location of the shot—also correlates with goal-scoring. If you can carry the ball toward goal, it is likely that either the defense isn't properly set or that you are beating some defenders.
I also include a rebound effect. When a shot is saved but not controlled, or when a shot hits the crossbar, the attacking player who jumps on the ball typically has a better chance of scoring. That's pretty straight forward. (To avoid double-counting chances from the same move, I only give team credit for the highest-xG chance in a single move. So if there are a bunch of rebound attempts in a row, the team battering the goal won't get credit for more than one expected goal.)
Finally, there is game state. I found game state to be significant only for regular shots, while accounting for other factors cleared out game state effects among headers, freekicks and shots assisted by crosses. With regular shots there is still a very small effect, which I believe reflects still unaccounted-for slight differences in defensive pressure applied by teams trailing or leading a match.
Player Finishing Skill
Here's a big change. Football analysts have been arguing for a while that conversion rates are massively variable. Anyone who has watched Alexis Sanchez and Sergio Aguero wend through slumps and back into form this season knows that you can't predict how well a player will take on his next shot based on what he did with the previous ten or so. (This is also why I expect Harry Kane to get off the schneid sooner rather than later.)
At the same time, the idea that Manchester City fans shouldn't care if Aguero or Jesus Navas gets on the end of a chance is equally ludicrous. The problem, statistically, is not that "there is no such thing as finishing skill" but rather a problem of sample size. Last season in the NHL, Alexander Ovechkin attempted 795 shots. In the NBA, Stephen Curry tossed up 1341. The comparison to football is instructive. Over the last six years, combined, the only player who has attempted over 1000 non-penalty shots in league play (in the big five leagues) is Cristiano Ronaldo. Lionel Messi has attempted 992, and next are Antonio Di Natale and Zlatan around 750 shots. So we just have far, far fewer shots on which to evaluate player finishing skill.
But with a six-year sample, I think we can begin to estimate player skill. For each player with at least 100 shot attempts, I split their shots into two equally-sized random samples a bunch of times. Then I checked the correlation between the conversion rate in the first random sample and that in the second. I found there was basically no correlation in finishing rate on split samples of 50-100 shots, but once you get to the few players well over 250 shots, there is a real effect.
We are of course not yet to the level we want to be. I'm looking for a correlation of about 0.7. To make a player adjustment, then, I need to make an assumption that the correlation will continue to improve as the number of players in the sample increases. I chose to go with this assumption for two reasons.
First, player finishing skill is real. This is both something that's just kind of obviously true, and it's something that I've shown elsewhere by expanding the sample by bucketing players together. I want to account for real football things if I can.
Second, player finishing skill is the primary driver of the "super team effect" in expected goals. Barcelona, Real Madrid and Bayern Munich all outperform typical versions of expected goals by large margins. But if I include a player finishing adjustment, most of that outperformance disappears. I have shown that players who finish chances brilliantly for these superclubs also finish their chances at high rates on other teams. So these top sides tend to finish their chances for the simple reason that they have collected players who nearly all, individually, shoot well. This is, I think, good prima facie evidence that the shooting skill adjustment captures something real.
I regress player finishing skill by adding a "weight" of 75 goals and 75 expected goals to a player's tally. If the player still shows up as a top finisher will all that regression weight added on, they're typically really good. (Lionel Messi and Yaya Toure show up as the world's best, with Jesus Navas among the very worst, so I'm pretty happy with how that turned out. Watch for the full article on this coming later this week.) Each chance is then adjusted based on the finishing skill of the player attempting it.
These sort of make me sad, but I haven't figured out a solution. Shots in the Bundesliga are just scored more. Headed shots in the English Premier league become goals much more rarely than they do in La Liga. There appear to be real differences in play and in shot selection between the leagues that can't be fully accounted for in expected goals without such adjustments.
Here they are. In all cases I used a logistic regression, which produces a formula of the type e ^ -(formula) / (1 + e ^ -(formula)). This produces a graph that looks like this:
(Thanks to UCLA for the image.) This graph has asymptotes at 0 and 1, just as the odds of a shot being scored are never exactly 0 and never perfectly 1. In between it increases as a roughly exponential rate, which tends to be a good model for the relationship of distance and angle to expected goals.
For each of the following I list the "formula" section of the logistic equation. The full logistic equation is then multiplied by the player adjustment, with regression at the ends to prevent shots from registering above 1 or below 0.
Regular Shots: (-3.19 - 0.095 * distance + 3.18 * inverse_distance + 1.88 * relative_angle + 0.24 * inverse_angle - 2.09 * inverse_dist*angle + 0.45 * throughball_assist + 0.64 * throughball_2nd_assist + 0.31 * assist_across_face - 0.15 * cutback_assist + 2.18 * inverse_assist_distance + 0.12 * assist_angle + 0.23 * fast_break + 0.18 * counterattack + 0.09 * established_possession - 0.18 * following_corner + 1.2 * big_chance + 1.1 * following_error + 0.39 * following_dribble + 0.14 * dribble_distance + 0.37 * rebound + 0.03 * game_state + 0.07 * Bundesliga - 0.1 * EPL - 0.09 * LaLiga - 0.07 * SerieA)
One further note on cutbacks. There is an interaction effect between "assist distance" and cutbacks Since cutbacks are typically from areas close to goal, most of the negative cutback effect is typically canceled out by the positive effect of the pass location. I define "dribble_distance" as the percentage of the distance to goal that the player covered while running with the ball before attempting a shot. "Assist distance" and "assist angle" are the distance and relative angle to goal from the location where the assist pass was played, calculated just as for shot attempt locations.
I excluded headed passes from the "assist distance" / "assist angle" calculation because, well, headed passes don't really eliminate defenders. They just sort of go over some of them.
The positive value of "inverse angle" helps to account for the higher-than-expected conversion rate of shots attempted from very tight angles. Since these are usually attempted only when the defense is not prepared, they are converted much more often than a simple distance * angle method can account for. That's why I've got so many distance and angle factors here.
Headed Shots Assisted by Crosses: (-2.88 - 0.21 * distance + 2.13 * relative_angle + 4.31 * inverse_assist_distance + 0.46 * assist_angle + 0.2 * fastbreak + 0.11 * counterattack + 0.12 * set_play - 0.24 * corner - 0.18 * otherbodypart + 1.2 * big_chance + 1.1 * following_error + 0.18 * EPL + 0.15 * LaLiga)
Occasionally a player will try to chest a cross into goal. This isn't easy to do, and is basically impossible to do with any power. So there's a small negative effect there. It's small because you basically only would ever try it when you're in a pretty good position to pull it off.
You can see how the much larger "distance" factor here means that shots assisted by crosses are only quality attempts from very close to goal. I don't honestly know what's going on with the La Liga crosses effect. I debated including it at all. It kind of goes against my stated theories. But it's so persistent I think it must be real. It could be an artefact of the coding of chances? I don't know.
Non-Headed Shots Assisted by Crosses: (-2.8 - 0.11 * distance + 3.52 * inverse_distance + 1.14 * angle + 0.14 * assist_across_face + 6.94 * inverse_assist_distance + 0.59 * assist_angle -0.12 * corner + 0.24 *fastbreak + 0.11 * counterattack + 1.25 * big_chance + 1.1 * following_error - 0.2 * EPL)
The much smaller "angle" adjustment here reflects how the angle to goal matters much less with footed shots assisted by crosses than with other shots. If you can get a good strike on the ball, you can beat the keeper from angle almost as easily as from straight on.
Headed Shots Not Assisted by Crosses: (-3.85 - 0.1 * distance + 2.56 * inverse_distance + 1.94 * relative_angle + 0.51 * throughball_assist + 0.44 * fastbreak + 0.26 * counterattack + 0.7 * rebound + 0.44 * established_possession + 1.14 * otherbodypart + 1.3 * big_chance + 1.1 * following_error - 0.29 * EPL - 0.24 * LaLiga - 0.26 * SerieA)
There are huge type-of-play effects on headed shots not assisted by crosses. It's typically only when the defense is out of position that these are reasonably good chances. So counterattacking or playing a rebound or getting free for a weird little interior pass after a long possession are big effects.
Shots from Direct Free Kicks: (-3.84 - 0.1 * distance + 98.7 * inverse_distance + 3.54 * inverse_angle - 91.1 * inverse_distance*angle)
Free kicks are weird. The angle of the attempt matters, but it's not necessarily bad to be at an angle. This method gives positive and negative angle effects based on location on the pitch, which is the best I can do right now for modeling these.
Shots Following a Dribble of the Keeper: (-0.61 - 0.09 * distance + 7.4 * inverse_distance + 1.04 * angle - 3.2 * inverse_distance*angle + 1.1 * big_chance + 0.67 * following_error)
I separate out shots following a keeper dribble because the defensive situation is known. There isn't a keeper. There may be some outfielders trying to block the shot, but it's just a totally different situation and thus a different curve than any other kind of shot.
Those are the formulas. Enjoy.
I run a Monte Carlo simulation, "playing" every remaining match in the season a million times and collating the results. I project matches using a random sample from bi-variate poisson for goals, as suggested by Dimitris Karlis and Ioannis Nitzoufras. (Karlis and Nitzoufras, "Analysis of sports data by using bivariate Poisson models," The Statistician 52 (2003), 381-393.)
For team quality, I use a combination of this season expected goals, last season expected goals and team payroll. Payroll functions as my "regression to the mean" factor. We don't expect every EPL team to regress to the same mean, but we expect richer clubs to do better because in fact that's precisely what they do.
I combine the factors this way. I found that year-to-year it's easier to predict goals scored than goals conceded, no matter what method I use. Tactical adjustments and player relationships can have huge effects on goals conceded, and these can change a lot year to year. (Any Spurs fan can see how the knock-on effects of just one good CM, and perhaps benching one or two bad CMs, have changed this club's defense entirely.) So I weight the current season's xGA more heavily than the current season's xG to account for this difficult.
With xG I start with an 0.8 weight on last year's xG and an 0.2 weight on payroll, and I increase the weight on this season's xG by 0.03 per week until the weight on last year's xG has been entirely replaced. With xGA I start with a weight of 0.7 on last year's xGA and 0.3 on payroll, but I increase the weight on this season's xGA by 0.045 per week instead of 0.03, and I continue to increase until I get to a ratio of 0.8 for this season's expected goals and 0.2 for payroll.
So right now after nine weeks, for the attacking rating, we have 0.27 for this season, 0.53 for last season and 0.2 for payroll. For the defensive rating it's 0.4 for this season, 0.35 for last season and 0.25 for payroll.
Testing Expected Goals
So does this work? I ran a bunch of tests and I can proudly say "pretty much."
I produced the expected goals system using data from the big five leagues (England, Spain, Germany, Italy, France) from 2010-2011 through 2013-2014. Then I tested it on 2014-2015. I got a root mean squared error of 0.258 on the in-sample data and also 0.258 in the out-sample data. Not bad! A study by Michael Bertin found a RMSE of 0.294 using a single-value xG. Martin Eastwood built a (very cool) xG model using support vector machines and got an RMSE of 0.265. Obviously what level of complication is worth the improvement is error rate is a judgment call. But I'm happy with this.
Then i wanted to test its predictivity. Studies of this sort have already been run variously, by myself (here and here), by Sander of 11tegen11 (here and here) and by Colin Trainor (here). Most of these studies have worked within seasons, and I wanted to test also how well expected goals predicts future performance between seasons. After all, with nine weeks of the season complete, we don't assume that Chelsea will remain mid-table—we assume they'll improve because we know these same players were very, very good together next year. But I did not want to just start in 2010-2011 and move forward, because that would privilege those few games that happened to start that season.
So I thought, what I start everywhere. I made for each club a list of all their matches in the last five seasons, ordered chronologically. Then I tested, starting at any match, how well performance in that match predicted performance in the next 20 matches, and I tested performance in that match and the next one combined against the following 20, that match and the next two against the following 20, and so on. The first thing I found is that expected goals does the best at predicting future goal difference and future points totals.
Expected goals ratio starts better early and remains the best predictive tool as the sample increase. One notable point here is how the pink "points" line and the yellow "goals ratio" line start slowly, but after about 12-15 matches goals ratio does about as well as any of the other statistics, bar expected goals, of projecting future performance. Total shots ratio is reasonably predictive early on, but drops off even below just points as the season continues.
But it differs league to league. Total shots ratio in particular "works" early on in the EPL, but not in the other big three leagues.
(You can click of the image to see the very large full image.)
As you can see, the strongest effects are seen in La Liga and the EPL. In the Bundesliga, expected goals is best through about two-thirds of a season, but goal difference then catches up. In Serie A shots on target are highly useful, though again expected goals does better overall.
Overall, if you want to apply one statistic league to league for predictive use, expected goals is definitely the one to use. I'm pretty happy with this.
This is the End
We're at nearly 6000 words so I'd hope so.
Here's the problem. I cited the top four leagues. But I also have Ligue 1. And in the French league, the graph is an ugly mess.
Not only do we see expected goals just in a big messy scrum with all the others, but more importantly we never get an R-squared much over 0.4. Basically, none of the shot statistics really "work" in Ligue 1. While the sample is smaller, and I haven't done anything like conclusive work on this, I'm finding similarly worrisome graphs for the Eredivisie and MLS. It's my hypothesis right now that at lower levels of football, expected goals and the shot statistics are not the way forward.
So this gives me the direction for future research. I won't stop working with shots, and I won't stop writing about shots. I still think these are highly useful statistics, at least for the big four leagues and for top teams. There's no question that at the individual and team level, shots and expected goals have a lot of value. And I certainly will keep doing expected goals maps. I love expected goals maps. They're fun.
But I think we can see that just as "Total Shots Ratio" seems to reach the limit of its utility when it leaves the EPL, so likewise we can see that Expected Goals works mostly at the highest level. This suggests to me that other statistics are needed not only in the lower leagues but also in the EPL and elsewhere. We want our statistics to "work" at as many levels as possible, and probably the ones that work more universally will also be more effective individually.
I am working on team statistics involving more events that just shots, as well as player statistics using those components. I'm not there yet, since this took kind of forever, but that's the direction I am now headed.
All data provided by Opta.