In my last piece, I explained why I want to build club quality statistics off of the underlying stats, things like shots on target, shots in the box, and big chances. During the last season, why I was running power rankings and season projections, I was using a goals scored / goals conceded estimator built from just these underlying stats. I primarily used shots on target and big chances, with adjustments for schedule difficulty.
I want to use the minute-by-minute database to renoobulate my expected goals formula. If possible, I'd like to get away from using "big chances." I used it for the last season because big chances correlate very well with goals scored, and if you adjust a team G/SoT based on their goals scored, you get a better correlation between expected goals and actual goals. We use the tools we have, if they work.
But I don't like big chances. Opta Sports defines a "big chance" as follows.
A situation where a player should reasonably be expected to score usually in a one-on-one scenario or from very close range.
This is a mostly subjective definition. To give an in-game example of my problems with it, the home match against Everton toward the end of the season, when we were missing Gareth Bale, is a good enough example. Both of the goals for Tottenham Hotspur, by Gylfi Sigurdsson and Emmanuel Adebayor, were classified as converted big chances. Sigurdsson's open-net finish of a rebound is about as classic a "big chance" as you'll see. Adebayor was at full stretch, but he did find himself clear on the end of a beautiful cross in the box. Those are ok, and give you a sense of what a "big chance" is. The other big chance in the game troubles me. Phil Jagielka's equalizing header also gets classified as a "big chance." Yes, it was a free header from reasonably close range, but that was a tight angle and free headers go every which way all the time in this sport. Would Jagielka have been credited with a "big chance" if he'd knocked it wide of the post or softly into the chest of Hugo Lloris? I don't know, but I am concerned. Further, after Sigurdsson's equalizer, Everton had a major opportunity on the counter which Victor Anichebe just failed to convert.
This example isn't meant to be definitive, but I hope it gets at my concerns. Do "big chances" correlate with goals scored merely because they're a really good statistic on their own, or do they also correlate with goals scored because Opta stringers are more likely to classify chances as "big" if they are converted into goals?
But at the time, I didn't have anything better to use. Now, thanks to the minute-by-minute database, I think I do.
When I argued during the season that I thought Tottenham had been unlucky in their goal-scoring, one of the common responses had to do with shots in the box. Tottenham took tons of shots from outside the box, more than anyone else in the EPL. They weren't converting because they were shooting from so far away. I said in response that I had found a mostly poor correlation between shots in the box and goals scored, certainly much weaker than the correlation with shots on target. In most publicly available slices of the Opta data, you can get information about team shots in the box or about teams shots on target, but not about the combination of the two. So since I didn't have data on shots in the box on target, I didn't use that.
I have the data now, and I think it's really important. Shots in the box on target are converted at a way, way higher rate than shots out of the box on target. For SiBoT, it's 35%. SoBoT are about one-third as likely to be scored, at a 12% conversion rate. Spurs not only led the Premier League in shots outside the box on target with 109, we were lapping the field. Queens Park Rangers were second with 76, Liverpool third with 70. You expect a club with that many of their SoT coming from outside the box to convert at a lower rate than the average club, and that's exactly what Tottenham did.
There's a second way in which SiBoT are really useful. They correlate extremely well with big chances. The correlation for attack is .90, for defense .78. Shots in the Box on Target are a good proxy for Big Chances, but without all that problematic subjectivity.
It makes sense, when you think about it. Imagine a big chance in your head. It's going to be an opportunity inside the 18-yard box. And then, with SiBoT, you get to group together just those big chances that someone was skilled enough to direct to the mouth of goal. Along with the big chances, you'll get mostly opportunities that perhaps aren't as "big," but they probably had to be pretty good. Only about 35% of shots in the box end up on target, most either miss the target and get blocked on the way in. Finding an open shooting lane from a reasonably short distance, and then finding the time and space to direct the ball down that shooting lane, usually requires at least a "half chance."
Renoobulating the Expected Goals Formula
For the past few months, I have been running spreadsheets using an expected goals formula based on shots on target, big chances, and shots in the box. I want to test a different expected goals formula, one based only on shots on target in the box and shots on target outside the box. As a control, I will also use a dummy formula based only on shots on target.
This is the part where we get into the nerdery, and it's going to take a few paragraphs. You can skip on down to the "preliminary expected goals" table if you just want to see the results.
Ok, for the six of you still reading, these are the three formulae for expected goals. SoT are shots on target, SiBoT and SoBoT are shots in or out of the box on target. Lg G/SoT is the league average for goals per shot on target. Lg G/SiBoT you can extrapolate. BC+ and SiB+ are the rates at which teams produce big chances and shots in the box, compared to league average. So if a club has 60 big chances, compared to a league average of 50, their BC+ will be 1.20.
1) SoT * ((.6 * Lg G/SoT) + (.3 * Lg G/SoT * BC+) + (.1 * Lg G/SoT * SiB+))
This was my old formula. I adjust the rate of goals scored per shot on target according to the club's rate of big chances and shots in the box. (I "derived" the 60/30/10 weights by trial and error. I wanted to use round numbers to avoid over-fitting.) As you can see, Big Chances make a big difference here.
2) SiBoT * Lg G/SiBoT + SoBoT * Lg G/SoBoT
This is my new proposed formula. It's so simple. Just divide shots on target between shots in the box and shots outside the box, multiply by league average conversion rates, you're done. Hope it works!
3) SoT * Lg G/SoT
This is basically a control. It's the simplest expected goals formula there is. My new numbers have to be better than this.
To test the models, I'm using two basic methods. I'm comparing the results of my expected goals formula to the actual goals the actual clubs scored or allowed. I'm using two different methods for comparison. The first is simple correlation. This will mostly tell me if I put the teams in the right order, if I have a good ranking of the clubs based on the underlying stats. The second is something called the "root mean square error" method. This measures how much my estimates missed the mark with every team, and it particularly punishes big misses. If I'm way off in estimating goals scored or goals conceded for just one or two clubs, the RMSE will have my shirt. This is good, because if I'm going to be using this model for estimating team quality, I don't want to be the idiot out there in November talking about how great Reading are. When in fact Reading are terrible.
So, these are the correlation coefficients and root mean square errors for each of my different estimators, both for attack and defense. A good correlation coefficient is as close to 1.0 as possible. A good RMSE is as close to 0 as possible.
|Formula||Att Corr||Att RMSE||---||Def Corr||Def RMSE|
|BC Formula (1)||.90||6.2||---||.80||7.3|
|SiBoT Formula (2)||.91||5.5||---||.77||6.9|
|SoT Formula (3)||.87||6.3||---||.72||7.3|
Shots in the Box on Target are generally a better predictor of goals scored than Big Chances. They produce expected goals numbers that correlate roughly as well with actual goals as the BC-based expected goals do. But when they miss, they miss by less. I want to run some of these tests using a larger set of seasons, but for now, I'm feeling pretty good about SiBoT.
The final thing I want to offer here is a nice table of preliminary expected goals scored and expected goals conceded for the 2012-2013 Premier League. There are a couple little things I should explain before getting to the table. First, since I found some meaningful correlations week-to-week for defensive G/SoT, I am using actual G/SoT as a small part of my goals conceded formula. I'm not regressing all the way to league average. Second, for penalties, which are obviously converted as a totally separate rate, I'd regressing the number of penalties given halfway to league average and rate of penalty conversion 100% to league average.
Ok, on to the numbers.
Preliminary Expected Goals Table
|Queens Park Rangers||38||61||---||30||60|
|West Bromwich Albion||52||57||---||53||57|
|West Ham United||48||61||---||45||53|
- Yeah, I don't know what the deal with Manchester United is either. No matter what slices of the stats you control for, they converted more of every kind of opportunity. I said the RMSE punished me for large errors, but basically every measure misses on Manchester United by a dozen or more goals so the punishments evened out.
- This method brings Tottenham to roughly equal footing with their London rivals, but it does so by taking the air out of Arsenal and Chelsea's numbers more so than by inflating Spurs'.
- I think it's interesting to compare the three relegated clubs. Reading and QPR were terrible, but Wigan despite their goal difference had quite respectable underlying stats. Not surprising for a club that won a cup title, I guess. Obviously all of these clubs will lose important players, but I think these numbers suggest that Wigan are not a bad bet for a quick re-promotion.
- I say "preliminary" because I'm not taking into account, say, the game state analysis stuff I've been working on. And because it's still just June, there's more time ot incorporate more data and renoobulate again. I appreciate any feedback you can offer, and hopefully I can take it into account for my further revisions of these methods.