I talk a lot about the "underlying stats" and their utility for football analysis. I realized I never actually made a worked-out argument for my methods in this space. So that's why I'm going to do. This is why I focus on the underlying stats–things like shots on target, big chances and shots in the box. The reason is pretty simple. I think that there is a ton of random variation in the percentage of shots on target which are converted into goals. I think that even over a full season of play, you shouldn't evaluate clubs based on their conversion rate of shots on target. You'll be evaluating them based on a statistic which is heavily influenced by random variation.
The example I've been giving, over the course of the season, is Gylfi Sigurdsson. He put a number of beautiful curling shots from inside the box on target during the first twenty-five weeks of the season, and they were every one pushed aside by the keeper or knocked off the bar.
It would be nearly impossible to argue while watching Sigurdsson play that he was doing something wrong when he took those shots, but nonetheless he scored none of his nine shots on target. Then in the final thirteen weeks of the season, Sigurdsson started scoring, putting home three of his seven SoT. I don't think he got better at shooting. I think random variation just came around for him.
The same applies at the team level. Now, I want to be clear about this. I am not arguing that there is no such thing as shooting skill. That would be silly. I'm arguing two things:
1) A lot of shooting skill is covered in the ability to put shots on target. Directing where you want a ball to go, using only your foot or your head, is super hard. Just forcing a save out of the goalkeeper requires skill, doing so consistently requires a lot of skill.
2) I actually strongly believe that there is real variation – at both the player and club level – in skill at placing shots in areas of the goal which are more difficult to save. Surely some players and some teams are better than some other players and teams at converting shots on target. However, the data demonstrates that the 38-game sample which the English Premier League season comprises is not large enough for these differences in skill to make themselves known. They are swamped by random variation. So even though I think there are real differences in shooting skill between different clubs, I do not believe that the G/SoT statistic usefully captures these differences. The better way to judge team quality is to regress G/SoT toward the league average.
The same applies, mutatis mutandis, to goal prevention and G/SoT against. (There is a little more signal captured in G/SoT against, which is a notable difference. But it's still mostly swamped by random variation.)
Here's one demonstration. I have a week-by-week record of the basic statistics. Shots, shots on target, goals scored, as well as some more complex statistics like big chances and shots in the box on target. (I guess "shots in the box on target" aren't complicated, but since they aren't freely available and I had to extract them from my minute-by-minute database, I'm going to go ahead and call them complicated. I'll give a definition of each statistic below.)
So, I have taken this database, and I have split it randomly into two sections for each club. What this means is, in one half of the database, for each of the 20 EPL clubs, I'll have 19 randomly chosen gameweeks for each club, and in the other half, I'll have the other 19 gameweeks for each club. Then, I can compare a team's performance in one half of the database to its performance in the other half.
I have then repeated this times, creating 100 different random season splits, and I have run correlations between each random half and its counterpart.
With a sample of 19 matches on either side, and this repeated randomly a hundred times, we should expect good correlations between good statistical measures. If you take two randomly chosen sets of 19 matches from Manchester United's season, they should have reasonably similar totals of goals and shots and shots on target. These numbers should all be very good, because Manchester United are very good. Likewise, if you do the same to Reading, you should expect similar sets of terrible numbers, because Reading are terrible.
And indeed, you do see high correlations with most statistics. But not with goals per shots on target. G/SoT correlate not at all for attack, and only very slightly for defense. The following two tables contain the correlation numbers for each statistic for attack and defense. Here you may either skip over an explanation of a statistical term or nerd out with me for a moment.
Nerdery: "Correlation" is a statistical method for evaluating the relationship between two sets of numbers, and it ranges between -1 and 1. Identical sets have a correlation of 1, while inverse sets have a correlation of -1. So, (1,2) and (1,2) have a correlation of exactly 1.0, while (1,2) and (2,1) have a correlation of exactly -1.0. A correlation of 0 suggests no relationship between two sets of numbers. In the correlations I'm running, I'm comparing two set of 20 numbers, the statistics of each EPL club for each of two randomly selected 19-week segments, but the math is the same. We expect good correlations, but not exact ones, for statistics which reflect a club's underlying ability.
|Attack Stat||Correlation||---||Defense Stat||Correlation|
|Att SoT||0.74||---||Def SoT||0.71|
|Att SiBoT||0.72||---||Def SiBoT||0.68|
|Att BC||0.70||---||Def BC||0.62|
|Att G/SoT||0.09||---||Def G/SoT||0.27|
Shot on Target: A goal attempt that either goes into the goal or would have gone into the goal if it had not been saved by the keeper or cleared by the last man defender.
Shot in the Box on Target: A shot on target from inside the 18-yard box.
Big Chance: An opportunity when a player is realistically expected to score, such as a one-on-one opportunity. Judged by Opta stringers.
Goals per Shot on Target: Percentage of shots on target which are goals.
In the attacking numbers, there is basically no correlation week-to-week of goals per shots on target. To the degree that there is a real team-level ability to convert shots on target, it is swamped by random variation. The other statistics correlate relatively well. On defense, there is somewhat weaker correlation of the underlying statistics, and a non-zero correlation of goals per shot on target. Still, there is much more consistency in the SoT, SiBoT and BC numbers week-to-week for defense as well as attack.
This data leads me to conclude that estimates of team quality should not be built off the rate at which clubs score goals from shots on target. Of course, an estimate of team quality based on goals scored will indeed be based on that goals / shots on target number that varies so widely week-to-week. So for the attacking statistics, I will want to built an estimator of club attacking quality based only on the underlying stats. For defense, I will include G/SoT as a small part of the calculation.
Tomorrow: New formulas unveiled, plus a critique of Opta's "big chances" statistic.