Repurposing Elo for Streaks, Seasons, and Garbine Muguruza

Elo is a fantastic tool for its explicit purpose: estimating the skill level of players based on available information. For instance, my WTA ratings currently rank Ashleigh Barty second. That seems plausible enough–it may be correct to give her the edge in a head-to-head matchup with everyone on tour except for Naomi Osaka. But with women pursuing such different schedules this season, a rating is only so useful.

For all of Barty’s or Osaka’s skill, is it right to say either one of them has had a better 2021 season than Garbine Muguruza? Osaka won the Australian Open, so she has a valid claim. Barty’s argument is a lot more tenuous, based on only eight victories. The Spaniard’s case writes itself–only a handful of players are up to double digits in wins this year, and Muguruza already has 18. How could we decide? If Elo is the smart version of the official rankings, what’s the smart version of the official race?

Starting fresh

The Elo algorithm itself offers a solution. A big part of the reason Muguruza is rated 4th on my current Elo list–and not higher–is her career before 2021. We had hundreds of matches worth of data on Garbine before January 1st, and it would be silly to throw all that away. Her 18-4 start is fantastic, but it doesn’t supersede everything that came before. It just gives us reason to update our rating.

Here’s where the ranking/race analogy is useful. The official rankings use a time span of 52 weeks (or more). The race restarts on January 1st. We could do the exact same thing with Elo, throwing away all results from the previous year and starting over, but that would be wasteful–it wouldn’t allow us to take into account whether players had faced particularly easy or tough draws, for instance.

The solution is to set Elo ratings back to zero (or 1500, in Elo parlance) one player at a time.

Take Muguruza. Instead of starting the year with a rating of 1981 and a history of several hundred matches, we pretend to know nothing about her. We give her a newbie’s rating of 1500 and a history of zero matches. Then we run the Elo algorithm to update her rating over the course of her 22 matches. First she faces Kristina Mladenovic (with her actual rating at the time of 1817), and improves to 1605. Then she beats Aliaksandra Sasnovich (and her rating of 1805), and improves to 1692. Repeat for each of her 2021 results, and the end result is a rating of 2160–almost 100 points higher than her current “real Elo” rating and within shouting distance of Osaka’s 2189.

To compare players, work through the same steps for everybody else, calculating their current-season rating as if they played their first career match in January.

It’s worth taking a moment to think about exactly what we’re measuring. That outstanding 2160 rating is what you get if a complete unknown shows up with zero match experience, then goes on the 22-match run that has been Muguruza’s season so far. The difference between real-Garbine and fake-newbie-Garbine is that the real one has an extensive track record that tells us she’s always been good–but that she probably isn’t quite this good.

I call it … yElo

This approach is “Elo for seasons” or “year Elo”–yElo*. It doesn’t have to be limited to calendar years, as the same approach would be useful to comparing, say, 20-match segments. It allows us to take advantage of the Elo algorithm–and the well-informed ratings of other players–to measure partial careers.

* you can pronounce it like the color “yellow,” but I prefer to say it like Phil Dunphy from Modern Family answering the phone.

Muguruza’s 2160 rating sure looks good, so how does it stack up against the rest of the tour? Here’s the 2021 top 20, considering players with at least five match wins through the Dubai and Guadalajara finals last weekend:

Rank  Player                W-L  yElo  
1     Garbine Muguruza     18-4  2160  
2     Naomi Osaka          10-0  2094  
3     Jessica Pegula       15-5  2002  
4     Serena Williams       8-1  1997  
5     Elise Mertens        11-2  1971  
6     Karolina Muchova      7-1  1953  
7     Aryna Sabalenka      11-4  1943  
8     Iga Swiatek          10-3  1941  
9     Daria Kasatkina      10-4  1910  
10    Barbora Krejcikova   10-5  1905  
11    Shelby Rogers         9-4  1902  
12    Jil Teichmann         9-5  1899  
13    Anett Kontaveit       9-4  1897  
14    Jennifer Brady        9-4  1892  
15    Cori Gauff           11-5  1885  
16    Danielle Collins      9-4  1883  
17    Ashleigh Barty        8-2  1878  
18    Sara Sorribes Tormo   9-2  1867  
19    Ann Li                5-1  1864  
20    Simona Halep          6-2  1854 

Like any Race list in March, this isn’t really reflective of skill. But when we consider the small amount of data it has to work with for each player, it’s … pretty good?

Again, you can quibble over whether Osaka or Muguruza has had the better season, but this approach weighs the better winning percentage and stronger average opponent against the much higher absolute win count and gives us a credible answer. Muguruza’s additional evidence of good tennis playing puts her ahead of Osaka’s evidence of short-term unbeatability.

While yElo is basically just a toy–it certainly doesn’t have the same predictive value as regular Elo–this initial look makes me like it. The possibilities are endless, from more sophisticated race tracking, to ranking the greatest seasons of all time, to comparing a player’s current hot streak to what’s she’s done in the past. Stay tuned, as I’m sure I’ll have more yElo results to report in the future.

So, About Those Stale Rankings

Both the ATP and WTA have adjusted their official rankings algorithms because of the pandemic. Because many events were cancelled last year (and at least a few more are getting canned this year), and because the tours don’t want to overly penalize players for limiting their travel, they have adopted what is essentially a two-year ranking system. For today’s purposes, the details don’t really matter–the point is that the rankings are based on a longer time frame than usual.

The adjustment is good for people like Roger Federer, who missed 14 months and is still ranked #6. Same for Ashleigh Barty, who didn’t play for 11 months yet returned to action in Australia as the top seed at a major. It’s bad for young players and others who have won a lot of matches lately. Their victories still result in rankings improvements, but they’re stuck behind a lot of players who haven’t done much lately.

The tweaked algorithms reflect the dual purposes of the ranking system. On the one hand, they aim to list the best players, in order. On the other hand, they try to maintain other kinds of “fairness” and serve the purposes of the tours and certain events. The ATP and WTA computers are pretty good at properly ranking players, even if other algorithms are better. Because the pandemic has forced a bunch of adjustments, it stands to reason that the formulas aren’t as good as they usually are at that fundamental task.

Hypothesis

We can test this!

Imagine that we have a definitive list, handed down from God (or Martina Navratilova), that ranks the top 100 players according to their ability right now. No “fairness,” no catering to the what tournament owners want, and no debates–this list is the final word.

The closer a ranking table matches this definite list, the better, right? There are statistics for this kind of thing, and I’ll be using one called the Kendall rank correlation coefficient, or Kendall’s tau. (That’s the Greek letter τ, as in Τσιτσιπάς.) It compares lists of rankings, and if two lists are identical, tau = 1. If there is no correlation whatsoever, tau = 0. Higher tau, stronger relationship between the lists.

My hypothesis is that the official rankings have gotten worse, in the sense that the pandemic-related algorithm adjustments result in a list that is less closely related to that authoritative, handed-down-from-Martina list. In other words, tau has decreased.

We don’t have a definitive list, but we do have Elo. Elo ratings are designed for only one purpose, and my version of the algorithm does that job pretty well. For the most part, my Elo formula has not changed due to the pandemic*, so it serves as a constant reference point against which we can compare the official rankings.

* This isn’t quite true, because my algorithm usually has an injury/absence penalty that kicks in after a player is out of action for about two months. Because the pandemic caused all sorts of absences for all sorts of reasons, I’ve suspended that penalty until things are a bit more normal.

Tau meets the rankings

Here is the current ATP top ten, including Elo rankings:

Player       ATP  Elo  
Djokovic       1    1  
Nadal          2    2  
Medvedev       3    3  
Thiem          4    5  
Tsitsipas      5    6  
Federer        6    -  
Zverev         7    7  
Rublev         8    4  
Schwartzman    9   10  
Berrettini    10    8

I’m treating Federer as if he doesn’t have an Elo rating right now, because he hasn’t played for more than a year. If we take the ordering of the other nine players and plug them into the formula for Kendall’s tau, we get 0.778. The exact value doesn’t really tell you anything without context, but it gives you an idea of where we’re starting. While the two lists are fairly similar, with many players ranked identically, there are a couple of differences, like Elo’s higher estimate of Andrey Rublev and its swapping of Diego Schwartzman and Matteo Berrettini.

Let’s do the same exercise with a bigger group of players. I’ll take the top 100 players in the ATP rankings who met the modest playing time minimum to also have a current Elo rating. Plug in those lists to the formula, and we get 0.705.

This is where my hypothesis falls apart. I ran the same numbers on year-end ATP rankings and year-end Elo ratings all the way back to 1990. The average tau over those 30-plus years is about 0.68. In other words, if we accept that Elo ratings are doing their job (and they are indeed about as predictive as usual), it looks like the pandemic-adjusted official rankings are better than usual, not worse.

Here’s the year-by-year tau values, with a tau value based on current rankings as the right-most data point:

And the same for the WTA, to confirm that the result isn’t just a quirk of the makeup of the men’s tour:

The 30-year average for women’s rankings is 0.723, and the current tau value is 0.764.

What about…

You might wonder if the pandemic is wreaking some hidden havoc with the data set. Remember, I said that I’m only considering players who meet the playing time minimum to have an Elo rating. For this purpose, that’s 20 matches over 52 weeks, which excludes about one-third of top-100 ranked men and closer to half of top-100 women. The above calculations still consider 100 players for year-end 2020 and today, but I had to go deeper in the rankings to find them. Thus, the definition of “top 100” shifts a bit from year-end 2019 to year-end 2020 to the present.

We can’t entirely address this problem, because the pandemic has messed with things in many dimensions. It isn’t anything close to a true natural experiment. But we can look only at “true” top-100 players, even if the length of the list is smaller than usual for current rankings. So instead of taking the top 100 qualifying players (those who meet a playing time minimum and thus have an Elo ranking), we take a smaller number of players, all of whom have top-100 rankings on the official list.

The results are the same. For men, the tau based on today’s rankings and today’s Elo ratings is 0.694 versus the historical average of 0.678. For women, it’s 0.721 versus 0.719.

Still, the rankings feel awfully stale. The key issue is one that Elo can’t help us solve. So far, we’ve been looking at players who are keeping active. But the really out-of-date names on the official lists are the ones who have stayed home. Should Federer still be #6? Heck if I know! In the past, if an elite player missed 14 months, Elo would knock him down a couple hundred points, and if that adjustment were applied to Fed now, it would push down tau. But there’s no straightforward answer for how the inactive (or mostly inactive) players should be rated.

What we’ve learned today

This is the part of the post where I’m supposed to explain why this finding makes sense and why we should have suspected it all along. I don’t think I can manage that.

A good way to think about this might be that there is a sort of tour-within-a-tour that is continuing to play regularly. Federer, Barty, and many others haven’t usually been part of it, while several dozen players are competing as often as they can. The relative rankings of that second group are pretty good.

It doesn’t seem quite fair that Clara Tauson is stuck just inside the top 100 while her Elo is already top-50, or that Rublev remains behind Federer despite an eye-popping six months of results while Roger sat at home. And for some historical considerations–say, weeks inside the top 50 for Tauson or the top 5 for Rublev–maybe it isn’t fair that they’re stuck behind peers who are choosing not to play, or who are resting on the laurels of 18-month-old wins.

But in other important ways, the absolute rankings often don’t matter. Rublev has been a top-five seed at every event he’s played since late September except for Roland Garros, the Tour Finals, and the Australian Open, despite never being ranked above #8. When the tour-within-a-tour plays, he is a top-five guy. The likes of Rublev and Tauson will continue to have the deck slightly stacked against them at the majors, but even that disadvantage will steadily erode if they continue to play at their current levels.

Believing in science as I do, I will take these findings to heart. That means I’ll continue to complain about the problems with the official rankings–but no more than I did before the pandemic.

How Much Does Naomi Osaka Raise Her Game?

You’ve probably heard the stat by now. When Naomi Osaka reaches the quarter-final of a major, she’s 12-0. That’s unprecedented, and it’s especially unexpected from a player who doesn’t exactly pile up hardware outside of the hard court grand slams.

It sure looks like Osaka finds another level as she approaches the business end of a major. Translated to analytics-speak, “she raises her game” can be interpreted as “she plays better than her rating implies.” That is certainly true for Osaka. She has won 16 of her 18 matches in the fourth round or later of a slam, often in matchups that didn’t appear to favor her. In her first title run, at the 2018 US Open, my Elo ratings gave her 36%, 53%, 46%, and 43% chances of winning her fourth-round, quarter-final, semi-final, and final-round matches, respectively.

Had Osaka performed at her expected level for each of her 18 second-week matches, we’d expect her to have won 10.7 of them. Instead, she won 16. The probability that she would have won 16 or more of the 18 matches is approximately 1 in 200. Either the model is selling her short, or she’s playing in a way that breaks the model.

Estimating lift

Osaka’s results in the second week of slams are vastly better than the other 93% or so of her tour-level career. It’s possible that it’s entirely down to luck–after all, things with a 0.5% chance of happening have a habit of occurring about 0.5% of the time, not never. When those rare events do take place, onlookers are very resourceful when it comes to explaining them. You might believe Osaka’s claims about caring more on the big stage, but we should keep in mind that whenever the unlikely happens, a plausible justification often follows.

Recognizing the slim possibility that Osaka has taken advantage of some epic good luck but setting it aside, let’s quantify how good she’d have to be for such a performance to not look lucky at all.

That’s a mouthful, so let me explain. Going into her 16 second-week slam matches, Osaka’s average surface-blended Elos have been 2,022. That’s good but not great–it’s a tick below Aryna Sabalenka’s hard-court Elo rating right now. Those modest ratings are how we come up with the estimate that Osaka should’ve won 10.7 of her 18 matches, and that she had a 1-in-200 shot of winning 16 or more.

2,022 doesn’t explain Osaka’s success, so the question is: What number does? We could retroactively boost her Elo rating before each of those matches by some amount so that her chance of winning 16-plus out of 18 would be a more believable 50%. What’s that boost? I used a similar methodology a couple of years ago to quantify Rafael Nadal’s feats at his best clay court events, another string of match wins that Elo can’t quite explain.

The answer is 280 Elo rating points. If we retroactively gave Osaka an extra 280 points before each of these 16 matches, the resulting match forecasts would mean that she’d have had a fifty-fifty chance at winning 14 or more of them. Instead of a pre-match average of 2,022, we’re looking at about 2,300, considerably better than anyone on tour right now. (And, ho hum, among the best of all time.) A difference of 280 Elo points is enormous–it’s the difference between #1 and #22 in the current hard-court Elo rating.

Osaka versus the greats

I said before that Osaka’s 12-0 is unprecedented. Her 16-2 in slam second weeks may not have quite the same ring to it, but compared to expectations based on Osaka’s overall tour-level performance, it is every bit as unusual.

Take Serena Williams, another woman who cranks it up a notch when it really matters. Her second-week record, excluding retirements, is 149-39, while the individual forecasts before each match would’ve predicted about 124-64. The chances of a player outperforming expectations to that extent are basically zero. I ran 10,000 simulations, and that’s how many times a player with Serena’s pre-match odds won 147 of the 185 matches. Zero.

For Serena to have had a 50% chance of winning 149 of the 188 second-week contests, her pre-match Elo ratings would’ve had to have been 140 points higher. That’s a big difference, especially on top of the already stellar ratings that she has maintained throughout her career, but it’s only half of the jump we needed to account for Osaka’s exploits. Setting aside the possibility of luck, Osaka raises her level twice as much as Serena does.

One more example. Monica Seles won 70 of her 95 second-week matches at slams, a marked outperformance of the 60 matches that Elo would’ve predicted for her. Like Osaka, her chances of having won 70 instead of 60 based purely on luck are about 1 in 100. But you can account for her actual results by giving her a pre-match Elo bonus of “only” 100 points.

The full context

I ran similar calculations for the 52 women who won a slam, made their first second-week appearance in 1958 or later, and played at least 10 second-week matches. They divide fairly neatly into three groups. 18 of them have career second-week performances that can easily be explained without recourse to good luck or level-raising. In some cases we can even say that they were unlucky or that they performed worse than expected. Ashleigh Barty is one of them: Of her 14 second-week matches, she was expected to win 9.9 but has tallied only 8.

Another 16 have been a bit lucky or slightly raised their level. To use the terms I introduced above, their performances can be accounted for by upping their pre-match Elo ratings by between 10 and 60 points. One example is Venus Williams, who has gone 84-43 in slam second weeks, about six wins better than her pre-match forecasts would’ve predicted.

That leaves 18 players whose second-week performances range from “better than expected” to “holy crap.” I’ve listed each of them below, with their actual wins (“W”), forecasted wins (“eW”), probability of winning their actual total given pre-match forecasts (“p(W)”), and the approximate number of Elo points (“Elo+”) which, when added to their pre-match forecasts, would explain their results by shifting p(W) up to at least 50%.

Player               M    W     eW   p(W)  Elo+  
Naomi Osaka         18   16   10.7   0.5%   280  
Billie Jean King   123   94   76.2   0.0%   160  
Sofia Kenin         10    7    4.7  10.6%   150  
Serena Williams    188  149  124.4   0.0%   140  
Evonne Goolagong    92   69   58.7   0.4%   130  
Jennifer Capriati   70   42   33.2   1.2%   110  
Monica Seles        95   70   60.2   1.2%   100  
Hana Mandlikova     75   49   41.7   3.1%   100  
Kim Clijsters       67   47   40.6   4.6%    90  
Justine Henin       74   55   48.9   6.3%    80  
Mary Pierce         55   28   22.4   6.9%    80  
Li Na               36   22   18.0  10.6%    80  
Steffi Graf        157  131  123.6   6.1%    70  
Maria Bueno         93   70   63.4   6.3%    70  
Garbine Muguruza    31   18   14.9  15.8%    70  
Mima Jausovec       32   18   15.0  15.9%    70  
Marion Bartoli      20   11    8.8  20.6%    70  
Sloane Stephens     24   12    9.7  20.8%    70

There are plenty of names here that we’d comfortably put alongside Williams and Seles as luminaries known for their clutch performances. Still, the difference between Osaka’s levels is on another planet.

Obligatory caveats

Again, of course, Osaka’s results could just be lucky. It doesn’t look that way when she plays, and the qualitative explanations add up, but … it’s possible.

Skeptics might also focus on the breakdown of the 52-player sample. In terms of second-week performance relative to forecasts, only one-third of the players were below average. That doesn’t seem quite right. The “average” woman outperformed expectations by about 30 Elo points.

There are two reasons for that. The first is that my sample is, by definition, made up of slam winners. Those players won at least four second-week matches, no matter how they fared in the rest of their careers. In other words, it’s a non-random sample. But that doesn’t have any relevance to Osaka’s case.

The second, more applicable, reason that more than half of the players look like outperformers is that any pre-match player rating is a measure of the past. Elo isn’t as much of a lagging indicator as, say, official tour rankings, but by its nature, it can only consider past results.

Any player who ascends to the top of the game will, at some point, need to exceed expectations. (If you don’t exceed expectations, you end up with a tennis “career” like mine.) To go from mid-pack to slam winner, you’ll have at least one major where you defy the forecasts, as Osaka did in New York in 2018. Osaka was an extreme case, because she hadn’t done much outside of the slams. If, for instance, Sabalenka were to win the US Open this year, she has done so well elsewhere that it wouldn’t be the same kind of shock, but it would still be a bit of a surprise.

In other words, almost every player to win a slam had at least one or two majors where they executed better than their previous results offered any reason to expect. That’s one reason why we find Sofia Kenin only two spots below Osaka on the list.

For Serena or Seles, the “rising star” effect doesn’t make much of a difference–those early tournaments are just a drop in the bucket of a long career. Yeah, it might mean they really only up their game by 110 Elo points instead of 130, but it doesn’t call their entire career’s worth of results into question. For Osaka or Kenin, the early results make up a big part of the sample, so this is something to consider.

It will be tougher to Osaka to outperform expectations as the expectations continue to rise. Much depends on whether she continues to struggle away from the big stages. If she continues to manage only one non-major title per year, she’ll keep her rating down and suppress those pre-match forecasts. (The predictions of major media pundits will be harder to keep under control.) Beating the forecasts isn’t necessarily something to aspire to–even though Serena does it, her usual level is so high that we barely notice. But if Osaka is going to alternate levels between world-class and merely very good, she could hardly do better than to bring out her best stuff when she does.

The Post-Covid Tennis World is Unpredictable. The Match Results Are Not.

Both the ATP and WTA patched together seasons in the second half of 2020, providing playing opportunities to competitors who had endured vastly different lockdowns–some who couldn’t practice for awhile, some who came down with Covid-19, and others who got knee surgery.

When the tours came back, we didn’t know quite what to expect. I’m sure some of the players didn’t know, either. Yet when we take the 2020 season (plus a couple weeks of 2021) as a whole, what happened on court was pretty much what happened before. The Australian Open, with its dozens of players in hard quarantine for two weeks, may change that. But for about five months, players faced all kinds of other unfamiliar challenges, and they responded by posting results that wouldn’t have looked out of place in January 2020.

The Brier end

My usual metric for “predictability” is Brier Score, which measures both accuracy (did our pre-match favorite win?) and confidence (if we think four players are all 75% favorites, did three of them win?). Pre-match odds are determined by my Elo ratings, which are far from the final word, but are more than sufficient for these purposes. My tour-wide Brier Scores are usually in the neighborhood of 0.21, several steps better than the 0.25 Brier that results from pure coin-flipping. A lower score indicates more accurate forecasts and/or better calibrated confidence levels.

Here are the tour-wide Brier Scores for the ATP and WTA since the late-summer restart:

  • ATP: 0.213 (2017 – early 2020: 0.212)
  • WTA: 0.192 (2017 – early 2020: 0.212)

The ATP’s level of predictability is so steady that it’s almost suspicious, while the WTA has somehow been more predictable since the restart.

But we aren’t quite comparing apples to apples. The post-restart WTA was sparser than the pre-Covid women’s tour, and the post-restart ATP was closer to its pre-pandemic normal.

Let’s look at a few things that do line up. Most of the top players showed up for the main events of the restarted tour, such as the US Open, Roland Garros, Rome, “Cincinnati” (played in New York), and men’s Masters event in Paris. Here are the 2019 and 2020 Brier Scores for each of those events:

Event          Men '19  Men '20  Women '19  Women '20  
Cincinnati       0.244    0.210      0.244      0.252  
US Open          0.210    0.167      0.178      0.186  
Roland Garros    0.163    0.199      0.191      0.226  
Rome             0.209    0.274      0.205      0.232  
Paris            0.226    0.199          -          -  
---
Total            0.204    0.202      0.198      0.218

(If you want even more numbers, I did similar calculations in August after Palermo, Lexington, and Prague.)

Three takeaways from this exercise:

  • Brier Scores are noisy. Any single tournament number can be heavily affected by a few major upsets.
  • Man, those ATP dudes were steady.
  • The WTA situation is more complicated than I thought.

Whether we look at the entire post-restart tour or solely the big events, the story on the ATP side is clear. Long layoffs, tournament bubbles, missing towelkids, Hawkeye Live … none of it had much effect on the status quo.

The predictability of the women’s tour is another thing entirely. The 12 top-level events between Palermo in July and Abu Dhabi in January were easier to forecast than a random sampling of a dozen tournaments from, say, 2018. But the four biggest events deviated from the script considerably more than they had in 2019 (or 2017 or 2018, for that matter).

From this, I offer a few tentative conclusions:

  • Big events, with their disproportionate number of star-versus-star matches, are a bit more predictable than other tournaments.
  • Accordingly, the post-restart WTA wasn’t as predictable as it first appeared. It was just lopsided in favor of tournaments that drew (most of) the top stars. Had the women’s tour featured a wider variety of events–which probably would’ve included a larger group of players, including some fringier ones–it’s post-restart Brier Score would’ve been higher. Perhaps even higher than the corresponding pre-Covid number.
  • Most tentative of all: The predictability of ATP and WTA match results might have itself been affected by the availability of tournaments. Top men were able to get into something like their usual groove, despite the weirdness of virus testing and empty stadiums. Most women never got a chance to play more than two or three weeks in a row.

Even six months after Palermo, the data is still limited. And by the time we have enough match results to do proper comparisons, some things will have gotten back to normal (hopefully!), complicating the analysis even further. That said, these findings are much clearer than my initial forays into post-restart Brier Scores in August. As for the Australian Open, quarantine and all, I’m forecasting a predictable tournament. At least for the men.

Not All Twenties Are Created Equal

The top of the all-time men’s grand slam ranking just got even more crowded. With his 13th Roland Garros title, Rafael Nadal has matched Roger Federer at the top of the list by securing his 20th major title. Novak Djokovic, Nadal’s final obstacle en route to the historic mark, remains within shouting distance with 17 slams.

The Roger-Rafa tie has spurred another (interminable, unresolvable) round of the (interminable, unresolvable) GOAT debate. Of course there’s much more to determining the best ever than the slam count. But the slam count is a big part of the conversation. If we’re going to keep doing this, we ought to at least recognize that not all major titles are created equal. And by extension, not all collections of twenty major titles are equivalent.

We all have intuitions about the difficulty of how a particular draw shakes out, with its typical mix of good and bad fortune. Nadal was lucky that he missed a few dangerous opponents in the early rounds, luckier still that he didn’t have to face Dominic Thiem in the semi-final, and unfortunate that he had to face down the next-best player in the draw, Djokovic, in the final. As it turned out, it didn’t really matter, but I think most of us would agree that Nadal’s achievement–staggering as it is–would look even better had he faced more than two more players ranked in the top 70.

Stop dithering and start calculating

I’ve written about this before, and I’ve established a metric to quantify those intuitions. Take the surface-weighted Elo rating of each of a player’s opponents, and determine the probability that an average slam champion would beat those players. After a couple of steps to normalize the results, we end up with a single number for the path to each slam title. The larger the result, the more difficult the path, and an average slam works out to 1.0.

Nadal’s path was easier than the historical average. Aside from Djokovic, none of his opponents would have had more than an 8% chance of knocking out an average slam champion on clay. The exact result is 0.64, which is easier than almost nine-tenths of majors in the Open Era. Rafa has had three easier paths to his major titles, including the 2017 US Open, which scored only 0.33. That’s the easiest US Open, Wimbledon, or Roland Garros in a half-century.

Of course, he’s had his share of difficult paths, such as 2012 Roland Garros (1.36), when he faced several clay specialists and a peak-level Djokovic. Federer and Djokovic have gotten their own shares of lucky and unlucky draws over the years–that’s why we need a metric. You might have a better memory for this kind of thing than I do, but I don’t think any of us can weigh 57 majors with 7 opponents each and work out any meaningful results in our heads.

The tally

Sum up the difficulty of the title paths for these 57 slams, and here are the results:

Player    Slams  Avg Score  Total  
Nadal        20       0.95   19.0  
Djokovic     17       1.06   18.1  
Federer      20       0.89   17.9  
                                   
Player     Easy     Medium   Hard  
Nadal         7          8      5  
Djokovic      5          5      7  
Federer       9         10      1

The first table shows each player’s average score for the paths to his major titles, and the total number of “adjusted slams” that gives them. Nadal is in the lead with 19, and Djokovic and Federer follow in a near-tie, just above and below 18.

You might be surprised to see the implication that this is a slightly weak era, with average scores a bit below 1.0. That wasn’t the case a few years ago, but there has only been one above-average title path since 2016. The Big Three-or-Four has generally stayed out of each other’s way since then, and even when they do clash, as they did yesterday, the leading contenders for quarter-final or semi-final challenges failed to make it that far. The average score of the last 15 slam title paths is a mere 0.73, while the 16 before that (spanning 2013-16) averaged 1.20.

The second table paints with a broader brush, classifying all Open Era slam titles into thirds: “easy,” “medium” and “hard” paths to the championship. Anything below 0.89 rates as “easy,” anything above 1.14 is marked as “hard,” with the remainder left as “medium.”

Djokovic is the leader in hard slams, with 7 of his 17 meriting that classification. Federer has racked up 10 medium slams, including several that score above 1.0, but only one that cleared the bar for the “hard” category. Nadal’s mix is more balanced.

Go yell at someone else

Hopefully these numbers have given you some new ammunition for your next twitter fight. Some of you will froth at the mouth while insisting that players can’t control who they play. You’re right, but it doesn’t really matter. We can’t start giving out GOAT points for things that players didn’t do, like beat Thiem in the 2020 French Open semi-finals. All three of these guys were or are good enough at various points to have beaten some of the opponents they didn’t have to face. There are other approaches we could take to the GOAT debate that incorporate peak Elo ratings and longevity at various levels, but that’s not what we’re talking about when we count slams.

If we are going to focus so much on the slam count, we might as well acknowledge that Nadal’s 20 is better than Federer’s 20, and Djokovic’s 17 is awfully close to both of them.

The Post-Covid WTA is Drifting Back to Normal

In the two latest WTA events, we saw a mix of the expected and the unusual. Simona Halep, the heavy favorite in Prague, wound up with the title despite a couple of demanding three-setters in her first two rounds. The week’s other tournament, in Lexington, failed to follow the script. Serena Williams and Aryna Sabalenka, the big hitters at the top and bottom of the bracket, combined for three wins, with four unseeded players making up the semi-final field.

Last week I pointed out that Palermo–the tour’s initial comeback event–was so unpredictable that you would’ve been better off to treat each match as a coin flip than to use pre-layoff player strength ratings (such as Elo) to forecast outcomes. Such an upset-ridden event isn’t unheard of, even in pandemic-free times, but it is suggestive that the WTA rank-and-file haven’t quite returned to their usual form.

Prague and Lexington give us three times as much data to work with. Plus, we might theorize that Prague would be a little more predictable because so many players in that field also took part in the Palermo event, meaning that they have a little more recent match experience. While our sample of 93 main draw matches is still flimsy, it brings us a little closer to understanding how well traditional forecasts will handle this unusual time.

A thorny Brier patch

The metric I’m using to quantify predictability–or to put it another way, the validity of pre-layoff player ratings–is Brier Score, which takes into account both raw accuracy (did the forecast pick the right player to win?) and confidence level (was the forecast too strong, too weak, or just right?). Tour-level Brier Scores are usually in the range of 0.21, while a score of 0.25 means the predictions were no better than coin flips. A lower score represents more accurate predictions.

Here are the Brier Scores for Palermo, Lexington, and Prague, along with the average of the three, and the average of all WTA International events (on all surfaces) since 2017. (The scores are based on forecasts generated from my Elo ratings.) We might expect the first round to be different, since players are particularly rusty at that stage, so I’ve also broken out first round (“R32 Brier”) matches for each of the tournaments and averages in the table.

Tournament    Brier  R32 Brier  
Palermo       0.268      0.295  
Lexington     0.226      0.170  
Prague        0.212      0.247  
Comeback Avg  0.235      0.237  
Intl Avg      0.217      0.213

As we last week, the Palermo results truly defied expectations. More than half of the matches were upsets (according to my Elo ratings), with a particularly unpredictable first round.

That didn’t last. The Prague first round rated 0.247–just barely better than coin flips–but the messiness didn’t last beyond the first couple of days. The event’s overall Brier Score was 0.212, slightly better than the average WTA International. In other words, this group of 32 women, only recently returned from a months-long break, delivered results that were roughly as predictable as we would expect in the middle of a normal season.

The Lexington numbers are a bit more difficult to make sense of, but like Prague’s, they point to a post-coronavirus world that isn’t all that weird. The opening round closely followed the script, with a Brier Score of 0.170. Of the last 115 WTA International events, only 22 were more predictable. The forecast accuracy didn’t last, in large part because of Serena’s loss at the hands of Shelby Rogers. The rating for the entire tournament was 0.226, less predictable than usual, but much better than random guessing and closer to tour average than to the assumption-questioning Palermo numbers.

Revised estimates

We’re still early in the process of evaluating what to expect from players after the COVID-19 layoff. As more tournaments take place, we can identify whether players become more predictable with more matches under their belts. (Perhaps the Prague participants who skipped Palermo were more difficult to forecast, although Halep is an obvious counterexample.)

At this point, anything is possible. It could be that we will steadily drift back to business is usual. On the other hand, the new social-distancing-oriented rules–with few or no fans on site, nightlife limited to Netflix, players fetching their own towels, and new variations of on-court coaching–might work to the advantage of some women and the disadvantage of others. If that’s the case, Elo ratings will go through a novel period of adjustment as they shift to reflect which players thrive on the post-corona tour.

It’s too early to do much more than speculate about something as significant as that. But in the last week, we’ve seen forecasts go from wildly wrong (in Palermo) to not half bad (in Lexington and Prague). We’ve gained some confidence that for all the things that have obviously changed since March, our approach to player ratings may be one thing that largely remains the same.

Did Palermo Show the Signs of a Five-Month Pandemic Layoff?

Are tennis players tougher to predict when they haven’t played an official match for almost half a year? Last week’s WTA return-to-(sort-of)-normal in Palermo gave us a glimpse into that question. In a post last week I speculated that results would be tougher than usual to forecast for awhile, necessitating some tweaks to my Elo algorithm. The 31 main draw matches from Sicily allow us to run some preliminary tests.

At first glance, the results look a bit surprising. Only two of the eight seeds reached the semifinals, and the ultimate champion was the unseeded Fiona Ferro. Two wild cards reached the quarters. Is that notably weird for a WTA International-level event? It doesn’t seem that strange, so let’s establish a baseline.

Palermo the unpredictable

My go-to metric for “predictability” is Brier Score, which measures the accuracy of percentage forecasts. It’s nice to pick the winner, but it’s more important to assign the right level of probability. If you say that 100 matches are all 60/40 propositions, your favorites should win 60 of the 100 matches. If they win 90, you weren’t nearly confident enough; if they win 50, you would’ve been better off flipping a coin. Brier Score encapsulates those notions into a single number, the lower the better. Roughly speaking, my Elo forecasts for ATP and WTA matches hover a bit above 0.2.

From 2017 through March 2020, the 975 completed matches at clay-court WTA International events had a collective Brier Score of 0.223. First round matches were a tiny bit more predictable, with R32’s scoring 0.219.

Palermo was a roller-coaster by comparison. The 31 main-draw matches combined for a Brier Score of 0.268. Of the 32 other events I considered, only last year’s Prague tourney was higher, generating a 0.277 mark.

The first round was more unpredictable still, at 0.295. On the other hand, the combination of a smaller per-event sample and the wide variety of first-round fields means that several tournaments were wilder for the first few days. 9 of the 32 others had a first-round Brier Score above 0.250, with four of them scoring higher–that is, worse–than Palermo did.

The Brier Score of shame

I mentioned the 0.250 mark because it is a sort of Brier Score of shame. Let’s say you’re predicting the outcome of a series of coin flips. The smart pick is 50/50 every time. It’s boring, but forecasting something more extreme just means you’re even more wrong half the time. If you set your forecast at 50% for a series of random events with a 50/50 chance of occurring, your Brier Score will be … 0.250.

Another way to put it is this: If your Brier Score is higher than 0.250, you would’ve been better off predicting that every match was 50/50. All the fancy forecasting went to waste.

In Palermo, 17 of the 31 matches went the way of the underdog, at least according to my Elo formula. The Brier Scores were on the shameful side of the line. My earlier post–which advocated moderating all forecasts, at least a bit–didn’t go far enough. At least so far, the best course would’ve been to scrap the algorithm entirely and start flipping that coin.

Moderating the moderation

All that said, I’m not quite ready to throw away my Elo ratings. (At the moment, they pick Simona Halep and Aryna Sabalenka, my two favorite players, to win in Prague in Lexington. So there’s that.) 31 matches is small sample, far from adequate to judge the accuracy of a system designed to predict the outcome of thousands of matches each year. As I mentioned above, Elo failed even worse at Prague last year, but because that tournament didn’t follow several months of global shutdowns, it wouldn’t have even occurred to me to treat it as more than a blip.

This time, a week full of forecast-busting surprises could well be more than a blip. Treating players as if they have exactly the abilities they had in March is probably the wrong way to do things, and it could be a very wrong way of doing things. We’ll triple the size our sample in the next week, and expand it even more over the next month. It won’t help us pick winners right now, but soon we’ll have a better idea of just how unpredictable the post-COVID-19 tennis world really is.

Elo, Meet COVID-19

Tennis is back, and no one knows quite what to expect. Unpredictability is the new normal at both the macro level–will the US Open be a virus-ridden disaster?–and the micro level–which players will come back stronger or weaker? While I plead ignorance on the macro issues, estimating player abilities is more in my line.

Thanks to global shutdowns, every professional player has spent almost five months away from ATP, WTA, and ITF events–“official” tournaments. Some pros, such as those who didn’t play in the few weeks before the shutdowns began, or who are opting not to compete at the first possible opportunity, will have sat out seven or eight months by the time they return to court. Exhibition matches have filled some of the gap, but not for every player.

Half a year is a long time without any official matches. Or, from the analyst’s perspective: It’s tough to predict a player’s performance without any data from the last six months.

Increased uncertainty

Let’s start with the obvious. All this time off means that we know less about each player’s current ability level than we did before the shutdown, back when most pros were competing every week or two. Back in March, my Elo ratings put Dominic Thiem in 5th place, with a rating of ~2050, and David Goffin in 15th, with a rating of ~1900. Those numbers gave Thiem a 70% chance of winning a head-to-head.

What about now? Both men have played in exhibitions, but can we be confident that their levels are the same as they were in March? Or that they’ve risen or fallen roughly the same amount? To me, it’s obvious that we can’t be as sure. Whenever our confidence drops, our predictions should move toward the “naive” prediction of a 50/50 coin flip. A six-month coronavirus layoff isn’t that severe, so it doesn’t mean that Thiem is no longer the favorite against Goffin, but it does mean our prediction should be closer to 50% than it was before.

So, 60%? Maybe 65%? Or 69%? I can’t answer that–yet, anyway.

The (injury) layoff penalty

My Elo ratings already incorporate a layoff penalty, which I introduced here. The idea is that if a player misses a substantial amount of time (usually due to injury, but possibly because of suspension, pregnancy, or other reasons), they usually play worse when they come back. But it’s tough to predict how much worse, and players regain their form at different rates.

Thus, the tweak to the rating formula has two components:

  • A one-time penalty based on the amount of time missed (more time off = bigger penalty)
  • A temporarily increased k-factor (the part of the formula that determines how much each match increases or decreases a player’s rating) to account for the initial uncertainty. After an injury, the k-factor increases by a bit more than 50%, and steadily declines back to the typical k-factor over the next 20 matches.

Not an injury

A six-month coronavirus layoff is not an injury. (At least, not for players who haven’t lost practice time due to contracting COVID-19 or picking up other maladies.) So the injury-penalty algorithm can’t be applied as-is. But we can take away two ideas from the injury penalty:

  • If we generate those closer-to-50% forecasts by shifting certain players’ ratings downward, the penalty should be less than the injury penalty. (The minimum injury penalty is 100 Elo points for a non-offseason layoff of eight or nine weeks.)
  • The temporarily increased k-factor is a useful tool to handle the type of uncertainty that surrounds a player’s ability level after a layoff.

The injury-penalty framework is useful because it has been validated by data. We can look at hundreds of injury (and other) layoffs in modern tennis history and see how players fared upon return. And the numbers I use in the Elo formula are based on exactly that. We don’t have the same luxury with the last six months, because it is so unprecedented.

Not an offseason, but…

The closest thing we have to a half-year shutdown in existing tennis data is the offseason. The sport’s winter break is much shorter, and it isn’t the same for every player. Yet some of the dynamics are the same: Many players fill their time with exhibitions, others sit on the beach, some let injuries recover, others work particularly hard to improve their games, and so on.

Here’s a theory, then: The first few weeks of each season should be less predictable than average.

Fact check: False! For the years 2010-19, I labeled each match according to how many previous matches the two players had contested that year. If it was both players’ first match, the label was 1. If it was one player’s 15th match and the other’s 21st, the label was the average, 18. Then, I calculated the Brier Score–a measure of prediction accuracy–of the Elo-generated predictions for the matches with each label.

The lower the Brier Score, the better. If my theory were right, we would see the highest Brier Scores for the first few matches of the season, followed by a decrease. Not exactly!

The jagged blue line shows the Brier Scores for each individual label (match 1, match 2, match 23, etc), while the orange line is a 5-match moving average that aims to represent the overall trend.

There’s not a huge difference throughout the season (which is reassuring), but the early-season trend is the opposite of what I predicted. Maybe the women, with their slightly longer offseason, will make me feel better?

No such luck. Again, the match-to-match variation in prediction accuracy is very small, and there’s no sign of early-season uncertainty.

I will not be denied

Despite disproving my own theory, I still expect to see an unpredictable couple of post-pandemic months. The regular offseason is something that players are accustomed to, and there is conventional wisdom in the game surrounding how to best use that time. And it’s two months, not five to seven. In addition, there are many other things that will make tour life more challenging–or different, at the very least–in 2020, such as limited crowds, social distancing protocols, and scheduling uncertainty. Some players will better handle those challenges than others, but it won’t necessarily be the strongest players who respond the best.

So my Elo ratings will, for the time being, incorporate a small penalty and a temporarily increased k-factor. (Something more like 69% for Thiem-Goffin, not 60%.) I haven’t finished the code yet, in large part because handling the two different types of layoffs–coronavirus and the usual injuries, etc–makes things very complicated. If you’re watching closely, you’ll see some minor tweaks to the numbers before the “Cincinnati” tournament in a few weeks.

There is a right answer

It’s clear from what I’ve written so far that any attempt to adjust Elo ratings for the COVID-19 layoff is a bit of a guessing game. But it won’t always be that way!

By the end of the year, we’ll know the right answer: just how unpredictable results turned out to be in the early going. Just as I’ve calculated penalties and k-factor adjustments for injury layoffs based on historical data, we will be able to do the same with match results from the second half of 2020. To be more precise, we’ll be able to work out a class of right answers, because one adjustment to the Elo formula will give us the best Brier Score, while another will best represent the gap between Novak Djokovic and Rafael Nadal, while others could target different goals.

The ultimate after-the-fact COVID-19 Elo-formula adjustment won’t help you win more money betting on tennis, but it will give us more insight into how the coronavirus layoff affected players after so much time off, and how quickly they returned to pre-layoff form. We’ll understand a little bit more about the game, even if we desperately hope never to have reason to apply the newly-won knowledge.

Who’s the GOAT? Balancing Career and Peak Greatness With Elo Ratings

On this week’s podcast, Carl, Jeff and I briefly discussed where Caroline Wozniacki ranks among Open-era greats. She’s among the top ten measured by weeks at the top of the rankings, but she has won only a single major. By Jeff’s Championship Shares metric, she’s barely in the top 30.

I posed the same question on Twitter, and the hive mind cautiously placed her outside the top 20:

https://twitter.com/tennisabstract/status/1214491560026484737

It’s difficult to compare different sorts of accomplishments–such as weeks at number one, majors won, and other titles–even without trying to adjust for different eras. It’s also challenging to measure different types of careers against each other. For more than a decade, Wozniacki has been a consistent threat near the top of the game, while other players who won more slams did so in a much shorter burst of elite-level play.

Elo to the rescue

How good must a player be before she is considered “great?” I don’t expect everyone to agree on this question, and as we’ll see, a precise consensus isn’t necessary. If we take a look at the current Elo ratings, a very convenient round number presents itself. Seven players rate higher than 2000: Ashleigh Barty, Naomi Osaka, Bianca Andreescu, Simona Halep, Karolina Pliskova, Elina Svitolina, and Petra Kvitova. Aryna Sabalenka just misses.

Another 25 active players have reached an Elo rating of at least 2000 at their peak, from all-time greats such as Serena Williams and Venus Williams down to others who had brief, great-ish spells, such as Alize Cornet and Anastasia Pavlyuchenkova. Since 1977, 88 women finished at least one season with an Elo rating of 2000 or higher, and 60 of them did so at least twice.

(I’m using 1977 because of limitations in the data. I don’t have complete match results–or anything close!–for the early and mid 1970s. Unfortunately, that means we’ll underrate some players who began their careers before 1977, such as Chris Evert, and we’ll severely undervalue the greats of the prior decade, such as Billie Jean King and Margaret Court.)

The resulting list of 60 includes anyone you might consider an elite player from the last 45 years, along with the usual dose of surprises. (Remember Irina Spirlea?) I’ll trot out the full list in a bit.

Measuring magnitude

A year-end Elo rating of 2000 is an impressive achievement. But among greats, that number is a mere qualifying standard. Serena has had years above 2400, and Steffi Graf once cleared the 2500 mark. For each season, we’ll convert the year-end Elo into a “greatness quotient” that is simply the difference between the year-end Elo and our threshold of 2000. Barty finished her 2019 season with a rating of 2123, so her greatness quotient (GQ) is 123.

(Yes, I know it isn’t a quotient. “Greatness difference” doesn’t quite have the same ring.)

To measure a player’s greatness over the course of her career, we simply find the greatness quotient for each season which she finished above 2000, and add them together. For Serena, that means a whopping 20 single-season quotients. Wozniacki had nine such seasons, and so far, Barty has two. I’ll have more to say shortly about why I like this approach and what the numbers are telling us.

First, let’s look at the rankings. I’ve shown every player with at least two qualifying seasons. “Seasons” is the number of years with year-end Elos of 2000 or better, and “Peak” is the highest year-end Elo the player achieved:

Rank  Player                     Seasons  Peak    GQ  
1     Steffi Graf                     14  2505  4784  
2     Serena Williams                 20  2448  4569  
3     Martina Navratilova             17  2442  4285  
4     Venus Williams                  14  2394  2888  
5     Chris Evert                     14  2293  2878  
6     Lindsay Davenport               12  2353  2744  
7     Monica Seles                    11  2462  2396  
8     Maria Sharapova                 13  2287  2280  
9     Justine Henin                    9  2411  2237  
10    Martina Hingis                   8  2366  1932  
11    Kim Clijsters                    9  2366  1754  
12    Gabriela Sabatini                9  2271  1560  
13    Arantxa Sanchez Vicario         12  2314  1556  
14    Amelie Mauresmo                  6  2279  1113  
15    Victoria Azarenka                9  2261  1082  
16    Jennifer Capriati                8  2214   929  
17    Jana Novotna                     9  2189   848  
18    Conchita Martinez               11  2191   836  
19    Caroline Wozniacki               9  2189   674  
20    Tracy Austin                     5  2214   647  
                                                      
Rank  Player                     Seasons  Peak    GQ  
21    Mary Pierce                      8  2161   637  
22    Elena Dementieva                 9  2140   629  
23    Simona Halep                     7  2108   562  
24    Svetlana Kuznetsova              6  2136   543  
25    Hana Mandlikova                  6  2160   516  
26    Jelena Jankovic                  4  2178   450  
27    Pam Shriver                      5  2160   431  
28    Vera Zvonareva                   5  2117   414  
29    Agnieszka Radwanska              8  2106   399  
30    Ana Ivanovic                     5  2133   393  
31    Petra Kvitova                    6  2132   346  
32    Na Li                            4  2095   310  
33    Anastasia Myskina                4  2164   290  
34    Anke Huber                       6  2072   277  
35    Mary Joe Fernandez               4  2110   274  
36    Nadia Petrova                    6  2094   265  
37    Dinara Safina                    3  2132   240  
38    Andrea Jaeger                    4  2087   237  
39    Angelique Kerber                 4  2109   224  
40    Nicole Vaidisova                 3  2121   222  
                                                      
Rank  Player                     Seasons  Peak    GQ  
41    Manuela Maleeva Fragniere        6  2059   194  
42    Anna Chakvetadze                 2  2107   174  
43    Ashleigh Barty                   2  2123   162  
44    Helena Sukova                    3  2078   150  
45    Jelena Dokic                     2  2110   142  
46    Iva Majoli                       2  2067   119  
47    Elina Svitolina                  3  2052   108  
48    Garbine Muguruza                 2  2061    98  
49    Zina Garrison                    2  2065    96  
50    Samantha Stosur                  3  2061    92  
51    Daniela Hantuchova               2  2050    80  
52    Irina Spirlea                    2  2064    76  
53    Nathalie Tauziat                 3  2041    73  
54    Patty Schnyder                   2  2057    70  
55    Chanda Rubin                     3  2034    68  
56    Marion Bartoli                   2  2033    66  
57    Sandrine Testud                  2  2041    62  
58    Magdalena Maleeva                2  2024    41  
59    Karolina Pliskova                2  2028    37  
60    Dominika Cibulkova               2  2007     7

You’ll probably find fault with some of the ordering here. While it isn’t the exact list I’d construct, either, my first reaction is that this is an extremely solid result for such a simple algorithm. In general, the players with long peaks are near the top–but only because they were so good for much of that time. A long peak, like that of Conchita Martinez, isn’t an automatic ticket into the top ten.

From the opposite perspective, this method gives plenty of respect to women who were extremely good for shorter periods of time. Both Amelie Mauresmo and Tracy Austin crack the top 20 with six or fewer qualifying seasons, while others with as many years with an Elo of 2000 or higher, such as Manuela Maleeva Fragniere, find themselves much lower on the list.

Steffi, Serena, and the threshold

It’s worth thinking about what exactly the Elo rating threshold of 2000 means. At the simplest level, we’re drawing a line, below which we don’t consider a player at all. (Sorry, Aryna, your time will come!) Less obviously, we’re defining how great seasons compare to one another.

For instance, we’ve seen that Barty’s 2019 GQ was 123. Graf’s 1989 season, with a year-end Elo rating of 2505, gave her a GQ of 505. Our threshold choice of 2000 implies that Graf’s peak season has approximately four times the value of Barty’s. That’s not a natural law. If we changed the threshold to 1900, Barty’s GQ would be 223, compared to Graf’s best of 605. As a result, Steffi’s season is only worth about three times as much.

The lower the threshold, the more value we give to longevity and the less value we give to truly outstanding seasons. If we lower the threshold to 1950, Steffi and Serena swap places at the top of the list. (Either way, it’s close.) Even though Williams had one of the highest peaks in tennis history, it’s her longevity that truly sets her apart.

I don’t want to get hung up on whether Serena or Steffi should be at the top of this list–it’s not a precise measurement, so as far as I’m concerned, it’s basically a tie. (And that’s without even raising the issue of era differences.) I also don’t want to tweak the parameters just to get a result or two to look different.

Ranking Woz

I began this post with a question about Caroline Wozniacki. As we’ve seen, greatness quotient places her 19th among players since 1977–almost exactly halfway between her position on the weeks-at-number-one list and her standing on the title-oriented Championship Shares table.

If we had better data for the first decade of the Open era, Wozniacki and many others would see their rankings fall by at least a few spots. King, Court, and Evonne Goolagong Cawley would knock her into the 20s. Virginia Wade might claim a slot in the top 20 as well. We can quibble about the exact result, but we’ve nailed down a plausible range for the 2018 Australian Open champion.

One-number solutions like this aren’t perfect, in part because they depend on assumptions like the Elo threshold discussed above. Just because they give us authoritative-looking lists doesn’t mean they are the final word.

On the other hand, they offer an enormous benefit, allowing us to get around the unresolvable minor debates about the level of competition when she reached number one, the luck of the draw at grand slams she won and lost, the impact of her scheduling on ranking, and so on. By building a rating based on every opponent and match result, Elo incorporates all this data. When ranking all-time greats, many fans already rely too much on one single number: the career slam count. Greatness quotient is a whole lot better than that.

An Introduction to Tennis Elo

Elo is a superior rating system to the ranking formulas used by the ATP and WTA. If you’ve spent much time reading this blog or listening to the podcast, you’ve probably heard me say that many times. But unless you’ve been exposed to Elo before, or done some research on your own, you might think of it as a sort of “magic” system. It’s worth digging in to understand better how it works.

The basic algorithm

The principle behind any Elo system is that each player’s rating is an estimate of their strength, and each match (or tournament) allows us to update that estimate. If a player wins, her rating goes up; if she loses, it goes down.

Where Elo excels is in determining the amount by which a rating should increase or decrease. There are two main variables that are taken into account: How many matches are already in the system (that is, how much confidence we have in the pre-match rating), and the quality of the opponent.

If you think about it for a moment, you’ll see that these two variables are a good approximation of how we already think about player strength. The more we already know about a player, the less we will change our opinion based on one match. Novak Djokovic’s round-robin loss to Dominic Thiem in London was a surprise, but only the most apocalyptic Djokovic fans saw the result as a disaster that should substantially change our estimate of his playing ability. Similarly, we adjust our opinion based on opponent quality. A loss to Thiem is disappointing, but a loss to, say, Marco Cecchinato is more concerning. The Elo system incorporates those natural intuitions.

Elo rating ranges

Traditionally, a player is given an Elo rating of 1500 when he enters the system–before any results come in. That number is completely arbitrary. All that matters is the difference between player ratings, so if we started each competitor with 0, 100, or 888, the end result of those differences would remain the same.

When I began calculating Elo ratings, I kept with tradition and started every player with 1500. Since then, I’ve expanded my view to Challengers (and the women’s ITF equivalent) and tour-level qualifying. If we started each new player at those levels with 1500 points, it re-scales the entire system, which would have been confusing. Instead, I replaced 1500 with a number in the low 1200s (it depends a bit on tournament level and gender) so that the ratings would remain approximately the same.

At the moment, the ATP and WTA top-ranked players are Rafael Nadal and Ashleigh Barty, at 2203 and 2123, respectively. The best players are often in this range, and the very best often approach 2500. According to the most recent version of my algorithm, Djokovic’s peak was 2470, and Serena Williams’s best was 2473.

The 2000-point mark is a good rule of thumb to separate the elites from the rest. At the moment, six men and seven women have ratings that high. 16 men and 18 women have Elo ratings of at least 1900, and a rating of 1800 is roughly equivalent to a place in the top 50.

Era comparisons and Elo inflation

Once we attach a single peak rating to every player, it’s only natural to start comparing across eras. While it’s always fun to do so, I’m not sure any rating system allows for useful cross-era comparisons in tennis. Elo doesn’t, either.

What you can do with Elo is compare how each player fared against her competition. In 1990, Helena Sukova achieved a rating of 2123–exactly the same as Barty’s today. That doesn’t mean that Sukova then was as good as Barty is now. But it does mean that their performance relative to their peers was similar. The second tier of players was considerably weaker thirty years ago, so in a sense it was easier to achieve such a rating. At the time, Sukova’s rating was only good for 11th place, far behind Steffi Graf’s 2600.

Thus, Elo doesn’t allow you rank players across eras unless you are confident that the level of competition was similar–or unless you have some other way of dealing with that issue, a minefield that many researchers have tried to cross, with little success.

A related issue is Elo inflation or deflation, which can also complicate cross-era comparisons. Every time a match is played, the winner and loser effectively “trade” some of their points, so the total number of Elo rating points in the system doesn’t change. However, every time a new player enters the system, the total number of points increases. And whenever a player retires, the total number of points decreases.

It would be nice if additions and subtractions canceled each other out, but for many competitions that use Elo, they don’t. Additions tend to outweigh subtractions, so Elo ratings increase over time. That doesn’t appear to be the case with my tennis ratings, at least in part because of the penalty I’ve introduced for injury absences, but it does serve as a reminder that the number of points in the system changes over time, for reasons unrelated to the strength of the top players. (I’ll have more to say about the absence penalty below.)

Elo predictions

Elo gives us a rating for every player, and we’re getting a sense of what we can and can’t do with them.

One of the main purposes of any rating system is to predict the outcome of matches–something that Elo does better than most others, including the ATP and WTA rankings. The only input necessary to make a prediction is the difference between two players’ ratings, which you can then plug into the following formula:

1 – (1 / (1 + (10^((difference) / 400))))

If we wanted to forecast a rematch of the last match of the Davis Cup Finals, we would take the Elo ratings of Nadal and Denis Shapovalov (2203 and 1947), find the difference (256), and plug it into the formula, for a result of 81.4%, Nadal’s chance of winning. If we used the negative difference (-256), we’d get 18.6%, Shapovalov’s odds of scoring the upset.

My version of tennis Elo is based on the most common match format, best-of-three matches. In a best-of-five match, the favorite has a better chance of winning. The math for converting best-of-three to best-of-five is a bit complicated, but for those interested, I’ve posted some code. The point is that an adjustment must be made. If the Nadal-Shapovalov rematch happens at the best-of-five Australian Open, Rafa’s 81.4% edge will increase to 86.7%.

Adjusting Elo for surface

For most sports, we could stop here. A match is a match, with only minor variations. In tennis, though, ratings and predictions should vary quite a bit based on surface.

My solution is a bit complicated. For each player, I maintain four separate Elo ratings: overall, hard court only, clay court only, and grass court only. I don’t differentiate between outdoor and indoor hard. For instance, Thiem’s ratings are 2066 overall, 1942 on hard, 2031 on clay, and 1602 on grass. (Surface ratings tend to be lower: Thiem’s clay rating is third-best on tour, miles ahead of everyone except for Nadal and Djokovic.)

These single-surface ratings tell us how we would rank players if we simply threw away results on every other surface. That’s not realistic, though. Single-surface ratings aren’t great at predicting match results. A better solution is to take a 50/50 blend of single-surface and overall ratings. If we wanted to predict Thiem’s chances in a clay-court match, we’d use a half-and-half mix of his 2066 overall rating and his 2031 clay-court rating. My weekly Elo reports show the single-surface ratings as “HardRaw” (and so on), and the blended ratings as “hElo,” “cElo,” and “gElo.”

There is no natural law that dictates a 50/50 blend. Every adjustment I’ve made to the basic Elo algorithm is determined solely by what works. (More on that below.) Initially, I suspected that a blend between single-surface and overall ratings would be appropriate, because a player’s success on one surface has some correlation with his success on others. I expected the blend to be different for each surface–perhaps using a higher percentage of the overall rating for grass, because there are fewer matches on the surface. In the end, my testing showed that 50/50 worked for each surface.

Non-adjustments

Ask some tennis fans which tournaments matches matter more–for rankings, for GOAT debates, whatever–and you can find yourself with a long, detailed list of what factors determine greatness. Maybe slams are more important than masters and premiers, though those are less important than tour finals and the Olympics, and of course finals are key, plus head-to-heads against certain players… you get the idea.

Elo provides for such adjustments. A coefficient usually referred to as the “k factor” allows us to give greater weight to certain matches. It’s common in Elo ratings for other sports, for example by using a higher k factor for postseason than regular season games. However, I’ve tested all sorts of different k factors for the likely types of “important” matches, and I’ve yet to find a tweak to the system that consistently improves its ability to predict match outcomes.

The absence penalty

There’s one exception. When players miss substantial amounts of time, I reduce their rating, and then increase the k factor for several matches after their return. I’ve explained more of the details in a previous post.

These steps are a logical extension of the Elo framework, especially when you consider our usual mental adjustments when a player misses time. If a player is injured for a few months, we never know quite what to expect when she returns. Maybe she’s as strong as ever; maybe she’s still a step slow. Perhaps she’ll return to normal quickly; she might never fully return to form. An extended absence raises a lot of questions. An injury player rarely returns in better form than when she left, while many players are worse upon return, giving us an average post-injury performance level that is worse than before the absence.

Therefore, when a player first returns, our estimate must be that she is worse. However, a few strong early results should be weighted more heavily–hence the higher k factor. The k factor reflects the fact that, immediately after an absence, we aren’t as confident as usual in our estimate.

The algorithm gets complicated, but the logic is simple. It’s basically just an attempt to work out a rigorous version of statements like, “I don’t know how well he’ll play when he comes back, but I’ll be watching closely.”

One side benefit of the absence penalty is that it counteracts Elo’s natural tendency toward ratings inflation. While more players enter the system than leave it, adding to the total number of available points, the penalty removes some points without re-allocating them to other players.

Validating Elo and adjustments

I’ve mentioned “testing” a few times, and I started this article with a claim that Elo is superior to the official ranking systems. What does that mean, and how do we know?

The simplest way to compare rating systems is a metric called “accuracy,” which counts correct predictions. There were 50 singles matches at the Davis Cup finals, and Elo picked the winner in 36 of them, for an accuracy rating of 72%. The ATP rankings picked the winner (in the sense that the higher-ranked player won the match) in 30 of them, for an accuracy rating of 60%. In this tiny experiment, Elo trounced the official rankings. Elo is also considerably better over the course of the entire season.

A better metric for this purpose is Brier score, which takes into account the confidence of each forecast. We saw earlier that Elo gives Nadal an 81.4% chance of beating Shapovalov. If Nadal ends up winning, 81.4% is a “better” forecast than, say, 65%, but it’s a “worse” forecast than 90%. Brier score takes the squared distance between the forecast (81.4%) and the result (0% or 100%, depending on the winner), and averages those numbers for all forecasted matches. It rewards aggressive forecasts that prove correct, but because it uses squared distance, it severely punishes predictions that are aggressive but wrong.

A more intuitive way to think about what Brier score is getting at is to imagine that Nadal and Shapovalov play 100 matches in a row. (Or, more accurately but less intuitively, imagine that 100 identical Nadals play simultaneous matches against 100 identical Shapovalovs.) A forecast of 81.4% means that we would expect Nadal to win 81 or 82 or those matches. If Nadal ends up winning 90, the forecast wasn’t Rafa-friendly enough. We’ll never get 100 simultaneous matches like this, but we do have thousands of individual matches, many of which share the same predictions, like a 60% chance of the favorite winning. Brier score aggregates all of those prediction-and-result pairs and spits out a number to tell us how we’re doing.

It’s tough to forecast the result of individual tennis matches. Any system, no matter how sophisticated, is going to be wrong an awful lot of the time. In many cases, the “correct” forecast is barely better than no forecast at all, if the evidence suggests that the competitors are equally matched. Thus, “accuracy” is of limited use–it’s more important to have the right amount of confidence than to simply pick winners.

All of this is to say: My Elo ratings have a much lower (better) Brier score than predictions derived from ATP and WTA rankings. Elo forecasts aren’t quite as good as betting odds, or else I’d be spending more time wagering and less time writing about rating systems.

Brier score is also the measure that tells us whether a certain adjustment–such as surface blends, injury absences, or tournament type–constitutes an improvement to the system. Assessing an injury penalty lowers the Brier score of the overall set of Elo forecasts, so we keep it. Decreasing the k factor for first-round matches has no effect, so we skip it.

Additional resources

My current Elo ratings: ATP | WTA

Extending Elo to doubles

… and mixed doubles

Code for tennis Elo (in R, not written by me)

A good introduction to Brier score

* * *

Subscribe to the blog to receive each new post by email: