Elo, Meet COVID-19

Tennis is back, and no one knows quite what to expect. Unpredictability is the new normal at both the macro level–will the US Open be a virus-ridden disaster?–and the micro level–which players will come back stronger or weaker? While I plead ignorance on the macro issues, estimating player abilities is more in my line.

Thanks to global shutdowns, every professional player has spent almost five months away from ATP, WTA, and ITF events–“official” tournaments. Some pros, such as those who didn’t play in the few weeks before the shutdowns began, or who are opting not to compete at the first possible opportunity, will have sat out seven or eight months by the time they return to court. Exhibition matches have filled some of the gap, but not for every player.

Half a year is a long time without any official matches. Or, from the analyst’s perspective: It’s tough to predict a player’s performance without any data from the last six months.

Increased uncertainty

Let’s start with the obvious. All this time off means that we know less about each player’s current ability level than we did before the shutdown, back when most pros were competing every week or two. Back in March, my Elo ratings put Dominic Thiem in 5th place, with a rating of ~2050, and David Goffin in 15th, with a rating of ~1900. Those numbers gave Thiem a 70% chance of winning a head-to-head.

What about now? Both men have played in exhibitions, but can we be confident that their levels are the same as they were in March? Or that they’ve risen or fallen roughly the same amount? To me, it’s obvious that we can’t be as sure. Whenever our confidence drops, our predictions should move toward the “naive” prediction of a 50/50 coin flip. A six-month coronavirus layoff isn’t that severe, so it doesn’t mean that Thiem is no longer the favorite against Goffin, but it does mean our prediction should be closer to 50% than it was before.

So, 60%? Maybe 65%? Or 69%? I can’t answer that–yet, anyway.

The (injury) layoff penalty

My Elo ratings already incorporate a layoff penalty, which I introduced here. The idea is that if a player misses a substantial amount of time (usually due to injury, but possibly because of suspension, pregnancy, or other reasons), they usually play worse when they come back. But it’s tough to predict how much worse, and players regain their form at different rates.

Thus, the tweak to the rating formula has two components:

  • A one-time penalty based on the amount of time missed (more time off = bigger penalty)
  • A temporarily increased k-factor (the part of the formula that determines how much each match increases or decreases a player’s rating) to account for the initial uncertainty. After an injury, the k-factor increases by a bit more than 50%, and steadily declines back to the typical k-factor over the next 20 matches.

Not an injury

A six-month coronavirus layoff is not an injury. (At least, not for players who haven’t lost practice time due to contracting COVID-19 or picking up other maladies.) So the injury-penalty algorithm can’t be applied as-is. But we can take away two ideas from the injury penalty:

  • If we generate those closer-to-50% forecasts by shifting certain players’ ratings downward, the penalty should be less than the injury penalty. (The minimum injury penalty is 100 Elo points for a non-offseason layoff of eight or nine weeks.)
  • The temporarily increased k-factor is a useful tool to handle the type of uncertainty that surrounds a player’s ability level after a layoff.

The injury-penalty framework is useful because it has been validated by data. We can look at hundreds of injury (and other) layoffs in modern tennis history and see how players fared upon return. And the numbers I use in the Elo formula are based on exactly that. We don’t have the same luxury with the last six months, because it is so unprecedented.

Not an offseason, but…

The closest thing we have to a half-year shutdown in existing tennis data is the offseason. The sport’s winter break is much shorter, and it isn’t the same for every player. Yet some of the dynamics are the same: Many players fill their time with exhibitions, others sit on the beach, some let injuries recover, others work particularly hard to improve their games, and so on.

Here’s a theory, then: The first few weeks of each season should be less predictable than average.

Fact check: False! For the years 2010-19, I labeled each match according to how many previous matches the two players had contested that year. If it was both players’ first match, the label was 1. If it was one player’s 15th match and the other’s 21st, the label was the average, 18. Then, I calculated the Brier Score–a measure of prediction accuracy–of the Elo-generated predictions for the matches with each label.

The lower the Brier Score, the better. If my theory were right, we would see the highest Brier Scores for the first few matches of the season, followed by a decrease. Not exactly!

The jagged blue line shows the Brier Scores for each individual label (match 1, match 2, match 23, etc), while the orange line is a 5-match moving average that aims to represent the overall trend.

There’s not a huge difference throughout the season (which is reassuring), but the early-season trend is the opposite of what I predicted. Maybe the women, with their slightly longer offseason, will make me feel better?

No such luck. Again, the match-to-match variation in prediction accuracy is very small, and there’s no sign of early-season uncertainty.

I will not be denied

Despite disproving my own theory, I still expect to see an unpredictable couple of post-pandemic months. The regular offseason is something that players are accustomed to, and there is conventional wisdom in the game surrounding how to best use that time. And it’s two months, not five to seven. In addition, there are many other things that will make tour life more challenging–or different, at the very least–in 2020, such as limited crowds, social distancing protocols, and scheduling uncertainty. Some players will better handle those challenges than others, but it won’t necessarily be the strongest players who respond the best.

So my Elo ratings will, for the time being, incorporate a small penalty and a temporarily increased k-factor. (Something more like 69% for Thiem-Goffin, not 60%.) I haven’t finished the code yet, in large part because handling the two different types of layoffs–coronavirus and the usual injuries, etc–makes things very complicated. If you’re watching closely, you’ll see some minor tweaks to the numbers before the “Cincinnati” tournament in a few weeks.

There is a right answer

It’s clear from what I’ve written so far that any attempt to adjust Elo ratings for the COVID-19 layoff is a bit of a guessing game. But it won’t always be that way!

By the end of the year, we’ll know the right answer: just how unpredictable results turned out to be in the early going. Just as I’ve calculated penalties and k-factor adjustments for injury layoffs based on historical data, we will be able to do the same with match results from the second half of 2020. To be more precise, we’ll be able to work out a class of right answers, because one adjustment to the Elo formula will give us the best Brier Score, while another will best represent the gap between Novak Djokovic and Rafael Nadal, while others could target different goals.

The ultimate after-the-fact COVID-19 Elo-formula adjustment won’t help you win more money betting on tennis, but it will give us more insight into how the coronavirus layoff affected players after so much time off, and how quickly they returned to pre-layoff form. We’ll understand a little bit more about the game, even if we desperately hope never to have reason to apply the newly-won knowledge.

Who’s the GOAT? Balancing Career and Peak Greatness With Elo Ratings

On this week’s podcast, Carl, Jeff and I briefly discussed where Caroline Wozniacki ranks among Open-era greats. She’s among the top ten measured by weeks at the top of the rankings, but she has won only a single major. By Jeff’s Championship Shares metric, she’s barely in the top 30.

I posed the same question on Twitter, and the hive mind cautiously placed her outside the top 20:

https://twitter.com/tennisabstract/status/1214491560026484737

It’s difficult to compare different sorts of accomplishments–such as weeks at number one, majors won, and other titles–even without trying to adjust for different eras. It’s also challenging to measure different types of careers against each other. For more than a decade, Wozniacki has been a consistent threat near the top of the game, while other players who won more slams did so in a much shorter burst of elite-level play.

Elo to the rescue

How good must a player be before she is considered “great?” I don’t expect everyone to agree on this question, and as we’ll see, a precise consensus isn’t necessary. If we take a look at the current Elo ratings, a very convenient round number presents itself. Seven players rate higher than 2000: Ashleigh Barty, Naomi Osaka, Bianca Andreescu, Simona Halep, Karolina Pliskova, Elina Svitolina, and Petra Kvitova. Aryna Sabalenka just misses.

Another 25 active players have reached an Elo rating of at least 2000 at their peak, from all-time greats such as Serena Williams and Venus Williams down to others who had brief, great-ish spells, such as Alize Cornet and Anastasia Pavlyuchenkova. Since 1977, 88 women finished at least one season with an Elo rating of 2000 or higher, and 60 of them did so at least twice.

(I’m using 1977 because of limitations in the data. I don’t have complete match results–or anything close!–for the early and mid 1970s. Unfortunately, that means we’ll underrate some players who began their careers before 1977, such as Chris Evert, and we’ll severely undervalue the greats of the prior decade, such as Billie Jean King and Margaret Court.)

The resulting list of 60 includes anyone you might consider an elite player from the last 45 years, along with the usual dose of surprises. (Remember Irina Spirlea?) I’ll trot out the full list in a bit.

Measuring magnitude

A year-end Elo rating of 2000 is an impressive achievement. But among greats, that number is a mere qualifying standard. Serena has had years above 2400, and Steffi Graf once cleared the 2500 mark. For each season, we’ll convert the year-end Elo into a “greatness quotient” that is simply the difference between the year-end Elo and our threshold of 2000. Barty finished her 2019 season with a rating of 2123, so her greatness quotient (GQ) is 123.

(Yes, I know it isn’t a quotient. “Greatness difference” doesn’t quite have the same ring.)

To measure a player’s greatness over the course of her career, we simply find the greatness quotient for each season which she finished above 2000, and add them together. For Serena, that means a whopping 20 single-season quotients. Wozniacki had nine such seasons, and so far, Barty has two. I’ll have more to say shortly about why I like this approach and what the numbers are telling us.

First, let’s look at the rankings. I’ve shown every player with at least two qualifying seasons. “Seasons” is the number of years with year-end Elos of 2000 or better, and “Peak” is the highest year-end Elo the player achieved:

Rank  Player                     Seasons  Peak    GQ  
1     Steffi Graf                     14  2505  4784  
2     Serena Williams                 20  2448  4569  
3     Martina Navratilova             17  2442  4285  
4     Venus Williams                  14  2394  2888  
5     Chris Evert                     14  2293  2878  
6     Lindsay Davenport               12  2353  2744  
7     Monica Seles                    11  2462  2396  
8     Maria Sharapova                 13  2287  2280  
9     Justine Henin                    9  2411  2237  
10    Martina Hingis                   8  2366  1932  
11    Kim Clijsters                    9  2366  1754  
12    Gabriela Sabatini                9  2271  1560  
13    Arantxa Sanchez Vicario         12  2314  1556  
14    Amelie Mauresmo                  6  2279  1113  
15    Victoria Azarenka                9  2261  1082  
16    Jennifer Capriati                8  2214   929  
17    Jana Novotna                     9  2189   848  
18    Conchita Martinez               11  2191   836  
19    Caroline Wozniacki               9  2189   674  
20    Tracy Austin                     5  2214   647  
                                                      
Rank  Player                     Seasons  Peak    GQ  
21    Mary Pierce                      8  2161   637  
22    Elena Dementieva                 9  2140   629  
23    Simona Halep                     7  2108   562  
24    Svetlana Kuznetsova              6  2136   543  
25    Hana Mandlikova                  6  2160   516  
26    Jelena Jankovic                  4  2178   450  
27    Pam Shriver                      5  2160   431  
28    Vera Zvonareva                   5  2117   414  
29    Agnieszka Radwanska              8  2106   399  
30    Ana Ivanovic                     5  2133   393  
31    Petra Kvitova                    6  2132   346  
32    Na Li                            4  2095   310  
33    Anastasia Myskina                4  2164   290  
34    Anke Huber                       6  2072   277  
35    Mary Joe Fernandez               4  2110   274  
36    Nadia Petrova                    6  2094   265  
37    Dinara Safina                    3  2132   240  
38    Andrea Jaeger                    4  2087   237  
39    Angelique Kerber                 4  2109   224  
40    Nicole Vaidisova                 3  2121   222  
                                                      
Rank  Player                     Seasons  Peak    GQ  
41    Manuela Maleeva Fragniere        6  2059   194  
42    Anna Chakvetadze                 2  2107   174  
43    Ashleigh Barty                   2  2123   162  
44    Helena Sukova                    3  2078   150  
45    Jelena Dokic                     2  2110   142  
46    Iva Majoli                       2  2067   119  
47    Elina Svitolina                  3  2052   108  
48    Garbine Muguruza                 2  2061    98  
49    Zina Garrison                    2  2065    96  
50    Samantha Stosur                  3  2061    92  
51    Daniela Hantuchova               2  2050    80  
52    Irina Spirlea                    2  2064    76  
53    Nathalie Tauziat                 3  2041    73  
54    Patty Schnyder                   2  2057    70  
55    Chanda Rubin                     3  2034    68  
56    Marion Bartoli                   2  2033    66  
57    Sandrine Testud                  2  2041    62  
58    Magdalena Maleeva                2  2024    41  
59    Karolina Pliskova                2  2028    37  
60    Dominika Cibulkova               2  2007     7

You’ll probably find fault with some of the ordering here. While it isn’t the exact list I’d construct, either, my first reaction is that this is an extremely solid result for such a simple algorithm. In general, the players with long peaks are near the top–but only because they were so good for much of that time. A long peak, like that of Conchita Martinez, isn’t an automatic ticket into the top ten.

From the opposite perspective, this method gives plenty of respect to women who were extremely good for shorter periods of time. Both Amelie Mauresmo and Tracy Austin crack the top 20 with six or fewer qualifying seasons, while others with as many years with an Elo of 2000 or higher, such as Manuela Maleeva Fragniere, find themselves much lower on the list.

Steffi, Serena, and the threshold

It’s worth thinking about what exactly the Elo rating threshold of 2000 means. At the simplest level, we’re drawing a line, below which we don’t consider a player at all. (Sorry, Aryna, your time will come!) Less obviously, we’re defining how great seasons compare to one another.

For instance, we’ve seen that Barty’s 2019 GQ was 123. Graf’s 1989 season, with a year-end Elo rating of 2505, gave her a GQ of 505. Our threshold choice of 2000 implies that Graf’s peak season has approximately four times the value of Barty’s. That’s not a natural law. If we changed the threshold to 1900, Barty’s GQ would be 223, compared to Graf’s best of 605. As a result, Steffi’s season is only worth about three times as much.

The lower the threshold, the more value we give to longevity and the less value we give to truly outstanding seasons. If we lower the threshold to 1950, Steffi and Serena swap places at the top of the list. (Either way, it’s close.) Even though Williams had one of the highest peaks in tennis history, it’s her longevity that truly sets her apart.

I don’t want to get hung up on whether Serena or Steffi should be at the top of this list–it’s not a precise measurement, so as far as I’m concerned, it’s basically a tie. (And that’s without even raising the issue of era differences.) I also don’t want to tweak the parameters just to get a result or two to look different.

Ranking Woz

I began this post with a question about Caroline Wozniacki. As we’ve seen, greatness quotient places her 19th among players since 1977–almost exactly halfway between her position on the weeks-at-number-one list and her standing on the title-oriented Championship Shares table.

If we had better data for the first decade of the Open era, Wozniacki and many others would see their rankings fall by at least a few spots. King, Court, and Evonne Goolagong Cawley would knock her into the 20s. Virginia Wade might claim a slot in the top 20 as well. We can quibble about the exact result, but we’ve nailed down a plausible range for the 2018 Australian Open champion.

One-number solutions like this aren’t perfect, in part because they depend on assumptions like the Elo threshold discussed above. Just because they give us authoritative-looking lists doesn’t mean they are the final word.

On the other hand, they offer an enormous benefit, allowing us to get around the unresolvable minor debates about the level of competition when she reached number one, the luck of the draw at grand slams she won and lost, the impact of her scheduling on ranking, and so on. By building a rating based on every opponent and match result, Elo incorporates all this data. When ranking all-time greats, many fans already rely too much on one single number: the career slam count. Greatness quotient is a whole lot better than that.

An Introduction to Tennis Elo

Elo is a superior rating system to the ranking formulas used by the ATP and WTA. If you’ve spent much time reading this blog or listening to the podcast, you’ve probably heard me say that many times. But unless you’ve been exposed to Elo before, or done some research on your own, you might think of it as a sort of “magic” system. It’s worth digging in to understand better how it works.

The basic algorithm

The principle behind any Elo system is that each player’s rating is an estimate of their strength, and each match (or tournament) allows us to update that estimate. If a player wins, her rating goes up; if she loses, it goes down.

Where Elo excels is in determining the amount by which a rating should increase or decrease. There are two main variables that are taken into account: How many matches are already in the system (that is, how much confidence we have in the pre-match rating), and the quality of the opponent.

If you think about it for a moment, you’ll see that these two variables are a good approximation of how we already think about player strength. The more we already know about a player, the less we will change our opinion based on one match. Novak Djokovic’s round-robin loss to Dominic Thiem in London was a surprise, but only the most apocalyptic Djokovic fans saw the result as a disaster that should substantially change our estimate of his playing ability. Similarly, we adjust our opinion based on opponent quality. A loss to Thiem is disappointing, but a loss to, say, Marco Cecchinato is more concerning. The Elo system incorporates those natural intuitions.

Elo rating ranges

Traditionally, a player is given an Elo rating of 1500 when he enters the system–before any results come in. That number is completely arbitrary. All that matters is the difference between player ratings, so if we started each competitor with 0, 100, or 888, the end result of those differences would remain the same.

When I began calculating Elo ratings, I kept with tradition and started every player with 1500. Since then, I’ve expanded my view to Challengers (and the women’s ITF equivalent) and tour-level qualifying. If we started each new player at those levels with 1500 points, it re-scales the entire system, which would have been confusing. Instead, I replaced 1500 with a number in the low 1200s (it depends a bit on tournament level and gender) so that the ratings would remain approximately the same.

At the moment, the ATP and WTA top-ranked players are Rafael Nadal and Ashleigh Barty, at 2203 and 2123, respectively. The best players are often in this range, and the very best often approach 2500. According to the most recent version of my algorithm, Djokovic’s peak was 2470, and Serena Williams’s best was 2473.

The 2000-point mark is a good rule of thumb to separate the elites from the rest. At the moment, six men and seven women have ratings that high. 16 men and 18 women have Elo ratings of at least 1900, and a rating of 1800 is roughly equivalent to a place in the top 50.

Era comparisons and Elo inflation

Once we attach a single peak rating to every player, it’s only natural to start comparing across eras. While it’s always fun to do so, I’m not sure any rating system allows for useful cross-era comparisons in tennis. Elo doesn’t, either.

What you can do with Elo is compare how each player fared against her competition. In 1990, Helena Sukova achieved a rating of 2123–exactly the same as Barty’s today. That doesn’t mean that Sukova then was as good as Barty is now. But it does mean that their performance relative to their peers was similar. The second tier of players was considerably weaker thirty years ago, so in a sense it was easier to achieve such a rating. At the time, Sukova’s rating was only good for 11th place, far behind Steffi Graf’s 2600.

Thus, Elo doesn’t allow you rank players across eras unless you are confident that the level of competition was similar–or unless you have some other way of dealing with that issue, a minefield that many researchers have tried to cross, with little success.

A related issue is Elo inflation or deflation, which can also complicate cross-era comparisons. Every time a match is played, the winner and loser effectively “trade” some of their points, so the total number of Elo rating points in the system doesn’t change. However, every time a new player enters the system, the total number of points increases. And whenever a player retires, the total number of points decreases.

It would be nice if additions and subtractions canceled each other out, but for many competitions that use Elo, they don’t. Additions tend to outweigh subtractions, so Elo ratings increase over time. That doesn’t appear to be the case with my tennis ratings, at least in part because of the penalty I’ve introduced for injury absences, but it does serve as a reminder that the number of points in the system changes over time, for reasons unrelated to the strength of the top players. (I’ll have more to say about the absence penalty below.)

Elo predictions

Elo gives us a rating for every player, and we’re getting a sense of what we can and can’t do with them.

One of the main purposes of any rating system is to predict the outcome of matches–something that Elo does better than most others, including the ATP and WTA rankings. The only input necessary to make a prediction is the difference between two players’ ratings, which you can then plug into the following formula:

1 – (1 / (1 + (10^((difference) / 400))))

If we wanted to forecast a rematch of the last match of the Davis Cup Finals, we would take the Elo ratings of Nadal and Denis Shapovalov (2203 and 1947), find the difference (256), and plug it into the formula, for a result of 81.4%, Nadal’s chance of winning. If we used the negative difference (-256), we’d get 18.6%, Shapovalov’s odds of scoring the upset.

My version of tennis Elo is based on the most common match format, best-of-three matches. In a best-of-five match, the favorite has a better chance of winning. The math for converting best-of-three to best-of-five is a bit complicated, but for those interested, I’ve posted some code. The point is that an adjustment must be made. If the Nadal-Shapovalov rematch happens at the best-of-five Australian Open, Rafa’s 81.4% edge will increase to 86.7%.

Adjusting Elo for surface

For most sports, we could stop here. A match is a match, with only minor variations. In tennis, though, ratings and predictions should vary quite a bit based on surface.

My solution is a bit complicated. For each player, I maintain four separate Elo ratings: overall, hard court only, clay court only, and grass court only. I don’t differentiate between outdoor and indoor hard. For instance, Thiem’s ratings are 2066 overall, 1942 on hard, 2031 on clay, and 1602 on grass. (Surface ratings tend to be lower: Thiem’s clay rating is third-best on tour, miles ahead of everyone except for Nadal and Djokovic.)

These single-surface ratings tell us how we would rank players if we simply threw away results on every other surface. That’s not realistic, though. Single-surface ratings aren’t great at predicting match results. A better solution is to take a 50/50 blend of single-surface and overall ratings. If we wanted to predict Thiem’s chances in a clay-court match, we’d use a half-and-half mix of his 2066 overall rating and his 2031 clay-court rating. My weekly Elo reports show the single-surface ratings as “HardRaw” (and so on), and the blended ratings as “hElo,” “cElo,” and “gElo.”

There is no natural law that dictates a 50/50 blend. Every adjustment I’ve made to the basic Elo algorithm is determined solely by what works. (More on that below.) Initially, I suspected that a blend between single-surface and overall ratings would be appropriate, because a player’s success on one surface has some correlation with his success on others. I expected the blend to be different for each surface–perhaps using a higher percentage of the overall rating for grass, because there are fewer matches on the surface. In the end, my testing showed that 50/50 worked for each surface.

Non-adjustments

Ask some tennis fans which tournaments matches matter more–for rankings, for GOAT debates, whatever–and you can find yourself with a long, detailed list of what factors determine greatness. Maybe slams are more important than masters and premiers, though those are less important than tour finals and the Olympics, and of course finals are key, plus head-to-heads against certain players… you get the idea.

Elo provides for such adjustments. A coefficient usually referred to as the “k factor” allows us to give greater weight to certain matches. It’s common in Elo ratings for other sports, for example by using a higher k factor for postseason than regular season games. However, I’ve tested all sorts of different k factors for the likely types of “important” matches, and I’ve yet to find a tweak to the system that consistently improves its ability to predict match outcomes.

The absence penalty

There’s one exception. When players miss substantial amounts of time, I reduce their rating, and then increase the k factor for several matches after their return. I’ve explained more of the details in a previous post.

These steps are a logical extension of the Elo framework, especially when you consider our usual mental adjustments when a player misses time. If a player is injured for a few months, we never know quite what to expect when she returns. Maybe she’s as strong as ever; maybe she’s still a step slow. Perhaps she’ll return to normal quickly; she might never fully return to form. An extended absence raises a lot of questions. An injury player rarely returns in better form than when she left, while many players are worse upon return, giving us an average post-injury performance level that is worse than before the absence.

Therefore, when a player first returns, our estimate must be that she is worse. However, a few strong early results should be weighted more heavily–hence the higher k factor. The k factor reflects the fact that, immediately after an absence, we aren’t as confident as usual in our estimate.

The algorithm gets complicated, but the logic is simple. It’s basically just an attempt to work out a rigorous version of statements like, “I don’t know how well he’ll play when he comes back, but I’ll be watching closely.”

One side benefit of the absence penalty is that it counteracts Elo’s natural tendency toward ratings inflation. While more players enter the system than leave it, adding to the total number of available points, the penalty removes some points without re-allocating them to other players.

Validating Elo and adjustments

I’ve mentioned “testing” a few times, and I started this article with a claim that Elo is superior to the official ranking systems. What does that mean, and how do we know?

The simplest way to compare rating systems is a metric called “accuracy,” which counts correct predictions. There were 50 singles matches at the Davis Cup finals, and Elo picked the winner in 36 of them, for an accuracy rating of 72%. The ATP rankings picked the winner (in the sense that the higher-ranked player won the match) in 30 of them, for an accuracy rating of 60%. In this tiny experiment, Elo trounced the official rankings. Elo is also considerably better over the course of the entire season.

A better metric for this purpose is Brier score, which takes into account the confidence of each forecast. We saw earlier that Elo gives Nadal an 81.4% chance of beating Shapovalov. If Nadal ends up winning, 81.4% is a “better” forecast than, say, 65%, but it’s a “worse” forecast than 90%. Brier score takes the squared distance between the forecast (81.4%) and the result (0% or 100%, depending on the winner), and averages those numbers for all forecasted matches. It rewards aggressive forecasts that prove correct, but because it uses squared distance, it severely punishes predictions that are aggressive but wrong.

A more intuitive way to think about what Brier score is getting at is to imagine that Nadal and Shapovalov play 100 matches in a row. (Or, more accurately but less intuitively, imagine that 100 identical Nadals play simultaneous matches against 100 identical Shapovalovs.) A forecast of 81.4% means that we would expect Nadal to win 81 or 82 or those matches. If Nadal ends up winning 90, the forecast wasn’t Rafa-friendly enough. We’ll never get 100 simultaneous matches like this, but we do have thousands of individual matches, many of which share the same predictions, like a 60% chance of the favorite winning. Brier score aggregates all of those prediction-and-result pairs and spits out a number to tell us how we’re doing.

It’s tough to forecast the result of individual tennis matches. Any system, no matter how sophisticated, is going to be wrong an awful lot of the time. In many cases, the “correct” forecast is barely better than no forecast at all, if the evidence suggests that the competitors are equally matched. Thus, “accuracy” is of limited use–it’s more important to have the right amount of confidence than to simply pick winners.

All of this is to say: My Elo ratings have a much lower (better) Brier score than predictions derived from ATP and WTA rankings. Elo forecasts aren’t quite as good as betting odds, or else I’d be spending more time wagering and less time writing about rating systems.

Brier score is also the measure that tells us whether a certain adjustment–such as surface blends, injury absences, or tournament type–constitutes an improvement to the system. Assessing an injury penalty lowers the Brier score of the overall set of Elo forecasts, so we keep it. Decreasing the k factor for first-round matches has no effect, so we skip it.

Additional resources

My current Elo ratings: ATP | WTA

Extending Elo to doubles

… and mixed doubles

Code for tennis Elo (in R, not written by me)

A good introduction to Brier score

* * *

Subscribe to the blog to receive each new post by email:

 

Andreescu, Medvedev, and the Future According to Elo

With the US Open title added to her 2019 trophy haul, Bianca Andreescu is finally a member of the WTA top 10, debuting at fifth on the ranking table. Daniil Medvedev, the breakout star of the summer on the men’s side, only cracked the ATP top 10 after Wimbledon. He’s now up to fourth. The official ranking algorithms employed by the tours take some time to adjust to the presence of new stars.

Elo, on the other hand, reacts quickly. While the ATP and WTA computers assign points based on a year’s worth of results (rounds reached, not opponent quality), Elo gives the most weight to recent accomplishments, with even greater emphasis placed on surprising outcomes, like upsets of top players. If your goal in using a ranking system is to predict the future, Elo is better: Elo-based forecasts significantly outperform predictions based on ATP and WTA ranking points.

Andreescu’s first Premier-level title came at Indian Wells in March, when she beat two top-ten players, Elina Svitolina and Angelique Kerber, in the semi-final and final. The WTA computer reacted by moving her up from 60th to 24th on the official list. Elo already saw Andreescu as a more formidable force after her run to the final in Auckland, so after Indian Wells, the algorithm moved her up to seventh. Three more wins in Miami, and the Canadian teen cracked the Elo top five.

Tennis fans are accustomed to the slow adjustments of the ranking system, so seeing a “(22)” or a “(15)” next to Andreescu’s name at Roland Garros and the US Open wasn’t particularly jarring. And there’s something to be said for withholding judgment, since tennis has had its share of teenage flashes in the pan. But Elo is usually right. The betting market heavily favored Serena Williams in the US Open final, but Elo saw the Canadian as the superior player, giving her a slight edge. After the latest seven match wins in New York, the algorithm rates Andreescu as the best player on tour, very narrowly edging out Ashleigh Barty. Would you dare disagree?

The launching (Ar)pad

When Medvedev first reached the top ten on the Elo list last October, I ran some numbers to compare the two ranking systems. Most players who earn a spot in the Elo top ten eventually make their way into the ATP top ten as well, but Elo is almost always first. On average, the algorithm picks top-tenners more than a half-year sooner than the tour’s computer. The 23-year-old Russian is a good example: He reached eighth place on the Elo list last October, but didn’t match that mark in the ATP rankings for another 10 months, after reaching the Montreal final.

Andreescu closed the gap faster than Medvedev did, needing a more typical six months to progress from Elo top-tenner to a single-digit WTA ranking. It may not take much longer before her Elo and WTA rankings converge at the top of both lists.

We no longer need Elo to tell us that Andreescu and Medvedev are likely to keep winning matches at the highest level. But having acknowledged the accuracy with which Elo glimpses the future, it’s worth looking at which players are likely to follow in their footsteps.

After the US Open, Elo’s boldest claim regards Matteo Berrettini, ranked sixth. The ATP computer sites him at 13th, and he only made one brief stop this summer inside the top 20. The Flushing semi-finalist has been inside the Elo top 10 since mid-June, and the algorithm currently puts him ahead of such better-established young players as Alexander Zverev and Stefanos Tsitsipas.

The women’s Elo list doesn’t feature any similar surprises in the top 10, but that hardly means it agrees with the WTA computer. Karolina Muchova, currently at a career-high WTA ranking of 43rd, is 23rd on the Elo table. Two veteran threats, Victoria Azarenka and Venus Williams, are also marooned outside the official top 40, but Elo sees them as 18th and 28th best on tour, respectively. In terms of predictiveness, quality is more important than quantity, so a limited schedule isn’t necessarily seen as a drawback. Elo is also optimistic about Sofia Kenin, rating her 13th, compared to her official WTA standing of 20th.

Half a year from now, I’d bet Berrettini’s official ranking is closer to 6 than to 13, and that Muchova’s position is closer to 23 than 43. It’s impossible to tell the future, but if we’re interested in looking ahead, Elo gives us a six-month head start on the official rankings. We’ll have to wait and see whether the rest of the women’s tour can keep Andreescu away from the top spot for that long.

Slow Conditions Might Just Flip the Outcome of Federer-Nadal XL

Italian translation at settesei.it

Roger Federer likes his courts fast. Rafael Nadal likes them slow. With eight Wimbledon titles to his name, Federer is the superior grass court player, but the conditions at the All England Club have been unusually slow this year, closer to those of a medium-speed hard court.

On Friday, Federer and Nadal will face off for the 40th time, their first encounter at Wimbledon since the Spaniard triumped in their historical 2008 title-match battle. Rafa leads the head-to-head 24-15, including a straight-set victory at his favorite slam, Roland Garros, several weeks ago. But before that, Roger had won five in a row–all on hard courts–the last three without dropping a set.

Because of the contrast in styles and surface preferences, the speed of the conditions–a catch-all term for surface, balls, weather, and so on–is particularly important. Nadal is 14-2 against his rival on clay, with Federer holding a 13-10 edge on hard and grass. Another way of splitting up the results is by my surface speed metric, Simple Speed Rating (SSR). 22 of the matches have been been on a court that is slower than tour average, with the other 17 at or above tour average speed:

Matches     Avg SSR  RN - RF  Unret%  <= 3 shots  Avg Rally  
SSR < 0.92     0.74     17-5   21.2%       49.5%        4.7  
SSR >= 1.0     1.14     7-10   27.0%       56.9%        4.3

At faster events–all of which are on hard or grass–fewer serves come back, more points end by the third shot, and the overall rally length is shorter. Fed has the edge, with 10 wins in 17 tries, while on slower surfaces–all of the clay matches, plus a handful of more stately hard courts–Rafa cleans up.

Rafa broke Elo

According to my surface-weighted Elo ratings, Federer is the big semi-final favorite. He leads Nadal by 300 points in the grass-only Elo ratings, which gives him a 75% chance of advancing to the final. The betting market strongly disagrees, believing that Rafa is the favorite, with a 57% chance of winning.

The collective wisdom of the punters is onto something. Elo has systematically underwhelmed when it comes to forecasting the 39 previous Fedal matches. Federer has more often been the higher-rated player, and if Roger and Rafa behaved like the algorithm expected them to, the Swiss would be narrowly leading the head-to-head, 21-18. We might reasonably conclude that, going into Friday’s semi-final, Elo is once again underestimating the King of Clay.

How big of Fedal-specific adjustment is necessary? I fit a logit model to the previous 39 matches, using only the surface-weighted Elo forecast. The model makes a rough adjustment to account for Elo’s limitations, and reduces Roger’s chances of winning the semi-final from 74.8% all the way down to 48.5%.

Now, about those conditions

The updated 48.5% forecast takes the surface into account–that’s part of my Elo algorithm. But it doesn’t distinguish between slow grass and fast grass.

To fix that, I added SSR, my surface speed metric, to the logit model. The model’s prediction accuracy improved from 64% to 72%, its Brier score dropped slightly (a lower Brier score indicates better forecasts), and the revised model gives us a way of making surface-speed-specific forecasts for this matchup. Here are the forecasts for Federer at several surface speed ratings, from tour average (1.0) to the fastest ratings seen on the circuit:

SSR  p(Fed Wins)  
1.0        49.3%  
1.1        51.4%  
1.2        53.4%  
1.3        55.5%  
1.4        57.5%  
1.5        59.5% 

In the fifteen years since Rafa and Roger began their rivalry, the Wimbledon surface has averaged around 1.20, 20% quicker than tour average. In 2006, when they first met at SW19, it was 1.24, and in 2008, it was 1.15. Three times in the last decade it has topped 1.30, 30% faster than the average ATP surface. This year, it has dropped almost all the way to average, at 1.00, when both men’s and women’s results are taken into account.

As the table shows, such a dramatic difference in conditions has the potential to influence the outcome. On a faster surface, which we’ve seen as recently as 2014, Federer has the edge. At this year’s apparent level, the model narrowly favors Nadal. Rafa has said that the surface itself is unchanged, but that the balls have been heavier due to humidity. He should hope for another muggy day on Friday–the end result could depend on it.

Introducing Elo Ratings for Mixed Doubles

Scroll down for Wimbledon updates, including a forecast for the title match.

With Andy Murray and Serena Williams pairing up in this year’s Wimbledon mixed doubles event, more eyes than ever are on tournament’s only mixed-gender draw. Mixed doubles is played just four times a year (plus the Olympics, the occasional exhibition, and the late Hopman Cup), so most partnerships are temporary, and it’s tough to get a sense of who is particularly good in the dual-gender format.

That’s where math comes into play. Over the last few years, I’ve deployed a variation of the Elo rating algorithm for men’s doubles. It treats each team as the average of the two members, and after every match, it adjusts each player’s rating based on the result and the quality of the opponent. Doubles Elo–D-Lo–is even better suited for mixed than for single-gender formats, because players rarely stick with the same partner. The main drawback of D-Lo for men’s or women’s doubles is that it doesn’t help us tease out the individual contributions of long-time teams such as Bob and Mike Bryan. By contrast, mixed doubles draws often look like a game of musical chairs from one major to the next.

The rating game

Let’s jump right in. The Wimbledon mixed doubles draw consists of 56 teams. Here are the 10 highest-rated of those 112 players, as of the start of the fortnight:

Rank  Player                 XD-Lo  
1     Venus Williams          1855  
2     Serena Williams         1847  
3     Bethanie Mattek-Sands   1834  
4     Jamie Murray            1809  
5     Ivan Dodig              1793  
6     Latisha Chan            1785  
7     Bruno Soares            1776  
8     Leander Paes            1771  
9     Heather Watson          1770  
10    Gabriela Dabrowski      1760

Serena and Venus Williams require a bit of an asterisk, since both are playing mixed for the first time after a long break. Venus last played at the 2016 Olympics, and Serena last competed in mixed at the 2012 French Open. Maybe they’re rusty. My XD-Lo algorithm doesn’t include any kind of adjustment for injuries or other layoffs, so it’s possible that we should expect them to perform at a lower level. On the other hand, they are among the greatest doubles players of all time, and players tend to age gracefully in doubles. Venus lost her opening match, but perhaps we should blame that on Francis Tiafoe (XD-Lo: 1,494). The sisters will probably trade places at the top of the list once Wimbledon results are incorporated.

Murray’s rating is a decent but more pedestrian 1,648, so Murray/Williams is not the best team in the field. But they’re close. The strongest pair is Jamie Murray and Bethanie Mattek Sands–3rd and 4th on the list above–followed by Ivan Dodig and Latisha Chan, 5th and 6th on the individual list. Due to the vagaries of ATP and WTA doubles rankings and the resulting seedings, Dodig/Chan entered the event as the narrow favorites, because they got a first-round bye and Murray/Mattek-Sands did not.

Here are the top ten teams in the draw:

Rank  Team                                XD-Lo  
1     Bethanie Mattek-Sands/Jamie Murray   1822  
2     Ivan Dodig/Latisha Chan              1789  
3     Bruno Soares/Nicole Melichar         1762  
4     Serena Williams/Andy Murray          1748  
5     Gabriela Dabrowski/Mate Pavic        1734  
6     Leander Paes/Samantha Stosur         1731  
7     Heather Watson/Henri Kontinen        1708  
8     Venus Williams/Frances Tiafoe        1674  
9     Abigail Spears/Marcelo Demoliner     1653  
10    Neal Skupski/Chan Hao-ching          1634

The top five have survived (though Murray/Mattek-Sands and Pavic/Dabrowski will complete their second-round match this afternoon, leaving only four), and of the last 18 teams standing, only one other one–John Peers and Shuai Zhang–is rated above 1,600.

Forecasting SerAndy

Using my ratings, Murray/Williams entered the tournament with a 9.8% chance of winning. That made them fourth favorite, behind Dodig/Chan (17.1%), Murray/Mattek-Sands (16.3%), and the big-serving duo of Bruno Soares and Nicole Melichar (14.5%). I’ll update the forecast this evening, when the second round is finally complete.

Murray/Williams’s second-round match is against Fabrice Martin and Racquel Atawo. They are both excellent doubles players, though neither has excelled in mixed. Atawo, especially, has struggled. Her XD-Lo is 1,304, the third-lowest of anyone who has entered a mixed draw since 2012. (Shuai Peng is rated 1,268, and Marc Lopez owns last place with 1,252.) A player with no results at all enters the system with 1,500 points, so falling to 1,300 requires a lot of losing. The combined ratings translate into a 89% chance of a Murray/Williams victory.

The challenge comes in the third round. Soares/Melichar are the top seed, and they have already advanced to the round of 16, awaiting the winner of Murray/Williams and Martin/Atawo. Thus two of of the top four teams will likely play for a place in the quarter-finals, with Soares/Melichar holding a narrow, 52% edge.

Historical peaks

Generating these current ratings required amassing a lot of data, so it would be a waste to ignore the history of the mixed doubles format. Here are the top 25 female mixed doubles players, ranked by their peak XD-Lo ratings:

Rank  Player                   Peak  
1     Billie Jean King         2043  
2     Greer Stevens            2035  
3     Margaret Court           2015  
4     Rosie Casals             2000  
5     Martina Navratilova      1998  
6     Helena Sukova            1991  
7     Anne Smith               1989  
8     Betty Stove              1985  
9     Jana Novotna             1977  
10    Martina Hingis           1964  
11    Wendy Turnbull           1956  
12    Kathy Jordan             1948  
13    Elizabeth Smylie         1947  
14    Arantxa Sanchez Vicario  1946  
15    Serena Williams          1942  
16    Venus Williams           1937  
17    Francoise Durr           1934  
18    Jo Durie                 1929  
19    Kristina Mladenovic      1922  
20    Zina Garrison            1901  
21    Samantha Stosur          1898  
22    Larisa Neiland           1891  
23    Lindsay Davenport        1888  
24    Victoria Azarenka        1887  
25    Natasha Zvereva          1886 

Venus really can’t catch a break. She’s one of the best players of all time, and Serena is always just a little bit better.

And the top 25 men:

Rank  Player               Peak XD-Lo  
1     Owen Davidson              2043  
2     Bob Hewitt                 2042  
3     Marty Riessen              2016  
4     Todd Woodbridge            2000  
5     Frew McMillan              1999  
6     Kevin Curren               1997  
7     Jim Pugh                   1995  
8     Ilie Nastase               1975  
9     Tony Roche                 1962  
10    Bob Bryan                  1949  
11    Rick Leach                 1938  
12    Mahesh Bhupathi            1933  
13    Mark Woodforde             1929  
14    Justin Gimelstob           1929  
15    Max Mirnyi                 1926  
16    John Lloyd                 1922  
17    Emilio Sanchez             1918  
18    Ken Flach                  1909  
19    Jeremy Bates               1908  
20    John Fitzgerald            1906  
21    Cyril Suk                  1902  
22    Wayne Black                1889  
23    Dick Stockton              1881  
24    Jean-Claude Barclay        1879  
25    Mike Bryan                 1875

Owen Davidson won eight mixed slams with Billie Jean King, plus three more with other partners. Bob Hewitt won six, spanning 18 years from 1961 to 1979. (We can’t erase his accomplishments from the history books, but any mention of Hewitt comes with the caveat that he is a convicted rapist who has since been expelled from the International Tennis Hall of Fame.)

It is interesting to see two famous pairs represented on the men’s list. Bob Bryan ranks 10th to Mike’s 25th, and Todd Woodbridge comes in 4th to Mark Woodforde’s 13th. We probably can’t conclude from mixed doubles results that one member of the team was a superior men’s doubles player, but it is one of the few data points that allows us to compare these partners.

The ignominious Spaniards

Finally, I can’t spend this much time with mixed doubles ratings without revisiting the case of David Marrero. You may recall the 2016 Australian Open, when Marrero’s first-round match with Lara Arruabarrena triggered “suspicious betting patterns.” As I wrote at the time, the most suspicious thing about it was that Marrero–who was terrible at mixed doubles and admitted that he played differently with a woman across the net–could still find a partner.

He entered that match with an XD-Lo rating of 1,349–the worst of any man in the draw, though Anastasia Pavlyuchenkova was a few points lower–and left it at 1,342. He played his last mixed doubles match at Wimbledon that year, and–surprise!–he lost. One hopes he’ll stick to men’s doubles for the remainder of his career, sticking with an XD-Lo rating of 1,326.

Marrero’s only saving grace is that he’s better than his compatriot Marc Lopez. Lopez has been active in mixed doubles more recently, entering the US Open last year with Arruabarrena. After that loss, he fell to his current rating of 1,252, the lowest mark recorded in the Open Era.

Fortunately for us, this year’s Wimbledon draw includes both Williams sisters, both Murray brothers, a healthy Mattek-Sands … and very few players as helpless in the mixed doubles format as Marrero or Lopez.

Update: Murray/Williams won their second-rounder, setting up the final 16. Mixed doubles isn’t the top scheduling priority, so it didn’t exactly work that way–by the time Muzzerena advanced, two other teams had already secured places in the quarter-finals. Ignoring those for the moment, here is the last-16 forecast:

Team                      QF     SF      F      W  
Soares/Melichar        52.5%  44.5%  33.2%  18.8%  
Murray/Williams        47.5%  39.7%  29.0%  15.8%  
Middelkoop/Yang        55.5%   9.5%   3.6%   0.8%  
Daniell/Brady          44.5%   6.3%   2.1%   0.4%  
Peers/Zhang            61.6%  36.9%  13.8%   5.2%  
Lindstedt/Ostapenko    38.4%  18.7%   5.2%   1.5%  
Skugor/Olaru           56.2%  26.3%   8.3%   2.6%  
Mektic/Rosolska        43.8%  18.0%   4.8%   1.3%  
                                                   
Player                    QF     SF      F      W  
Koolhof/Peschke        42.6%  10.1%   2.4%   0.6%  
Qureshi/Kichenok       57.4%  16.7%   4.9%   1.5%  
Sitak/Siegemund        27.4%  16.0%   5.3%   1.8%  
Pavic/Dabrowski        72.6%  57.2%  30.8%  17.5%  
Dodig/Chan             75.9%  64.6%  44.1%  28.1%  
Roger-Vasselin/Klepac  24.1%  15.5%   6.6%   2.5%  
Hoyt/Silva             54.1%  11.3%   3.5%   1.0%  
Vliegen/Zheng          45.9%   8.6%   2.5%   0.6% 

The two teams already in the quarters are Skugor/Olaru and Hoyt/Silva. Since both of their matches were close to 50/50, you can roughly double their odds, and the odds of the other teams are only a tiny bit less. The remaining six third-round matches are scheduled for Wednesday, and I’ll try to update again when those are in the books.

Update 2: Murray/Williams are out, so the number of people interested in mixed doubles has fallen from double digits back to the typical level of single digits. The departure of the singles stars also leaves one clear favorite in each half. Here is the updated forecast:

Team                    SF      F      W  
Soares/Melichar      83.4%  64.3%  36.4%  
Middelkoop/Yang      16.6%   6.7%   1.5%  
Lindstedt/Ostapenko  46.0%  12.6%   3.7%  
Skugor/Olaru         54.0%  16.4%   5.2%  
Koolhof/Peschke      37.5%   7.3%   1.8%  
Sitak/Siegemund      62.5%  17.2%   6.0%  
Dodig/Chan           84.4%  68.3%  43.3%  
Hoyt/Silva           15.6%   7.2%   2.0%

All four quarter-finals are scheduled for Thursday, so I’ll post another update tomorrow evening.

Update 3: We’re down to four teams. Of the Elo favorites in the quarter-finals, only Dodig/Chan survived, leaving them as the overwhelming pick to take the title. Here’s the full forecast:

Team                     F      W  
Middelkoop/Yang      42.3%   8.2%  
Lindstedt/Ostapenko  57.7%  14.1%  
Koolhof/Peschke      14.1%   6.3%  
Dodig/Chan           85.9%  71.4% 

Update 4: Both favorites won in Friday’s semi-finals, so we’ve got a final between Lindstedt/Ostapenko and Dodig/Chan. The first team didn’t get an opening-round bye, so they won one more match to get here. They also have a better story, since Ostapenko keeps hitting her partner in the head. Dodig/Chan entered as the 8th seeds, despite being the second-best team according to XD-Lo.

Consequently, Dodig/Chan get the edge here, with an 81% of winning the 2019 Wimbledon Mixed Doubles title.

How Good is Cori Gauff Right Now?

Italian translation at settesei.it

15-year-old sensation Cori Gauff holds a WTA ranking of No. 313. She has played only a limited number of events that are considered by the WTA’s system, so even before her impressive run began, we could’ve predicted that her ranking was an understatement. But by how much?

Gauff doesn’t show up yet on my Elo ratings list because, before Wimbledon qualies, she hadn’t played at least 20 matches at the ITF $50K level or higher in the last year. However, she still had a rating: 1,488, good for 194th place among those who had met the playing time minimum. A rating in that range translates to about a 3% chance of upsetting current top-ranked player Ashleigh Barty, and a 10% chance of beating someone around 20th, such as Donna Vekic. Given how little data we had to work with at that point, that seemed like a reasonable assessment.

Since she arrived in London, she has won six matches: Three in qualifying and three in the main draw, with wins over Venus Williams, Magdalena Rybarikova, and Polona Hercog. Not bad for a teenager who had previous won only one slam qualifying match and one tour-level main draw match in her young career!

194th place doesn’t seem like such a fair judgment anymore. Any player who comes through qualifying and reaches the fourth round at a major deserves some reassessment, and that’s even more applicable to a player about whom we knew so little two weeks ago. The tricky part is figuring out how much to adjust. Is Gauff now a top-100 player? Top 50? Top 20?

Revising with Elo

The Elo algorithm does a good job of approximating how humans make those reassessments: The more data we already have about a player, the less we will adjust her rating after a win or loss. The previous player to defeat Hercog was Simona Halep, at Eastbourne. We already have years’ worth of match results for Halep, and she was heavily favored to win the match. Thus, the fact that she recorded the victory alters our opinion of her by only a small amount. In Elo terms, it was an increase from 2,100 points to 2,102–basically nothing.

Gauff is a different story. Entering her third-round clash with Hercog, not only did we know very little about her skill level, it wasn’t even clear if she was the favorite. The result caused Elo to make a considerably larger adjustment, increasing her rating from 1,713 to 1,755, a rise 21 times greater than what Halep received after beating the same opponent. The 42-point jump caused her to leapfrog 16 players in the rankings.

Here is Gauff’s Elo progression, from the moment she arrived at Wimbledon to middle Sunday. After each match, I show her overall Elo (the numbers I’ve been discussing so far), her grass-specific Elo, and her grass-weighted Elo, a 50/50 blend of overall and grass-specific that is used for forecasting. For each of the three ratings, I also show her ranking among WTA players with at least 20 matches in the last 52 weeks.

Match          Overall   Rk  Grass   Rk  Weighted   Rk  
Pre-Wimbledon     1488  194   1350  163      1419  187  
d. Bolsova        1540  171   1405  132      1473  155  
d. Ivakhnenko     1566  157   1447  107      1507  131  
d. Minnen         1614  132   1514   57      1564   95  
d. Venus          1670  108   1578   40      1624   73  
d. Rybarikova     1713   83   1650   21      1682   41  
d. Hercog         1755   67   1686   17      1721   31

Over the course of only six matches, Gauff has jumped from 194th in the overall Elo rankings to 67th. For forecasting purposes, her grass court rating has soared from 187th to 31st. Her current weighted rating of 1,721 is better than that of three other women in the round of 16: Karolina Muchova, Carla Suarez Navarro, and Shuai Zhang. She trails another surviving player, Elise Mertens, by only 20 points.

So you’re telling me there’s a chance

Unfortunately, none of those relatively weak grass-court players are Gauff’s next opponent. Instead, the 15-year-old will face Halep, the third-best remaining player (behind Barty and Karolina Pliskova), and a three-time quarter-finalist at the All England Club. Halep’s weighted Elo rating is 229 points higher than Gauff’s, implying that the veteran has a 79% chance of winning on Monday. The betting market concurs, suggesting that the probability of a Halep victory is about 80%.

It doesn’t usually have much of an effect on forecasts to update Elo ratings throughout a tournament. While anyone reaching the 4th round has a higher rating than they did before the event, the differences are typically small. And since forecasts are based on the difference between the ratings of two players, the forecast isn’t affected if both players’ ratings have increased by roughly the same amount.

As a teenager with such limited match experience, Gauff breaks the mold. Her pre-Wimbledon 1,488 Elo rating is only two weeks old, and it is already completely unrepresentative of what we know of her skill level. She’ll have ample time to prove us right or wrong in the upcoming years, but for now, we have good reason to estimate that she belongs–even more than some of the older players who have reached the second week at Wimbledon.

Forecasting Andy Murray, Doubles Specialist

We are three weeks into the mostly-triumphant doubles comeback of Andy Murray. In his first week back, he raced to the Queen’s Club title with Feliciano Lopez. A week later, he paired Marcelo Melo and lost in the first round. At Wimbledon, he is partnering Pierre-Hugues Herbert, with whom he has already defeated the only-at-a-slam duo of Marius Copil and Ugo Humbert.

Today in the second round, Herbert/Murray face a sterner test: sixth-seeded team Nikola Mektic and Franco Skugor. The betting markets heavily favored Herbert/Murray going into the contest, but we have to assume that punters (including an unusually high number of casual ones) are probably overrating the familiar name on his home turf.

D-Lo to the rescue

Let’s see what D-Lo (Elo for doubles!) says about today’s match. D-Lo treats each team as a 50/50 mix of the two players, and adjusts each player’s rating after every match, depending on the quality of the opponent. It also very slightly regresses both partners to the team average after each match, because it’s impossible to know how much each player contributed to the result.

Herbert is D-Lo’s top doubles player in the world on hard and clay courts, though he falls to 6th in the 50/50 blend of overall and grass-specific ratings used for forecasting. Murray, thanks to his run at Queen’s, is up to 54th in the blend, though that’s really more like 40th among players in the draw, since several injured and recently-retired players are clinging to high D-Lo ratings.

Mektic and Skugor are credible specialists, as indicated by their ATP ranking. They are 24th and 26th in the D-Lo, respectively. Combined, the two teams’ ratings are quite close: 1773 for Herbert/Murray to 1763 for Mektic/Skugor. In a best-of-three match, a difference of 10 points translates to a 51.4% edge for the favorites. In best-of-five, the better team is always more likely to come out on top, though with such a small margin it barely matters. Here, the best-of-five number is 51.6%.

Versus the pack

How does a team rating of 1773 compare to the rest of the remaining field? Entering Saturday’s play, 22 men’s doubles pairs were still in the draw. As I write this, Lopez and Pablo Carreno Busta are the only additional team to have been eliminated, reducing the field to 21.

Here are the combined D-Lo ratings of these teams. The rank shown for each player is based on the 50/50 blend of overall and grass rating used for forecasting.

Team D-Lo  Rank  Player             Rank  Player             
1873       2     Mike Bryan         3     Bob Bryan          
1858       4     Lukasz Kubot       7     Marcelo Melo       
1836       9     Raven Klaasen      10    Michael Venus      
1817       8     John Peers         17    Henri Kontinen     
1802       12    Nicolas Mahut      22    E Roger-Vasselin   
1788       18    J S Cabal          19    Robert Farah       
1773       6     P H Herbert        54    Andy Murray        
1764       15    Oliver Marach      36    Jurgen Melzer      
1763       24    Nikola Mektic      26    Franco Skugor      
1757       20    Rajeev Ram         33    Joe Salisbury      
1747       23    Horia Tecau        41    Jean Julien Rojer  
1709       42    Maximo Gonzalez    46    Horacio Zeballos   
1695       29    Ivan Dodig         88    Filip Polasek      
1681       58    Marcus Daniell     62    Wesley Koolhof     
1677       50    Frederik Nielsen   77    Robin Haase        
1644       81    Marcelo Demoliner  90    Divij Sharan       
1637       84    A Ul Haq Qureshi   99    Santiago Gonzalez  
1596       106   Philipp Oswald     123   Roman Jebavy       
1575       101   Mischa Zverev      184   Nicholas Monroe    
1533             Jaume Munar        216   Cameron Norrie     
1517       177   Marcelo Arevalo    214   M Reyes Varela

Herbert/Murray rank 7th among the surviving pairs. The combined rating of 1773 makes them competitive against anyone. The 100-point difference separating them and the Bryans gives them a 33% chance of pulling off a best-of-five upset, while the 29-point gap between them and Nicolas Mahut/Edouard Roger Vasselin translates to a 45/55 proposition.

Fortunately for the French-British pair, they won’t have to play a higher-rated team for some time. If they win today, they’ll face the winner of Dodig/Polasek vs Zverev/Monroe. The first of those teams is rated 80 points lower than Herbert/Murray (64% odds for the favorites), and the second is 200 points lower (81% for the faves). The three teams that could advance to become the quarter-final opponent for Herbert/Murray are all rated lower than Dodig/Polasek.

The draw certainly favored Sir Andrew. Yes, the 1859-rated Pavic/Soares duo crashed out in their section, but even before that, three of the best teams–Bryan/Bryan, Kubot/Melo, and Mahut/Roger-Vasselin–were stuck together in another quarter. While no men’s doubles match is a sure thing, the path is clear for Herbert/Murray to reach the final four.

Beyond Wimbledon

Does Murray have what it takes to become a full-time doubles specialist? Taking his Queen’s Club title into account, his overall D-Lo is already up to 36th best on tour, just ahead of Skugor, and several places better than Roland Garros co-champ Kevin Krawietz. Jurgen Melzer, another excellent singles player making of a go of it on the doubles circuit, is ranked 20 places lower, with a D-Lo 40 points less than Murray’s.

The short answer, then, is yes. It must be noted, though, that he isn’t the best choice among the big four to have a successful post-singles career as part of a team. That honor goes overwhelmingly to Rafael Nadal. Nadal’s career peak D-Lo is 100 points higher than Murray’s, and even his grass-court rating–based, admittedly, on some old results–is 70 points higher. Aside from the injured doubles wizard Jack Sock, Nadal is the best active player absent from the Wimbledon draw.

So, Murray/Nadal, Wimbledon 2021 champions? Sounds good to me–as long as Herbert relinquishes his new partner and finally commits to focusing on singles.

Forecasting Future Felix With ATP Aging Patterns

Italian translation at settesei.it

It’s been an exceptional six weeks for Felix Auger-Aliassime. He broke into the top 100 with a runner-up performance on clay in Rio de Janeiro, won two matches each at Sao Paulo and Indian Wells (including an upset of Stefanos Tsitsipas), and raced to a semi-final at the Miami Masters, the youngest player ever to make the final four of that event. Four months away from his 19th birthday, his ranking is up to 33rd in the world, and he has few points to defend until June.

Felix is the youngest man in the top 100, and he’s reaching milestones early enough to draw comparisons with some of the best young players in the sport’s history. Will he follow in the footsteps of past wunderkinds such as Rafael Nadal and Lleyton Hewitt? To answer that question, let’s take a look at typical ATP aging patterns, what they say about when players hit their peaks, and what they can show us about the fate of the best 18 year olds.

The standard curve

Last week, I looked at WTA aging curves and found that women tend to peak around age 23 or 24, an age that has not changed even as the sport has gotten older. I also discovered that there is a surprisingly modest gap–about 70 Elo points–between 18-year-old performance and a woman’s peak level. The men’s results are different.

To calculate the average ATP aging curve, I found over 700 players who were born between 1960 and 1989 and played at least 20 tour-level, tour qualifying, or challenger-level matches in each of five seasons. Overall, peak age was 25, though the difference from age 24 to 27 is only a few Elo points, so small as to be negligible.

As the tour has gotten older, the men’s peak age has also increased. Of the nearly 300 players born between 1980 and 1989, peak age is 26-27, with ages 28 and 29 also within 10 Elo points of the age 26-27 peak. Plenty of players are peaking at older ages, and many of those who aren’t are remaining close to their best levels into their late twenties. The peak age could be even higher still–a few of the players in the 1980-89 cohort turn 30 this year, and could conceivably still improve on their career bests.

The following graph shows the trajectory of the average player (with peak year-end Elo set to 1,850) born in the 1960s and the pattern of the average player born in the 1980s:

It’s a long ascent from the performance level at age 18 to the typical peak, especially for more recent players. There’s even a hefty bit of selection bias that should inflate the level of 18 year olds, since only about 10% of the players in the overall sample qualified for a year-end Elo rating when they were 18. The ones who did were, in general, the best of the bunch.

Felix forward

Through the Miami semi-final, Auger-Aliassime’s Elo rating is 1,848. The average player in the entire dataset who played at least 20 matches in their age-18 season went on to add another 281 Elo points to their rating between the end of their age-18 season and their peak. In the narrower, more recent cohort of 1980-89 births, the players with year-end ratings as 18 year olds improved their Elos by a whopping 369 points before reaching their peaks.

Adding either of those numbers to Felix’s current rating gives us quite the rosy forecast:

Cohort   Current  Increase  Proj. Peak  
1960-89     1848       281        2129  
1980-89     1848       369        2217

There’s a bit of slight of hand in how I’m doing this, since my study uses players’ year-end ratings, and I’m using Felix’s rating in April. However, there’s no natural law that says one artificial 12-month span is better than another, and Felix’s current age of 18.6 is roughly in the middle of the ages of the year-end 18-year-olds with whom I’m comparing him.

An Elo rating of 2,129 would be good enough for fourth place on the current list, behind only the big three. The rating of 2,217 is better than any of the big three can boast at the moment, and would be the fourth-best peak year-end rating among active players, again trailing only the big three. (And Andy Murray, if you consider him active.) Only 15 Open era players have managed year-end Elo peaks above 2,217.

No comparisons

It’s tough to say whether this method, of finding the typical difference between 18-year-old and peak Elo ratings, is adequate to handle the extremes. Some players peak earlier than average, and it stands to reason that the best young talents are more likely to do so. Boris Becker posted a whopping 2,212 Elo rating at the end of his age-18 season, which didn’t leave much room for improvement. He gained another 90 points before the end of his age-19 season, which was his career best.

Becker’s career path is not particularly helpful to our effort to forecast Felix’s, in part because the German was so unique, and also because his experience reflects such a different era. But even among less unique players, there are few useful comparables. No one born since 1987 managed a better age-18 Elo rating than Felix’s 1,848, and only a handful of active or recently-retired players even reached 1,750 by that age.

Lacking the data for a more precise approach, let’s repeat what I did for Bianca Andreescu last week, and see how the nearest 18-year-old comparisons fared. Of the players whose age-18 year-end Elos were closest to Felix’s 1,848, here are the 10 above him and the 10 below him on the list:

Player               BirthYr  18yo Elo  Incr  Peak Elo  
Stefan Edberg           1966      1916   350      2266  
John Mcenroe            1959      1912   496      2408  
Guillermo Coria         1982      1909   145      2055  
Pat Cash                1965      1907   151      2058  
G. Perez Roldan         1969      1884    41      1925  
Andy Murray             1987      1878   465      2343  
Roger Federer           1981      1871   487      2359  
Thomas Enqvist          1974      1865   216      2081  
Rafael Nadal            1986      1862   452      2314  
Jim Courier             1970      1849   283      2132  
…                                                       
Jimmy Brown             1965      1834     0      1834  
Andy Roddick            1982      1815   291      2106  
Aaron Krickstein        1967      1812   246      2058  
Yannick Noah            1960      1812   299      2112  
Fabrice Santoro         1972      1805    85      1890  
Andreas Vinciguerra     1981      1803    16      1819  
Novak Djokovic          1987      1792   645      2436  
Sergi Bruguera          1971      1790   265      2055  
Thomas Muster           1967      1788   329      2117  
Dominik Hrbaty          1978      1779   133      1913

The average increase among this group is 270 Elo points, close to the overall average for players who qualified for a year-end Elo rating at age 18. The youngest members of this list are encouraging: the big four, Andy Roddick, and Andreas Vinciguerra. Most promising youngsters would happily take a two-in-three shot at having a career at the level of the big four.

Perhaps the best comparison for Felix is a player who didn’t quite make that list, Alexander Zverev. The 21-year-old German posted a year-end Elo of 1,768 as an 18 year old, and already boosted that number by more than 300 points at the end of his 2018 campaign. Zverev is only an approximate comparison, he’s just a single data point, and we don’t know where he’ll end up, but his experience is a decade more recent than those of Novak Djokovic, Murray, and Nadal.

Forecasting the career performance of young tennis players is an inexact science, at best. Potential outcomes for Auger-Aliassime range from teenage flameout to double-digit major winner. Based on the limited information he’s given us so far, the latter seems within reach. What we know for sure is that he’s playing better tennis than any 18 year old we’ve seen in a decade. If that’s not reason for optimism, I don’t know what is.

Nick Kyrgios is More Predictable Than We Think

Italian translation at settesei.it

There is a persistent belief among tennis fans and commentators that some players are particularly inconsistent. For today’s purposes, I’m talking about match-to-match results, the players who have a knack for upsetting higher-ranked opponents but are also particularly susceptible to losses against weaker players. We have a range of words for this, like unpredictable, dangerous, tricky, and the preferred term for Nick Kyrgios: mercurial.

So far in 2019, Kyrgios has provided a perfect example of the inconsistent type. After early losses to Jeremy Chardy and Radu Albot, he bounced back to win last week’s ATP 500 in Acapulco, knocking out Rafael Nadal, Stan Wawrinka, John Isner, and Alexander Zverev. There’s no question that the Australian possesses more talent than his ranking would suggest. This is a guy who has yet to crack the top ten, but holds a .500 record in completed matches against the Big 3, a feat managed by no other active player (minimum 5 matches, excepting Nadal and Novak Djokovic themselves).

He sounds inconsistent. His results look unpredictable. But compared to the uncertainty that comes with every tennis match between highly-ranked professionals, how does he stack up? As my headline suggests, it’s not as clear-cut as it seems.

Measuring predictability

Consider the opposite type, a player who reliably beats lower-ranked opponents and usually loses against his betters. Roberto Bautista Agut has this type of reputation. As we’ll see, the numbers bear it out, notwithstanding his Doha upset of Djokovic a couple of months ago. If someone really is so predictable, that should show up in a comparison of his pre-match forecasts to his results. For a Bautista Agut type, the forecasts would be particularly accurate, while for a Kyrgios type, the forecasts would be much less reliable.

We already have a metric for this. Brier Score measures the accuracy of forecasts, considering not just how often predictions proved correct, but how close they came. For instance, after Kyrgios beat Zverev in Saturday’s Acapulco final, those prognosticators who gave the Aussie a 90% chance of winning were “more” correct than those who gave him a 60% shot. On the other hand, too much confidence runs the risk of a worse Brier Score–if you’re always giving tennis favorites a 90% chance of winning, you’ll often be wrong. Brier Score is the average of the squared difference between the pre-match forecast (e.g. 90%) and the result (1 or 0, depending if the pick was correct).

Brier Scores for ATP forecasting hover around the 0.2 mark. A lower Brier Score is better, representing less difference between prediction and results, so if you can come in much lower than 0.2, you should be making money betting on matches. If you’re much higher than 0.2, you might as well be flipping a coin. If we use random, 50/50 pre-match predictions, the resulting Brier Score is 0.25.

Brier-gios

If a player is truly unpredictable, the Brier Score for his matches should approach the 0.25 mark, and it should definitely exceed the tour-typical 0.2. To measure the reliability of pre-match forecasts for Kyrgios and other players, I used my surface-weighted Elo ratings for every completed tour-level main draw match since 2000 and generated percentage forecasts for each one. By this method, Zverev had a 67.4% probability of winning the Acapulco final.

So far in 2019, Kyrgios does look truly unpredictable. The Brier Score of his ten match results is 0.318, meaning that we’d have done better by simply flipping a coin to forecast the result of each of his matches. Even if we retroactively increase his chances of winning each match to account for the fact that he’s playing better than his Elo rating predicted, the Brier Score is 0.277, still worse than coin flips.

On the other hand, it’s just ten matches. Several other players have 2019 Brier Scores well over the 0.25 threshold, including Frances Tiafoe, Joao Sousa, Juan Ignacio Londero, and Felix Auger Aliassime. In a handful of tournaments, you’ll always get a few oddball results, either because of marked improvements (as is likely with Auger Aliassime) or extreme good or bad luck. Unless we’re willing to say that Sousa and Londero are remarkably unpredictable players, we shouldn’t draw the same conclusion based on Kyrgios’s last ten matches.

What you predict is what you get

The Brier Score for Elo-based forecasts of Kyrgios’s career matches at tour level is 0.219. That’s higher–and thus less predictable–than average, but not by that much. Of the 280 players with at least 100 tour-level matches this century, Kyrgios ranks 84th, more reliable than 30% of his peers. In 2017, his results were quite unpredictable, with a Brier Score of 0.244, but in 2015 and 2016 they generated a more pedestrian 0.210, and last year they looked downright predictable, at 0.177.

The Australian may be quite unpredictable in tactics, point-to-point performance, or on-court behavior, but his results just aren’t that unusual. The following table shows the 15 most unpredictable active players, as measured by Brier Score, along with Kyrgios, followed by the 15 most predictable active players:

Player                 Matches  Brier  
Lucas Pouille              189  0.247  
Andrey Rublev              106  0.245  
Benoit Paire               377  0.239  
Ivo Karlovic               650  0.239  
Stefanos Tsitsipas         100  0.232  
Karen Khachanov            154  0.231  
Peter Gojowczyk            102  0.231  
Federico Delbonis          225  0.227  
Marius Copil               108  0.227  
Damir Dzumhur              173  0.227  
Ernests Gulbis             420  0.226  
Pablo Cuevas               338  0.226  
Mischa Zverev              297  0.226  
Joao Sousa                 323  0.226  
Borna Coric                210  0.226  
...                                       
Nick Kyrgios               191  0.219  
...                                       
Matthew Ebden              171  0.188  
David Goffin               344  0.188  
Marin Cilic                684  0.186  
Richard Gasquet            770  0.183  
Tomas Berdych              911  0.182  
Milos Raonic               448  0.178  
David Ferrer              1048  0.177  
Jo Wilfried Tsonga         600  0.175  
Roberto Bautista Agut      384  0.172  
Kei Nishikori              517  0.167  
Juan Martin Del Potro      560  0.160  
Andy Murray                802  0.146  
Roger Federer             1350  0.121  
Novak Djokovic             951  0.117  
Rafael Nadal              1060  0.114 

Lucas Pouille’s results have been almost impossible to forecast. The Brier Score generated by his 2018 results was nearly 0.3, suggesting it would have been smarter to calculate a forecast and then bet against it! Ivo Karlovic also shows up among the less reliable players, though it’s not clear whether that’s due to his unusual game style. Isner, the only decent parallel we have, is as reliable as the tour in general, with a career Brier Score of 0.201. Reilly Opelka, the other towering ace machine in the ATP top 100, has defied the odds so far in 2019, but he hasn’t yet amassed enough data to draw any conclusions.

At the other end of the spectrum, the most reliable players are many of the best. That adds up: A dominant player not only wins most of the matches he should, but his performance also allows us to make more aggressive forecasts. Nadal often enters matches with a 90% or better probability of winning, and confident predictions like that–as long the player converts them into wins–are what generate the lowest Brier Scores.

Consistent consistency results

We all tend to read too much into unusual results. Kyrgios has given us plenty of those, and we’ve repaid the favor by making him out to be even more of a wild card than he is. A couple of weeks ago, I took on a similar question and found that ATPers don’t really “play their way in” to tournaments, earning better or worse results in different rounds. This isn’t quite the same issue, but it all comes back to similar truths: Existing forecasts are pretty good, there’s always going to be a lot of randomness in the results, and the stories we invent to account for the randomness don’t really explain much at all.

Kyrgios is an immensely interesting player–I joked in yesterday’s podcast that readers should prepare themselves for a ten-part series–and digging into his point-by-point stats could reveal characteristics that are unique among tour players. That is still true. But at the match level, the likelihood that his contests will end in upsets isn’t unique at all–even if he is the proud new owner of a sombrero that says otherwise.