Handling Injuries and Absences With Tennis Elo

Italian translation at settesei.it

For the last year or so, every mention of my ATP and WTA Elo ratings has required some sort of caveat. Ratings don’t change while players are absent from the tour, so Serena Williams, Novak Djokovic, Andy Murray, Maria Sharapova, and Victoria Azarenka were all stuck at the top of their tour’s Elo rankings. When their layoffs started, they were among the best, and even a smattering of poor results (or a near season’s worth, in the case of Sharapova) isn’t enough to knock them too far down the list.

This is contrary to common sense, and it’s very different from how the official ATP and WTA rankings treat these players. Common sense says that returning players probably aren’t as good as they were before a long break. The official rankings are harsher, removing players entirely after a full year away from the tour. Serena probably isn’t the best player on tour right now (as Elo insisted during her time off), but she’s also much more of a threat than her WTA ranking of No. 454 implies. We must be able to do better.

Before we fix the Elo algorithm, let’s take a moment to consider what “better” means. Fans tend to get worked up about rankings and seedings, as if a number confers value on the player. The official rankings are, by design, backward-looking: They measure players based on their performance over the last 52 weeks, weighted by how the tour prioritizes events. (They are used in a forward-looking way, for tournament seedings, but the system is not designed to be predictive of future results.) In this way, the official rankings say, “this is how good she has played for the last year.” Whatever her ability or potential, Serena (along with Vika, Murray, and Djokovic) hasn’t posted many positive results this year, and her ranking reflects that.

Elo, on the other hand, is designed to be predictive. Out of necessity, it can only use past results, but it uses those results in a way to best estimate how well a player is competing right now–our best proxy for how someone will play tomorrow, or next week. Elo ratings–even the naive ones that said Serena and Novak are your current No. 1s–are considerably better at predicting match outcomes than are the official rankings. For my purposes, that’s the definition of “better”–ratings that offer more accurate forecasts and, by extension, the best approximation of each player’s level right now.

The time-off penalty

When players leave the tour for very long, they return–at least on average, and at least temporarily–at a lower level. I identified every layoff of eight weeks or longer in ATP history, taken by a player with an Elo rating of 1900 or above*. In their first matches back on tour, their pre-break Elo overestimated their chances of winning by about 25%. It varies a bit by the amount of time off: eight- to ten-week breaks resulted in an overestimation around 17%, while 30- to 52-week breaks meant Elo overestimated a player’s chances by nearly 50% upon return. There are exceptions to every rule, like Roger Federer at the 2017 Australian Open, and Rafael Nadal, who won 14 matches in a row after his two-month break this season, but in general, players are worse when they come back.

* I used the cutoff of 1900 because, below that level, some players are alternating between the ATP and Challenger tours. My Elo algorithm doesn’t include challenger results, so for lower-rated players, it’s not clear which timespans are breaks, and which are series of challenger events. Also, the eight-week threshold doesn’t count the offseason, so an eight-week layoff might really mean ~16 weeks between events, with the break including the offseason.

Translated into Elo terms, an eight-week break results in a drop of 100 Elo points, and a not-quite-one-year break, like Andy Murray’s current injury layoff, means a drop of 150 points. Making that adjustment results in an immediate improvement in Elo’s predictiveness for the first match after a layoff, and a small improvement in predictiveness for the first 20 matches after a break.

Incorporating uncertainty

Elo is designed to always provide a “best estimate”–when a player is new on tour, we give him a provisional rating of 1500, and then adjust the rating after each match, depending on the result, the quality of the opponent, and how many matches our player has contested. That provisional 1500 is a completely ignorant guess, so the first adjustment is a big one. Over time, the size of a player’s Elo adjustments goes down, because we learn more about him. If a player loses his first-ever match to Joao Sousa, the only information we have is that he’s probably not as good as Sousa, so we subtract a lot of points. If Alexander Zverev loses to Sousa after more than 150 career matches, including dozens of wins over superior players, we’ll still dock Zverev a few points, but not as many, because we know so much more about him.

But after a layoff, we are a bit less certain that what we knew about a player is still relevant. Djokovic a great example right now. If he lost six out of nine matches (as he did between the Australian Open fourth round and Madrid) without missing any time beforehand, we’d know it was a slump, but most of us would expect him to snap out of it. Elo would reduce his rating, but he’d remain near the top. Since he missed the second half of last season, however, we’re more skeptical–perhaps he’ll never return to his former level. Other cases are even more clear-cut, as when a player returns from injury without being fully healed.

Thus, after a layoff, it makes sense to alter how much we adjust a player’s Elo ratings. This isn’t a new idea–it’s the core concept behind Glicko, another chess rating system that expands on Elo. Over the years, I’ve tinkered with Glicko quite a bit, looking for improvements that apply to tennis, without much success. Changing the multiplier that determines rating adjustments (known as the k factor) doesn’t improve the predictiveness of tennis Elo on its own, but combined with the post-layoff penalties I described above, it helps a bit.

The nitty-gritty: After a layoff, I increase the multiplier by a factor of 1.5, and then gradually reduce it back to 1x over the next 20 matches. The flexible multiplier slightly improves the accuracy of Elo ratings for those 20 matches, though the difference is minor compared to the effect of the initial penalty.

No more caveats*

* I thought it would be funny to put an asterisk after “no more caveats.”

Post-layoff penalties and flexible multipliers end up bringing down the current Elo ratings of the players who are in the middle of long breaks or have recently come back from them, giving us ranking tables that come closer to what we expect–and should do a better job of predicting the outcome of upcoming matches. These changes to the algorithm also have minor effects on the ratings of other players, because everyone’s rating depends on the rating of all of his or her opponents. So Taro Daniel’s Elo bounce from defeating Djokovic in Indian Wells doesn’t look quite as good as it did before I implemented the penalty.

On the ATP side, the new algorithm knocks Djokovic down to 3rd in overall Elo, Murray to 6th, Jo-Wilfried Tsonga to 21st, and Stan Wawrinka to 24th. That’s still quite high for Novak considering what we’ve seen this year, but remember that the Elo algorithm only knows about his on-court performances: A six-month break followed by a half-dozen disappointing losses. The overall effect is about a 200-point drop from his pre-layoff level; the “problem” is that his Elo a year ago reflected how jaw-droppingly good he had recently been.

The WTA results match my intuition even better than I hoped. Serena falls to 7th, Sharapova to 18th, and Azarenka to 23rd. Because of the flexible multiplier, a few early wins for Williams will send her quickly back up the rankings. Like Djokovic, she rates so high in part because of her stratospheric Elo rating before her time off. For her part, Sharapova still rates higher by Elo than she does in the official rankings. Despite the penalty for her one-year drug suspension, the algorithm still treats her prior success as relevant, even if that relevance fades a bit more every week.

Elo is always an approximation, and given the wide range of causes that will sideline a player, not to mention the spectrum of strategies for returning to the tour, any rating/forecasting system is going to have a harder time with players in that situation. That said, these improvements give us Elo ratings that do a better job of representing the current level of players who have missed time, and they will allow us to make superior predictions about matches and tournaments involving those players.

Under the hood

If you’re interested in some technical details, keep reading.

Before making these adjustments, the Brier score for Elo-based predictions of all ATP matches since 1972 was about 0.20. For all matches that involved at least one player with an Elo of 1900 or better, it was 0.17. (Not only are 1900+ players better, their ratings tend to be based on more data, which at least partly explains why the predictions are better. The lower the Brier score, the better.)

For the population of about 500 “first matches” after layoffs for qualifying players, the Brier score before these changes was 0.192. After implementing the penalty, it improved to 0.173.

For the 2nd through 20th post-comeback matches, the Brier score for the original algorithm was 0.195. After adding the penalty, it was 0.191, and after making the multiplier flexible, it fell a bit more to 0.190. (Additional increases to the post-layoff multiplier had negative results, pushing the Brier score back to about 0.195 when the 2nd-match multiplier was 2x.) I realize that’s a tiny change, and it very possibly won’t hold up in the future. But in looking at various notable players over the course of their comebacks, that’s the option that generated results that looked the most intuitively accurate. Since my intuition matched the best Brier score (however miniscule the difference), it seems like the best option.

Finally, a note on players with multiple layoffs. If someone misses six months, plays a few matches, then misses another two months, it doesn’t seem right to apply the penalty twice. There aren’t a lot of instances to use for testing, but the limited sample confirms this. My solution: If the second layoff is within two years of the previous comeback, combine the length of the two layoffs (here: eight months), find the penalty for a break of that length, and then apply the difference between that penalty and the previous one. Usually, that results in second-layoff penalties of between 10 and 50 points.

Forecasting the Laver Cup

Italian translation at settesei.it

This weekend brings us the first edition of the Laver Cup, a star-studded three-day affair that pits Europe against the rest of the world. The European team features Roger Federer and Rafael Nadal, and even though several other elites from the continent are missing due to injury, the European team is still much stronger on paper.

Here are the current rosters, along with each competitor’s weighted hard court Elo rating and rank among active players:

EUROPE                  Elo Rating  Elo Rank  
Roger Federer                 2350         2  
Rafael Nadal                  2225         4  
Alexander Zverev              2127         7  
Tomas Berdych                 2038        14  
Marin Cilic                   2029        15  
Dominic Thiem                 1995        17  
                                              
WORLD                   Elo Rating  Elo Rank  
Nick Kyrgios                  2122         8  
John Isner                    1968        22  
Jack Sock                     1951        23  
Sam Querrey                   1939        25  
Denis Shapovalov              1875        36  
Frances Tiafoe                1574       153  
Juan Martin del Potro*        2154         5

*del Potro has withdrawn. I’ve included his singles Elo rating and rank to emphasize how damaging his absence is to the World squad.

“Weighted” surface Elo is the average of overall (all-surface) Elo and surface-specific Elo. The 50/50 split is a much better predictor of match outcomes than either number on its own.

Nick Kyrgios can hang with anybody on a hard court. But despite some surface-specific skills represented by the American contingent, every other member of the World team rates lower than every member of team Europe. This isn’t a good start for the rest of the world.

What about doubles? Here are the D-Lo (Elo for doubles) ratings and rankings for all twelve participants, plus Delpo:

EUROPE                  D-Lo rating  D-Lo rank  
Rafael Nadal                   1895          4  
Tomas Berdych                  1760         28  
Marin Cilic                    1676         76  
Roger Federer**                1650         90  
Alexander Zverev               1642         99  
Dominic Thiem                  1521        185  
                                                
WORLD                   D-Lo rating  D-Lo rank  
Jack Sock                      1866          8  
John Isner                     1755         29  
Nick Kyrgios                   1723         45  
Sam Querrey                    1715         49  
Denis Shapovalov**             1600        130  
Frances Tiafoe                 1546        166  
Juan Martin del Potro*         1711         55

** Federer hasn’t played tour-level doubles since 2015, and Shapovalov hasn’t done so at all. These numbers are my best guesses, nothing more.

Here, the World team has something of an edge. While both sides feature an elite doubles player–Rafa and Jack Sock–the non-European side is a bit deeper, especially if they keep Denis Shapovalov and last-minute Delpo replacement Frances Tiafoe on the sidelines. Only one-quarter of Laver Cup matches are doubles (plus a tie-breaking 13th match, if necessary), so it still looks like team Europe are the heavy favorite.

The format

The Laver Cup will take place in Prague over three days (starting Friday, September 22nd), and consist of four matches each day: three singles and one doubles. Every match is best-of-three sets with ad scoring and a 10-point super-tiebreak in place of the third set.

On the first day, the winner of each match gets one point; on the second day, two points, and on the third day, three points. That’s a total of 24 points up for grabs, and if the twelve matches end in a 12-12 deadlock, the Cup will be decided with a single doubles set.

All twelve participants must play at least one singles match, and no one can play more than two. At least four members of each squad must play doubles, and no doubles pairing can be repeated, except in the case of a tie-breaking doubles set.

Got it? Good.

Optimal strategy

The rules require that three players on each side will contest only one singles match while the other three will enter two each. A smart captain would, health permitting, use his three best players twice. Since matches on days two and three count for more than matches on day one, it also makes sense that captains would use their best players on the final two days.

(There are some game-theoretic considerations I won’t delve into here. Team World could use better players on day one in hopes of racking up each points against the lesser members of team Europe, or could drop hints that they will do so, hoping that the European squad would move its better players to day one. As far as I can tell, neither team can change their lineup in response to the other side’s selections, so the opportunities for this sort of strategizing are limited.)

In doubles, the ideal roster deployment strategy would be to use the team’s best player in all three matches. He would be paired with the next-best player on day three, the third-best on day two, and the fourth-best on day one. Again, this is health permitting, and since all of these guys are playing singles, fatigue is a factor as well. My algorithm thus far would use Nadal five times–twice in singles and three times in doubles–and I strongly suspect that isn’t going to happen.

The forecast

Let’s start by predicting the outcome of the Cup if both captains use their roster optimally, even if that’s a longshot. I set up the simulation so that each day’s singles competitors would come out in random order–if, say, Querrey, Shapovalov, and Tiafoe play for team World on day one, we don’t know which of them will play first, or which European opponent each will face. So each run of the simulation is a little different.

As usual, I used Elo (and D-Lo) to predict the outcome of specific matchups. Because of the third-set super-tiebreak, and because it’s an exhibition, I added a bit of extra randomness to every forecast, so if the algorithm says a player has a 60% chance of winning, we knock it down to around 57.5%. When I dug into IPTL results last winter, I discovered that exhibition results play surprisingly true to expectations, and I suspect players will take Laver Cup a bit more seriously than they do IPTL.

Our forecast–again, assuming optimal player usage–says that Europe has an 84.3% chance of winning, and the median point score is 16-8. There’s an approximately 6.5% chance that we’ll see a 12-12 tie, and when we do, Europe has a slender 52.4% edge.

If Delpo were participating, he would increase the World team’s chances by quite a bit, reducing Europe’s likelihood of victory to 75.5% and narrowing the most probable point score to 15-9.

What if we relax the “optimal usage” restriction? I have no idea how to predict what captains John McEnroe and Bjorn Borg will do, but we can randomize which players suit up for which matches to get a sense of how much influence they have. If we randomize everything–literally, just pick a competitor out of a hat for each match–Europe comes out on top 79.7% of the time, usually winning 15-9. There’s a 7.6% chance of a tie-breaking 13th match, and because the World team’s doubles options are a bit deeper, they win a slim majority of those final sets. (When we randomize everything, there’s a slight risk that we violate the rules, perhaps using the same doubles pairing twice or leaving a player on the bench for all nine singles matches. Those chances are very low, however, so I didn’t tackle the extra work required to avoid them entirely.)

We can also tweak roster usage by team, in case it turns out that one captain is much savvier than the other. (Or if a star like Nadal is unable to play as much as his team would like.) The best-case scenario for our World team underdogs is that McEnroe chooses the best players for each match and Borg does not. Assuming that only European players are chosen from a hat, the probability that the favorites win falls all the way to 63.1%, and the typical gap between point totals narrows all the way to 13-11. The chance of a tie rises to 10%.

On the other hand, it’s possible that Borg is better at utilizing his squad. After all, it doesn’t take an 11-time grand slam winner to realize that Federer and Nadal ought to be on court when the stakes are the highest. This final forecast, with random roster usage from team World and ideal choices from Borg, gives Europe a whopping 92.3% chance of victory, and median point totals of 17 to 7. The World team would have only a 4% shot at reaching a deadlock, and even then, the Europeans win two-thirds of the tiebreakers.

There we have it. The numbers bear out our expectation that Europe is the heavy favorite, and they give us a sense of the likely margin of victory. Tiafoe and Shapovalov might someday be part of a winning Laver Cup side, but it looks like they’ll have to wait a few years before that happens.

Update: One more thing… What about doubles specialists? Both captains have two discretionary picks to use on players regardless of ranking. Most great doubles players are much worse at singles, but as we’ve seen, a player can be relegated to a lone one-point singles match on day one, and as a doubles player, he can have an effect on three different matches, totaling six points.

Sure enough, swapping out Dominic Thiem (a very weak doubles player for whom indoor hard courts are less than ideal) for Nicolas Mahut would have increased Europe’s chances of winning from 84.3% to 88.5%. On the slight chance that the Cup stayed tight through the final doubles match and into a tiebreaker, the doubles team of Mahut-Nadal (however unorthodox that sounds) would be among the best that any captain could put on the court.

There’s even more room for improvement on the World side, especially with del Potro out. At the moment, the third-highest rated hard court player by D-Lo is Marcelo Melo, who would be a major step down in singles but a huge improvement on most of the potential partners for Sock in doubles. If we give him a singles Elo of 1450 and put him on the roster in place of Tiafoe and pit the resulting squad against the original Europe team (with Thiem, not Mahut), it almost makes up for the loss of Delpo–World’s chances of winning increase from 15.7% to 19.3%.

Unfortunately, Borg and McEnroe may have missed their chance to eke out extra value from their six-man rosters–this is a trick that will only work once. If both teams made this trade, Mahut-for-Thiem and Melo-for-Tiafoe, each side’s win probability goes back to near where it started: 85.8% for Europe. That’s a boost over where we started (84.3%), just because Mahut is better suited for the competition than Melo is, as an elite doubles specialist who is also credible on the singles court. No one available to the World team (except for Sock, who is already on the roster) fits the same profile on a hard court. Vasek Pospisil comes to mind, though he has taken a step back from his peaks in both singles and doubles. And on clay, Pablo Cuevas would do nicely, but on a faster surface, he would represent only a marginal improvement over the doubles players already playing for team World.

Maybe next year.

 

Measuring the Impact of Wimbledon’s Seeding Formula

Italian translation at settesei.it

Unlike every other tournament on the tennis calendar, Wimbledon uses its own formula to determine seedings. The grass court Grand Slam grants seeds to the top 32 players in each tour’s rankings, and then re-orders them based on its own algorithm, which rewards players for their performance on grass over the last two seasons.

This year, the Wimbledon seeding formula has more impact on the men’s draw than usual. Seven-time champion Roger Federer is one of the best grass court players of all time, and though he dominated hard courts in the first half of 2017, he still sits outside the top four in the ATP rankings after missing the second half of 2016. Thanks to Wimbledon’s re-ordering of the seeds, Federer will switch places with ATP No. 3 Stan Wawrinka and take his place in the draw as the third seed.

Even with Wawrinka’s futility on grass and the shakiness of Andy Murray and Novak Djokovic, getting inside the top four has its benefits. If everyone lives up to their seed in the first four rounds (they won’t, but bear with me), the No. 5 seed will face a path to the title that requires beating three top-four players. Whichever top-four guy has No. 5 in his quarter would confront the same challenge, but the other three would have an easier time of it. Before players are placed in the draw, top-four seeds have a 75% chance of that easier path.

Let’s attach some numbers to these speculations. I’m interested in the draw implications of three different seeding methods: ATP rankings (as every other tournament uses), the Wimbledon method, and weighted grass-court Elo. As I described last week, weighted surface-specific Elo–averaging surface-specific Elo with overall Elo–is more predictive than ATP rankings, pure surface Elo, or overall Elo. What’s more, weighted grass-court Elo–let’s call it gElo–is about as predictive as its peers for hard and clay courts, even though we have less grass-court data to go on. In a tennis world populated only by analysts, seedings would be determined by something a lot more like gElo and a lot less like the ATP computer.

Since gElo ratings provide the best forecasts, we’ll use them to determine the effects of the different seeding formulas. Here is the current gElo top sixteen, through Halle and Queen’s Club:

1   Novak Djokovic         2296.5  
2   Andy Murray            2247.6  
3   Roger Federer          2246.8  
4   Rafael Nadal           2101.4  
5   Juan Martin Del Potro  2037.5  
6   Kei Nishikori          2035.9  
7   Milos Raonic           2029.4  
8   Jo Wilfried Tsonga     2020.2  
9   Alexander Zverev       2010.2  
10  Marin Cilic            1997.7  
11  Nick Kyrgios           1967.7  
12  Tomas Berdych          1967.0  
13  Gilles Muller          1958.2  
14  Richard Gasquet        1953.4  
15  Stanislas Wawrinka     1952.8  
16  Feliciano Lopez        1945.3

We might quibble with some these positions–the algorithm knows nothing about whatever is plaguing Djokovic, for one thing–but in general, gElo does a better job of reflecting surface-specific ability level than other systems.

The forecasts

Next, we build a hypothetical 128-player draw and run a whole bunch of simulations. I’ve used the top 128 in the ATP rankings, except for known withdrawals such as David Goffin and Pablo Carreno Busta, which doesn’t differ much from the list of guys who will ultimately make up the field. Then, for each seeding method, we randomly generate a hundred thousand draws, simulate those brackets, and tally up the winners.

Here are the ATP top ten, along with their chances of winning Wimbledon using the three different seeding methods:

Player              ATP     W%  Wimb     W%  gElo     W%  
Andy Murray           1  23.6%     1  24.3%     2  24.1%  
Rafael Nadal          2   6.1%     4   5.7%     4   5.5%  
Stanislas Wawrinka    3   0.8%     5   0.5%    15   0.4%  
Novak Djokovic        4  34.1%     2  35.4%     1  34.8%  
Roger Federer         5  21.1%     3  22.4%     3  22.4%  
Marin Cilic           6   1.3%     7   1.0%    10   1.0%  
Milos Raonic          7   2.0%     6   1.6%     7   1.7%  
Dominic Thiem         8   0.4%     8   0.3%    17   0.2%  
Kei Nishikori         9   1.9%     9   1.7%     6   1.9%  
Jo Wilfried Tsonga   10   1.6%    12   1.4%     8   1.5%

Again, gElo is probably too optimistic on Djokovic–at least the betting market thinks so–but the point here is the differences between systems. Federer gets a slight bump for entering the top four, and Wawrinka–who gElo really doesn’t like–loses a big chunk of his modest title hopes by falling out of the top four.

The seeding effect is a lot more dramatic if we look at semifinal odds instead of championship odds:

Player              ATP    SF%  Wimb    SF%  gElo    SF%  
Andy Murray           1  58.6%     1  64.1%     2  63.0%  
Rafael Nadal          2  34.4%     4  39.2%     4  38.1%  
Stanislas Wawrinka    3  13.2%     5   7.7%    15   6.1%  
Novak Djokovic        4  66.1%     2  71.1%     1  70.0%  
Roger Federer         5  49.6%     3  64.0%     3  63.2%  
Marin Cilic           6  13.6%     7  11.1%    10  10.3%  
Milos Raonic          7  17.3%     6  14.0%     7  15.2%  
Dominic Thiem         8   7.1%     8   5.4%    17   3.8%  
Kei Nishikori         9  15.5%     9  14.5%     6  15.7%  
Jo Wilfried Tsonga   10  14.0%    12  13.1%     8  14.0%

There’s a lot more movement here for the top players among the different seeding methods. Not only do Federer’s semifinal chances leap from 50% to 64% when he moves inside the top four, even Djokovic and Murray see a benefit because Federer is no longer a possible quarterfinal opponent. Once again, we see the biggest negative effect to Wawrinka: A top-four seed would’ve protected a player who just isn’t likely to get that far on grass.

Surprisingly, the traditional big four are almost the only players out of all 32 seeds to benefit from the Wimbledon algorithm. By removing the chance that Federer would be in, say, Murray’s quarter, the Wimbledon seedings make it a lot less likely that there will be a surprise semifinalist. Tomas Berdych’s semifinal chances improve modestly, from 8.0% to 8.4%, with his Wimbledon seed of No. 11 instead of his ATP ranking of No. 13, but the other 27 seeds have lower chances of reaching the semis than they would have if Wimbledon stopped meddling and used the official rankings.

That’s the unexpected side effect of getting rankings and seedings right: It reduces the chances of deep runs from unexpected sources. It’s similar to the impact of Grand Slams using 32 seeds instead of 16: By protecting the best (and next best, in the case of seeds 17 through 32) from each other, tournaments require that unseeded players work that much harder. Wimbledon’s algorithm took away some serious upset potential when it removed Wawrinka from the top four, but it made it more likely that we’ll see some blockbuster semifinals between the world’s best grass court players.

Unpredictable Bounces, Predictable Results

Italian translation at settesei.it

These days, the grass court season is the awkward stepchild of the tennis calendar. It takes place almost entirely within a single country’s borders, lasts barely a month, and often suffers from the absence of top players who prefer to rest after the French Open.

The small number of grass court events makes the surface problematic for analysts, as well. The surface behaves differently than hard or clay courts and rewards certain playing styles, so it’s reasonable to assume that many players will be particularly good or bad on grass. But with 90% of tour-level matches contested on other surfaces, many players don’t have much of a track record with which we can assess their grass-court prowess.

I was surprised, then, to find that grass court results are rather predictable. Elo-based forecasts of ATP grass court matches are almost as accurate as hard court predictions and considerably more effective than clay court forecasts. Even when we use “pure” surface forecasts–that is, predicting matches using ratings which draw only on results from that surface–grass court forecasts are a bit better than clay court predictions.

I started with a dataset of the roughly 50,000 ATP matches from 2000 through last week, excluding retirements and withdrawals. As a benchmark, I used official ATP rankings to make predictions for each of those matches. 66.6% of them were right, and the Brier score for ATP rankings over that span is .210. (Brier score measures the accuracy of a set of forecasts by averaging the squared error of each individual forecast, so a lower number is better. To put tennis-specific Brier scores in context, in 2016, ATP rankings had a .208 Brier score, and aggregate betting odds had a .189 Brier score.)

Let’s break that down by surface and compare the performance of ATP rankings, Elo, and surface-specific Elo. “F%” is the percentage of matches won by the favorite–as determined by that system, and “Br” is Brier score:

Surface  ATP F%  ATP Br  Elo F%  Elo Br  sElo F%  sElo Br  
Hard      67.3%   0.207   68.0%   0.205    68.5%    0.202  
Clay      66.1%   0.211   67.1%   0.211    67.0%    0.213  
Grass     66.0%   0.215   67.6%   0.207    68.5%    0.207

All three rating systems do best on hard courts, and for good reason: official rankings and overall Elo are more heavily weighted toward hard court results than they are clay or grass. Surface-specific Elo does best on hard courts for a similar reason: more data.

Already, though, we can see the unexpected divergence of clay and grass courts, especially with surface-specific Elo. It’s possible to explain overall Elo’s better performance on grass courts due to the presumed similarly between hard and grass–if a player excels on one, he’s probably good on the other, even if he’s horrible on clay.  But that doesn’t explain sElo doing better on grass than on clay. There are 3.3 times as many tour-level matches on clay than on grass, so even allowing for the fact that players choose schedules to suit their surface preferences, almost everyone is going to have more results on dirt than on turf. More data should give us better results, but not here.

We can improve our forecasts even more by blending surface-specific ratings with overall ratings. After testing a wide range of possible mixes, it turns out that equally weighting Elo and sElo provides close to the best results. (The differences between, say, 60/40 and 50/50 are extremely small on all surfaces, so even where 60/40 is a bit better, I prefer to keep it simple with a half-and-half mix.) Here are the results for weighted surface Elos for all three surfaces:

Surface  ATP F%  ATP Br  
Hard      68.6%   0.202  
Clay      68.0%   0.207  
Grass     69.8%   0.196

Now grass courts are the most predictable of the major surfaces! Even when we use a weighted average of Elo and sElo, grass court forecasts rely on less data than those of the other surfaces–the surface-specific half of the grass court forecasts uses less than one-third the match results of clay court predictions and less than one-fifth the results of hard court forecasts. In fact, we can do at least as well–and perhaps a tiny bit better–with even less data: A 50/50 weighting of grass-specific Elo and hard-specific Elo is just as accurate as the half-and-half mix of grass-specific and overall Elo.

Regardless of the exact formula, it remains striking that we can predict ATP grass court results so accurately from such limited data. Even if one-third of ATP events were played on grass, I still wouldn’t have been surprised if grass court results turned out to be the least predictable. The more a surface favors the server–and it’s hardest to break on grass–the tighter the scoreline will tend to be, introducing more randomness into the end result. Despite that structural tendency, we’re able to pick winners as successfully on grass as on the more common surfaces.

Here’s my theory: Even though there aren’t many grass court events, the conditions at those few tournaments are quite consistent. Altitude is roughly sea level, groundskeepers follow the lead of the staff at Wimbledon, and rain clouds are almost always in sight. Compare that homogeneity to the variety of hard courts or clay courts. The high-altitude hard courts in Bogota are nothing like the slow ones in Indian Wells. The “clay” in Houston is only nominally equal to the crushed brick of Roland Garros. While grass courts are almost identical to each other, clay courts are nearly as different from each other as they are from other surfaces.

It makes sense that ratings based on a uniform surface would be more accurate than ratings based on a wide range of surfaces, and it’s reassuring to find that the limited available data doesn’t cancel out the advantage. This research also suggests a further path to better forecasts: grouping hard and clay matches by a more precise measure of surface speed. If 10% of tour matches are sufficient to make accurate grass court predictions, the same may be true of the slowest one-third of clay courts. More data is almost always better, but sometimes, precisely targeted data is best of all.

Is Jelena Ostapenko More Than the Next Iva Majoli?

Italian translation at settesei.it

Winning a Grand Slam as a teenager–or, in the case of this year’s French Open champion Jelena Ostapenko, a just-barely 20-year-old–is an impressive feat. But it isn’t always a guarantee of future greatness. Plenty of all-time greats launched their careers with Slam titles at age 20 or later, but three of the players who won their debut major at ages closest to Ostapenko’s serve as cautionary tales in the opposite direction: Iva Majoli, Mary Pierce, and Gabriela Sabatini. Each of these women was within three months of her 20th birthday when she won her first title, and of the three, only Pierce ever won another.

However, simply comparing her age to that of previous champions understates the Latvian’s achievement. Women’s tennis has gotten older over the last two decades: The average age of a women’s singles entrant in Paris this year was 25.6, a few days short of the record established at Roland Garros and Wimbledon last year. That’s two years older than the average player 15 years ago, and four years older than the typical entrant three decades ago. Headed into the French Open this year, there were only five teenagers ranked in the top 100; at the end of 2004, the year of Maria Sharapova’s and Svetlana Kuznetsova’s first major victories, there were nearly three times as many.

Thus, it doesn’t seem quite right to group Ostapenko with previous 19- and 20-year-old first-time winners. Instead, we might consider the Latvian’s “relative age”—the difference between her and the average player in the draw—of 5.68 years younger than the field. When I introduced the concept of relative age last week, it was in the context of Slam semifinalists, and in every era, there have been some very young players reaching the final four who burned out just as quickly. The same isn’t true of women who went on to win major titles.

In the last thirty years, only two players have won a major with a greater relative age than Ostapenko: Sharapova, who was 6.66 years younger than the 2004 US Open field, and Martina Hingis, who won three-quarters of the Grand Slam in 1997 at age 16, between 6.3 and 6.6 years younger than each tournament’s average entrant. The rest of the top five emphasizes Ostapenko’s elite company, including Monica Seles (5.29, at the 1990 French Open) and Serena Williams (5.26, at the 1999 US Open).

Each of those four women went on to reach the No. 1 ranking and win at least five majors–an outrageously optimistic forecast for Ostapenko, who, even after winning Roland Garros, is ranked outside the top ten. By relative age, Majoli, Pierce, and Sabatini are poor comparisons for Saturday’s champion–Majoli and Pierce were only three years younger than the fields they overcame, and Sabatini was only two years younger than the average entrant. By comparison, Garbine Muguruza, who won last year’s French Open at age 22, was two and a half years younger than the field.

Which is it, then? Unfortunately I don’t have the answer to that, and we probably won’t have a better idea for several more years. For most of the Open Era, until about ten years ago, the average age on the women’s tour fluctuated between 21 and 23. Thus, for the overall population of first-time major champions, actual age and relative age are very highly correlated. It’s only with the last decade’s worth of debut winners that the numbers meaningfully diverge. For Ostapenko and Muguruza–and perhaps Victoria Azarenka and Petra Kvitova–we have yet to see what their entire career trajectory will look like. To build a bigger sample to test the hypothesis, we’ll need a few more young first-time Slam winners, something we may finally see with Sharapova and Williams out of the way.

For more post-French Open analysis, here’s my Economist piece on Ostapenko and projecting major winners in the long term. Also at the Game Theory blog, I wrote about Rafael Nadal and his abssurd dominance on clay in a fast-court-friendly era.

Finally, check out Carl Bialik’s and my extra-long podcast, recorded Monday, with tons of thoughts and the winners and the fields in general.

Dominic Thiem and Reversible Blowouts

Italian translation at settesei.it

A few weeks ago in Rome, Dominic Thiem got destroyed by Novak Djokovic, 6-1 6-0. It was a letdown after Thiem’s previous-round upset of Rafael Nadal, and it seemed to provide a reminder of the old adage that tennis is about matchups. Even someone good enough to beat the King of Clay might struggle against a different sort of opponent.

Those struggles didn’t last. On Wednesday, Thiem faced Djokovic again, this time in the French Open quarterfinals, and won in straight sets. In less than three weeks, the Austrian bounced back from a brutal loss to defeat one of the greatest players of all time.

I’ve written before about the limited value of head-to-head records: When the head-to-head suggests that one player will win but the rankings disagree, the rankings prove to be the better forecaster. More sophisticated rating systems such as Elo would presumably do better still, though I haven’t done that exact test. There are certainly individual cases in which something specific about a matchup casts doubt on the predictiveness of the rankings, but if you have to pick one or the other, head-to-heads are the loser.

What about blowouts? Going into Wednesday’s quarterfinal, my surface-specific Elo ratings suggested that Thiem had a 26% chance of scoring the upset. The recent 6-1 6-0 loss was factored into those numbers, but only as a loss–there’s no consideration of severity. Should we have been even more skeptical of Thiem’s chances, given the most recent head-to-head result?

As it turns out, Thiem is far from the first player to turn things around after such a nasty scoreline. The most famous example is Robin Soderling, who lost 6-1 6-0 to Nadal in Rome in 2009, then bounced back to register one of the biggest upsets in tennis history, knocking out Rafa at Roland Garros. Few recoveries are so dramatic, but there are hundreds more.

Most players who lose lopsided scorelines–for today’s purposes, I’m considering any match in which the loser won two games or fewer–never get a chance to redeem themselves. I found roughly 2250 such matches in the ATP’s modern era, and the same two players met again less than half of those times. The fact that the head-to-head continues is a signal itself: Mediocre players–the ones you’d expect to lose badly–don’t get another chance. Even some top-20 players rarely meet each other on court, so the sort of player who earns the chance for redemption might have already proven that his lopsided loss was just an off day.

Of the 951 occasions that a player loses badly and faces the same opponent again, he gets revenge and wins the next match 277 times–about 29%. Crazy as it sounds, if the only thing we knew about Djokovic and Thiem entering Wednesday’s match was that Djokovic had won the last match 6-1 6-0, our base forecast would’ve been pretty close to the 26% that the much-more sophisticated Elo algorithm offered us.

29% is much higher than I expected, but it is lower than the typical rate for players in this situation. I found all head-to-heads of at least two meetings, and for every match after the first, counted whether it maintained or reversed the previous result. In addition to isolating lopsided scores, I also considered matches in which the loser won a set, on the assumption that those might be tighter matchups. Finally, for each of those categories, I tracked whether the follow-up matches were on the same surface as the previous one. Here are the results, with all win percentages shown from the perspective of the player who, like Thiem, lost the first encounter:

Score     Next Surface  Matches   Wins  Win %  
Any loss  All             68128  26586  39.0%  
Any loss  Same            31084  11855  38.1%  
Any loss  Diff            37044  14731  39.8%  
Bad loss  All               951    277  29.1%  
Bad loss  Same              457    128  28.0%  
Bad loss  Diff              494    149  30.2%  
Won set   All             26075  11286  43.3%  
Won set   Same            11766   4974  42.3%  
Won set   Diff            14309   6312  44.1%

The chances of recovering from a bad loss are better than I thought, but they are considerably worse than the odds that a player reverses the result after a less conspicuous scoreline–39%. The table also shows that the player seeking revenge is more likely to get it if the opportunity arises on a different surface, though not by a wide margin.

It’s clear that players are less likely to recover from a bad loss than from a more typical one, but how much of that is selection bias? After all, most of the players who lose 6-1 6-0 aren’t of the caliber of Thiem or Soderling, even if they are good enough to stick around in main draws and ultimately face the same opponent again.

To answer that question, I looked again at those 950 post-blowout matches, this time with pre-match Elo ratings. After eliminating everything before 1980 and a few other matchups with very little data, we were left with just under 600 data points. In this subset, Elo predicted that the players who lost badly had a 33.6% chance of winning the follow-up match. As we’ve seen, the actual success rate was 29%. Players who won lopsided matches outperformed their Elo forecast in the next meeting.

It’s not a huge difference, but enough to suggest that the matchup tells a little bit about how the next contest will go. One match can make a difference in the forecast–as long as it isn’t against Dominic Thiem.

Digging into the cases when a player lost badly and then recovered, I found a couple of entertaining examples:

  • Former No. 7 Harold Solomon beat Ivan Lendl in their first meeting, 6-1 6-1. Later that year, they met again at the US Open, and Lendl won, 6-1 6-0 6-0. Lendl also won their six matches after that.
  • Over the course of four years, Phil Dent and Mark Cox played three lopsided matches against each other. Cox won the first, Dent got revenge in the second, and Cox reversed things again in the third.

The Negative Impact of Time of Court

Italian translation at settesei.it

With 96 men’s matches in the books so far at Roland Garros this year, we’ve seen only one go to the absolute limit, past 6-6 in the fifth set. Still, we’ve had our share of lengthy, brutal five-set fights, including three matches in the first round that exceeded the four-hour mark. The three winners of those battles–Victor Estrella, David Ferrer, and Rogerio Dutra Silva–all fell to their second-round opponent.

A few years ago, I identified a “hangover effect” after Grand Slam marathons, defined as those matches that reach 6-6 in the fifth. Players who emerge victorious from such lengthy struggles would often already be considered underdogs in their next matches–after all, elite players rarely need to work so hard to advance–but marathon winners underperform even when we take their underdog status into account. (Earlier this week, I showed that women suffer little or no hangover effect after marathon third sets.)

A number of readers suggested I take a broader look at the effect of match length. After all, there are plenty of slugfests that fall just short of the marathon threshold, and some of those, like Ferrer’s loss yesterday to Feliciano Lopez, 6-4 in the final set, are more physically testing than some of those that reach 6-6. Match time still isn’t a perfect metric for potential fatigue–a four-hour match against Ferrer is qualitatively different from four hours on court with Ivo Karlovic–but it’s the best proxy we have for a very large sample of matches.

What happens next?

I took over 7,200 completed men’s singles matches from Grand Slams back to 2001 and separated them into groups by match time: one hour to 1:29, 1:30 to 2:00, and so on, up to a final category of 4:30 and above. Then I looked at how the winners of all those matches fared against their next opponents:

Prev Length   Matches  Wins  Win %  
1:00 to 1:29      448   275  61.4%  
1:30 to 1:59     1918  1107  57.7%  
2:00 to 2:29     1734   875  50.5%  
2:30 to 2:59     1384   632  45.7%  
3:00 to 3:29      976   430  44.1%  
3:30 to 3:59      539   232  43.0%  
4:00 to 4:29      188    64  34.0%  
4:30 and up        72    23  31.9%

The trend couldn’t be any clearer. If the only thing you know about a Slam matchup is how long the players spent on court in their previous match, you’d bet on the guy who recorded his last win in the shortest amount of time.

Of course, we know a lot more about the players than that. Andy Murray spent 3:34 on court yesterday, but even with his clay-court struggles this year, we would favor him in the third round against most of the men in the draw. As I’ve done in previous studies, let’s account for overall player skill by estimating the probability of each player winning each of these 7,200+ matches. Here are the same match-length categories, with “expected wins” (based on surface-specific Elo, or sElo) shown as well:

Prev Length   Wins  Exp Wins  Exp Win %  Ratio  
1:00 to 1:29   275       258      57.5%   1.07  
1:30 to 1:59  1107      1058      55.2%   1.05  
2:00 to 2:29   875       881      50.8%   0.99  
2:30 to 2:59   632       657      47.5%   0.96  
3:00 to 3:29   430       445      45.6%   0.97  
3:30 to 3:59   232       244      45.3%   0.95  
4:00 to 4:29    64        77      41.2%   0.83  
4:30 and up     23        30      42.1%   0.76

Again, there’s not much ambiguity in the trend here. Better players spend less time on court, so if you know someone beat their previous opponent in 1:14, you can infer that he’s a very good player. Often that assumption is wrong, but in the aggregate, it holds up.

The “Ratio” column shows the relationship between actual winning percentage (from the first table) and expected winning percentage. If previous match time had no effect, we’d expect to see ratios randomly hovering around 1. Instead, we see a steady decline from 1.07 at the top–meaning that players coming off of short matches win 7% more often than their skill level would otherwise lead us to forecast–to 0.76 at the bottom, indicating that competitors tend to underperform following a battle of 4:30 or longer.

It’s difficult to know whether we’re seeing a direct effect of time of court or a proxy for form. As good as surface-specific Elo ratings are, they don’t capture everything that could possibly predict the outcome of a match, especially micro-level considerations like a player’s comfort on a specific type of surface or at a certain tournament. sElo also needs a little time to catch up with players making fast improvements, particularly when they are very young. All this is to say that our correction for overall skill level will never be perfect.

Thus, a 75-minute win may improve a player’s chances by keeping him fresh for the next round … or it might tell us that–for whatever reason–he’s a stronger competitor right now than our model gives him credit for. One point in favor of the latter is that, at the most extreme, less time on court doesn’t help: Players don’t appear to benefit from advancing via walkover. That isn’t a slam-dunk argument–some commentators believe that walkovers could be detrimental due to the long resulting layoff at a Slam–but it does show us that less time on court isn’t always a positive.

Whatever the underlying cause, we can tweak our projections accordingly. Murray could be a little weaker than usual tomorrow after his length battle yesterday with Martin Klizan. Albert Ramos, the only man to complete a second-rounder in less than 90 minutes, might be playing a bit better than his rating suggest. It’s certainly evident that match time has something to tell us even when players aren’t stretched to the breaking point of a marathon fifth set.

Men’s Doubles On the Dirt

Angelique Kerber wasn’t the only top seed to crash out early at this year’s French Open. In the men’s doubles draw, the top section opened up when Henri Kontinen and John Peers, the world’s top-ranked team, lost to the Spanish pair of David Marrero and Tommy Robredo. It’s plausible to attribute the upset to the clay, as Kontinen-Peers have tallied a pedestrian five wins against four losses on the dirt this season and one could guess that the Spaniards are at their strongest on clay.

Fortunately we don’t have to guess. Using a doubles variant of sElo–surface-specific Elo, which I began writing about a few days ago in the context of women’s singles–we can make rough estimates of how Kontinen/Peers would fare against Marrero/Robredo on each surface. The top seeds are solid on all surfaces–less than a year ago, they won a clay title in Hamburg–but stronger on hard courts. sElo ranks them 4th and 8th on hard, but 10th and 13th on clay among tour regulars.  Marrero is the surface-specialist of the bunch, ranking 37th on clay and 78th on hard. Robredo throws a wrench into the exercise, as he has played very little doubles recently, only eight events since the beginning of 2016.

Using these numbers–including those derived from Robredo’s limited sample–we find that sElo would have given Kontinen/Peers a 73.6% chance of winning yesterday, compared to a 78.3% advantage on a hard court. Even if we adjust Robredo’s clay-court sElo to something closer to his all-surface rating, the top seeds still look like 69% favorites.

A more striking example comes from yesterday’s other big upset, in which Julio Peralta and Horacio Zeballos took out Feliciano Lopez and Marc Lopez. On any surface, the Lopezes are the superior team, but Peralta and Zeballos have a much larger surface differential:

Player    Hard sElo  Clay sElo  
M Lopez        1720       1804  
F Lopez        1713       1772  
Zeballos       1651       1756  
Peralta        1517       1770

On a hard court, sElo gives the Lopezes a 68.1% chance of winning this matchup. But on clay, the gap narrows all the way to 53.6%. It’s still a bit of an upset for the South Americans, but not one that should come as much of a surprise.

Mismatches

I’ve speculated in the past that surface preferences aren’t as pronounced in doubles as they are in singles. Regardless of surface, points are shorter, and many teams position one player at the net even on the dirt. While some hard-courters are probably uncomfortable on clay (and vice versa), I wouldn’t expect the effects to be as substantial as they are in singles.

The numbers tell a different story. Here are the top ten, ranked by hard court sElo:

Rank  Player          Hard sElo  
1     Jack Sock            1947  
2     Nicolas Mahut        1893  
3     Marcelo Melo         1883  
4     Henri Kontinen       1879  
5     P-H Herbert          1862  
6     Bob Bryan            1851  
7     Mike Bryan           1846  
8     John Peers           1842  
9     Bruno Soares         1829  
10    Jamie Murray         1828

By clay court sElo:

Rank  Player                Clay sElo  
1     Mike Bryan                 1950  
2     Bob Bryan                  1950  
3     P-H Herbert                1894  
4     Nicolas Mahut              1889  
5     Jack Sock                  1887  
6     Robert Farah               1850  
7     Juan Sebastian Cabal       1849  
8     Pablo Cuevas               1824  
9     Rohan Bopanna              1812  
10    John Peers                 1810

Jamie Murray and Bruno Soares, who appear in the hard court top ten, sit outside the top 25 in clay court sElo. Robert Farah and Juan Sebastian Cabal are 41st and 42nd in hard court sElo, despite ranking in the clay court top seven. Pablo Cuevas, another clay court top-tenner, is 87th on the hard court list.

To go beyond these anecdotes–noteworthy as they are–we need to compare the level of surface preference in men’s doubles to other tours. To do that, I calculated the correlation coefficent between hard court and clay court sElo for the top 50 players (ranked by overall Elo) in men’s doubles, men’s singles, and women’s singles. (I don’t yet have an adequate database to generate ratings for women’s doubles.)

In other words, we’re testing how much a player’s results on one surface predict his or her results on the other major surface. The higher the correlation coefficient, the more likely it is that a player will have similar results on hard and clay. Here’s how the tours compare:

Tour             Correl  
Men's Singles     0.708  
Women's Singles   0.417  
Men's Doubles     0.323

In contrast to my hypothesis above, surface preferences in men’s doubles appear to be much stronger than in either men’s or women’s singles. (And there’s a huge difference between men’s and women’s singles, but that’s a subject for another day.)

Randomness

I suspect that the low correlation of surface-specific Elos in men’s doubles is partly due to the more random nature of doubles results. Because the event is more serve-dominated, there are more close sets ending in tiebreaks, and because of the no-ad, super-tiebreak format used outside of Slams, tight matches are decided by a smaller number of points. Thus, every doubles player’s results–and their various Elo ratings–reflect the influence of chance more than the singles results are.

Another consideration–one that I haven’t yet made sense of–is that surface-specific ratings don’t improve doubles forecasts they way that they do men’s and women’s singles predictions. As I wrote on Sunday, sElo represents a big improvement over surface-neutral Elo for women’s forecasts, and in an upcoming post, I’ll be able to make some similar observations for the men’s game. Using Brier score, a measure of the calibration of predictions, we can see the effect of using surface-specific Elo ratings in 2016 tour-level matches:

Tour             Elo Brier  sElo Brier  
Men's Singles        0.202       0.169  
Women's Singles      0.220       0.179  
Men's Doubles        0.171       0.181

The lower the Brier score, the more accurate the forecasts. This isn’t a fluke of 2016: The differences in men’s doubles Brier scores are around 0.01 for each of the last 15 seasons. By this measure, Elo does a very good job predicting the outcome of men’s doubles matches, but the surface-specific sElo represents a small step back. It could be that the smaller sample–using only one surface’s worth of results–is more damaging to forecasts in doubles than it is in singles.

Doubles analytics is particularly uncharted territory, and there’s plenty of work remaining for researchers even in this narrow subtopic. There’s lots of work to do for the world’s top doubles players as well, now that we can point to a noticeably weaker surface for so many of them.

The Steadily Less Predictable WTA

Italian translation at settesei.it

Update: The numbers in this post summarizing the effectiveness of sElo are much too high–a bug in my code led to calculating effectiveness with post-match ratings instead of pre-match ratings. The parts of the post that don’t have to do with sElo are unaffected and–I hope–remain of interest.

One of the talking points throughout the 2017 WTA season has been the unpredictability of the field. With the absence of Serena Williams, Victoria Azarenka, and until recently, Petra Kvitova and Maria Sharapova, there is a dearth of consistently dominant players. Many of the top remaining players have been unsteady as well, due to some combination of injury (Simona Halep), extreme surface preferences (Johanna Konta), and good old-fashioned regression to the mean (Angelique Kerber).

No top seed has yet won a title at the Premier level or above so far this year. Last week, Stephanie Kovalchik went into more detail, quantifying how seeds have failed to meet expectations and suggesting that the official WTA ranking system–the algorithm that determines which players get those seeds–has failed.

There are plenty of problems with the WTA ranking system, especially if you expect it to have predictive value–that is, if you want it to properly reflect the performance level of players right now. Kovalchik is correct that the rankings have done a particularly poor job this year identifying the best players. However, there’s something else going on: According to much more accurate algorithms, the WTA is more chaotic than it has been for decades.

Picking winners

Let’s start with a really basic measurement: picking winners. Through Rome, there had been more than 1100 completed WTA matches. The higher-ranked player won 62.4% of those. Since 1990, the ranking system has picked the winner of 67.9% of matches, and topped 70% during several years in the 1990s. It never fell below 66% until 2014, and this year’s 62.4% is the worst in the 28-year time frame under consideration.

Elo does a little better. It rates players by the quality of their opponents, meaning that draw luck is taken out of the equation, and does a better job of estimating the ability level of players like Serena and Sharapova, who for various reasons have missed long stretches of time. Since 1990, Elo has picked the winner of 68.6% of matches, falling to an all-time low of 63.1% so far in 2017.

For a big improvement, we need surface-specific Elo (sElo). An effective surface-based system isn’t as complicated as I expected it to be. By generating separate rankings for each surface (using only matches on that surface), sElo has correctly predicted the winner of 76.2% of matches since 1990, almost cracking 80% back in 1992. Even sElo is baffled by 2017, falling to it’s lowest point of 71.0% in 2017.

(sElo for all three major surfaces is now shown on the Tennis Abstract Elo ratings report.)

This graph shows how effectively the three algorithms picked winners. It’s clear that sElo is far better, and the graph also shows that some external factor is driving the predictability of results, affecting the accuracy of all three systems to a similar degree:

Brier scores

We see a similar effect if we use a more sophisticated method to rate the WTA ranking system against Elo and sElo. The Brier score of a collection of predictions measures not only how accurate they are, but also how well calibrated they are–that is, a player forecast to win a matchup 90% of the time really does win nine out of ten, not six out of ten, and vice versa. Brier scores average the square of the difference between each prediction and its corresponding result. Because it uses the square, very bad predictions (for instance, that a player has a 95% chance of winning a match she ended up losing) far outweigh more pedestrian ones (like a player with a 95% chance going on to win).

In 2017 so far, the official WTA ranking system has a Brier score of .237, compared to Elo of .226 and sElo of .187. Lower is better, since we want a system that minimizes the difference between predictions and actual outcomes. All three numbers are the highest of any season since 1990. The corresponding averages over that time span are .207 (WTA), .202 (Elo), and .164 (sElo).

As with the simpler method of counting correct predictions, we see that Elo is a bit better than the official ranking, and both of the surface-agnostic methods are crushed by sElo, even though the surface-specific method uses considerably less data. (For instance, the clay-specific Elo ignores hard and grass court results entirely.) And just like the results of picking winners, we see that the differences in Brier scores of the three methods are fairly consistent, meaning that some other factor is causing the year-to-year differences:

The takeaway

The WTA ranking system has plenty of issues, but its unusually bad performance this year isn’t due to any quirk in the algorithm. Elo and sElo are structured completely differently–the only thing they have in common with the official system is that they use WTA match results–and they show the same trends in both of the above metrics.

One factor affecting the last two years of forecasting accuracy is the absence of players like Serena, Sharapova, and Azarenka. If those three played full schedules and won at their usual clip, there would be quite a few more correct predictions for all three systems, and perhaps there would be fewer big upsets from the players who have tried to replace them at the top of the game.

But that isn’t the whole story. A bunch of no-brainer predictions don’t affect Brier score very much, and the presence of heavily-favored players also make it more likely that massively surprising results occur, such as Serena’s loss to Madison Brengle, or Sharapova’s ouster at the hands of Eugenie Bouchard. Many unexpected results are completely independent of the top ten, like Marketa Vondrousova’s recent title in Biel.

While some of the year-to-year differences in the graphs above are simply noise, the last several years looks much more like a meaningful trend. It could be that we are seeing a large-scale changing of a guard, with young players (and their low rankings) regularly upsetting established stars, while the biggest names in the sport are spending more time on the sidelines. Upsets may also be somewhat contagious: When one 19-year-old aspirant sees a peer beating top-tenners, she may be more confident that she can do the same.

Whatever influences have given us the WTA’s current state of unpredictability, we can see that it’s not just a mirage created by a flawed ranking system. Upsets are more common now than at any other point in recent memory, whichever algorithm you use to pick your favorites.

Del Potro’s Draws and the Possible Persistence of Bad Luck

Italian translation at settesei.it

Tennis’s draw gods have not been kind to Juan Martin del Potro this year.

In Acapulco and Indian Wells, he drew Novak Djokovic as his second-match opponent. In Miami, Delpo got a third-rounder with Roger Federer. In each of the March Masters events, with 1,000 ranking points at stake, del Potro was handed the most difficult opponents for his first round against a fellow seed. Thanks in part to the resulting early exits, one of the most dangerous players on tour is still languishing outside of the top 30 in the ATP rankings.

When I wrote about the Indian Wells quarter of death–the section of the draw containing del Potro, Djokovic, Federer, Rafael Nadal, and Nick Kyrgios–I attempted to quantify the effect of the draw on each player’s expected ranking points. Before each player’s name was placed in the bracket, my model predicted that Delpo would earn about 150 ranking points–the weighted average of his likelihood of reaching the third round, the fourth round, and so on–and after the draw was conducted, his higher probability of a clash with Djokovic knocked that number down to just over 100. That negative effect was one of the worst of any player in the tournament.

The story in Miami is similar, if less extreme. Pre-draw, Delpo’s expected points were 183. Post draw: 155. In the four tournaments he has entered this year, he has been uniformly unlucky:

Tournament    Pre-Draw  Post-Draw  Effect  
Delray Beach      89.3       74.0  -17.1%  
Acapulco         121.5       97.1  -20.1%  
Indian Wells     154.6      102.5  -33.7%  
Miami            182.9      155.4  -15.0%  
TOTAL            548.2      429.0  -21.7%

*The numbers above for Indian Wells are slightly different than what I published in the Indian Wells article, since the simulations I ran for this post consider the entire 96-player field, not just the 64-player second round.

The good news, as we’ll see, is that it’s virtually impossible for this degree of misfortune to continue. The bad news is that those 119 points are gone forever, and at Delpo’s current position in the ranking table, that disadvantage will affect his tournament seeds, which in turn will result in worse draws (earlier meetings with higher-ranked players, independent of luck) for at least another few weeks.

Before we go any further, let me review the methodology I’m using here. (If you’re not interested, skip this paragraph.) For “post-draw” expected points, I’m taking jrank-based forecasts–like the ones on the front page of Tennis Abstract–and using each player’s probability of each round to calculate a weighted average of expected points. “Pre-draw” forecasts are much more computationally demanding. In Miami, for instance, Delpo could’ve faced any of the 64 unseeded players in the second round and been slated to meet any of the top eight seeds in the third round. For each tournament, I ran a Monte Carlo simulation with the tournament seeds, generating a new draw and simulating the tournament–100,000 times, then summing all those outcomes. So in the pre-draw forecast, Delpo had a one-eighth chance of getting Fed in the third round, a one-eighth chance of getting Kei Nishikori there, and so on.

It seems clear that a 22%, 119-point rankings hit over the course of four tournaments is some seriously bad luck. Last year, there were about 750 instances of a player being seeded at an ATP tournament, and in fewer than 60 of those, the draw resulted in an effect of -22% or worse on the player’s expected ranking points. And that’s just one tournament! The odds that Delpo would get such a rough deal in all four of his 2017 tournaments are 1 in more than 20,000.

Over the course of a full season, draw luck mostly evens out. It’s rare to see an effect of more than 10% in either direction. Last year, Thiemo de Bakker saw a painful difference of 18% between his pre-draw and post-draw expected points in 12 ATP events, but everyone else with at least that many tournaments fell between -11% and +11%, with three-quarters of players between -5% and +5%. Even when draw luck doesn’t balance itself out, the effect isn’t as bad as what Delpo has seen in 2017.

Del Potro’s own experience in 2016 is a case in point. His most memorable event of the season was the Olympics, where he drew Djokovic in the first round, so it’s easy to recall his year as being equally riddled with bad luck. But in his 12 other ATP events, the draw aided him in six–including a +34% boost at the US Open–and hurt him at the other six. Altogether, his 2016 ATP draws gave him a 5.9% advantage over his “pre-draw” expected points–a bonus of 17 ranking points. (I didn’t include the Olympics, since no ranking points were awarded there.)

Taken together, Delpo’s 2016-17 draws have deprived him of about 100 ranking points, which would move him three spots up the ranking table. So even with a short stretch of extreme misfortune, draw luck hasn’t affected him that much. Last year’s most extreme case among elite players, Richard Gasquet, suffered a similar effect: His draws knocked down his expected take by 9%, or 237 points, a difference that would bump him up from #22 to #19 in this week’s ranking list.

There are many reasons to believe that del Potro is a much better player than his current ranking suggests, such as his Elo rating, which stands at No. 7. But his ATP ranking reflects his limited schedule and modest start last year much more than it does the vagaries of each week’s brackets. The chances are near zero that he will continue to draw the toughest player in each tournament’s field in the earliest possible round, so we’ll soon have a better idea of what exactly he is capable of, and where exactly he should stand in the rankings.