Forecasting – Page 2 – Heavy Topspin

Do Players Like Daniil Medvedev Eventually Start Winning on Clay?

Daniil Medvedev is within a whisker of the ATP number two ranking, and he has twice reached a grand slam final. He has a big serve, but he’s more than a serve-bot, and his resourceful, varied baseline game suggests he has the tools to excel on all surfaces.

Yet out of 28 career tour-level matches on clay, he’s won 10. Ten wins is an awfully meager haul for a 25-year-old with his sights set on the sport’s top honors.

I put together a list of about 140 ATP top-tenners–that’s basically all of them, with the exception of those whose careers were well underway at the start of the Open Era. For each one, I tallied up their clay court winning percentage in their first 28 matches (or 29 or 30, if the 28th came in the middle of an event), their hard-court results up to the same point in their career, and their eventual clay court results.

When Medvedev played his most recent match on dirt last September, it dropped his clay winning percentage to 35.7%, compared to a hard court record of 116-51, or 69.5%. Few top ATPers have begun their careers so ineffective on clay or so deadly on hard.

In fact, only 5 of the 140 players were worse in their first 28 clay matches. It’s a motley bunch, ranging from Joachim Johansson (who only played 17 matches on the surface in his career) to Kevin Curren (who only got there at age 34) to Diego Schwartzman, who is best on clay, but was overmatched early in his career. The guys tied with Medvedev are an equally mixed crowd, including those who preferred to skip the clay–Tim Henman, Paradorn Srichaphan–and those who took some time to get their footing at tour level, such as Nicolas Almagro and Robin Soderling.

Unlike Daniil Medvedev

This sampling of names suggests that the question I started with is difficult to define. On paper, Henman was “like” Medvedev. By the time he finished his 28th clay match, he had already played 152 times at tour level on hard courts, winning two thirds of them. But their playing styles are so different that the statistical similarities could be misleading.

Let’s narrow the list of comparable players to those who meet the following criteria:

lost more than half of their first 28 clay matches
had played at least 75 hard-court matches by the time they played their 28th on clay (in other words, they weren’t slow-starting dirtballers like Schwartzman or Almagro)
played at least 40 more clay-court matches in their careers (to exclude the blatant clay-avoiders like Curren and Srichaphan)

The following table shows the remaining 14 players, plus Medvedev. I’ve included the age when they played their 28th clay match, and their winning percentages on clay and hard up to that point. The final three columns show how things proceeded from there–after the tournament when they played they 28th clay match (“Future”), you can see how many clay matches they played, what percentage they won, and how many titles they took home:

Player        Age  Clay%  Hard%  Future: M   W%  Titles  
T Johansson  24.1    29%    56%         79  38%       0  
Soderling    21.7    34%    53%        109  70%       3  
Henman       24.7    34%    66%         90  59%       0  
Medvedev     24.6    36%    69%                          
Enqvist      22.1    39%    67%         93  51%       1  
Federer      19.8    41%    62%        266  80%      11  
Rafter       24.4    41%    60%         41  59%       0  
Cilic        20.5    43%    64%        174  65%       2  
Anderson     25.9    43%    60%         80  56%       0  
Isner        25.9    45%    60%        100  57%       1  
Kiefer       21.9    45%    62%         94  45%       0  
Blake        23.4    46%    56%         72  46%       0  
Murray       21.9    48%    76%        125  74%       3  
Bjorkman     25.1    48%    60%         71  31%       0  
Rusedski     24.0    48%    54%         50  30%       0

The results don’t exactly leave Rafael Nadal quaking in his Nikes. 8 of the 14 never won a clay title, and Isner’s 2013 win in Houston barely saves him from making it 9. The combined post-28th-match winning percentage of these guys is just shy of 60%, which isn’t bad, until you consider that without Roger Federer, the rate drops to 55%. The four players that offer some hope for Medvedev–Federer, Soderling, Andy Murray, and Marin Cilic–all played their 28th tour-level match on clay before their 22nd birthday, and even given their relative inexperience, all but Soderling did better in their first 28 than the Russian did.

When we take age into consideration, Henman looks like an even better comp, alongside characters like Pat Rafter and Greg Rusedski. They were more obviously one-dimensional than Medvedev is, but their early-career results offer decent parallels. Medvedev can only hope the similarities end there.

One thing I learned in putting together this list was probably already obvious to most of you–there aren’t a lot of players, now or in the past, who can easily be described as “like” Daniil Medvedev. That makes forecasting even trickier than usual. His height and recent serving prowess almost classes him with Isner and Kevin Anderson, while his game style puts him in a category with … Murray?

There’s another lesson in trying to locate parallels for Medvedev. He’d better hope that he continues to defy easy classification. It’s a bit late to become the next Federer, so if he’s going to become more an occasional threat on dirt, he’ll have a whole lot of historical precedent to overcome.

How Much Does Naomi Osaka Raise Her Game?

You’ve probably heard the stat by now. When Naomi Osaka reaches the quarter-final of a major, she’s 12-0. That’s unprecedented, and it’s especially unexpected from a player who doesn’t exactly pile up hardware outside of the hard court grand slams.

It sure looks like Osaka finds another level as she approaches the business end of a major. Translated to analytics-speak, “she raises her game” can be interpreted as “she plays better than her rating implies.” That is certainly true for Osaka. She has won 16 of her 18 matches in the fourth round or later of a slam, often in matchups that didn’t appear to favor her. In her first title run, at the 2018 US Open, my Elo ratings gave her 36%, 53%, 46%, and 43% chances of winning her fourth-round, quarter-final, semi-final, and final-round matches, respectively.

Had Osaka performed at her expected level for each of her 18 second-week matches, we’d expect her to have won 10.7 of them. Instead, she won 16. The probability that she would have won 16 or more of the 18 matches is approximately 1 in 200. Either the model is selling her short, or she’s playing in a way that breaks the model.

Estimating lift

Osaka’s results in the second week of slams are vastly better than the other 93% or so of her tour-level career. It’s possible that it’s entirely down to luck–after all, things with a 0.5% chance of happening have a habit of occurring about 0.5% of the time, not never. When those rare events do take place, onlookers are very resourceful when it comes to explaining them. You might believe Osaka’s claims about caring more on the big stage, but we should keep in mind that whenever the unlikely happens, a plausible justification often follows.

Recognizing the slim possibility that Osaka has taken advantage of some epic good luck but setting it aside, let’s quantify how good she’d have to be for such a performance to not look lucky at all.

That’s a mouthful, so let me explain. Going into her 16 second-week slam matches, Osaka’s average surface-blended Elos have been 2,022. That’s good but not great–it’s a tick below Aryna Sabalenka’s hard-court Elo rating right now. Those modest ratings are how we come up with the estimate that Osaka should’ve won 10.7 of her 18 matches, and that she had a 1-in-200 shot of winning 16 or more.

2,022 doesn’t explain Osaka’s success, so the question is: What number does? We could retroactively boost her Elo rating before each of those matches by some amount so that her chance of winning 16-plus out of 18 would be a more believable 50%. What’s that boost? I used a similar methodology a couple of years ago to quantify Rafael Nadal’s feats at his best clay court events, another string of match wins that Elo can’t quite explain.

The answer is 280 Elo rating points. If we retroactively gave Osaka an extra 280 points before each of these 16 matches, the resulting match forecasts would mean that she’d have had a fifty-fifty chance at winning 14 or more of them. Instead of a pre-match average of 2,022, we’re looking at about 2,300, considerably better than anyone on tour right now. (And, ho hum, among the best of all time.) A difference of 280 Elo points is enormous–it’s the difference between #1 and #22 in the current hard-court Elo rating.

Osaka versus the greats

I said before that Osaka’s 12-0 is unprecedented. Her 16-2 in slam second weeks may not have quite the same ring to it, but compared to expectations based on Osaka’s overall tour-level performance, it is every bit as unusual.

Take Serena Williams, another woman who cranks it up a notch when it really matters. Her second-week record, excluding retirements, is 149-39, while the individual forecasts before each match would’ve predicted about 124-64. The chances of a player outperforming expectations to that extent are basically zero. I ran 10,000 simulations, and that’s how many times a player with Serena’s pre-match odds won 147 of the 185 matches. Zero.

For Serena to have had a 50% chance of winning 149 of the 188 second-week contests, her pre-match Elo ratings would’ve had to have been 140 points higher. That’s a big difference, especially on top of the already stellar ratings that she has maintained throughout her career, but it’s only half of the jump we needed to account for Osaka’s exploits. Setting aside the possibility of luck, Osaka raises her level twice as much as Serena does.

One more example. Monica Seles won 70 of her 95 second-week matches at slams, a marked outperformance of the 60 matches that Elo would’ve predicted for her. Like Osaka, her chances of having won 70 instead of 60 based purely on luck are about 1 in 100. But you can account for her actual results by giving her a pre-match Elo bonus of “only” 100 points.

The full context

I ran similar calculations for the 52 women who won a slam, made their first second-week appearance in 1958 or later, and played at least 10 second-week matches. They divide fairly neatly into three groups. 18 of them have career second-week performances that can easily be explained without recourse to good luck or level-raising. In some cases we can even say that they were unlucky or that they performed worse than expected. Ashleigh Barty is one of them: Of her 14 second-week matches, she was expected to win 9.9 but has tallied only 8.

Another 16 have been a bit lucky or slightly raised their level. To use the terms I introduced above, their performances can be accounted for by upping their pre-match Elo ratings by between 10 and 60 points. One example is Venus Williams, who has gone 84-43 in slam second weeks, about six wins better than her pre-match forecasts would’ve predicted.

That leaves 18 players whose second-week performances range from “better than expected” to “holy crap.” I’ve listed each of them below, with their actual wins (“W”), forecasted wins (“eW”), probability of winning their actual total given pre-match forecasts (“p(W)”), and the approximate number of Elo points (“Elo+”) which, when added to their pre-match forecasts, would explain their results by shifting p(W) up to at least 50%.

Player               M    W     eW   p(W)  Elo+  
Naomi Osaka         18   16   10.7   0.5%   280  
Billie Jean King   123   94   76.2   0.0%   160  
Sofia Kenin         10    7    4.7  10.6%   150  
Serena Williams    188  149  124.4   0.0%   140  
Evonne Goolagong    92   69   58.7   0.4%   130  
Jennifer Capriati   70   42   33.2   1.2%   110  
Monica Seles        95   70   60.2   1.2%   100  
Hana Mandlikova     75   49   41.7   3.1%   100  
Kim Clijsters       67   47   40.6   4.6%    90  
Justine Henin       74   55   48.9   6.3%    80  
Mary Pierce         55   28   22.4   6.9%    80  
Li Na               36   22   18.0  10.6%    80  
Steffi Graf        157  131  123.6   6.1%    70  
Maria Bueno         93   70   63.4   6.3%    70  
Garbine Muguruza    31   18   14.9  15.8%    70  
Mima Jausovec       32   18   15.0  15.9%    70  
Marion Bartoli      20   11    8.8  20.6%    70  
Sloane Stephens     24   12    9.7  20.8%    70

There are plenty of names here that we’d comfortably put alongside Williams and Seles as luminaries known for their clutch performances. Still, the difference between Osaka’s levels is on another planet.

Obligatory caveats

Again, of course, Osaka’s results could just be lucky. It doesn’t look that way when she plays, and the qualitative explanations add up, but … it’s possible.

Skeptics might also focus on the breakdown of the 52-player sample. In terms of second-week performance relative to forecasts, only one-third of the players were below average. That doesn’t seem quite right. The “average” woman outperformed expectations by about 30 Elo points.

There are two reasons for that. The first is that my sample is, by definition, made up of slam winners. Those players won at least four second-week matches, no matter how they fared in the rest of their careers. In other words, it’s a non-random sample. But that doesn’t have any relevance to Osaka’s case.

The second, more applicable, reason that more than half of the players look like outperformers is that any pre-match player rating is a measure of the past. Elo isn’t as much of a lagging indicator as, say, official tour rankings, but by its nature, it can only consider past results.

Any player who ascends to the top of the game will, at some point, need to exceed expectations. (If you don’t exceed expectations, you end up with a tennis “career” like mine.) To go from mid-pack to slam winner, you’ll have at least one major where you defy the forecasts, as Osaka did in New York in 2018. Osaka was an extreme case, because she hadn’t done much outside of the slams. If, for instance, Sabalenka were to win the US Open this year, she has done so well elsewhere that it wouldn’t be the same kind of shock, but it would still be a bit of a surprise.

In other words, almost every player to win a slam had at least one or two majors where they executed better than their previous results offered any reason to expect. That’s one reason why we find Sofia Kenin only two spots below Osaka on the list.

For Serena or Seles, the “rising star” effect doesn’t make much of a difference–those early tournaments are just a drop in the bucket of a long career. Yeah, it might mean they really only up their game by 110 Elo points instead of 130, but it doesn’t call their entire career’s worth of results into question. For Osaka or Kenin, the early results make up a big part of the sample, so this is something to consider.

It will be tougher to Osaka to outperform expectations as the expectations continue to rise. Much depends on whether she continues to struggle away from the big stages. If she continues to manage only one non-major title per year, she’ll keep her rating down and suppress those pre-match forecasts. (The predictions of major media pundits will be harder to keep under control.) Beating the forecasts isn’t necessarily something to aspire to–even though Serena does it, her usual level is so high that we barely notice. But if Osaka is going to alternate levels between world-class and merely very good, she could hardly do better than to bring out her best stuff when she does.

The Post-Covid Tennis World is Unpredictable. The Match Results Are Not.

Both the ATP and WTA patched together seasons in the second half of 2020, providing playing opportunities to competitors who had endured vastly different lockdowns–some who couldn’t practice for awhile, some who came down with Covid-19, and others who got knee surgery.

When the tours came back, we didn’t know quite what to expect. I’m sure some of the players didn’t know, either. Yet when we take the 2020 season (plus a couple weeks of 2021) as a whole, what happened on court was pretty much what happened before. The Australian Open, with its dozens of players in hard quarantine for two weeks, may change that. But for about five months, players faced all kinds of other unfamiliar challenges, and they responded by posting results that wouldn’t have looked out of place in January 2020.

The Brier end

My usual metric for “predictability” is Brier Score, which measures both accuracy (did our pre-match favorite win?) and confidence (if we think four players are all 75% favorites, did three of them win?). Pre-match odds are determined by my Elo ratings, which are far from the final word, but are more than sufficient for these purposes. My tour-wide Brier Scores are usually in the neighborhood of 0.21, several steps better than the 0.25 Brier that results from pure coin-flipping. A lower score indicates more accurate forecasts and/or better calibrated confidence levels.

Here are the tour-wide Brier Scores for the ATP and WTA since the late-summer restart:

ATP: 0.213 (2017 – early 2020: 0.212)
WTA: 0.192 (2017 – early 2020: 0.212)

The ATP’s level of predictability is so steady that it’s almost suspicious, while the WTA has somehow been more predictable since the restart.

But we aren’t quite comparing apples to apples. The post-restart WTA was sparser than the pre-Covid women’s tour, and the post-restart ATP was closer to its pre-pandemic normal.

Let’s look at a few things that do line up. Most of the top players showed up for the main events of the restarted tour, such as the US Open, Roland Garros, Rome, “Cincinnati” (played in New York), and men’s Masters event in Paris. Here are the 2019 and 2020 Brier Scores for each of those events:

Event          Men '19  Men '20  Women '19  Women '20  
Cincinnati       0.244    0.210      0.244      0.252  
US Open          0.210    0.167      0.178      0.186  
Roland Garros    0.163    0.199      0.191      0.226  
Rome             0.209    0.274      0.205      0.232  
Paris            0.226    0.199          -          -  
---
Total            0.204    0.202      0.198      0.218

(If you want even more numbers, I did similar calculations in August after Palermo, Lexington, and Prague.)

Three takeaways from this exercise:

Brier Scores are noisy. Any single tournament number can be heavily affected by a few major upsets.
Man, those ATP dudes were steady.
The WTA situation is more complicated than I thought.

Whether we look at the entire post-restart tour or solely the big events, the story on the ATP side is clear. Long layoffs, tournament bubbles, missing towelkids, Hawkeye Live … none of it had much effect on the status quo.

The predictability of the women’s tour is another thing entirely. The 12 top-level events between Palermo in July and Abu Dhabi in January were easier to forecast than a random sampling of a dozen tournaments from, say, 2018. But the four biggest events deviated from the script considerably more than they had in 2019 (or 2017 or 2018, for that matter).

From this, I offer a few tentative conclusions:

Big events, with their disproportionate number of star-versus-star matches, are a bit more predictable than other tournaments.
Accordingly, the post-restart WTA wasn’t as predictable as it first appeared. It was just lopsided in favor of tournaments that drew (most of) the top stars. Had the women’s tour featured a wider variety of events–which probably would’ve included a larger group of players, including some fringier ones–it’s post-restart Brier Score would’ve been higher. Perhaps even higher than the corresponding pre-Covid number.
Most tentative of all: The predictability of ATP and WTA match results might have itself been affected by the availability of tournaments. Top men were able to get into something like their usual groove, despite the weirdness of virus testing and empty stadiums. Most women never got a chance to play more than two or three weeks in a row.

Even six months after Palermo, the data is still limited. And by the time we have enough match results to do proper comparisons, some things will have gotten back to normal (hopefully!), complicating the analysis even further. That said, these findings are much clearer than my initial forays into post-restart Brier Scores in August. As for the Australian Open, quarantine and all, I’m forecasting a predictable tournament. At least for the men.

The Next Five Years, According To a (Dumb) Grand Slam Crystal Ball

Last year, I introduced a bare-bones model that predicts men’s grand slam results for the next five years. It takes a minimum of inputs: a player’s age, and his number of major semi-finals, finals, and titles in the last two years. Despite leaving out so much additional data, the model explains a lot of the variation among players, achieving most of what a more complex algorithm would, but with nothing more than basic arithmetic.

A bit further down, I’ll introduce a similar model for women’s grand slam results. First, let’s look at the revised numbers for the men. Keep in mind that these are not career slam forecasts, but only slams in the next five years. That’s good enough for the Big Three, but it probably doesn’t tell the whole story for, say, Stefanos Tsitsipas.

Player              Projected Slams  
Novak Djokovic                  2.5  
Rafael Nadal                    2.1  
Dominic Thiem                   2.0  
Alexander Zverev                0.9  
Stefanos Tsitsipas              0.6  
Daniil Medvedev                 0.6  
Matteo Berrettini               0.3  
Lucas Pouille                   0.1  
Diego Schwartzman               0.1

A few other players (notably Roger Federer) reached a semi-final in the last two years, but because of their age, the model forecasts zero slams. Also keep in mind that Wimbledon was not played this year, so there was a bit less data to work with.* The sum of the forecasts is a mere 9.2 slams, out of a possible 20. In some previous years, the model predicted as many as 15 titles for the players it took into consideration. Because today’s top players are so old, they aren’t expected to dominate much of the 2021-25 calendar, leaving room for new contenders to emerge.

* My original post describes the forecasting algorithm as counting results from “the last four slams” and “the previous four slams.” We could account for the three-slam 2020 season by following those steps literally, giving greater weights to the last four slams (the 2019 US Open plus the three 2020 slams), and giving lesser (but still non-zero) weights to the four slams before that. I rejected that approach because (a) it would give an awful lot of weight to the US Open, and (b) the relative lack of 2020 data reflects higher-than-usual uncertainty, which ought to show up in the forecasts, as well. Thus, only seven slams were taken into account for 2021-25 predictions, instead of the usual eight.

Interestingly, the 2020 season has barely budged the predicted career totals for the big three. Numbers I published immediately after last year’s US Open forecast Rafael Nadal for 22.5 career slams: his (then) 19 plus 3.5 more. Now he has 20, and the model pegs him for another 2.1. Novak Djokovic was slated for a career total of 19.5: 3.5 more on top his then-total of 16. He’s still penciled in for 19.5: 17 plus another 2.5 in the future. Federer didn’t have reason to expect much a year ago, and it’s no better now.

The women’s model

It turns out that a similar back-of-the-envelope approach gives good approximations of future slam totals for WTA stars, as well. The weights are a bit different, the average peak age is one year sooner, and the age adjustment is slightly smaller, but the idea is essentially the same.

Here’s how to calculate the number of expected major titles for your favorite player:

Start with zero points
Add 20 points for each slam semi-final reached in the last 12 months
Add 20 points for each slam final reached in the last 12 months
Add 80 points for each slam title won in the last 12 months
Add 10 points for each slam semi-final reached in the previous 12 months
Add 10 points for each slam final reached in the previous 12 months
Add 40 points for each slam title won in the previous 12 months
If the player is older than 26 (at the time of the next slam), subtract 7 points for each year she is older than 26
If the player is younger than 26, add 7 points for each year she is younger than 26
Divide the sum by 100

To take a simple example, consider Iga Swiatek. For her recent French Open title, she gets 20 points for the semi, 20 points for the final, and 80 points for the title. She will still be 19 when the Australian Open rolls around, so we add another 49 points: 7 years younger than 26, times 7 points per year. Her projected total is (20 + 20 + 80 + 49) / 100 = 1.69.

Here are the results for all of the women who reached a major semi-final in 2019 or 2020 and are projected to win more than zero slams between 2021 and 2025:

Player               Projected Slams  
Naomi Osaka                      2.0  
Sofia Kenin                      1.9  
Iga Swiatek                      1.7  
Bianca Andreescu                 1.0  
Ashleigh Barty                   0.9  
Amanda Anisimova                 0.6  
Simona Halep                     0.6  
Marketa Vondrousova              0.6  
Nadia Podoroska                  0.4  
Garbine Muguruza                 0.3  
Belinda Bencic                   0.3  
Jennifer Brady                   0.3  
Elina Svitolina                  0.2  
Petra Kvitova                    0.1  
Victoria Azarenka                0.1

These forecasts sum to 11.0 slams, more than the men’s total. That’s largely because so many of recent women’s champions are younger, giving the model more reason to be optimistic about them. It still leaves plenty of room for other players to earn some hardware in the next half-decade, which makes sense. The WTA has featured a non-stop succession of breakout young stars for the past few years, and with players like Aryna Sabalenka, Elena Rybakina, and Cori Gauff in the mix, there’s no shortage of talent to keep the carousel turning.

And then there’s Serena Williams. The model projects her for zero slams, despite her three semi-finals and two finals in the last two years. The reason is her age: The algorithm expects players to steadily decline from age 27 onwards, so the age penalty by age 39 is harsh. One one hand, that makes sense: we’re forecasting the results of events that will mostly take place when she’s in her 40s. On the other hand, a player who had so much success at age 37 is probably a good bet to break the mold at 39, as well. Were this a more fully-developed model, we’d probably be smart to tinker with the age adjustment to reflect the reality that Williams is a much better bet to win a major title than Nadia Podoroska.

We could go on all day. For every variable that these forecasts take into account, there are a dozen more than have some plausible claim to relevance. But this simple approach gets us surprisingly far in telling the future–a future in which the men’s all-time grand slam race keeps getting more complicated, and the women’s game continues to feature a wide array of promising young stars.

The Post-Covid WTA is Drifting Back to Normal

In the two latest WTA events, we saw a mix of the expected and the unusual. Simona Halep, the heavy favorite in Prague, wound up with the title despite a couple of demanding three-setters in her first two rounds. The week’s other tournament, in Lexington, failed to follow the script. Serena Williams and Aryna Sabalenka, the big hitters at the top and bottom of the bracket, combined for three wins, with four unseeded players making up the semi-final field.

Last week I pointed out that Palermo–the tour’s initial comeback event–was so unpredictable that you would’ve been better off to treat each match as a coin flip than to use pre-layoff player strength ratings (such as Elo) to forecast outcomes. Such an upset-ridden event isn’t unheard of, even in pandemic-free times, but it is suggestive that the WTA rank-and-file haven’t quite returned to their usual form.

Prague and Lexington give us three times as much data to work with. Plus, we might theorize that Prague would be a little more predictable because so many players in that field also took part in the Palermo event, meaning that they have a little more recent match experience. While our sample of 93 main draw matches is still flimsy, it brings us a little closer to understanding how well traditional forecasts will handle this unusual time.

A thorny Brier patch

The metric I’m using to quantify predictability–or to put it another way, the validity of pre-layoff player ratings–is Brier Score, which takes into account both raw accuracy (did the forecast pick the right player to win?) and confidence level (was the forecast too strong, too weak, or just right?). Tour-level Brier Scores are usually in the range of 0.21, while a score of 0.25 means the predictions were no better than coin flips. A lower score represents more accurate predictions.

Here are the Brier Scores for Palermo, Lexington, and Prague, along with the average of the three, and the average of all WTA International events (on all surfaces) since 2017. (The scores are based on forecasts generated from my Elo ratings.) We might expect the first round to be different, since players are particularly rusty at that stage, so I’ve also broken out first round (“R32 Brier”) matches for each of the tournaments and averages in the table.

Tournament    Brier  R32 Brier  
Palermo       0.268      0.295  
Lexington     0.226      0.170  
Prague        0.212      0.247  
Comeback Avg  0.235      0.237  
Intl Avg      0.217      0.213

As we last week, the Palermo results truly defied expectations. More than half of the matches were upsets (according to my Elo ratings), with a particularly unpredictable first round.

That didn’t last. The Prague first round rated 0.247–just barely better than coin flips–but the messiness didn’t last beyond the first couple of days. The event’s overall Brier Score was 0.212, slightly better than the average WTA International. In other words, this group of 32 women, only recently returned from a months-long break, delivered results that were roughly as predictable as we would expect in the middle of a normal season.

The Lexington numbers are a bit more difficult to make sense of, but like Prague’s, they point to a post-coronavirus world that isn’t all that weird. The opening round closely followed the script, with a Brier Score of 0.170. Of the last 115 WTA International events, only 22 were more predictable. The forecast accuracy didn’t last, in large part because of Serena’s loss at the hands of Shelby Rogers. The rating for the entire tournament was 0.226, less predictable than usual, but much better than random guessing and closer to tour average than to the assumption-questioning Palermo numbers.

Revised estimates

We’re still early in the process of evaluating what to expect from players after the COVID-19 layoff. As more tournaments take place, we can identify whether players become more predictable with more matches under their belts. (Perhaps the Prague participants who skipped Palermo were more difficult to forecast, although Halep is an obvious counterexample.)

At this point, anything is possible. It could be that we will steadily drift back to business is usual. On the other hand, the new social-distancing-oriented rules–with few or no fans on site, nightlife limited to Netflix, players fetching their own towels, and new variations of on-court coaching–might work to the advantage of some women and the disadvantage of others. If that’s the case, Elo ratings will go through a novel period of adjustment as they shift to reflect which players thrive on the post-corona tour.

It’s too early to do much more than speculate about something as significant as that. But in the last week, we’ve seen forecasts go from wildly wrong (in Palermo) to not half bad (in Lexington and Prague). We’ve gained some confidence that for all the things that have obviously changed since March, our approach to player ratings may be one thing that largely remains the same.

Did Palermo Show the Signs of a Five-Month Pandemic Layoff?

Are tennis players tougher to predict when they haven’t played an official match for almost half a year? Last week’s WTA return-to-(sort-of)-normal in Palermo gave us a glimpse into that question. In a post last week I speculated that results would be tougher than usual to forecast for awhile, necessitating some tweaks to my Elo algorithm. The 31 main draw matches from Sicily allow us to run some preliminary tests.

At first glance, the results look a bit surprising. Only two of the eight seeds reached the semifinals, and the ultimate champion was the unseeded Fiona Ferro. Two wild cards reached the quarters. Is that notably weird for a WTA International-level event? It doesn’t seem that strange, so let’s establish a baseline.

Palermo the unpredictable

My go-to metric for “predictability” is Brier Score, which measures the accuracy of percentage forecasts. It’s nice to pick the winner, but it’s more important to assign the right level of probability. If you say that 100 matches are all 60/40 propositions, your favorites should win 60 of the 100 matches. If they win 90, you weren’t nearly confident enough; if they win 50, you would’ve been better off flipping a coin. Brier Score encapsulates those notions into a single number, the lower the better. Roughly speaking, my Elo forecasts for ATP and WTA matches hover a bit above 0.2.

From 2017 through March 2020, the 975 completed matches at clay-court WTA International events had a collective Brier Score of 0.223. First round matches were a tiny bit more predictable, with R32’s scoring 0.219.

Palermo was a roller-coaster by comparison. The 31 main-draw matches combined for a Brier Score of 0.268. Of the 32 other events I considered, only last year’s Prague tourney was higher, generating a 0.277 mark.

The first round was more unpredictable still, at 0.295. On the other hand, the combination of a smaller per-event sample and the wide variety of first-round fields means that several tournaments were wilder for the first few days. 9 of the 32 others had a first-round Brier Score above 0.250, with four of them scoring higher–that is, worse–than Palermo did.

The Brier Score of shame

I mentioned the 0.250 mark because it is a sort of Brier Score of shame. Let’s say you’re predicting the outcome of a series of coin flips. The smart pick is 50/50 every time. It’s boring, but forecasting something more extreme just means you’re even more wrong half the time. If you set your forecast at 50% for a series of random events with a 50/50 chance of occurring, your Brier Score will be … 0.250.

Another way to put it is this: If your Brier Score is higher than 0.250, you would’ve been better off predicting that every match was 50/50. All the fancy forecasting went to waste.

In Palermo, 17 of the 31 matches went the way of the underdog, at least according to my Elo formula. The Brier Scores were on the shameful side of the line. My earlier post–which advocated moderating all forecasts, at least a bit–didn’t go far enough. At least so far, the best course would’ve been to scrap the algorithm entirely and start flipping that coin.

Moderating the moderation

All that said, I’m not quite ready to throw away my Elo ratings. (At the moment, they pick Simona Halep and Aryna Sabalenka, my two favorite players, to win in Prague in Lexington. So there’s that.) 31 matches is small sample, far from adequate to judge the accuracy of a system designed to predict the outcome of thousands of matches each year. As I mentioned above, Elo failed even worse at Prague last year, but because that tournament didn’t follow several months of global shutdowns, it wouldn’t have even occurred to me to treat it as more than a blip.

This time, a week full of forecast-busting surprises could well be more than a blip. Treating players as if they have exactly the abilities they had in March is probably the wrong way to do things, and it could be a very wrong way of doing things. We’ll triple the size our sample in the next week, and expand it even more over the next month. It won’t help us pick winners right now, but soon we’ll have a better idea of just how unpredictable the post-COVID-19 tennis world really is.

How Much Will the ATP Cup Raise for Australian Bushfire Relief?

Yesterday, the ATP announced that it would make a sizeable donation to the Australian Red Cross:

Each ace served across the @ATPCup at all three venues will deliver $100 to the @RedCrossAU bushfire disaster relief and recovery efforts.

With more than 1500 aces expected to be served, the tournament contribution is expected to exceed $150,000.
— ATPCup (@ATPCup) January 2, 2020

Several players, including Nick Kyrgios, have made additional pledges of their own that extend across the several tournaments of the Australian summer. (Kyrgios’s pledge started the ball rolling, a rare instance of the tour following the lead of its most controversial star.)

How much?

The ATP offered an estimate of 1,500 aces. This is the first edition of the ATP Cup, not to mention the first men’s tour event in Perth, so we can’t simply check how many aces there were last year. Complicating things even further, we don’t know who will play for each nation in each day of the tournament, or which countries will advance to the knockout stages.

In other words, any ace prediction is going to be approximate.

Start with the basics. The ATP Cup will encompass 129 matches. That’s 43 ties, with two singles rubbers and one doubles rubber each. As in the new Davis Cup finals, many doubles rubbers are likely to be “dead,” so all 43 will probably not be played. In Madrid, 21 of the 25 doubles matches were played*, so let’s say that doubles will be skipped at the same rate in Australia, giving us 36 doubles matches.

* one of the four matches I’ve excluded was a 1-0 retirement, which for the purpose of ace counting–not to mention common sense–is effectively unplayed.

The average ace counts in best-of-three matches across the entire tour last year were 12 per singles match and 7 per doubles match. That gives us 1,284 for the 122 total contests we expect to see over the course of the event.

But we can do better. There are more aces on hard courts by a healthy margin. Over the 2019 season, the average best-of-three hard-court singles match returned 15 aces, while doubles matches featured half as many. That works out to a projected total of 1,542, 20% higher than where we started, and quite close to the ATP’s estimate.

While we don’t have much data on the surface in Perth, we have years worth of results from Brisbane and Sydney. Brisbane was one of the ace-friendliest surfaces on tour, while Sydney was at the other end of the spectrum. The figures have also varied from year to year, even controlling for the changing mix of players. Whether we look at one year or a longer time span, the average ace rates in Brisbane and Sydney combine to something in the neighborhood of the tour-wide rate.

Complicating factors

The record-setting temperatures in Australia are likely to nudge ace rates upwards. But the mix of players makes things considerably more difficult to forecast.

One challenge is the extreme range between the best players in the event (Rafael Nadal and Novak Djokovic) and the weakest, like Moldova’s 818th-ranked Alexander Cozbinov. Not only are underdogs like Cozbinov likely to see their typical ace rates plummet against higher-quality competition, they will probably struggle to keep matches competitive. The shorter the match, the fewer aces. Ironically, Cozbinov fought Steve Darcis for over three hours on the first day of play, but even at that length, only 2 of his 116 service points went for aces. He and Darcis combined for a below-average total of 10.

Another difficulty is one that would arise in predicting the total aces at any tournament. Overall ace counts depend heavily on who advances to the later rounds. The Spanish team of Nadal, Roberto Bautista Agut, and Pablo Carreno Busta is likely to do well despite relatively few first-serve fireworks. But if Canada reprises its Davis Cup Finals success, the top-line combination of Denis Shapovalov and Felix Auger Aliassime could give us six rounds of stratospheric serving stats. The American duo of John Isner and Taylor Fritz could do the same, though their odds of advancing took a dire turn after a day-one loss to Norway. At least Isner has already done his part, tallying 33 aces in a three-set loss to Casper Ruud.

As I write this, day one is not quite in the books. The first ten completed singles matches worked out to 16 aces each, slightly above the hard-court tour average. Thanks to Isner and Kyrgios, the outliers propped up that number, with 37 and 35 aces in the Isner-Ruud and Kyrgios-Struff matches, respectively. The three completed doubles matches have averaged just over 6 aces each, a bit below tour average.

This is all of long way of saying, surprise! The ATP’s estimate isn’t bad at all. A full simulation of each matchup and the event as a whole would give us more precision, but barring that, 1,500 aces and $150,000 looks like a pretty good bet. Philanthropists should line up behind the big hitting teams from Australia, Canada, and the USA, or at least cheer for an above-average number of free points off the serve of Rafael Nadal.

Rethinking Match Results as Probabilities

You don’t have to watch tennis for long before hearing a commentator explain that matches can be decided by the slimmest of margins. It’s common for a match winner to tally only 51% or 52% of the total points played. Dozens of times each year, players go even further, triumphing despite winning fewer than half of points. Novak Djokovic did just that in the 2019 Wimbledon final, claiming only 204 points to Roger Federer’s 218.

It’s right to look at results like Djokovic-Federer and conclude that many matches are decided by slim margins or that performance on certain points is crucial. Indeed, players occasionally win matches while winning as few as 47% of points.

Still, it’s possible to take the “slim margins” claim too far. 51% sounds like a narrow margin, as does 53%. In many endeavors, sporting and otherwise, 55% represents a near-tie, and even 60% or 65% suggests that there isn’t much to separate the two sides. Not so in tennis, especially in the serve-centered men’s game. However it sounds, 60% represents a one-sided contest, and 65% is a blowout verging on embarrassment. In 2019, only three ATP tour matches saw one player win more than 70% of total points.

Answer a different question

For several reasons, total points won is an imperfect measure of one player’s superiority, even in a single match. One flaw is that it is usually stuck in that range between 35% and 65%, incorrectly implying that all tennis matches are relatively close contests. Another drawback is that not all 55% rates (or 51%s, or 62%s) are created equal. The longer the match, the more information we gain about the players. For a specific format, like best-of-three, a longer match usually requires closely-matched players to go to tiebreaks or a third set. But if we want to compare matches across different formats (like best-of-three and best-of-five), the length of the match doesn’t necessarily tell us anything. Best-of-five matches are longer because of the rules, not because of any characteristics of the players.

The solution is to think in terms of probabilities. Given the length of a match, and the percentage of points won by each player, what is the probability that the winner was the better player?

To answer that question, we use the binomial distribution, and consider the likelihood that one player would win as many points as he did if the players were equally matched. If we flipped a fair coin 100 times, we would expect the number of heads to be around 50, but not that it will always be exactly 50. The binomial distribution tells us how often to expect any particular number of heads: 49, 50, or 51 are common, 53 is a bit less common, 55 even less so, 40 or 60 quite uncommon, as so on. For any number of heads, there’s some probability that it is entirely due to chance, and some probability that it occurs because the coin is biased.

Here’s how that relates to a tennis match. We start the match pretending that we know nothing about the players, assuming that they are equal. The number of points is analogous to the number of coin flips–the more points, the more likely the player who wins the most is really better. The number of points won by the victor corresponds to the number of heads. If the winner claims 60% of points, we can be pretty sure that he really is better, just as a tally of 60% heads in 100 or more flips would indicate that the coin is probably biased.

More than just 59%

The binomial distribution helps us convert those intuitions into probabilities. Let’s look at an example. The 2019 Roland Garros final was a fairly one-sided affair. Rafael Nadal took the title, winning 58.6% of total points played (116 of 198) over Dominic Thiem, despite dropping the second set. If Nadal and Thiem were equally matched, the probability that Nadal would win so many points is barely 1%. Thus, we can say that there is a 99% probability that Nadal was–on the day, in those conditions, and so on–the better player.

No surprises there, and there shouldn’t be. Things get more interesting when we alter the length of the match. The two other 2019 ATP finals in which one player won about 58.6% of points were both claimed by Djokovic. In Paris, he won 58.7% of points (61 of 104) against Denis Shapovalov, and in Tokyo, he accounted for 58.3% (56 of 96) in his defeat of John Millman. Because they were best-of-three instead of best-of-five, those victories took about half as long as Nadal’s, so our confidence that Djokovic was the better player–while still high!–shouldn’t be quite as close to 100%. The binomial distribution says that those likelihoods are 95% and 94%, respectively.

The winner of the average tour-level ATP match in 2019 won 55% of total points–the sort of number that sounds close, even as attentive fans know it really isn’t. When we convert every match result into a probability, the average likelihood that the winner was the better player is 80%. The latter number not only makes more intuitive sense–fewer results are clustered in the mid 50s, with numbers spread out from 15% to 100%–but it considers the length of the match, something that old-fashioned total-points-won ignores.

Why does this matter?

You might reasonably think that anyone who cared about quantifying match results already has these intuitions. You already know that 55% is a tidy win, 60% is an easy one, and that the length of the match means those numbers should be treated differently depending on context. Ranking points and prize money are awarded without consideration of this sort of trivia, so what’s the point of looking for an alternative?

I find this potentially valuable as a way to represent margin of victory. It seems logical that any player rating system–such as my Elo ratings–should incorporate margin of victory, because it’s tougher to execute a blowout than it is a narrow win. Put another way, someone who wins 59% of points against Thiem is probably better than someone who wins 51% of points against Thiem, and it would make sense for ratings to reflect that.

Some ratings already incorporate margin of victory, including the one introduced recently by Martin Ingram, which I discussed with him on a recent podcast. But many systems–again, including my Elo ratings–do not. Over the years, I’ve tested all sorts of potential ways to incorporate margin of victory, and have not found any way to consistently improve the predictiveness of the ratings. Maybe this is the one that will work.

Leverage and lottery matches

I’ve already hinted at one limitation to this approach, one that affects most other margin-of-victory metrics. Djokovic won only 48.3% of points in the 2019 Wimbledon final, a match he managed to win by coming up big in more important moments than Federer did. Recasting margin of victory in terms of probabilities gives us more 80% results than 55% results, but it also gives us more 25% results than 48% results. According to this approach, there is only a 24% chance that Djokovic was the better player that day. While that’s a defensible position–remember the 218 to 204 point gap–it’s also a bit uncomfortable.

Using the binomial distribution as I’ve described above, we completely ignore leverage, the notion that some points are more valuable than others. While most players aren’t consistently good or bad in high-leverage situations, many matches are decided entirely by performance in those key moments.

One solution would be to incorporate my concept of Leverage Ratio, which compares the importance of the points won by each player. I’ve further combined Leverage Ratio with Dominance Ratio, a metric closely related to total points won, into a single number I call DR+, or adjusted Dominance Ratio. It’s possible to win a match with a DR below 1.0, which means winning fewer return points than your opponent did, an occurrence that often occurs when total points won is below 50%. But when DR is adjusted for leverage, it’s extremely uncommon for a match winner to end up with a DR+ below 1.0. Djokovic’s DR in the Wimbledon final was 0.87, and his DR+ was 0.97, one of the very few instances in which a winner’s adjusted figure stayed below 1.0.

It would be impossible to fix the binomial distribution approach in the same way I’ve “fixed” DR. We can’t simply multiply 65%, or 80%, or whatever, by Leverage Ratio, and expect to get a sensible result. We might not even be interested in such an approach. Calculating Leverage Ratio requires access to a point-by-point log of the match–not to mention a hefty chunk of win-probability code–which makes it extremely time consuming to compute, even when the necessary data is available.

For now, leverage isn’t something we can fix. It is only something that we can be aware of, as we spot confusing margin-of-victory figures like Djokovic’s 24% from the Wimbledon final.

Rethinking, fast and slow

As with many of the metrics I devise, I don’t really expect wide adoption. If the best application of this approach is to create a component that improves Elo ratings, then that’s a useful step forward, even if it goes no further.

The broader goal is to create metrics that incorporate more of our intuitions. Just because we’ve grown accustomed to the quirks of the tennis scoring system, a universe in which 52% is close and 54% is not, doesn’t mean we can’t do better. Thinking in terms of probabilities takes more effort, but it almost always nets more insight.

Podcast Episode 80: Martin Ingram on Predicting Match Outcomes, Bayesian Style

Episode 80 of the Tennis Abstract Podcast features Martin Ingram (@xenophar), author of a recent academic paper, A point-based Bayesian hierarchical model to predict the outcome of tennis matches.

If you’re interested in learning more about what goes into a forecasting system, this one’s for you. We start with a discussion of the advantages as well as the limitations of the common “iid” assumption, that points are independent and identically distributed. Martin’s model, which relies on the iid assumption, incorporates each player’s serve and return skill, in addition to surface preferences and tournament-specific characteristics. In our conversation, he explains how it works, and why this sort of model is able to provide reasonable forecasts even with limited data.

That’s just the beginning. Martin suggests several possible additions to his model, and we close by considering the importance of domain knowledge in this sort of statistical work.

Thanks for listening!

Audio Player

00:00

Use Up/Down Arrow keys to increase or decrease volume.

(Note: this week’s episode is about 65 minutes long; in some browsers the audio player may display a different length. Sorry about that!)

Click to listen, subscribe on iTunes, or use our feed to get updates on your favorite podcast software.

Andreescu, Medvedev, and the Future According to Elo

With the US Open title added to her 2019 trophy haul, Bianca Andreescu is finally a member of the WTA top 10, debuting at fifth on the ranking table. Daniil Medvedev, the breakout star of the summer on the men’s side, only cracked the ATP top 10 after Wimbledon. He’s now up to fourth. The official ranking algorithms employed by the tours take some time to adjust to the presence of new stars.

Elo, on the other hand, reacts quickly. While the ATP and WTA computers assign points based on a year’s worth of results (rounds reached, not opponent quality), Elo gives the most weight to recent accomplishments, with even greater emphasis placed on surprising outcomes, like upsets of top players. If your goal in using a ranking system is to predict the future, Elo is better: Elo-based forecasts significantly outperform predictions based on ATP and WTA ranking points.

Andreescu’s first Premier-level title came at Indian Wells in March, when she beat two top-ten players, Elina Svitolina and Angelique Kerber, in the semi-final and final. The WTA computer reacted by moving her up from 60th to 24th on the official list. Elo already saw Andreescu as a more formidable force after her run to the final in Auckland, so after Indian Wells, the algorithm moved her up to seventh. Three more wins in Miami, and the Canadian teen cracked the Elo top five.

Tennis fans are accustomed to the slow adjustments of the ranking system, so seeing a “(22)” or a “(15)” next to Andreescu’s name at Roland Garros and the US Open wasn’t particularly jarring. And there’s something to be said for withholding judgment, since tennis has had its share of teenage flashes in the pan. But Elo is usually right. The betting market heavily favored Serena Williams in the US Open final, but Elo saw the Canadian as the superior player, giving her a slight edge. After the latest seven match wins in New York, the algorithm rates Andreescu as the best player on tour, very narrowly edging out Ashleigh Barty. Would you dare disagree?

The launching (Ar)pad

When Medvedev first reached the top ten on the Elo list last October, I ran some numbers to compare the two ranking systems. Most players who earn a spot in the Elo top ten eventually make their way into the ATP top ten as well, but Elo is almost always first. On average, the algorithm picks top-tenners more than a half-year sooner than the tour’s computer. The 23-year-old Russian is a good example: He reached eighth place on the Elo list last October, but didn’t match that mark in the ATP rankings for another 10 months, after reaching the Montreal final.

Andreescu closed the gap faster than Medvedev did, needing a more typical six months to progress from Elo top-tenner to a single-digit WTA ranking. It may not take much longer before her Elo and WTA rankings converge at the top of both lists.

We no longer need Elo to tell us that Andreescu and Medvedev are likely to keep winning matches at the highest level. But having acknowledged the accuracy with which Elo glimpses the future, it’s worth looking at which players are likely to follow in their footsteps.

After the US Open, Elo’s boldest claim regards Matteo Berrettini, ranked sixth. The ATP computer sites him at 13th, and he only made one brief stop this summer inside the top 20. The Flushing semi-finalist has been inside the Elo top 10 since mid-June, and the algorithm currently puts him ahead of such better-established young players as Alexander Zverev and Stefanos Tsitsipas.

The women’s Elo list doesn’t feature any similar surprises in the top 10, but that hardly means it agrees with the WTA computer. Karolina Muchova, currently at a career-high WTA ranking of 43rd, is 23rd on the Elo table. Two veteran threats, Victoria Azarenka and Venus Williams, are also marooned outside the official top 40, but Elo sees them as 18th and 28th best on tour, respectively. In terms of predictiveness, quality is more important than quantity, so a limited schedule isn’t necessarily seen as a drawback. Elo is also optimistic about Sofia Kenin, rating her 13th, compared to her official WTA standing of 20th.

Half a year from now, I’d bet Berrettini’s official ranking is closer to 6 than to 13, and that Muchova’s position is closer to 23 than 43. It’s impossible to tell the future, but if we’re interested in looking ahead, Elo gives us a six-month head start on the official rankings. We’ll have to wait and see whether the rest of the women’s tour can keep Andreescu away from the top spot for that long.