If you’ve found your way here from the Wall Street Journal, welcome! If you don’t know what I’m talking about, go read what Carl Bialik has to say in today’s paper, and in an online follow-up.
I’ve developed a fairly sophisticated algorithm to predict the outcome of tennis matches. It seeks to remedy some of the flaws in the present ranking system and do a better job of forecasting which players will perform better at certain times, on certain surfaces, against certain opponents.
In the past, I’ve written about the predictiveness of ATP ranking points–which are pretty darn good, for all their flaws. By just about any standard, however, my system is better. It’s not perfect–it’s far, far from it–but it does give you a valid second opinion on a player’s abilities at any given time.
The components
My algorithm does several things that traditional ranking points do not. Here are a few of the components:
- Points are awarded based on the quality of opponents, not on the round or tournament. Thus, beating Mikhail Youhzny in the quarterfinals in Moscow is worth the same as the semifinals of Indian Wells. Losing to a low-ranked player counts against you more than losing against Roger Federer.
- These points, and everything else, are adjusted for surface. Beating Federer counts for more on hard courts than on clay; beating Juan Carlos Ferrero is the opposite.
- The algorithm generates a set of overall rankings, and it also generates two sets of surface-specific rankings, one for clay courts, one for everything else. (There isn’t enough data on indoor hard courts or grass courts to treat them separately from any other type of fast court.) So for Indian Wells, I’m using the hard-court rankings. Of course, this drastically impacts the chances of many players.
- The points awarded for any tournament are also based on how recent the event was. Beating Andy Murray last week is more relevant than beating him last year. Thus, Milos Raonic does better in my rankings (24th overall) than in the ATP rankings (37th). Sure, it would help if Raonic had played more ATP-level events last year, but my algorithm recognizes that February results count for more than wins from last June.
- My system considers matches from the last two years, not just one year, as the ATP rankings do. This and the ‘recency’ adjustment remedy what I consider to be the most ridiculous part of the ATP ranking system. A player can fall dozens of spots in the rankings simply because a tournament result “falls off.”
So, a match from 51 weeks ago tells us a lot about a player’s current skill level, but a match from 53 weeks ago does not? In my system, both are counted; a match from 51 weeks ago counts for about 55-60% of the value of a match from last week, while a match from a few weeks earlier counts for a little less.
- Grand slams count for a bit more, but not a lot more. The main reason for this is that the winner of a five-setter is more likely to the more skilled player than the winner of a three-setter. A couple of bad bounces in a tiebreak can turn a three-setter against you, but it’s awfully hard to win a five-setter with luck.
- There is a bit of home court advantage in tennis, though with the increasing use of the challenge system (which limits officiating bias), it seems to be decreasing. It still exists, and it’s considered.
- For whatever reason, it appears that qualifiers and wild cards do worse in ATP main draw matches than my system would otherwise expect. So they are penalized a small amount.
- Finally, there is a head-to-head component. It turns out that the head-to-head component can’t improve that much on the rankings-based algorithm, but it does have some value. So I do consider the history of each matchup, giving a slight edge to the player who has won more matches in the past. (Depending, of course, on how long ago it was, what surface the matches were on, and so on.)
Whew!
Thanks for reading this far.
As I post this, a few matches have already been played. But these numbers were generated this morning, after the full draw was released. It shows the probability that each player reaches each round of the tournament. I’ll have a little more to say at the bottom.
Player R64 R32 R16 QF SF F W (1)Nadal 100% 94.6% 78.3% 56.3% 40.1% 24.1% 13.0% (q)De Voest 54% 3.1% 0.8% 0.1% 0.0% 0.0% 0.0% Riba 46% 2.3% 0.5% 0.1% 0.0% 0.0% 0.0% (q)Sweeting 42% 8.4% 0.8% 0.1% 0.0% 0.0% 0.0% Granollers 58% 17.2% 2.0% 0.5% 0.1% 0.0% 0.0% (27)Monaco 100% 74.4% 17.7% 7.5% 2.9% 0.8% 0.2% (19)Baghdatis 100% 86.1% 52.9% 21.3% 11.3% 4.7% 1.6% (q)Devvarman 43% 5.0% 1.0% 0.1% 0.0% 0.0% 0.0% Mannarino 57% 8.9% 2.2% 0.2% 0.0% 0.0% 0.0% (q)Cipolla 28% 4.0% 0.7% 0.1% 0.0% 0.0% 0.0% Malisse 72% 22.1% 6.6% 1.5% 0.4% 0.1% 0.0% (15)Tsonga 100% 73.9% 36.7% 12.2% 5.9% 2.0% 0.6% (11)Almagro 100% 81.5% 51.0% 22.4% 7.8% 2.7% 0.8% (q)Russell 45% 8.1% 2.0% 0.3% 0.0% 0.0% 0.0% Anderson 55% 10.4% 3.1% 0.6% 0.1% 0.0% 0.0% Istomin 41% 13.1% 4.6% 1.0% 0.2% 0.0% 0.0% Nieminen 59% 24.4% 9.3% 2.8% 0.6% 0.1% 0.0% (23)Montanes 100% 62.5% 30.2% 10.8% 3.1% 0.8% 0.2% (28)Simon 100% 73.1% 27.2% 14.5% 4.6% 1.4% 0.4% Schuettler 40% 8.3% 1.2% 0.3% 0.0% 0.0% 0.0% Haase 60% 18.7% 4.0% 1.3% 0.2% 0.0% 0.0% (q)Matosevic 29% 2.7% 0.6% 0.1% 0.0% 0.0% 0.0% Karlovic 71% 12.7% 5.0% 1.8% 0.4% 0.1% 0.0% (6)Ferrer 100% 84.6% 61.9% 44.1% 22.2% 10.8% 4.4% (4)Soderling 100% 89.0% 71.0% 46.8% 27.3% 15.8% 7.6% Phau 37% 3.0% 0.9% 0.2% 0.0% 0.0% 0.0% Berrer 63% 8.0% 3.4% 0.9% 0.2% 0.0% 0.0% (q)Smyczek 48% 10.5% 1.1% 0.2% 0.0% 0.0% 0.0% Marchenko 52% 13.4% 1.5% 0.3% 0.0% 0.0% 0.0% (32)Kohlsch. 100% 76.1% 22.0% 7.7% 2.3% 0.6% 0.1% (20)Dolgopolov 100% 68.8% 24.4% 8.9% 2.8% 0.9% 0.3% Hanescu 39% 10.5% 1.8% 0.3% 0.0% 0.0% 0.0% Seppi 61% 20.8% 4.9% 1.1% 0.2% 0.0% 0.0% Stepanek 30% 12.1% 6.7% 2.3% 0.8% 0.2% 0.1% (PR)Del Potro 70% 46.4% 35.6% 20.8% 11.1% 6.1% 2.9% (14)Ljubicic 100% 41.6% 26.5% 10.6% 4.4% 1.7% 0.5% (9)Verdasco 100% 86.2% 60.7% 23.2% 10.1% 4.2% 1.3% (WC)Berankis 52% 7.4% 2.2% 0.3% 0.0% 0.0% 0.0% (q)Bogomolov 48% 6.3% 1.7% 0.2% 0.0% 0.0% 0.0% Tipsarevic 71% 34.2% 12.2% 3.3% 0.9% 0.2% 0.0% Kamke 29% 8.2% 1.7% 0.2% 0.0% 0.0% 0.0% (21)Querrey 100% 57.6% 21.5% 5.8% 1.5% 0.4% 0.1% (25)Robredo 100% 70.8% 16.9% 7.6% 2.2% 0.6% 0.1% Zverev 62% 20.9% 2.9% 0.8% 0.1% 0.0% 0.0% (q)Ebden 38% 8.3% 0.8% 0.2% 0.0% 0.0% 0.0% (q)Young 37% 2.2% 0.6% 0.1% 0.0% 0.0% 0.0% Starace 63% 6.3% 2.6% 0.7% 0.1% 0.0% 0.0% (5)Murray 100% 91.4% 76.3% 57.7% 35.6% 21.5% 11.1% (8)Roddick 100% 84.9% 63.0% 43.4% 21.7% 8.7% 3.9% (WC)Blake 63% 11.3% 4.5% 1.4% 0.3% 0.0% 0.0% (q)Guccione 37% 3.8% 1.1% 0.2% 0.0% 0.0% 0.0% Ram-Hidalgo 34% 5.1% 0.5% 0.1% 0.0% 0.0% 0.0% Mello 66% 16.4% 2.7% 0.6% 0.1% 0.0% 0.0% (30)Isner 100% 78.4% 28.1% 12.6% 3.6% 0.8% 0.2% (18)Gasquet 100% 73.4% 34.8% 14.2% 4.6% 1.2% 0.3% Cuevas 72% 22.8% 6.7% 1.7% 0.3% 0.0% 0.0% Andujar 28% 3.9% 0.5% 0.1% 0.0% 0.0% 0.0% Benneteau 46% 16.1% 7.1% 2.3% 0.6% 0.1% 0.0% Lopez 54% 18.9% 9.0% 3.1% 0.8% 0.2% 0.0% (10)Melzer 100% 65.0% 41.9% 20.4% 8.2% 2.7% 0.9% (16)Troicki 100% 82.3% 40.1% 10.5% 4.3% 1.1% 0.3% (q)Bopanna 30% 3.1% 0.3% 0.0% 0.0% 0.0% 0.0% (WC)Tomic 70% 14.6% 3.1% 0.3% 0.1% 0.0% 0.0% Giraldo 55% 14.6% 6.0% 1.0% 0.3% 0.0% 0.0% Gim-Traver 45% 10.9% 3.8% 0.6% 0.1% 0.0% 0.0% (24)Llodra 100% 74.5% 46.7% 15.8% 7.1% 2.2% 0.7% (31)Gulbis 100% 56.7% 12.5% 6.0% 2.3% 0.6% 0.1% Hewitt 75% 37.3% 7.5% 3.7% 1.4% 0.4% 0.1% Lu 25% 6.0% 0.6% 0.1% 0.0% 0.0% 0.0% Mayer 66% 12.7% 7.2% 3.8% 1.6% 0.4% 0.1% Golubev 34% 3.7% 1.5% 0.5% 0.1% 0.0% 0.0% (3)Djokovic 100% 83.6% 70.8% 57.7% 42.5% 24.8% 15.4% (7)Berdych 100% 84.1% 64.8% 33.2% 12.6% 5.6% 2.3% Kukushkin 48% 7.6% 2.8% 0.5% 0.1% 0.0% 0.0% Kubot 52% 8.3% 3.1% 0.5% 0.1% 0.0% 0.0% De Bakker 48% 20.6% 5.3% 1.3% 0.2% 0.0% 0.0% Becker 52% 21.9% 5.9% 1.5% 0.2% 0.1% 0.0% (26)Bellucci 100% 57.4% 18.1% 4.9% 0.9% 0.2% 0.0% (17)Cilic 100% 81.7% 37.2% 20.7% 6.6% 2.6% 1.0% Gabashvili 49% 9.6% 1.5% 0.3% 0.0% 0.0% 0.0% Serra 51% 8.7% 1.2% 0.3% 0.0% 0.0% 0.0% Davydenko 84% 49.6% 32.8% 21.0% 8.7% 4.4% 2.1% Fognini 16% 3.5% 1.1% 0.3% 0.1% 0.0% 0.0% (12)Wawrinka 100% 47.0% 26.2% 15.5% 5.2% 2.2% 0.9% (13)Fish 100% 64.5% 41.9% 13.0% 6.4% 2.7% 1.1% (WC)Raonic 81% 33.0% 17.9% 4.3% 1.7% 0.6% 0.2% Ilhan 19% 2.5% 0.6% 0.0% 0.0% 0.0% 0.0% (WC)Harrison 26% 5.7% 1.0% 0.1% 0.0% 0.0% 0.0% Chardy 74% 32.1% 12.0% 2.4% 0.8% 0.2% 0.1% (22)Garcia-Lopez 100% 62.2% 26.6% 5.9% 2.3% 0.8% 0.2% (29)Chela 100% 59.2% 7.7% 2.6% 0.7% 0.2% 0.0% Petzschner 66% 30.5% 3.4% 1.1% 0.3% 0.0% 0.0% Brown 34% 10.3% 0.7% 0.1% 0.0% 0.0% 0.0% Andreev 41% 3.0% 1.4% 0.4% 0.1% 0.0% 0.0% Nishikori 59% 6.4% 3.7% 1.4% 0.4% 0.1% 0.0% (2)Federer 100% 90.6% 83.1% 68.7% 52.4% 36.7% 24.5%
You’ll probably notice right off that Federer and Djokovic have the best chances of winning. Indeed, they are the top two players on hard courts, according to my rankings. Yes, Nadal has won the slams lately, but he has also lost to a few players he shouldn’t have (Baghdatis, Melzer, Garcia-Lopez) in the recent past. I personally wouldn’t put money on Federer over Nadal in the final, but my algorithm disagrees.
A few other players my system likes are Juan Martin Del Potro, Nikolay Davydenko, and Marcos Baghdatis. It picks out some players for scoring wins over top-ranked players. It likes Del Potro both because of his strong record in the last few weeks and because the algorithm still considers his torrid summer of 2009, leading up to his U.S. Open win.
One more thing, and then I’ll shut up for now. In the first-round matches, there are very few that stray beyond a 70/30 split. Even Tomic-Bopanna is 70/30, and Bopanna barely plays singles. The narrow divides are partly because no top players are involved in the first round, but it also shows you the depth of the men’s game — even someone ranked outside of the top 150, like Flavio Cipolla, has a decent chance of advancing.
Of course, Flavio doesn’t have quite the same odds against Tsonga, and you can tell from Nadal’s second round odds that neither Pere Riba nor Rik de Voest stand much of a chance against him.
Enjoy the tennis … and the numbers.