US Open Draw Datasets

Earlier today, I published a thorough analysis of the last ten years of US Open draws, showing that while first and second seeds have had extremely easy first-round matchups, there is no other credible statistical evidence that suggests any nonrandom manipulation of the draw.

If you want to take a look at the draws yourself, I’ve made it easier.  The following files not only have the full draws going back to 2001, but they also include each player’s ATP or WTA ranking at the time of the tournament, their ordinal ranking among the players in the draw, the ordinal ranking of their first-round opponent, and the ordinal ranking of their best-possible second round opponent.

Click to download the files:

Here’s a quick rundown of the columns you’ll find in each sheet:

  • Year — each file contains the entire draws for the last ten years.
  • Draw Pos[ition] — numbers 1 to 128, so you can always sort the sheet to show the players in draw order.  (For instance, the #1 seed is 1, that player’s opponent is 2, and so on.)
  • Player
  • Country
  • Seed — the seeding assigned by the US Open
  • Rank [ATP/WTA] — the player’s official ranking the Monday that the tourney began.
  • Ordinal — the player’s rank among the 128 players in the field.  Last year, Shelby Rogers’s WTA ranking was 344, which made her ordinal ranking 124 out of 128.
  • 1stRdOpp — the ordinal ranking of the player’s first-round opponent.
  • Best2nd — the ordinal ranking of the player’s best possible second-round opponent.
Let me know if you find anything interesting!

Is the US Open Draw Truly Random?

Italian translation at settesei.it

Last week, an ESPN “Outside the Lines” article called into question the fairness of the U.S. Open main draw.  A researcher discovered that the top two seeds (both men and women) have gotten very easy first-round assignments.

This is one small step away from a direct accusation of draw-rigging by the USTA.  It’s a serious claim, and while the article’s author leans heavily on a single academic who supports the methodology used, it’s not at all clear that anything unacceptable is going on.

What they found

For some reason, the study focused on the top two seeds.  It’s not at all clear why it did so–I have no idea what the USTA’s motive would be for rigging the draw in favor of the top two seeds, regardless of their identity.  Sure, there were a few years when a Federer-Nadal final would have been particularly mouthwatering, or when American viewers craved a Serena-Venus showdown in Flushing, but why would the USTA be tweaking a draw in favor of Gustavo Kuerten?  Marat Safin?  Amelie Mauresmo? Dinara Safina?

For the moment, let’s set that major concern aside.  To quantify the difficulty of each player’s first-round opponents, the ESPN study invented a metric called “difficulty score.”  We’ll come back to “difficulty score” in a bit.

A simple look at the lists they assembled of first-round opponents does suggest that something untoward is going on.  In the last ten years of men’s draws, a top-two seed has faced a top-80 opponent only four times, and not once in the last five years.  Seeded players should face top-80 opponents about half the time.

If we are truly interested in the first-rounders assigned to top-two seeds, it’s clear that these players have been given an easier path than what would be statistically expected.  But it’s not yet clear that it’s anything other than good luck.

Breaking down “difficulty score”

Here’s the explanation of the metric that ESPN used:

So if a top two seed faced the 33rd-ranked player in the first round, he/she would get a difficulty score of 0.995 for that round; if he/she faced the 128th-ranked player in the first round, the score for that round would be 0.005. An average opponent (ranked around 80th or 81st), would correspond to a difficulty score near 0.500, which should be the average difficulty score over several years of draws.

I don’t understand why the ESPN study needed to switch from ordinal rankings (1 to 128) to difficulty scores between 0.005 and 0.995.  But I replicated the work using ordinal rankings instead of difficulty scores, and came up with the same results.

The average first round opponent for the top two seeds in each year’s men’s draw has been about the 98th-best player in the draw.  Given that seeds can draw anyone from 33 to 128, the average “should” be around 80.  With difficulty scores, ESPN says that the likelihood of the last ten years of easy draws is 0.3%.  With ordinal rankings, I found approximately the same.  The last thing the sports-analysis world needs is another superfluous metric, but at least this one doesn’t appear to be misleading.

What about better reasons for rigging?

The core problem here is this: Why do we care  specifically about the draws for the first two seeds?  Or, why would the USTA care enough to compromise the fairness of the draw?

As ESPN highlighted, some of the first-round victims are American wild cards.  Scoville Jenkins, for instance, was fed to the wolves twice, once each against Federer and Roddick.  If we’re really fishing for an explanation, perhaps the USTA wants to put up-and-coming stars such as Jenkins, Devin Britton, and Coco Vandeweghe on a big stage, either to showcase these players, or to make otherwise pedestrian blowouts more interesting.  I suppose I’d rather watch Nadal play Jack Sock than, say, Diego Junqueira.

But that’s ex post facto reasoning of the most blatant sort.  If the USTA were going to rig the draw, wouldn’t they be more likely to do so in favor of top Americans?  Or in favor of a broader range of seeds, to better ensure marquee matchups for the second week?  Or rig second-round matchups for top players, to ensure that the big names make it to the middle weekend?

If no evidence of draw manipulation appears in any of those other scenarios, it would seem that ESPN discovered something more like the famous correlation between the S&P 500 and butter production in Bangladesh.  If your search for a newsworthy conclusion is sufficiently wide, you’re bound to find something.

The top seeds

As I’ve said, there’s no doubt that the top two seeds in the men’s draw have had an easy go of it in the last ten years, since the draws started seeding 32 players instead of 16.  The same is true of the women.

The top two in both the men’s and women’s draws faced an opponent who ranked roughly 98th out of the 128 field.  The odds of this happening on either side are tiny–about 0.25%.  The chances that a single tournament would randomly produce draws so easy for the top two men and women for ten years are effectively zero.

Beyond the top two, however, any suspicions quickly disappear.  The average opponent for the top four seeded men has been ranked about 89 out of 128, meaning that #3 and #4 face opponents around #80–dead average.  The average first-round assignment for the top eight seeded men has been around 87, meaning that seeds 5-8 face average opponents in the mid-80s.  Nothing to cause a raised eyebrow there, and the numbers are almost identical on the women’s side.

To go one step further, there’s no evidence of manipulation in the second-round draws.  In fact, the top two women’s seeds faced particularly tough 2nd round opponents–there was only a 20% chance that those twenty women would be given as tough of 2nd round assignments as they have.

Before looking at the draws of U.S. players, a quick summary.  While the top two seeds were given very low-ranked opponents in the first round, the effect did not extend to the second round, or to any seeds beyond the top two.

The American draws

If the USTA were to tweak the draws, you’d expect them to do so in favor of the home players, if for no other reason than television ratings.  But they haven’t.

Let’s start with the American men.  The top two ranked American men each year have faced opponents ranked, on average, 79 of 128.  That’s a bit tougher than average.  If we expand the analysis to the top four ranked Americans, or just seeded Americans, the results stay around average.  If anyone is manipulating the draws in favor of American men, they are either doing it without regard for ATP rankings, or they aren’t doing a very good job.

More surprising is the average opponent of all American men.  The average opponent of an American man in the last ten years has been 61.2 — considerably lower than 80, in part because unseeded men may draw seeded players in the first round.  But the average shouldn’t be that low.  In fact, there is only a 20% chance that American men would be given such a tough assignment.

Results for the women are mostly similar.  The top two American women each year have gotten a slightly easy draw–the average opponent rank is 83 of 128.  Keep in mind, however, that this overlaps with the analysis of the top two seeded women–five of the 20 top-two-seeded women were Americans, and in almost each one of those five cases, those women faced one of the weakest players in the draw.  In other words, there’s more evidence that the draw is skewed in favor of the top two seeds than the top two Americans.

As with the men, American women in general have been given tough assignments.  In fact, there is only a 16% chance that American women would face such tough first round opponents as they have.

What this means

If the USTA (or anyone else) is messing with the US Open draws, they are doing so in a nearly inscrutable way.  The only evidence of manipulation is with each year’s top two seeds, as ESPN highlighted.

The theory I mentioned above–that it might be desirable to pit top players against up-and-coming Americans–is appealing, but also not supported by the evidence.  Only five of the 20 opponents of top-two men’s seeds (and six of 20 women’s opponents) has been American, despite the fact that the U.S. contributes five or six lowly-ranked wild cards each year, in addition to a disproportionate number of qualifiers.

It’s an odd situation.  The first-round opponents of the top two seeds makes for a plausible target of draw manipulation, if not the most obvious one.

Postscript: One more question

I mentioned earlier that I’d rather watch Nadal play Jack Sock than Diego Junqueira.  I like up-and-comers, and it’s always interesting to see whether a new opponent forces a top player to change tactics.  It makes for a more interesting match than Nadal (or any top-tenner) against a 29-year-old who has hovered for years around #100.

My question, then: If you’re Rafa Nadal, and (presumably) you want to go deep at the U.S. Open, who would you rather play?  The American wild card ranked #450, or the veteran ranked #99?  A tougher question: Sock, or a veteran who was nearly seeded, like Fabio Fognini?  I can see different players making different choices, but I don’t think it’s clear cut.

It is the draws of Jenkins, Britton, Glatch–in other words, the Jack Socks of previous years–that give us this evidence of manipulation.  On paper, the 127th-highest-ranked player in the draw looks like the 127th-best, but in practice, it’s not nearly so clear cut.  And if these wild cards really are “wild cards,” what looks like an easy draw may not be much easier than yet another dissection of Sergiy Stakhovsky or Albert Montanes.

It may be true that at some stage, the US Open draws are being manipulated for (and only for) the top two seeds in each field.  But that doesn’t tell us whether those players are gaining anything from it.  It’s far from clear that the lowest-ranked players in each draw are the easiest opponents.

US Open Qualifiers Guide

Tomorrow, 128 men will begin battling for the last 16 spots in the US Open main draw.  (Technically, given withdrawals, it’ll probably be more like 18 or 19 spots, but as of now, it’s 16.)

It’s my favorite time of year, and if you live near New York, it should be yours as well.  But unless you’re an extreme tennis nut, most of the names aren’t very familiar.

Click here for a quick “guide” to the 128 contenders, including their seed, country of origin, birthday, current ATP ranking, and current hard-court “jrank”–that is, their standing in my ranking system.  (The table is too wide to display well on this site.)

If you want to play with the guide, feel free to download it in CSV format.

Hard Court Singles Rankings: 22 August 2011

With the U.S. Open a mere seven days away (and qualifying starting tomorrow!), it’s time to update my hard-court singles rankings.  If you’re interested in some of the methodology underlying these rankings, start here.

Here’s the top 101.  For what might be the first time since I started publishing these, Delpo is knocked out of the top four.  Because my system takes into account the last two years, he could take a hit when the 2009 US Open comes off the books.  It’s not as major a shift as in the ATP rankings, because my system has already heavily discounted the 2009 Open because it was so long ago, but given how large a factor those wins play in Delpo’s ranking, it will make a difference.

Also interesting to see how my system reflects the mess that is 6 through 15.  Fish, appropriately, heads the group on hard courts, while Ferrer loses several spots compared to the ATP rankings.  (Remember, these numbers are hard-court specific.)  Melzer and Almagro find themselves way out of the running.

Note also what these numbers do with some younger players — Bernard Tomic is on the cusp of cracking the top 20, and Ryan Harrison is inside the top 50.

RANK  PLAYER                  POINTS  
1     Novak Djokovic            7509  
2     Rafael Nadal              4977  
3     Roger Federer             4154  
4     Andy Murray               3911  
5     Juan Martin del Potro     3207  
6     Mardy Fish                2709  
7     Jo-Wilfried Tsonga        2654  
8     Robin Soderling           2360  
9     Tomas Berdych             2034  
10    Stanislas Wawrinka        1907  
11    Gael Monfils              1842  
12    Marin Cilic               1790  
13    David Ferrer              1601  
14    Andy Roddick              1518  
15    Gilles Simon              1507  
16    Nikolay Davydenko         1422  
17    Marcos Baghdatis          1392  
18    Richard Gasquet           1339  
19    Fernando Verdasco         1321  
20    David Nalbandian          1279  

RANK  PLAYER                  POINTS  
21    Bernard Tomic             1279  
22    Milos Raonic              1267  
23    Ernests Gulbis            1256  
24    Janko Tipsarevic          1159  
25    Viktor Troicki            1143  
26    Mikhail Youzhny           1108  
27    Florian Mayer             1093  
28    Alexander Dolgopolov      1068  
29    Philipp Kohlschreiber     1061  
30    Jurgen Melzer             1045  
31    Samuel Querrey            1044  
32    Nicolas Almagro           1023  
33    Ivan Ljubicic             1011  
34    Kei Nishikori             1005  
35    John Isner                 982  
36    Ivan Dodig                 948  
37    Michael Llodra             921  
38    Feliciano Lopez            903  
39    Radek Stepanek             896  
40    Guillermo Garcia-Lopez     854  

RANK  PLAYER                  POINTS  
41    Kevin Anderson             751  
42    Jeremy Chardy              745  
43    Juan Monaco                745  
44    Dmitry Tursunov            740  
45    Philipp Petzschner         736  
46    Ryan Harrison              736  
47    Julien Benneteau           734  
48    Marcel Granollers          720  
49    Tommy Robredo              716  
50    Adrian Mannarino           709  
51    Robin Haase                664  
52    Alex Bogomolov             662  
53    Xavier Malisse             660  
54    Thomaz Bellucci            651  
55    Lleyton Hewitt             621  
56    Sergey Stakhovsky          613  
57    Ivo Karlovic               607  
58    Grigor Dimitrov            602  
59    Thiemo de Bakker           598  
60    Andrei Goloubev            596  

RANK  PLAYER                  POINTS  
61    Lukasz Kubot               592  
62    Olivier Rochus             586  
63    Donald Young               585  
64    Dudi Sela                  559  
65    Santiago Giraldo           554  
66    Mikhail Kukushkin          543  
67    Andreas Seppi              541  
68    Denis Istomin              541  
69    Igor Andreev               528  
70    Pablo Cuevas               521  
71    Fabio Fognini              512  
72    James Ward                 505  
73    Yen-Hsun Lu                500  
74    James Blake                488  
75    Richard Berankis           477  
76    Matthias Bachinger         474  
77    Albert Montanes            468  
78    Lukas Lacko                466  
79    Benjamin Becker            466  
80    Jarkko Nieminen            463  

RANK  PLAYER                  POINTS  
81    Ryan Sweeting              461  
82    Leonardo Mayer             458  
83    Somdev K. Dev Varman       454  
84    Jerzy Janowicz             444  
85    Daniel Brands              444  
86    Matt Ebden                 440  
87    Michael Zverev             437  
88    Tobias Kamke               429  
89    Evgueni Korolev            426  
90    Blaz Kavcic                421  
91    Michael Berrer             419  
92    Daniel Gimeno              416  
93    Vladimir Ignatik           416  
94    Edouard Roger-Vasselin     412  
95    Frank Dancevic             406  
96    Alejandro Falla            401  
97    Ilia Marchenko             399  
98    Gilles Muller              396  
99    Grega Zemlja               396  
100   Simone Bolelli             387  
101   Wayne Odesnik              386

The High-Quality Cincinnati Draw

It’s tough to imagine a Master’s series event featuring a higher-quality field than the one assembled in Cincinnati this week.  With the exception of Robin Soderling, virtually every “name” player is present.  Just as importantly, almost all of the players awarded wild cards are legitimate competitors at this level.  The same is true of most of the seven men who qualified.

For tennis fans, it’s an enjoyable outcome: With the possible exception of Robby Ginepri, everyone present “deserves” to be here.  The event gave the other three wild cards to Ryan Harrison, Grigor Dimitrov, and James Blake, three men inside the top 85 who excel on hard courts.  Four of the top seeds in qualifying advanced to the main draw, all of whose current rankings put them right on the cusp of making the cut in the first place.

All this made me wonder: How does the Cincinnati draw compare to other 56-player Masters fields?  Is Cinci always this strong?

I’ve previous looked at the field quality of ATP 250s, so it was a small step to point the guns at the bigger tourneys.  Here are all 48- and 56-draw Masters events since 2009, along with the average entry rank and median entry rank of players in the field, sorted by the latter:

Year  Event        Field  AvgRank  MedRank  
2011  Madrid          56     37.7     30.0  
2010  Paris           48     38.1     30.5  
2011  CINCINNATI      56     50.1     31.5  
2010  Shanghai        56     56.5     31.5  
2009  Paris           48     57.5     31.5  
2009  Cincinnati      56     38.5     32.0  
2009  Montreal        56     83.6     32.5  
2010  Cincinnati      56     38.5     33.5  
2009  Rome            56     42.0     33.5  
2009  Shanghai        56     54.8     33.5  
2011  Rome            56     42.2     34.5  
2009  Madrid          56     43.6     34.5  
2011  Montreal        56     50.7     35.5  
2009  Monte Carlo     56     45.1     36.5  
2011  Monte Carlo     56     51.9     36.5  
2010  Rome            56     43.1     38.5  
2010  Toronto         56     57.7     40.5  
2010  Madrid          56     59.5     43.0  
2010  Monte Carlo     56     50.6     43.5

There’s not a huge difference in quality–after all, players are required to show up for most of these events–but there is a noticeable differentiation into “haves” and “have-nots.”  Of course Monte Carlo is near the bottom, as it is not mandatory.  Rome is required, but it does get skipped.  Madrid is an interesting case, as this year’s new schedule meant all the best players showed up, while last year, it was near the bottom of the list.

Setting aside Paris, which is near the top of the list because its field has eight fewer players, Cinci appears to consistently offer one of the best Masters fields.  This makes sense, as even if it weren’t a required stop on the tour, it’s a perfectly scheduled warm-up for the U.S. Open.

How Long Does the Server’s Advantage Last?

In professional tennis, it’s a given that the server has an advantage.  The size of that advantage depends on the abilities of the two players and the surface, but especially in men’s tennis, it’s a sizable edge.  On average, a server in an ATP match starts a point with a roughly 65% chance of winning.

But how long does it last?  It seems that, at some stage in the rally, the server’s advantage has disappeared.  Four or five strokes in, the server may still be benefiting from an off-balance return.  But by ten strokes, one would assume that the rally is neutral–that the advantage conferred by serving has evaporated.

As usual with tennis analysis, one question begets several more.  Does the server’s advantage last longer on faster surfaces?  Do women settle into “neutral” rallies sooner than men do?  Do dominating players, like this year’s edition of Novak Djokovic, take away the server’s advantage faster than the average player?

Using the rally counts provided by Pointstream at the last three majors, we can start to answer these questions.

Neutralizing the serve

The first step is to take all the matches we have rally-count data for, and average them out.  Then, for each point length, we calculate the odds that the server wins a point of at least that length.  So, for instance, we look at all points of five shots or more, and figure out how many of those the server wins.

Each one of these numbers is biased, because a rally of exactly five-strokes is, by definition, won by the server.  The server either hits a winner on his third shot (the fifth overall), or the returner makes an error attempting to hit his own third shot (the sixth overall).  Thus, if we look at all points of at least five strokes, the exactly five-stroke rallies virtually guarantee that the server will have the advantage.

However, the same reasoning shows us that a six-stroke rally will be biased in favor of the returner.  When we do the math for at-least-five, at-least-six, at-least-seven, and so on, we’ll see a yo-yo effect.  When the biases have equal effect, that means the serve is neutralized.

Here are the results for the approximately 150 grand slam matches with Pointstream data so far this year:

At least…  Win%  Notes                          
0          63%   before point begins            
1          66%   if serve goes in               
2          50%   if serve is returned           
3          60%   if server makes second shot    
4          46%   if returner makes second shot  
5          58%                                  
6          45%                                  
7          57%                                  
8          44%                                  
9          56%                                  
10         44%                                  
11         56%                                  
12         43%                                  
13         56%                                  
14         43%                                  
15         56%

In the table, “Win%” refers to the server’s chance of winning the point.  The biases even out somewhere between the 4th and 8th shot, meaning that in that zone, the server’s advantage is neutralized.

While the server retains the advantage at least until the fourth shot, it is interesting to see how quickly it decays.  Dropping from 66% upon making a serve to 56% once the advantage is neutralized, it loses more than half the difference between the first and third shots.  Thus, the returner doesn’t negate the server’s advantage simply by getting the ball back in play, but he does take a large step toward doing so.

Does surface matter?

As usual, it sure does.  The numbers for the Australian and French Opens are similar, and since they make up 2/3 of the data set, they are close to the aggregate numbers shown above.  But Wimbledon, as is so often the case, seems to play by a different set of rules:

At least…    Wimby    Austr    French  
0            66%        62%       62%  
1            68%        64%       67%  
2            52%        50%       48%  
3            62%        59%       58%  
4            48%        46%       45%  
5            61%        57%       57%  
6            47%        44%       44%  
7            61%        55%       56%  
8            47%        44%       44%  
9            59%        55%       54%  
10           47%        43%       43%  
11           60%        54%       55%  
12           46%        43%       43%  
13           59%        55%       55%  
14           43%        45%       42%  
15           56%        54%       56%

The biases don’t balance out until the very bottom–at 14 or more shots!  That’s only about 3% of points.  I’m not sure how to explain this, except perhaps psychologically, that on grass (considered the best surface for servers), players are less successful in return games simply because that’s what they expect to happen.  Regardless of surface, I can’t understand why else the server’s advantage would persist into double-digit shot counts.

What about the ladies?

WTA players (on average) start each service point with a smaller advantage than their male counterparts, and as it turns out, that advantage evaporates more quickly.

We saw a moment ago that, by putting the return in play, an ATP returner gives himself a 50% chance of winning the point–at least until his opponent hits another shot.  Women, however, knock the server’s winning percentage down to 47% by making the return.

The returner clearly neutralizes the point by the fourth stroke overall, and–here’s the good part–takes over a slight advantage herself by her third shot, the sixth stroke overall.  By making that sixth shot, the returner has a 57% chance of winning the point, while the server will never reach 57% again.  The advantage is only a percentage or two, but from the sixth stroke on, the returner has the edge.

Finally, presenting Novak Djokovic

Pointstream has tracked 17 of Djokovic’s slam matches this year, giving us a good set of data to work with.  When a man is having a season like this one–in large part because of his return game–it’s fascinating to see how comprehensively he is outplaying his opponents.

In the same terms as the tables above, here are Djokovic’s serve and return points across those 17 matches.  The return points are shown with the server’s winning percentages:

At least…    ND Sv    ND Ret  
0            70%         57%  
1            72%         60%  
2            59%         42%  
3            68%         50%  
4            56%         39%  
5            68%         49%  
6            57%         38%  
7            68%         47%  
8            58%         38%  
9            68%         45%  
10           55%         34%  
11           65%         44%  
12           55%         33%  
13           66%         38%  
14           52%         29%  
15           65%         40%

After seeing the averages above, you might reasonably conclude that these numbers are out of this world.  Even with the bias of 4-, 6-, and 8-stroke rallies, as discussed above, Djokovic still maintains an edge.  For everyone else, once fifth or sixth shot is struck, the point is a 50/50 proposition.  For Novak, it’s at least 60/40 in his favor.

The amazing stats are on his return.  When he gets his return back in play, he’s more than likely to win the point.  That may not surprise anyone who has watched Djokovic play this year, but consider how remarkable that is in the context of modern men’s tennis.  By the 8th stroke or so, he’s back to the 60/40 odds of the service points that turn into longer rallies.

Thanks to Carl Bialik for suggesting this topic.

The Most (and Least) Consistent Players on the ATP Tour

“Consistency” is one of the many terms that commentators frequently use but rarely define.  It’s often misused, too: we say we want a player to be more consistent, when we really just want him to stop playing badly.

To me, consistency for a tennis player is similar to the notion of “playing up to his ranking.”  In other words, if a player is consistent, he usually beats players ranked lower, and he usually loses to players ranked higher.  No player is perfect in this regard, but clearly, some are much more reliable than others.

A recent poster boy for inconsistency is Ernests Gulbis.  At Roland Garros, he lost to Blaz Kavcic, ranked 82nd in the world.  That was on clay, a surface on which Gulbis had posted some excellent results the previous year.  Two months later, in Los Angeles, he beat Juan Martin del Potro, someone he shouldn’t have even challenged on a hard court.

Quantifying consistency

With my player rankings and match prediction system, I’m able to assign a win probability to each player for every match.  For instance, when Ivan Dodig beat Nadal last week, I had given him a 14.4% chance of doing so.  As you might imagine, that’s a major upset–as I wrote the next day, it was the 10th-biggest upset of the season.

In these terms, an ideally consistent player will never be on either end of an upset.  If he is the favorite, he wins; if he is the underdog, he loses.  In practice, no tour-level player accomplishes this, though over the last two years, Florent Serra and Eduardo Schwank have come very close.

I’ve come up with a metric to measure consistency.  This is how it works:

  • Gather a list of all ATP-level matches for the desired time period.  (Today, I’m using everything from January 2010 through Montreal last week.)
  • Eliminate matches that ended in retirement or walkover, as well as those where we don’t have enough information to make an educated prediction.  (e.g. the first few comeback matches of Tommy Haas, or one with a wildcard playing his first professional match.)
  • For each player, count how many matches he played.
  • For each player, find the matches where he was the favorite and lost, or was the underdog and won.
  • For each of those matches, take the probability than the eventual winner would win (e.g. 18% — always under 50%), multiply by 100 (e.g. 18, not 18%),  subtract it from 50 (e.g. 50 – 18 = 32), and square the result (e.g. 32*32 = 1024).
  • Sum all of the squares, then divide by the number of total matches–not just the ones where the favorite lost.

Whew!  In something more like layman’s terms, we’re taking all the upsets a player was involved in, coming up with a number to represent how big (or surprising) the upset was, then averaging the results.

Using this method, we give big upsets considerably more weight than mini-upsets.  If a player had a 45% chance of winning a match and ends up winning, it barely counts as an upset–and this system treats it accordingly.  By dividing by the total number of matches, we give consistency credit to players who win the matches they’re “supposed to” win, and lose those they are supposed to lose.

Most importantly, the numbers this algorithm spits out are completely believable, matching up well with the conventional wisdom of which players are consistent and inconsistent.

The consistency of the top ten

The most consistent player on the tour, since the beginning of 2010, has been … Florent Serra.  Amazingly, Igor Kunitsyn comes in second.  But I doubt many of you care much about the consistency of guys like that.

Let’s start with the current top 10, ranked from most to least consistent:

Player              Upsets  Matches    Up%  UpsetScore  
David Ferrer            25      119  21.0%          55  
Rafael Nadal            16      131  12.2%          68  
Novak Djokovic          18      119  15.1%          69  
Roger Federer           21      123  17.1%          69  
Jo-Wilfried Tsonga      20       80  25.0%          75  
Mardy Fish              19       77  24.7%          82  
Tomas Berdych           32      113  28.3%         106  
Gael Monfils            24       89  27.0%         107  
Robin Soderling         23      115  20.0%         130  
Andy Murray             24       97  24.7%         151

The relevant column is the rightmost, “UpsetScore,” which is the result of the algorithm described above.  Ferrer has been part of more upsets than any of the top three (“Up%”), but his upsets are more minor.  Except for losses to Ivo Karlovic and Jarkko Nieminen early in the year on hard courts, Ferrer has not lost a match he had a 60% or better chance of winning.

The two ends of this list certainly line up with what I would have expected: Ferrer and Nadal are rock-solid (last week’s loss to Dodig notwithstanding), while Soderling and Murray both can be picked off by anybody, and frequently threaten higher-ranked players.

Right now, you may be tempted to put Djokovic higher on the list–after all, he’s ranked #1 and he’s beating everybody.  However, in the slightly longer term of 20 months, his movement around the top three has included some unexpected results, like losing to Ljubicic at Indian Wells last year, and victories over Federer and Nadal before his ranking suggested he would do so.

Tour wild cards

Outside of the top 10, there are a handful of players who are almost impossible to predict.  Some names that come to mind are Marcos Baghdatis, Ernests Gulbis, and Nikolay Davydenko, men who can take out one of the top three on a good day (well, maybe not Gulbis), but can lose to a qualifier on the next.

Player                 Upsets  Matches  Up%  UpsetScore  
Nikolay Davydenko          32       73  44%         273  
Marin Cilic                27       91  30%         180  
Marcos Baghdatis           38       89  43%         177  
Olivier Rochus             20       52  38%         164  
Milos Raonic               16       41  39%         164  
Juan Martin del Potro      11       48  23%         154  
Andy Murray                24       97  25%         151  
Jurgen Melzer              28       96  29%         150  
Fernando Verdasco          40      104  38%         150  
Ivan Ljubicic              27       69  39%         149  
Florian Mayer              30       72  42%         147  
Samuel Querrey             26       66  39%         146  
Andrei Goloubev            24       55  44%         143  
Ernests Gulbis             22       69  32%         140  
Jeremy Chardy              31       65  48%         133  
Juan Monaco                26       73  36%         131  
Robin Soderling            23      115  20%         130  
Michael Llodra             26       67  39%         130  
Rainer Schuettler          15       42  36%         119  
Mikhail Youzhny            25       78  32%         116

The “upset score” number tells the story for Davydenko.  The man who beat Nadal at the beginning of the year and threatened Djokovic last week recently suffered defeat at the hands of Cedrik-Marcel Stebe (twice!) and Antonio Veic.

While no one is in Davydenko’s league, names like Cilic, Baghdatis, Murray, and Verdasco seem appropriate.  Verdasco, along with Melzer and Milos Raonic suggest a flaw in this approach: the algorithm reads very fast improvement or decline as inconsistency, which isn’t quite right.  Yes, Raonic has shocked the tennis world repeatedly this season, but he hasn’t mixed in too many disastrous losses alongside the surprise upsets.  I tinkered with ways to include that in the model, but nothing worked very well.

A couple more interesting notes from the “most inconsistent” players are found in the upset percentage column.  Guys like Davydenko, Baghdatis, Mayer, Goloubev, and Chardy are involved in upsets nearly half the time.  Chardy is highest in that category.  In fact, if I expanded the study to challenger events, he might rocket to the top of this list, as he plays quite a few, and often manages to lose against players outside the top 100.

The consistent ones

The flip side is considerably less star-studded.  In the 20 most-consistent players of the last 19-20 months, Ferrer is the only top-10 guy present, though #11 Nicolas Almagro is there as well.

Here’s my seat-of-the-pants theory.  In this sense, “consistent” isn’t good.  Yes, “consistent” sounds good, especially when “inconsistent” means Davydenko losing to Antonio Veic or Mayer falling to Federico del Bonis.  But inconsistent means Davy beating Federer and Mayer beating Soderling.  So, the players who show up on as “most consistent” are in fact consistent, but they are also mediocre.  Their consistency (perhaps a mental advantage) has helped them move up from the top 200 to the top 50 or 100, but that’s all they can do.

Ferrer and Almagro are good examples of this, actually.  Neither has the weaponry that makes commentators say, “This guy could be number one!”  But they’ve earned their rankings by regularly reaching the quarters and semis of tournaments, not suffering the boneheaded losses that afflict the likes of Cilic and Baghdatis.

All that said, here’s the list:

Player            Upsets  Matches  Up%  UpsetScore  
Florent Serra         11       56  20%          23  
Igor Kunitsyn         14       40  35%          33  
Ilia Marchenko        14       46  30%          40  
Potito Starace        28       81  35%          46  
Victor Hanescu        26       77  34%          50  
Tobias Kamke          12       41  29%          52  
Andreas Seppi         24       81  30%          53  
Julien Benneteau      23       59  39%          53  
Viktor Troicki        25      101  25%          54  
David Ferrer          25      119  21%          55  
Fabio Fognini         18       71  25%          55  
Pere Riba             13       41  32%          56  
Lukas Lacko           14       44  32%          57  
Igor Andreev          17       62  27%          58  
Lukasz Kubot          26       63  41%          59  
Nicolas Almagro       22      112  20%          59  
Frederico Gil         15       40  38%          60  
Denis Istomin         25       76  33%          65  
Jarkko Nieminen       25       74  34%          66  
John Isner            21       82  26%          67

These lists hardly represent the final word on who is or is not consistent–for one thing, I haven’t said anything about consistency within matches, which may be a completely separate issue.  But this approach does, I think, provide some insight into who is more likely to be part of an upset, and suggests that consistency might not be such a good thing after all.

ATP Cincinnati Predictions

If last week’s tournament in Montreal taught us anything, it’s that predicting the outcome of ATP matches is a fool’s errand.  With that in mind, let’s see what the draw has in store for us in Cinci!

The draw this week is what the Master’s series is all about.  With the exception of a couple of late withdrawals (Tsonga?) that may yet come down the pike, nearly every top player in the men’s game is in Cincinnati.  Andy Roddick is trying to return from injury; David Ferrer makes his summer hard-court debut, and we’re already set for a Federer/del Potro showdown in the second round.

Del Potro’s mere presence makes every tournament a little more interesting.  He’s laid a couple of eggs recently, losing to Gulbis and Cilic, but he tore up the spring hard court circuit and lost only to the best of the best on clay.  My ranking system still gives him a lot of respect, keeping him within the top five, which makes Federer’s route to the semifinals (heck, the third round!) look particularly challenging.

Djokovic (who, once again, is in Fed’s half) could face a slew of Americans on their home turf.  His probable second-round opponent is Ryan Harrison, who I favor heavily over Juan Ignacio Chela.  After that, it’s easy to see John Isner in the third round, and possibly Andy Roddick in the quarters.  It’s theoretically possible, but a little less likely that another American, James Blake, will make it through the semis to be Novak’s opponent in that round.

Here is my full projection.  For purity’s sake, it doesn’t reflect the results of today’s two matches, in which Delpo and Blake both advanced.

Player                        R32    R16     QF         W  
(1)Novak Djokovic          100.0%  90.1%  74.4%    29.61%  
(WC)Ryan Harrison           71.5%   8.6%   3.2%     0.07%  
Juan Ignacio Chela          28.5%   1.3%   0.3%     0.00%  
(q)Radek Stepanek           43.6%  18.2%   3.0%     0.09%  
John Isner                  56.4%  24.7%   5.0%     0.24%  
Andrey Golubev              23.2%   8.4%   1.0%     0.02%  
(16)Stanislas Wawrinka      76.8%  48.7%  13.0%     1.41%  

Player                        R32    R16     QF         W  
(11)Andy Roddick            63.7%  43.3%  23.0%     1.34%  
Philipp Kohlschreiber       36.3%  20.8%   9.1%     0.24%  
Juan Carlos Ferrero         23.9%   4.3%   0.9%     0.00%  
Feliciano Lopez             76.1%  31.6%  13.3%     0.29%  
Ivan Dodig                  39.4%  10.9%   3.9%     0.05%  
(q)Ernests Gulbis           60.6%  24.3%  11.6%     0.32%  
(6)Gael Monfils            100.0%  64.9%  38.2%     2.22%  

Player                        R32    R16     QF         W  
(3)Roger Federer           100.0%  58.9%  44.5%    10.38%  
Juan Martin del Potro       82.9%  38.8%  27.9%     5.00%  
Andreas Seppi               17.1%   2.4%   0.8%     0.01%  
(WC)James Blake             23.3%   8.1%   1.1%     0.01%  
Marcos Baghdatis            76.7%  46.2%  15.2%     1.02%  
Fabio Fognini               26.1%   7.6%   0.9%     0.01%  
(14)Viktor Troicki          73.9%  38.1%   9.6%     0.40%  

Player                        R32    R16     QF         W  
(9)Nicolas Almagro          62.0%  32.1%  14.0%     0.25%  
Albert Montanes             38.0%  13.8%   4.0%     0.03%  
Ivo Karlovic                32.6%  14.0%   4.6%     0.04%  
Florian Mayer               67.4%  40.2%  18.2%     0.56%  
Tommy Haas                  13.2%   0.8%   0.1%     0.00%  
Juan Monaco                 86.8%  28.8%  13.9%     0.21%  
(8)Tomas Berdych           100.0%  70.4%  45.3%     2.43%  

Player                        R32    R16     QF         W  
(5)David Ferrer            100.0%  75.5%  44.7%     2.52%  
(q)Marsel Ilhan             38.3%   7.5%   1.9%     0.00%  
(WC)Grigor Dimitrov         61.7%  17.0%   6.0%     0.05%  
Janko Tipsarevic            70.3%  31.7%  15.3%     0.43%  
(q)Edouard Roger-Vasselin   29.7%   8.2%   2.2%     0.01%  
Jurgen Melzer               44.2%  25.0%  12.0%     0.35%  
(10)Gilles Simon            55.8%  35.2%  17.9%     0.72%  

Player                        R32    R16     QF         W  
(15)Jo-Wilfried Tsonga      57.2%  46.4%  19.8%     2.28%  
Marin Cilic                 42.8%  32.7%  12.2%     0.85%  
(q)Alex Bogomolov Jr        57.7%  13.7%   3.0%     0.04%  
(WC)Robby Ginepri           42.3%   7.2%   1.1%     0.01%  
(q)Kei Nishikori            46.5%  10.6%   4.5%     0.14%  
David Nalbandian            53.5%  13.1%   5.9%     0.27%  
(4)Andy Murray             100.0%  76.3%  53.6%    10.62%  

Player                        R32    R16     QF         W  
(7)Mardy Fish              100.0%  63.6%  42.6%     3.59%  
Nikolay Davydenko           68.5%  28.8%  16.7%     0.99%  
Sergiy Stakhovsky           31.5%   7.6%   3.0%     0.04%  
Xavier Malisse              55.4%  21.5%   6.7%     0.11%  
Kevin Anderson              44.6%  15.4%   4.5%     0.04%  
Alexandr Dolgopolov         47.8%  29.6%  12.4%     0.39%  
(12)Richard Gasquet         52.2%  33.4%  14.0%     0.52%  

Player                        R32    R16     QF         W  
(13)Mikhail Youzhny         54.6%  28.3%   7.7%     0.41%  
Michael Llodra              45.4%  20.5%   4.9%     0.17%  
Thomaz Bellucci             36.9%  15.1%   3.1%     0.05%  
Fernando Verdasco           63.1%  36.0%  10.4%     0.61%  
Guillermo Garcia-Lopez      60.5%  13.3%   6.3%     0.19%  
(q)Julien Benneteau         39.5%   4.9%   1.8%     0.03%  
(2)Rafael Nadal            100.0%  81.8%  65.8%    18.36%

Welcome To Your 30s, Roger Federer

This week, Roger Federer turned 30.  In some sports, that age can represent peak performance; in tennis, it is often a signal that the end is near.

I’m sure Roger wouldn’t appreciate being treated as an age-grouper, but viewing him that way gives us more evidence of his greatness.  Regardless of whether he returns to the top of the ATP rankings, it would seem that he’ll remain the #1 thirty-something for as long as he wants to keep playing.

Here is the current list of best 30-somethings, based on this Monday’s ATP rankings. The only achievement that exceeds Fed’s domination of the 30-and-over set is Ivan Ljubicic’s standing among 32-year-olds. [Edit: That is, if you ignore Radek Stepanek, who is older and higher-ranked.  Never mind…]

3    Roger Federer       SUI    8/8/81
18   Jurgen Melzer       AUT   5/22/81
22   Juan Ignacio Chela  ARG   8/30/79
27   Radek Stepanek      CZE  11/27/78
30   Nikolay Davydenko   RUS    6/2/81
31   Ivan Ljubicic       CRO   3/19/79
33   Michael Llodra      FRA   5/18/80
47   Albert Montanes     ESP  11/26/80
49   Xavier Malisse      BEL   7/19/80
50   Jarkko Nieminen     FIN   7/23/81
63   Potito Starace      ITA   7/14/81
64   Victor Hanescu      ROU   7/21/81
79   Olivier Rochus      BEL   1/18/81
85   Michael Berrer      GER    7/1/80
86   James Blake         USA  12/28/79
88   Eric Prodon         FRA   6/27/81
91   Ricardo Mello       BRA  12/21/80
98   Diego Junqueira     ARG  12/28/80
100  Michael Russell     USA    5/1/78
103  Marc Gicquel        FRA   3/30/77

Andy Murray and The Worst Upsets of the Year

On Tuesday in Montreal, Andy Murray played an ugly, listless match against world #35 Kevin Anderson, losing 6-3 6-1.  While Murray has played some solid matches this year and is in no immediate danger of losing his top-four ranking, the Anderson loss is hardly the first disaster of his season.  Back in Indian Wells and Miami, he managed to lose to Donald Young and Alex Bogomolov in successive matches.  Ouch.

Using my rankings and match projection system, I’ve generated win probabilities for every ATP match of the season.  Combined with match outcomes, that allows us to find the upsets that were least expected on that surface, at that time.

Pre-match, my system gave Anderson a 16.3% chance of beating Murray–only a smidge better than Dodig’s 14.4% against Nadal.  (My system has never given the South African much credit; his hard-court ranking right now is #58.)  In fact, Anderson was the 4th-biggest underdog going into the 2nd round, ahead only of Dodig, Michael Russell, and Vasek Pospisil.

As it turns out, Anderson’s victory was the 14th-biggest upset win of the ATP season.  (I took out retirements and “comeback players,” like Fernando Gonzalez and Tommy Haas, whose rankings aren’t very predictive.)  That’s 14 out of nearly 1,700.

But, as you might guess, 14th-best of the season isn’t enough to be 1st with Murray on the losing end.  The Murray loss to Young in March is the biggest upset of the year–Donald entered the match with an 8.6% chance of winning.  The Bogomolov match comes in 4th overall; the American had a 10.1% chance before play began.

Edit: This is what I get for writing a draft the night before!  Dodig’s upset victory comes in tied for #10 on the season, pushing Murray/Anderson down one more spot on the list.  Nadal will go home having suffered the biggest upset of the Rogers Cup, though he played far and away better than Murray did to achieve the same outcome.

The biggest upsets of the year

I couldn’t possibly give you those numbers without following through with a complete table.  Here are the 36 matches where the winner entered the match with less than a 20% chance of winning.  This list is through last week’s matches, so it doesn’t yet show Murray’s latest meltdown and Dodig’s shocker.

(This site doesn’t show wide tables very well; click here for a clearer version.)

P(UPSET)  WINNER                 LOSER               TOURNEY          SCORE
 8.6%     Donald Young           Andy Murray         Indian Wells     7-6(4) 6-3
 9.4%     Bernard Tomic          Robin Soderling     Wimbledon        6-1 6-4 7-5
 9.7%     Jimmy Wang             Igor Kunitsyn       Newport          4-6 7-5 6-2
10.1%     Alex Bogomolov         Andy Murray         Miami            6-1 7-5
11.6%     Stephane Robert        Tomas Berdych       French Open      3-6 3-6 6-2 6-2 9-7
13.1%     Milos Raonic           Mikhail Youzhny     Australian Open  6-4 7-5 4-6 6-4
13.8%     Denis Kudla            Grigor Dimitrov     Newport          6-1 6-4
14.0%     James Ward             Stanislas Wawrinka  Queen's Club     7-6(3) 6-3
14.1%     Federico del Bonis     Florian Mayer       Stuttgart        6-2 6-3
14.4%     Jo-Wilfried Tsonga     Rafael Nadal        Queen's Club     6-7(3) 6-4 6-1               

P(UPSET)  WINNER                 LOSER               TOURNEY          SCORE
15.0%     Jan Hernych            Thomaz Bellucci     Australian Open  6-2 6-7(11) 6-4 6-7(3) 8-6
15.2%     Nikolay Davydenko      Rafael Nadal        Doha             6-3 6-2
15.6%     Andrey Kuznetsov       Marcos Baghdatis    Casablanca       6-4 4-6 6-4
16.4%     Flavio Cipolla         Andy Roddick        Madrid Masters   6-4 6-7(7) 6-3
16.6%     Lukas Rosol            Jurgen Melzer       French Open      6-7(4) 6-4 4-6 7-6(3) 6-4
16.8%     Antonio Veic           Nikolay Davydenko   French Open      3-6 6-2 7-5 3-6 6-1
17.0%     Sergei Bubka Jr.       Daniel Gimeno       Doha             6-0 6-3
17.1%     Leonardo Mayer         Marcos Baghdatis    French Open      7-5 6-4 7-6(6)
17.6%     Ivan Dodig             Robin Soderling     Barcelona        6-2 6-4
17.6%     Michael Yani           Dudi Sela           Newport          7-6(5) 6-3                   

P(UPSET)  WINNER                 LOSER               TOURNEY          SCORE
17.7%     Frank Dancevic         Feliciano Lopez     Johannesburg     6-7(5) 6-2 7-6(8)
17.8%     Alexander Dolgopolov   Robin Soderling     Australian Open  1-6 6-3 6-1 4-6 6-2
17.9%     James Ward             Samuel Querrey      Queen's Club     3-6 6-3 6-4
17.9%     Jan Hernych            Sergey Stakhovsky   Halle            6-3 6-7(5) 7-6(8)
18.2%     Jan Hernych            Denis Istomin       Australian Open  6-3 6-4 3-6 6-2
18.3%     Jo-Wilfried Tsonga     Roger Federer       Wimbledon        3-6 6-7(3) 6-4 6-4 6-4
19.0%     Milos Raonic           Fernando Verdasco   San Jose         7-6(6) 7-6(5)
19.0%     Lukasz Kubot           Gael Monfils        Wimbledon        6-3 3-6 6-3 6-3
19.2%     Federico del Bonis     Sergey Stakhovsky   Stuttgart        6-4 6-3
19.3%     Lukasz Kubot           Nicolas Almagro     French Open      3-6 2-6 7-6(3) 7-6(5) 6-4    

P(UPSET)  WINNER                 LOSER               TOURNEY          SCORE
19.3%     Bernard Tomic          Feliciano Lopez     Australian Open  7-6(4) 7-6(3) 6-3
19.7%     Thomaz Bellucci        Andy Murray         Madrid Masters   6-4 6-2
19.7%     Pavol Cervenak         Victor Hanescu      Stuttgart        6-3 7-6(6)
19.7%     Richard Gasquet        Roger Federer       Rome Masters     4-6 7-6(2) 7-6(4)
19.9%     Rajeev Ram             Grigor Dimitrov     Atlanta          6-4 6-4
20.0%     Philipp Kohlschreiber  Robin Soderling     Indian Wells     7-6(8) 6-4