Unmixing the Gender Gap in Mixed Doubles

Doubles has long been a sort of final frontier in tennis analytics. Double is interesting, at least in part, for the same reason that all team sports are compelling–contributions can come from either player, or a combination of the two. From an analytics perspective, that poses a challenge: Can we isolate what each player brings to the court? I’ve tried to do so with my doubles Elo ratings, but that method relies on players changing partners. It’s not possible to identify how much each half contributed simply by looking at match results.

The problem, as usual, is limited data availability. To know how much value to assign to each player, we need to know what he or she did, even at the basic level of aces, double faults, winners, and errors. The tours report matchstats for many doubles contests, but do not separate the players. Knowing that the Bryan brothers hit 12 aces doesn’t tell us anything about Bob or Mike. The grand slam websites have been better, often providing sequential point-by-point data for some matches, but the same problem persists: They don’t differentiate between players.

That is, until now! The Australian Open website specified the server for each point of every doubles match. (It doesn’t identify the returner on each point, but … baby steps.) That opens up whole new vistas for analytics to separate the contributions of each player.

There’s no I in mixed

A natural place to start is mixed doubles, an event that, due to lack of data, has been almost entirely ignored by analysts. Yet mixed doubles is one of things that everyone seems to have at least a moderate interest in, either because it’s a popular amateur pastime, or because gender differences in sport are inherently fascinating. Due to the variety of skillsets on court at all times, mixed doubles presents tactical puzzles that are different from those posed by same-gender matches.

Let’s start with the basics. There are only 32 teams in a grand slam mixed doubles event, so it’s possible to extend the dataset even further by manually recording which players returned from which sides. (Thanks to Jeff M for a big assist with this.) Thus, for over 3,000 points, we have the gender of the server and the returner. The following table shows several aggregates: Overall mixed doubles averages, typical performance for male and female servers, and rates for male and female returners, including serve points won, first-serve-in rates, and average first serve speed:

Subset           Hold%    SPW  First In  Avg 1st  
Average          76.0%  63.3%     66.2%    103.1  
Men serving      78.6%  65.1%     65.0%    110.2  
Women serving    72.4%  61.3%     67.6%     94.9  
Men returning        -  60.4%     64.6%    103.5  
Women returning      -  65.9%     67.6%    102.8

I was a bit surprised by how narrow the gap is between men and women serving. In men’s doubles at the Australian Open, servers won 67.8% of points, and in women’s doubles, servers won 58.5%. The pool of players is very similar, but in the mixed event, men won fewer serve points and women won more.

Perhaps there is more insight to be gained by looking at more specific matchups:

Server  Returner    SPW  First In  Avg 1st  
Male      Male    61.7%     63.5%    111.0  
Male      Female  68.1%     66.3%    109.5  
Female    Male    58.9%     66.0%     94.6  
Female    Female  63.3%     69.0%     95.1 

Tactics appear to change a bit depending on the gender of the returner. Both men and women land more first serves when facing a female returner. However, first serve speed doesn’t vary much. This suggests that David Marrero–who got himself in hot water by possibly fixing a 2016 Australian Open mixed match and then making some questionable comments about inter-gender competition afterward–is unusual in his reluctance to hit hard against female opponents.

Interestingly, the averages from same-gender doubles matches pop up in this table. When men serve to women in mixed doubles, they win 68.1% of points, almost exactly the same rate of serve points won in men’s doubles. When women serve to men, they take 58.9% of points, just a bit higher than the usual rate in women’s doubles. This suggests that while the server-returner matchup is important, the gender of the net player is a key factor as well.

Beware of Melichar

Individual player results against each gender will tell us more, but a single tournament worth of no-ad, third-set super-tiebreak matches doesn’t give us a lot of data on many players. Many members of first-round losing teams served only 20-25 points each. Of the finalists, John Patrick Smith had the biggest gender gap, winning 54.9% of service points against men and 74.4% against women, and his opponent Barbora Krejcikova was similar, winning 59.6% against men and 73.0% against women. Their partners, Astra Sharma and Rajeev Ram, both had narrower gaps of just a few percentage points.

Over the course of the entire event, Sharma was the best server of the four, winning 69.7% of total service points compared to Ram’s 69.0%. But neither came close to semi-finalist Nicole Melichar, who won a whopping 78.4%, narrowly besting her partner, Bruno Soares, who won 77.7%. The Melichar/Soares duo appears to be particularly effective as a unit: Melichar won only 72.6% of service points in her three women’s doubles matches, and Soares won only 70.2% in his men’s doubles quarter-final run alongside Jamie Murray.

The first step toward analyzing any sporting event is simply understanding what’s going on. In the case of mixed doubles, a big part of that is getting a sense of the gender gap on both serve and return. There’s still a painful dearth of data–we now have a mere 31 matches with servers and returners identified for each point–but the next time you watch a mixed doubles match, you’ll be that much smarter about what to expect and what sorts of performances are worthy of further study.

Gender Differences in Point Penalties

Embed from Getty Images

Italian translation at settesei.it

The officiating in Saturday night’s US Open women’s final has become a hot-button issue, to put it mildly. Many of the complaints about Serena Williams’s treatment at the hands of chair umpire Carlos Ramos come down to a belief that Ramos’s actions were sexist. Most of us have seen players–both men and women–act in ways that seem more objectionable than anything Serena did, and anybody paying attention has seen innumerable coaching violations go unpenalized.

There are a few things we can all agree on. First: Not all umpires are the same. Ramos is more strict than, say, Mohamed Lahyani. Second: Officials have a lot of latitude, so something that triggers a penalty in one match may not have the same result in another match. And third: Umpires usually do everything they can not to call game penalties. A lot of matches have at least one warning, whether for coaching, ball abuse, or a variety of other things, but only a small percentage of them escalate to the loss of a point or game. Of course, players typically proceed with caution as well. Once a warning has been called, you don’t see nearly as many rackets smashed or balls sent sailing out of the stadium.

The differences between umpires, and the latitude granted to them within the rules, makes it easy to point to any given call and accuse the umpire of sexism, racism, favoritism, homerism, Fed-hating, Rafa-hating, or good old-fashioned stupidity. The rarity of point and game penalties makes Saturday night’s decisions all the more glaring, since within each umpire’s range of options, they rarely go nuclear and dock an entire game.

Some numbers

Point penalties–let alone game penalties–are so rare that it’s impossible to draw concrete conclusions. Still, let’s take a look at what we have. As far as I know, none of the ATP, WTA, ITF, or USTA have released any data on penalties, the players who receive them, or the umpires who levy them. (This would be a great time to do so, but I’m not holding my breath.) As an alternative, we can turn to the increasingly sizable dataset of the Match Charting Project (MCP), which now spans over 3,500 matches from the 2010s alone.

MCP data is not random, since matches are chosen by charters in part because of their personal interests. But in a way, that’s good for today’s purposes: MCP matches skew in the direction of notability, with a disproportionate number of finals and substantial data for top players, including over 100 matches for Serena. With those caveats in mind, let’s take a look at penalties in matches from 2010 to the present, not including Saturday’s final. The final column, “P%”, is the percent of matches in which a penalty was levied.

Category        Matches  Penalties     P%  
Women (all)        1895         13  0.69%  
Women (slams)       490          6  1.22%  
Women (finals)      228          2  0.88%
  
Men (all)          1689         16  0.95%  
Men (slams)         234          6  2.56%  
Men (finals)        371          5  1.35%

Men receive more point penalties than women in three separate comparisons: All MCP matches, matches at grand slams*, and finals. The grand slam numbers are particularly pertinent because it is the only category in which the umpires are drawn from the same pool. At other events, the ATP and WTA use separate groups of officials.

(I’m ignoring full-game penalties because there’s almost no data. In these 3,500-plus matches, there was only one instance where things escalated beyond the point penalty stage: Grigor Dimitrov’s meltdown at the 2016 Istanbul final.)

* Update: A number of people have pointed out that the grand slam comparison isn’t exactly apples-to-apples, because men play best-of-five. True. I’m not sure, however, if we should expect proportionally more penalties in longer matches. Coaching, for instance, would continue throughout a match until identified as a code violation, and then (one would hope) stop. That said, it is certainly true that on a per-point or per-set basis, the gender gap at majors is smaller than these numbers suggest, though it still leaves us with more point penalties against men.

These numbers aren’t proof of gender fairness, nor do they establish sexism against either women or men. Aside from the limited number of penalties, we know nothing about the actions that led to them, or about similar instances that didn’t trigger penalties. Perhaps men are generally more abusive to officials, so they should receive half again as many–or even more–penalties than women. I don’t know, and it’s likely that nobody else commenting on the Serena-Ramos incident knows either. Anecdotes are a key ingredient in this sort of vitriol. To firmly settle the issue, we’d need to set up a controlled study, perhaps by instructing a set of male and female players to berate umpires in identical ways and then comparing the results. As entertaining as that would be, it’s not going to happen.

None of this is to say that accusations of sexism require statistical support to be valid. They don’t. But in cases where the data is available, especially when it is possessed by some of the very organizations making accusations, it’s a shame that the numbers get ignored. The limited information available to us via the MCP indicates that men are more frequently penalized by chair umpires than women are. The USTA, ITF, and WTA could go a long way to clear up the issue–whether officials are consistently equitable or there is a pattern of harsher treatment of female players–by releasing details of all matches, including the number and causes of warnings and penalties, as well as the identity of the umpires. Alas, the more likely outcome is a few more weeks of unsubstantiated grandstanding.

Men, Women, and Unforced Errors

Italian translation at settesei.it

If you’ve ever suffered through a debate about the relative merits of men’s and women’s tennis, you’ve probably heard the assertion that women’s tennis is sloppier–“riddled with unforced errors,” perhaps.  Maybe you’ve even made that claim yourself, which is understandable, given how often some version of it crops up, unchallenged, in tennis commentary.

But is it really true?  Do WTA matches feature so many more unforced errors than ATP matches? Unforced errors were counted at most slam matches last year, so we can find out.

Let’s start with the most recent results.  In men’s matches at the 2013 US Open, 33.2% of points ended in an unforced error.  Play may have tightened up just a bit in the final week: In the round of 16 and later, 32.9% of points ended in UFEs.

Women’s matches did, in fact, feature a higher rate of unforced errors. Considering the entire tournament, 39.7% of points ended that way, while in the fourth round and later, the rate dropped to 36.7%.

So yes, there are more unforced errors in the women’s game.  There are similar gaps between ATP and WTA error rates at Wimbledon and the Australian Open, and while the difference on the French Open clay is smaller, it is still present.

Eyeballing errors

However, these aren’t massive differences.  Using the US Open numbers, we can calculate that WTA points ended in UFEs about 20% more often than ATP points.  In the last four rounds of the tournament, when more people are watching closely and drawing conclusions, that difference drops to 11.7%.

Without a scorebook in hand, that gap may well be too small to spot.  In a typical set of, say, 60 points, the average ATP pairing averaged 20 UFEs, against a typical WTA matchup’s  24.  That’s one extra unforced error every other game–if that.  Looking at the four final rounds, the difference drops to 20 UFEs in a men’s match against 22 in a women’s match.  Two extra errors a set.

The divide is real, but it hardly seems substantial enough to represent a major difference in the quality of play or in the viewing experience.

Here are the numbers for the entire field at all four 2013 slams, followed by the rates in the final 16:

Slam             ATP UFE%  WTA UFE%  WTA/ATP  
Australian Open  36.2%        44.4%     1.22  
French Open      33.6%        37.0%     1.10  
Wimbledon        19.1%        24.6%     1.29  
US Open          33.2%        39.7%     1.20

R16 and later:                                           
Slam             ATP UFE%  WTA UFE%  WTA/ATP  
Australian Open  36.4%        41.1%     1.13  
French Open      33.9%        34.9%     1.03  
Wimbledon        20.5%        24.4%     1.19  
US Open          32.9%        36.8%     1.12

Don’t read too much into the contrasts between one slam and another–what’s important here is how the same set of scorers, in the same conditions, are judging men’s and women’s matches.  Wimbledon, especially, is known for its, shall we say, unique approach to counting unforced errors.

Instead, a power gap

The French Open rates are by far the closest of those at the four slams.  This shouldn’t come as a surprise.  On a slower surface, ATPers earn fewer free points than usual on serve, finding themselves more frequently in rallies.  Take away those one- or two-shot rallies that the men’s game is known for, and the UFE disparity starts to shrink.

While we can’t account for all service winners and forced error returns, we can take aces out of the equation.  So far, we’ve only see unforced errors as a percentage of all points.  Take UFEs as a percentage of all non-ace points, and the difference between men’s and women’s error rates decreases.

In other words, now we’re starting to look at what happens when the serve is returnable:

Slam             ATP UFE%  WTA UFE%  WTA/ATP  
Australian Open  39.6%        46.2%     1.17  
French Open      35.6%        38.3%     1.08  
Wimbledon        21.2%        25.9%     1.22  
US Open          36.1%        41.3%     1.14  

R16 and later:                                
Slam             ATP UFE%  WTA UFE%  WTA/ATP  
Australian Open  39.6%        42.8%     1.08  
French Open      35.3%        36.0%     1.02  
Wimbledon        22.7%        25.6%     1.13  
US Open          34.9%        38.3%     1.10

In most of these cases, we’re down to a couple of points per set.  If we were able to sort out service winners and perhaps forced error returns, we would almost surely see even more minor differences.

There’s no doubt that men hit harder serves and are, on average, more likely to win a point without having to hit a second ball.  But if we’re comparing the characteristics of women’s tennis, it doesn’t seem right to give the men credit for not hitting as many unforced errors when some of the already modest difference is due to the dominance of the serve.

Quibbles

This entire analysis depends on the unforced error stat, which I don’t much care for.  It is hugely dependent on the scorer, and there’s no widespread agreement in the sport on what exactly it means.

However, if we want to challenge a widely-held belief about unforced errors, there’s not really any way around using unforced errors, is there?

The best we can do to eliminate scorer’s biases is to compare only within single events.  The same person isn’t counting unforced errors at every US Open match, but each scorer probably works both men’s and women’s matches.  At a given venue, every scorer might even go through the same training program.

Even with that consideration, there is the strong possibility that scorers make adjustments–consciously or unconsciously–depending on the gender of players on court.  If unforced errors are shots that a player should have made but didn’t, a lot hinges on your interpretation of the word “should.”  It may be that some shots would be called unforced errors in a men’s match, but forced errors in a women’s match.  To the extent that’s the case, it’s awfully difficult to compare the genders using a stat that itself differs depending on gender.

On the other hand, scorers are presumably tennis fans, and they’ve heard the same conventional wisdom everyone else has.  If you believe that women hit more unforced errors than men do, perhaps you call borderline women’s shots unforced and borderline men’s shots forced.  In that case, scorers might be unwittingly amplifying the gender difference, not reducing it.

Given the difficulties of collecting data from hundreds of matches on different continents spread across many months, I doubt any non-automated method of counting unforced errors would address all of these issues.  For now, we have to take the official unforced error counts as the best available representation of reality and draw conclusions accordingly.

Whatever the limitations of the data, and whatever the other differences between the genders on a tennis court, unforced error counts are not nearly the distinguishing factor that they’ve been made out to be.