## Google's Gender Gaps - Worse Than Advertised
*Where we can measure it, female participation in Google's engineering is much lower than expected.*
Google reports that 21.4% of their “tech” employees are now female, up from 20.2% last year[^1]. You would be mistaken though, if you assumed that this was equivalent to the female representation in their software engineering roles, which a lawsuit revealed to be significantly lower at 16% in 2017[^2].
Are there other reasons to be skeptical? Well, glancing over at Microsoft's diversity report, we notice that they shifted their data collection date from September to June this year[^3]. Allegedly the shift was to align with their financial calendar, but it is interesting that this date coincides with the rapidly expanding "Explore Microsoft" internship program, which has a [markedly different]( https://twitter.com/donasarkar/status/999809666170290177?s=20) demographic makeup compared to the rest of the company. Facebook also, is reporting 2018 data from June[^4], shifted from May[^5] in past years. Google has a diverse summer internship program as well, and 2018 saw *"49% of Google’s global interns identifying as Black, Latinx, and/or women"[^1]*. Google doesn't provide a reporting date with their data, but I wouldn't bet against it being in June.
This possibly explains some of the discrepancy between these companies' PR announcements and their Equal Employment Opportunities filings, which show much lower numbers[^6][^7][^8]. Facebook warns us that:
> ...due to the way the U.S. government tracks EEO-1 data, the numbers reflected in the below form are representative of a point in time in December 2017, and not our current 2018 data. The EEO-1 data also reflects job groupings and categories that do not align with the way Facebook groups our roles and employees internally. we believe that the information present on this website is a far more accurate reflection of the progress we've made and the work that remains to be done.[^4]
If this seems difficult to disentangle, that's probably intentional. We may want to understand where representation is at its lowest, but tech companies are trying to make themselves look as good as they can. They have to compromise between appearing to be honest and forthcoming with their metrics and making sure those metrics stack up against the competition.
Ideally we would measure employee demographics and work patterns directly, without having them passed through the PR department. An enterprising Google employee could do this quite easily - they could compile user activity from their engineering systems, tag users with identity information from the internal mailing lists, and run the numbers[^10].
We don't have access to this data, but we can find a smaller approximation of it. The past few years have seen the rapid growth of [Google’s GitHub Organization](https://github.com/google). Here thousands of verified Googlers do their work in public on open source projects. Almost all their accounts can be linked to their real identities, making relatively accurate determination of their gender possible, albeit time consuming, and GitHub provides an API for linking them to [contributions](https://blog.github.com/2013-01-07-introducing-contributions/).
As of December 2017, a scrape of the Google GitHub organization yielded:
- 1493 members
- 1048 repositories
- 60,969 contributions[^12]
Aggregated by user and supplemented with gender and the user's personal GitHub stats[^11], you can find the dataset [here](https://pastebin.com/raw/gjCbeNbi). Since this scrape, the member list has grown to 2100 with significant turnover, so there will be a lot of value in updating it with new data. But for now, lets see what we have.
First up, member statistics by gender:
| | Count | Avg #repos | Avg #followers | Avg #gists |
| :------------ | --------: | ---------: | -------------: | ---------: |
| Male | 1378 | 7.26 | 147.9 | 8.18 |
| Female | 69 | 5.29 | 84.7 | 2.64 |
| Unknown | 46 | 3.29 | 20.8 | 0.76 |
The 46 users marked as “unknown” are comprised mostly of pseudonymous usernames that could not be linked to any social media presence. These accounts tended to have less activity, causing the low numbers for this group. A significant number of unknowns also came from real names that did not have a strong gender association and could not be reliably linked to a social media presence.
Looking at the male and female groups two things show up as interesting:
1. Only 4.8% of the gendered members are female
2. Females have significantly less activity than males, with:
- 73% as many repositories
- 57% as many followers
- 32% as many gists
Now lets examine contributions to repositories owned by the Google organization:
| | #contribs > 0 | Avg #contribs | Avg excluding 0 | % of all |
| :------------ | ------------: | ------------: | --------------: | -------: |
| Male | 34% | 44 | 127 | 98.8% |
| Female | 25% | 8 | 32 | 0.9% |
| Unknown | 34% | 5 | 15 | 0.3% |
As shown in the first column, the majority of Googlers who join the GitHub org, don't make any contributions (or only contributed to capsicum-linux[^12]). What would cause someone to join the org but not make any contributions is a potentially important unknown, so we calculate the average number of contributions both with and without these members.
Points of interest:
1. Females were 74% as likely to have any contributions as males
2. Females with contributions had 25% as many contributions as males
3. 98.8% of all contributions were from males
The contributions disparity between males and females was much larger than expected[^13]. What with only 1.2% of the original member list being contributing females, we're down to just 17 members for this category but we can still get a better idea of how the groups compare with a density plot:
Note that the contributions are shown here on a log2 scale, so members on the right edge of this graph are contributing 1000x as much as members on the left. What we observe is that a minority of the male contributors are contributing far more than most other members, and thus also driving the disparity in average contributions between male and female contributors. It's not so much the case that female members have unusually low contributions; rather its that all the members with unusually high contributions are male.
Given the surprising nature of these findings, we should revisit the assumption under which this data was collected. We started out aiming to understand the work patterns and demographics of software engineering at Google, but not being able to access this data directly, we settled for what was hosted on the Google GitHub organization. This excluded all closed source projects, and also excluded open source projects hosted elsewhere (such as Android and Chromium). This approach includes selection effects that could have skewed our sample, such as:
- Open source vs non-open source
- Small projects vs large projects
- Small teams vs large teams
Disparities don't come from nowhere, and if our data is significantly skewed from Google's internal workflows then that raises interesting questions in itself. For example, are small teams worse at accommodating women? Is Google's open source culture particularly male-dominated?
It should also be noted that the disparities we are looking at are not the same thing as employee performance. In our data "contribution" is a term from GitHub's API, and may be a poor proxy for an employee's actual contribution to the company (although again, a gender skew between GitHub contributions and overall performance would be interesting in itself).
Overall, three important conclusions appear to be warranted by this investigation. First, that female participation in Google's GitHub org is, for whatever reason, much lower than expected. Secondly, that for understanding gender disparities at Google, the diversity reports and press materials fall somewhere between "insufficient" and "misleading". While they probably don't give false information, anybody who takes them at face value will be very surprised by findings such as the gender disparity in their GitHub users. Finally, this investigation demonstrates that interesting gender disparities can be found by extracting data from engineering systems, and that open source provides an opportunity for interested third parties to conduct research.
[^1]: [Google Diversity Annual Report 2018](https://static.googleusercontent.com/media/diversity.google/en//static/pdf/Google_Diversity_annual_report_2018.pdf), page 17
[^2]: [JAMES DAMORE vs. GOOGLE, LLC Case #18CV321529](https://www.dhillonlaw.com/wp-content/uploads/2018/04/20180418-Damore-et-al.-v.-Google-FAC_Endorsed.pdf), page 62
[^3]: [Microsoft Workforce Demographic Report 2018](https://www.microsoft.com/en-us/diversity/inside-microsoft/default.aspx)
[^4]:[Facebook Diversity Update 2018](https://www.facebook.com/careers/diversity-report)
[^5]: [Driving Diversity at Facebook](https://newsroom.fb.com/news/2015/06/driving-diversity-at-facebook/)
[^6]: [Facebook's EEO report](https://www.facebook.com/careers/pdf/diversity-report)
[^7]: [Google's EEO report](https://diversity.google/static/pdf/Alphabet-Consolidated-EEO-1-Report-2017.pdf)
[^8]: [Microsoft's EEO report](https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE2Hh94)
[^9]: Note: it’s important to do (2) before (3) and (4), to minimize the possibility of researcher bias when determining gender.
[^10]: It would also be fascinating to compare other groups, e.g. do furries create more bugs than basketball players?
[^11]: Method for collecting the data:
1. Scrape the members list from Google’s GitHub org
2. For each member account, attempt to determine gender[^9]:
- Try determination from account name and photo
- Try determination from contact details, personal website, email address, username reuse
- Attempt to match all possibly ambiguous names to LinkedIn and social media accounts
- For data deficient names, record “Unknown”
- Go with stated gender if it exists, if the member identifies as a gender other than male or female record “Unknown”
3. Scrape GitHub’s contribution statistics for every repository owned by the organization
4. For each member, additionally scrape their personal account statistics
[^12]: Excluding contributions for the repository [capsicum-linux](https://github.com/google/capsicum-linux), as these were too large for the API
[^13]: Predictions recorded before running this experiment:
- Only 8% of contributors will be female
- Females will on average have 75% as many contributions as males