# Fairlearn: Candidate screening
[TOC]
:::warning
**[Sociotechnical context, Candidate screening](https://hackmd.io/x9Q3o-EVTbC-gO1EkPOC2w#Candidate-screening)** is linked below. It was split out because the document got too large for hackmd.
a. Narratives and lived experiences
b. Law
c. Psychology
c. Machine learning
d. Business
e. Economics
f. Australia, new graduate employment
g. Other perspectives
:::
---
# Recap

## Seed scenarios
We brainstormed a bunch of [seed scenarios](https://hackmd.io/gD8dwdPSRsqH3BxBt9terg?both), discussed them a bit as a group. Then we voted on which ones to try to turn into an example notebook, following those guidelines. The group picked four top scenarios. Kevin worked through tax fraud and predictive healthcare a bit as well, but picked Candidate screening to start with. Other folks are welcome to start their own too! :)
:::spoiler more background...
We're following [this process](https://hackmd.io/gD8dwdPSRsqH3BxBt9terg?both#Next-steps).
#### What were the top four scenarios the group voted on?
```
1. Identifying potential tax fraud
2. Debit card fraud investigation
4. Candidate screengin
22. Predicting who needs special attention in healthcare
```
#### Why candidate screening, or pymetrics, in particular?
1) open source package with implementation guidelines
2) case studies on several companies they've worked with
3) responsive on open source
4) clear path to impact if the fairlearn library is able to find places where it actually provides value
5) high bar since open source is tied up in their branding
> kevin: For me, this is really tied up with the [theory of change](https://hackmd.io/EBh01XPtRZGHEg76oD1rGw?both) for the work, and about trying to help us as a group validate some of the latent assumptions about how Fairlearn can impact the world. Since pymetrics has publicly shared a bunch of information, and they have an open source library for auditing, they seemed like they might provide a high bar for us in the way that would be a helpful anchor for nudging us towards productive work. I'm not really sure if that will work out, but that's my hope! :)
#### Were those even the "right" scenarios anyway?
> kevin: my vote is that it's important to go deeper first, rather than for coverage. i think a core challenge the team is working through is what it even means to engage with fairness as sociotechnical work (a la Selbst et al. 2020). i have an idea of what that means and looks like, but for other folks on the team, some of the concepts are still pretty new. even doing work like user-centered design, any kind of UX or UI work or requirements or user stories or roadmaps, these are outside the experience and comfort level for some folks on the team. when we first started the thursday calls, MS folks also got blocked on communicating with customers in ways that they couldn't talk about externally. if anyone is able to move forward that work in a different way, and connect with real users, that seems great! but my intuition is that without access to real users, folks can contribute a lot more by picking one scenario they find interesting or compelling (free free to make your own!) and going deep enough into it that you can turn it into sociotechnical "talking points" or an example notebook following these guidelines https://fairlearn.github.io/contributor_guide/contributing_example_notebooks.html i've collaborated with other folks on getting started in that work, and happy to do that with anyone else too!
#### Wait, why are we doing this again?
> kevin: the goal here is really to build a shared understanding and context for people to collaborate at all. in earlier discussions, people were making wildly different assumptions and there was a lot of talking past each other :) so the seed scenarios, and getting into the specifics of pymetrics and ANZ is an attempt to ground us in something concrete where we can actually collaborate and work through what's involved.
> one goal for the work is that it translates into an example notebook that demonstrates the value of the current fairlearn python code in assessing or reducing real harms to real people (https://fairlearn.github.io/contributor_guide/contributing_example_notebooks.html). this is particularly important to some MS stakeholders. but i also hope in the process we build up the shared language and ways of collaborating about fairness too. so that means building shared understanding of what the items in (https://fairlearn.github.io/contributor_guide/contributing_example_notebooks.html) mean and look like, getting better at communicating across different kinds of perspectives, expertise, and experiences (eg, https://hackmd.io/EBh01XPtRZGHEg76oD1rGw#Data-scientists%E2%80%99-opportunities-to-influence). and trying to align the project towards practioner needs and real problems (eg, https://github.com/fairlearn/fairlearn/pull/500#discussion_r443040820). more tangibly, i hope that as we start reworking the website, all of those things can find a home to be published on fairlearn.org.
If you want to read more background and history, [start here](https://hackmd.io/gD8dwdPSRsqH3BxBt9terg?both#Historical-notes-and-links).
:::
## Collaboration tools
This chart is for helping talk about different ways people might engage with fairness. Translating between different perspectives and levels is critical when approaching fairness as sociotechnical work.
[](https://drive.google.com/file/d/1leJHwmi4LLYl4CQVSGN4ob18IJvnBrAv/view)
- [hackpad: Theory of change](https://hackmd.io/EBh01XPtRZGHEg76oD1rGw#Data-scientists%E2%80%99-opportunities-to-influence)
## Other potential outputs of this work
One way to think of this is as a "design sprint for sociotechnical fairness." That's the main bottleneck in just shipping an example notebook, not the code. So here's other outputs of this work:
- hackpad: [Sociotechnical “talking points"](https://hackmd.io/nDiDafJ6TMKi2cYDHnujtA)
- blog-length posts on fairlearn.org
- new functions, methods, code and features!
:::spoiler Flowchart showing where this work could go...
The short version, focused on fairlearn example notebooks:

And how this fits in the bigger picture of aiming to reframe the work in the project:

:::
## dogfooding!
You can also think of the work to make an example notebook as "[dogfooding](https://en.wikipedia.org/wiki/Eating_your_own_dog_food)" our own product :)

---
# Candidate screening
We started from this seed initially, but have worked through it to make it more specific and concrete below.
> A potential client asks you if ML can help with predicting job candidates' suitability for jobs based on a combination of personality tests and body language [Rhagavan et al. (2019)](https://arxiv.org/pdf/1906.09208.pdf)
## 1. Introductions!
- Meet [Priyanka](https://www.ted.com/talks/priyanka_jain_how_to_make_applying_for_jobs_less_painful/up-next), [Lewis](https://github.com/ljbaker) and [Kelly](https://www.linkedin.com/in/kelly-trindel/). They work for a company that does candidate screening called pymetrics. Priyanka's head of Product, Lewis is Director of Data Science, and Kelly's the Head of Policy and I/O Science (she worked at the EEOC for 7 years before that).
<div>
<img height="130" src="https://pe.tedcdn.com/images/ted/a7615dba4a0a0ae90f54488adbc4b71fb6ca72d6_254x191.jpg" />
<img height="130" src="https://i.postimg.cc/FKkLVB2L/0.jpg" />
<img height="130" src="https://i.imgur.com/FTxzaCK.png" />
</div>
- Meet [Cholena](https://www.linkedin.com/in/cholenaorr/), she works at an Australian financial company called ANZ. Cholena's collaborated with Priyanka and Lewis at pymetrics to improve how her company recruits and interviews recent graduates.
<img height="150" src="https://i.postimg.cc/DyY0pcQd/cholena.jpg" />
- Imagine that during our meeting today, we're having a discussion about how we might collabore with them, and that I'll synthesize that and send them an email afterward about what we discuss.
:::spoiler more on how people talk about pymetrics tests...
https://www.quora.com/Has-someone-tried-Pymetrics-How-true-is-it-for-you?share=1
:::
:::spoiler more on Kelly and IO science...
From [testimony to EEOC in 2016]( https://www.eeoc.gov/meetings/meeting-october-13-2016-big-data-workplace-examining-implications-equal-employment/trindel%2C%20phd), her former employer:
Understands concerns, this is not new to her:
> The primary concern is that employers may not be thinking about big data algorithms in the same way that they've thought about more traditional selection devices and employment decision strategies in the past. Many well-meaning employers wish to minimize the effect of individual decision-maker bias, and as such might feel better served by an algorithm that seems to maintain no such human imperfections. Employers must bear in mind that these algorithms are built on previous worker characteristics and outcomes. These statistical models are nothing without the training data that is fed to them, and within that, the definition of 'success' input by the programmer. It is the experience of previous employees and decision-makers that is the source of that training data, so in effect the algorithm is a high-tech way of replicating past behavior at the firm or firms used to create the dataset. If past decisions were discriminatory or otherwise biased, or even just limited to particular types of workers, then the algorithm will recommend replicating that discriminatory or biased behavior.
Specifically about tech industry doing this, and representation problems:
> As an example of the type of EEO problems that could arise with the use of these algorithms, imagine that a Silicon Valley tech company wished to utilize an algorithm to assist in hiring new employees who 'fit the culture' of the firm. The culture of the organization is likely to be defined based on the behavior of the employees that already work there, and the reactions and responses of their supervisors and managers. If the organization is staffed primarily by young, single, White or Asian-American male employees, then a particular type of profile, friendly to that demographic, will emerge as 'successful.'
And awareness of disabilities:
> The use of big data algorithms could also potentially disadvantage people with disabilities.
And that this is all correlational:
> Finally, it merits mention that the relationships among variables that are uncovered by advanced algorithms seem, at this point, exclusively correlational in nature. No one argues that the distance an employee lives from work, or her affinity for curly french fries, the websites she visits, or her likelihood to shop at a particular store, makes her a better or worse employee...
>
> It would seem to behoove the employer or vendor uncovering this relationship to do some additional, theory-driven research to understand its true nature rather than to stop there and take distance from work into account when making future employment decisions. This is true not only because making selections based on an algorithm that includes distance from work, or some other proxy representing geography, is likely to affect people differently based on their race but also because it is simply an uninformed decision.
And that a causal relationship with work quality would be more meaningful:
> It is an uninformed decision that has real impact on real people. Rather, perhaps selecting on some variable that is causally related to work quality, in conjunction with offering flexible work arrangement options, might represent both better business and equal opportunity for workers. Thank you.
Here's how she [frames the work at pymetrics](http://gbaworkshop.tntlab.org/wp-content/uploads/2019/08/Conference-Agenda.pdf) in a workshop:
> Historically, psychological assessment for employment selection is challenging and has left significant room for improvement. Test-takers can intuit the intent behind survey items, and can cheat on objective questionnaires. Many assessments are directional and general, meaning that the definition of “success” is the same for all people, for all roles. In addition, many assessments suffer from fairness issues, disadvantaging candidates based on their demographic background. Here, we present findings from the first five years of pymetrics’ implementations assessing over 1 million people. We illustratethe benefits of game-based assessments that wed decades of empirical research with modern machine learning techniques to create a custom assessment approach optimized for fairness and validity. We showcase methods for testing criterion-related validity and fairness estimation and remediation before an assessment even goes live. Finally, we share case studies on the results of game-based ML technology as relevant to real-world job candidates.
:::
:::spoiler Notes...
- (Lewis is very responsive [on GitHub](https://github.com/pymetrics/audit-ai/issues/30), and has worked to open source their [ai-audit](https://github.com/pymetrics/audit-ai) library, with [examples](https://github.com/pymetrics/audit-ai/tree/master/examples), [implementation suggestions](https://github.com/pymetrics/audit-ai/blob/master/examples/implementation_suggestions.md), and a whitepaper, [Removing Bias from Hiring with Artificial Intelligence](http://go2.pymetrics.ai/l/863702/2020-06-08/gwtk3/863702/18240/Removing_Bias.pdf).)
- *hanna: i met Lewis at a park, and emailed with him a bit, so i know him. in chat: we might want to set up a meeting with the pymetrics folks*
- *solon in chat: i know the pymetrics folks pretty well and they tend to be quite open.*
- *hanna, in chat: does kelly trindell still work there? (solon: yes)*
:::
## 2. How does it work?
In the first phase of the interview process, candidates play 12 online games. Their responses are scored, using an algorthmic process involving other datasets and machine learning (see more below). These scores are used to screen out candidates.

This is an example of one of the games, based on the Balloon Analogue Risk Task (BART). The game came out of cognitive science lab studies (seem more below).
## 3. Sociotechnical context
:::warning
See **[Sociotechnical context, Candidate screening](https://hackmd.io/x9Q3o-EVTbC-gO1EkPOC2w#Candidate-screening)**, which was split out because the document got too large for hackmd.
a. Narratives and lived experiences
b. Law
c. Psychology
c. Machine learning
d. Business
e. Economics
f. Australia, new graduate employment
g. Other perspectives
:::
## 4. ANZ college recruiting
[Cholena](https://www.linkedin.com/in/cholenaorr/) works at an Australian financial company called ANZ. She's working on the new graduate hiring program, which has **4x more roles to fill** than in previous years.
<img height="150" src="https://i.postimg.cc/DyY0pcQd/cholena.jpg" />
:::spoiler background on ANZ
- [WGEA gender equity report](https://www.anz.com.au/content/dam/anzcomau/documents/pdf/aboutus/wgea-2020-public-report.pdf)
- [Cultural diversity, reconciliation plan](https://www.anz.com.au/about-us/sustainability/workplace-participation-diversity/cultural-diversity/)
- [ESG, Workplace Participation & Diversity](https://www.anz.com.au/about-us/sustainability/workplace-participation-diversity/)
- In 2016, the new CEO sited "culture" as a challenge, after multiple media reports of sexism and racism, and lawsuits against ANZ in the NYC offices ([wikipedia](https://en.m.wikipedia.org/wiki/Australia_and_New_Zealand_Banking_Group)).
:::
:::spoiler key references for scenario
- Start with: [Case study](https://www.pymetrics.ai/case-studies/anz-case-study)
- article: [AI technology to drive a better graduate recruitment experience](https://atchub.net/candidate-experience/how-anz-embraced-ethical-ai-technology-to-drive-a-better-graduate-recruitment-experience/)
- more details, often conflicting, re: [skills required](https://www.anz.co.nz/careers/), new graduate [process](https://www.anz.co.nz/careers/apply/), [new graduate program overview](https://www.anz.com.au/about-us/careers/programs/graduates/)
- See also [**detailed walkthrough**](https://www.graduatesfirst.com/anz-assessment-tests) from graduatesfirst
- Let's take this [PDF on new graduate programs](https://www.anz.co.nz/content/dam/anzconz/documents/careers/NZ-graduate-programmes.pdf) as the most authoritative truth when other references conflict.
:::
### a. Example job posting
This is for a role on the **Digital & Transformation** team (from [PDF](https://www.anz.co.nz/content/dam/anzconz/documents/careers/NZ-graduate-programmes.pdf)).
- **Mentoring**: The company provides significant amount of mentoring, training and support for these roles ("developing future leaders").
- **New graduates only**: For an 18month term.
<img height="400" src="https://i.imgur.com/hCWPR1T.png" />
<details><summary>more quotes...</summary>
What type of things might you do as a Digital & Transformation Graduate?
> We are set up around customer journeys; home owners, business owners and every day bankers and we have a strong focus on delivering to their needs. For example, you’ll work on delivering new features for our digital channels, the main ones being internet banking, goMoney and our staff digital tool, Banker Workbench.
Quotes from Chonela ([article](https://news.efinancialcareers.com/uk-en/3000372/anz-has-stopped-using-resumes-in-its-graduate-recruitment-process-here-s-what-anz-is-doing-instead)):
> “It comes down to five things: acting with integrity; collaborating across our business; taking full accountability for your work and the results of your work; being respectful to customers, colleagues and even our competitors; and striving for excellence,” Orr says.
> “We refer to our values as your ticket to play. If a graduate doesn’t demonstrate those values through the process that is a select out.”
</details>
### b. Screening process, after
Candidates are screened out at each phase.
1. **Games**: Play 12 games online; no resume or cover letters accepted. *This is the new step added in partnership with pymetrics*.
3. **Personality questionanaire**: Answer online about how you prefer to work.
4. **Video answers to prompts**: Behavioral interview questions.
5. **In-person interview**: Behavioral interview questions and role-plays.
Cholena worked with pymetrics to add the new Games step up front.
:::spoiler show an example game again...

https://www.youtube.com/watch?v=Aimj2wNHNA8
![Uploading file..._klzle7y2p]()
![Uploading file..._se7zes8yg]()
![Uploading file..._9bye6aqn7]()
![Uploading file..._gzrgv6bmn]()
:::
:::spoiler clap at the red circle (impulsivity, attention)...
[video](https://youtu.be/9fF1FDLyEmM?t=758)
:::
:::spoiler more on traits...
Cognitive:
- Attention duration
- Processing consistency
Social:
- Fairness
Emotional:
- ???
Julie: "the games themselves could have some cultural bias but we try our best to remove it after we get the data and before we start modeling... there's a game that shows a bunch of facial expressions of Caucasian people... we're fully aware of the differences in culture in how they perceive the stimuli and try to account for that."
Others:
- Delayed Gratification
- Attention
- Learning
- Complex Money Exchange
- Distraction
- Flexibility
- Risk
- Processing
- Planning
- Simple Money Exchange
- Effort
- Emotion
- Flexibility Change Speed
- Memory
List:
- Memory Span
- Emotion Identification from Facts
- Attention Duration
- Trust
:::
### c. Ways Cholena talks about success
- **4x increase in opening**. This was the impetus for changing the hiring process.
- **35% shorter first step**. For applicants, the time to complete the first phase of the process was cut down to about 35% of what it was (120min to 40min).
- **48% increase in applicants**. While the overall time to complete the whole interview process increased by adding this step, reducing the commitment at the first step increased the applicant yield significantly.
- **4,000 applicants**. Supported wider outreach, since the system could process more candidates
- **67% screened out in first step**. This cut the number of recruiter screenings by 50 percent, equivalent to 275 hours or seven weeks of a single recruiter’s time saved.
- **2 weeks shorter time-to-hire**. By screening out more candidates in the first step, recruiting and HR could complete the overall process faster for candidates they selected.
- **95% satisfaction with the process.** Unstated how this was asked or when.
- **Diversity talking points**. "achieved gender diversity" and "three percent [of hires] reported experiencing some kind of disability." And an "11 percent increase in the number of offers to candidates from lower socio-economic universities," as well as "levelling the playing field for culturally diverse candidates; particularly Aboriginal and Torres Strait graduates."
:::success
Reducing hiring costs is key.
:::
:::spoiler quotes...
Note that some of this information may seem conflicting or contradictory, since we don't have domain experience or know much about the situation. But we can see what kinds of outcomes matter, what kinds of things are pointed to as success (even if these are different than the internal pressures driving the change).
7/20: Expanding sales in ANZ region (https://www.pymetrics.ai/pygest/welcoming-paul-bridgewater-apac-sales-lead)
> This year ANZ is increasing its graduate recruitment for technology roles, and Orr says it is looking to hire four-times the number of graduates in this area in its next cohort compared to previous years. ([article](https://news.efinancialcareers.com/uk-en/3000372/anz-has-stopped-using-resumes-in-its-graduate-recruitment-process-here-s-what-anz-is-doing-instead))
Before:
> Previously, candidates would spend around two hours on the application process which included uploading their resume, submitting responses and completing a personality profile. This process has now been replaced by the 39-minute application which consists entirely of a series of carefully considered online exercises.
Some metrics they talk about:
> Although the platform drove a 48 percent increase in applications and supported a wider outreach to over 4,000 candidates, it resulted in significant time saving as it reduced the applicant pool by 67 percent. In turn, this reduced the number of recruiter screenings by 50 percent, equivalent to 275 hours or seven weeks of a single recruiter’s time saved.
>
> The offer to acceptance rate in 2020 increased by six percent to 92 percent. Importantly, it also reduced the time to hire by two weeks, which meant ANZ was able to get offers to candidates sooner and minimise the risk of losing a candidate to rival offers due to a lengthy process.
The way Cholena talks about success:
>“We certainly don’t have a cookie cutter cohort; our graduates are all culturally and economically different. We have also achieved gender diversity and three percent reported experiencing some kind of disability,” explained Cholena.
>
> Before ANZ introduced the new AI-based hiring practices, it had also predominantly hired from a certain class of universities, namely those where degrees cost the most. The abandonment of CVs also eliminated the type of bias that certain educational backgrounds or names might elicit, as well as levelling the playing field for culturally diverse candidates; particularly Aboriginal and Torres Strait graduates.
>
>Now, applicants are drawn more widely with an **11 percent increase in the number of offers to candidates from lower socio-economic universities. This has enabled ANZ to attract a different type of candidate whilst maintaining the integrity of their purpose in recruiting adaptive graduates who can develop the critical thinking and skills needed to ensure the future success of the organisation.
And how the article describes the candidate experience:
> “We are now attracting candidates who may not have previously thought of ANZ as an employer of choice, as they have heard from peers that the recruitment process is both fun and efficient."
> Satisfaction levels with the process stand at 95 percent.
:::
:::spoiler Notes from 8/20...
#### Pre-discussion notes:
- see gitter.im/fairlearn/community for amazing discussion from Lisa and Roman
- Roman: hiring (screening) --> what are the goals? How are they measured (e.g. what is diversity?) and prioritized (e.g. is high-quality more important than diversity?)
- more diversity of talent (with respect to which dimensions?)
- process too lengthy (losing candidates)
- save recruiter hours
- high-quality hires
- avoiding lawsuits
:::
## 5. Notebook drafts
These diagrams may be helpful for visualizing what's in the code in the notebooks. They're primarly focused on the data-generating processes that were chosen, but see [Martin et al. (2020)](https://arxiv.org/pdf/2006.09663.pdf) for more on extended modeling to avoid framing traps.
:::spoiler optional: clip from "framing trap" interactive
This is in a different context (college admissions), but aims to visualize the core framing trap of learning from 'incumbents.' This may be a helpful starting point if you are new to noticing abstraction traps, before you dive into the complexity of what's happening in the scenario with Cholena.
<video width="100%" controls src="https://i.imgur.com/sGzij4q.mp4"></video>
This work is based on open-source work by Adam Pearce.
:::
#### a. When deployed, after
After building the system, this is where it fits in deployed.
```mermaid
graph LR
subgraph ANZ new graduate interviews
applicants -- apply --> G
G[(1. Games)] --> model((prediction model))
P[2. Personality] --> V
V[3. Video prompts] --> I
bye[rejected]
model
end
I[4. Role plays] --> offer
offer -- accepted --> hired
model --> P
model -- 67% screened out --> bye
P --> bye
V --> bye
I --> bye
classDef pos fill:#ff8358ff,color:white,stroke:#ff8358ff,stroke-width:12px;
classDef neg fill:#0e99d1ff,color:white,stroke:#0e99d1ff,stroke-width:12px;
classDef green fill:#6aa84fff,stroke:white,color:white,stroke-width:1px;
classDef purple fill:#1d61d5ff,color:white,stroke:#1d61d5ff,stroke-width:12px;
classDef red fill:darkred,color:white;
classDef hired fill:#333,color:white;
class hired hired;
class bye red;
class G green;
class G2 green;
class y=0 neg;
class y=1 pos;
class dataset purple;
class model purple;
```
#### b. Constructing datasets, before
This is the flow of data and labeling, reflecting some the design choices in how the dataset and model is constructed:
```mermaid
graph LR
subgraph ANZ past interview practices
applicants -- apply --> interview>old interview process]
interview --> rejected
end
interview --> offer
offer -- accepted --> hired
hired --> workers
subgraph ANZ current new graduate workers
workers -- labeled --> TP['top performer']
workers-- labeled --> NTP[not 'top performer']
workers --> resigned
end
TP --> G[(Games)]
G --> y=1[target, y=1] --> dataset[(dataset)]
G2 -- select subset --> y=0[baseline, y=0] --> dataset
subgraph other pymetrics clients
OP[other people] -- apply --> O[other roles]
OP -- apply --> O1[other companies] --> G2
OP -- apply --> O2[other industries] --> G2
O --> G2[(Games)]
end
dataset -- fit, etc. --> model((prediction model))
classDef pos fill:#ff8358ff,color:white,stroke:#ff8358ff,stroke-width:12px;
classDef neg fill:#0e99d1ff,color:white,stroke:#0e99d1ff,stroke-width:12px;
classDef green fill:#6aa84fff,stroke:white,color:white,stroke-width:1px;
classDef purple fill:#1d61d5ff,color:white,stroke:#1d61d5ff,stroke-width:12px;
classDef red fill:darkred,color:white;
classDef hired fill:#333,color:white;
class hired hired;
class rejected red;
class G green;
class G2 green;
class y=0 neg;
class y=1 pos;
class dataset purple;
class model purple;
```
#### c. Notebook drafts
[colab: fairlearn-hiring-anz ](https://colab.research.google.com/drive/1jQsKKfIYPYk5TsHHx6bQJUuZ2NuPx-vJ#scrollTo=K9SWu7yckiiH)
##### Data collection
1. First-phase screening is blinded to sensitive attributes, as is the baseline dataset. The only difference is `is_top_performer.

2. The baseline dataset is chosen from other applicants to similar roles in similar places (eg, applicants to new graduate roles in the country). There's selection bias here, but it's the best estimate.
##### Test development
3. A model is fit to that data set.
4. The "debias set" is a held-out subset of the baseline set with sensitive attributes. To check for bias, the fit model makes predictions against the "debias set."

5. These predictions are run against various statistical tests (eg, chi test). This is different than just comparing metrics, and is done as part of US regulatory guidance. If these tests fail, the *features for the model* are adjusted until the fit model passes the tests.

6. The model has been de-biased. Note that there is no access to 'ground truth' in the process, and we are merging data collected from different processes; there is no "sampling" in the statistical sense occurring anywhere in this process.
##### Diagram
<img src="https://i.imgur.com/WeW5J8P.png" height="300" />
:::
##### In simulations, what is meaningful for fairness?
Noticing distributions of scores by gender, and by (age, gender):
<div>
<img width="45%" src="https://i.imgur.com/iZIOHJW.png" />
<img width="45%" src="https://i.imgur.com/ldwzWfL.png" />
</div>
:::info
If you want to collaborate on notebooks, please review the earlier sections of this document, the *technical references* below, and then come chat in [gitter!](https://gitter.im/fairlearn/community)
:::
:::spoiler real example data...

from https://www.pymetrics.com/docs/integrations/webservice/v3
:::
:::spoiler technical references...
See [implementation suggestions](https://github.com/pymetrics/audit-ai/blob/master/examples/implementation_suggestions.md) and [GitHub discussion](https://github.com/pymetrics/audit-ai/issues/30).
:::
:::spoiler Notes from 8/4...
Walk through what happens:
1. First, the company provides the "incumbent set", assessment data of their "top performers."
2. Then, pymetrics compares the "incumbent set" against their "baseline set," the assessment data from the pool of new applicants.
3. Next, they run audit-ai on a "debias set," likely drawn from some particular subsample of pymetrics' data, chosen based on the particular job role.
4. Pymetrics does [pre-deployment auditing and pre-deployment debiasing](https://github.com/pymetrics/audit-ai/blob/master/examples/implementation_suggestions.md#pre-deployment-auditing), and possible EEOC 4/5ths rule compliance checks.
5. Finally, each applicant gets a score (or they are ranked, unclear), and the company may use that to skip steps in the interview process, or filter people out right away.
6. They also do post-deployment validation across protected groups.
- hanna: what is debias set? how operationalize this? see https://github.com/pymetrics/audit-ai/blob/master/examples/implementation_suggestions.md#pre-deployment-auditing and https://github.com/pymetrics/audit-ai/blob/master/examples/implementation_suggestions.md#pre-deployment-auditing and Miro assumes that the the purpose of the "reference data set" is to turn one-class classification problem into binary classification/regression
- roman: 4/5ths rule, is that for screening or who gets hired? should i think about that more broadly? kevin: depends on regulatory context (Australian in example), 4/5ths is from EEOC is US context.
- solon: hiring can violate 4/5ths rule, but pymetrics by design doesn't have anything to say about that.
- miro: related to linkedin rankings, maybe different features
- hanna in chat: they force the model to satisfy the 4/5ths rule
- solon in chat: but it's not clear *how* they do that
- abbey in chat: their mission is very ethical AI focus, is that why this company is chosen?
:::
## 6. Opportunities to engage
#### a. Real harms to real people
Where should we focus, and how can we express these in human terms, in data-scientists terms, and in business terms? *eg, health, mental health, earnings, debt, quality of life, happiness*
See above background on displacement in evaluating affirmative action policy choices, eg:
> We find that students who gain access to the University of Texas at Austin see increases in college enrollment and graduation with some evidence of positive earnings gains 7-9 years after college. In contrast, students who lose access do not see declines in overall college enrollment, graduation, or earnings.
Notes:
- practice: de-abstracting, humanizing - if this is the work, visuals and UX have to be core
- research: influence of presentation choices, what would shift power to people harmed?
- research: "differential harm" as in police stops, affirmative action policy, etc.
:::info
If you are interesting in helping express harms to new graduates in the scenario, or differential harms to different groups of new graduates, come chat in [gitter!](https://gitter.im/fairlearn/community)
:::
:::spoiler notes...
see "sociotechnical" slides deck for images and concrete stories and examples
eg https://www.cell.com/patterns/fulltext/S2666-3899(20)30086-6
:::
#### b. Using fairlearn 0.4.6
Keep in mind the [contributor guidelines](https://fairlearn.github.io/contributor_guide/contributing_example_notebooks.html):
:::success

:::
Notes:
1. **ground truth**: Data for "ground truth" is never collected.
2. **sensitive attributes**: This isn't available for each data set collected.
3. **small numbers**: The numbers of incumbents in the training set describing themselves as indigenous or having a disability is very small, or zero; there may not even be a single example in the training set.
4. **guarantees don't hold**: Given small numbers, increased concerns about overfitting on noise and generalization error (eg, [#460](https://github.com/fairlearn/fairlearn/issues/460))
5. **non-determinism**: This brings complexity to evaluation and monitoring, in a context where theoretical guarantees don't apply.
6. **legal risk**: Running statistical tests are part of US regulatory guidance for hiring, not just looking at rates alone.
#### c. Co-design, HCI, and ethnographic work
:::spoiler show me six awesome papers!
[](https://andrewselbst.files.wordpress.com/2019/10/selbst-et-al-fairness-and-abstraction-in-sociotechnical-systems.pdf)
[](https://arxiv.org/pdf/1812.05239.pdf)
[](http://www.jennwv.com/papers/checklists.pdf)
[](https://datasociety.net/wp-content/uploads/2019/09/Owning-Ethics-PDF-version-2.pdf)
[](https://journals.sagepub.com/doi/pdf/10.1177/2053951720939605)
[](https://docs.google.com/document/d/1E7XaHi80BNPZRlK4dmwGMtpEA9lzOeIlRZKnIWlBBd0/view#heading=h.ky7hcpdfxr62)
:::
#### d. Engaging at different levels of "closure"
Framing quote from [Passi and Sengers (2020)](https://journals.sagepub.com/doi/pdf/10.1177/2053951720939605):
> Practitioners and researchers struggle with different sets of constraining forces, but it is important to be reflexive and remember that both groups perceive and act on the world through constraints that shape what they believe to be good and possible... our call to work with data science practitioners [means] embarking on a difficult journey to learn more about the situated nature of data science practice and research, making visible the differences and similarities in our normative goals.
This is mostly translational work; how to take some feeling or belief at one level, and translate it to another level in a way that speaks the language of data science, product management, or business.
This an initial set of brainstorming ideas is framed in terms of the chart below. Many of these seek to describe a range of ways to pull on "value levers" (Madaio et al. 2020) that may be available in this specific context.

:::warning
This is more speculative and more personal, as a way to spark discussion.
:::spoiler images
(these are just nodding at examples, to widen the frame of discussion)
What current metaphors or mental models do common visuals encourage now?
1. "avoid overfitting" This is the Boston housing dataset, so the "predicted house price" is completely wrong to begin with. The visuals foreclose asking that question.

2. "close the gaps" This in on the UCI income dataset, but the visual focuses on the gap, not the fact that the accuracy of predicting "do they make over $50k" doesn't mean anything tied to the real world or a human decision process.

3. "find the gaps" This looks at the false positive and false negative rates for subgroups on the COMPAS dataset. But the visual doesn't help you grapple with the fact that research shows judges don't follow these recommendations, that the upstream data is biased, or that these models don't generalize geographically.

**so, what might other kinds of tools look like?**
(in the spirit of many other awesome existing work, but the twist here is trying to engage with the sociocultural aspect of the work, instead of punting on it)
4. "look at the data" this is the boston housing data set. The (lat, long) for the first data point is nowhere near Nahant - it's not even on the right side of the harbor. There are endless tutorials on this dataset but none mention this, and there are no tools make this problem visible and tangible.

5. "make noise and generalization error visual and tangible" This is a toy fork of FairVis, that adds in continual noise modeling judge's downstream decisions. And it makes visual and tangible the act of deploying in different locations. And shows the impact of just that noise alone on fairness metrics - they become immediately less reliable, consistent and worthy of trust (in a way that all the people in the system with domain knowledge already know :))

5. "make generalization visible and tangible" Where would we use the Ames dataset outside of Ames? The city is super unrepresentative at the state or national level - it's extra white and extra educated because of the university. Basic census data on things like this is easily available, why do no fairness packages even look at this?

6. "tools for staying grounded in real, personal examples" Joy's work on gendershades was so powerful in part because it was so personal. How can practitioners in teams use that method in their own work (not literally, since teams are so unrepresentative of the world and people impacted by ML). Here's an educational toy for citizens to explore facial recognition accuracy by pointing it on image search terms. People can understand and debate this in a way that they can't debate and understand AUC and model cards.

7. "humanize data" It's important to remember the influence of ProPublica's reporting and journalism. Journalists don't lead with datasets, they lead with the stories of people. Here's an educational toy showing "differences in rates" in a way that personalizes and expresses some aspect of emotion or affect. See how Deb Raji writes about COVID deaths.

8. "tools for de-abstracting" It's insane how much scholarship there is on COMPAS that washes away the people being imprisoned. At the same time, we have incredibly powerful technology like thispersondoesnotexist.com. What if we brought it to bear, and rendered CSVs in more human ways that were more accessible for more stakeholders?

8. "tools for bringing in sociocultural" This simulation looks at how removing sensitive attributes doesn't "solve" fairness. But it's totally nonsensical if you know about how college admissions work, and doesn't help show the core problem of generalization error (let alone the role of legacy admissions).

9. "tools that are situated and invite critiquing" joining data sources that are universal is almost always a portability trap. think of how lossy the compression is of the sociocultural complexity in these tools. what if there was a range of design choices for **what** sociocultural context to bring in, and guidance on **how to choose** that was rooted in sociocultural perspectives of existing injustice, but accessible to a wide range of stakeholders?

**...anyway...**
The bar is very low! Progress is very possible :)
1. There are no open communities doing this work
In some prominent open source data science and ML communities, discussions on fairness take up 1 millionth of the time and energy as the most minute technical discussion. What if there was an open source community that was different?
2. There's opportunity in translational work

3. And just continuing to add metrics will actually leave practitioners worse off, if there's no guidance on how to connect them to real-world harms or impact.

**...thanks for reading... sorry this is rambling and brainstormy and not clearer! but that's part of making new things and doing design work :)**
:::
:::info
If you are interesting in finding new opportunities, or prototyping some of the ideas here and presenting them, come chat in [gitter!](https://gitter.im/fairlearn/community)
:::
:::success
:::spoiler sketches of more ways to engage...
### Model performance
1. performance: sim showing impact of test-retest reliability on accuracy



2. performance: email with quotes pulled from BART literature (eg, group differences, sample selection problems)
3. distribution drift: sim the impact of this on
### Fairness metrics
1. parity: sim impact of 4/5th rule in each of four stages of screening on end-to-end process
2. intersectional subgroups: do this by indigenous tribe, ancestry, and gender
### Organizational process & flows
1. upstream data quality: sim showing impact of test-retest relibility on same cohort
2. upstream data quality: sim showing validation correlation coefficients
3. systems dynamics: sim showing how outreach and recruitment influences applicant pool. include estimate for time saved in interviewing can be moved to recruiting events, and whether would help diversity
4. feedback loops: sim showing how if screening selects for scores on a particular game, labeling top performers in the next three years is more narrowly constrained within the distribution
### Abstractions, concepts, methods
1. measurement modeling: email about construct validity coefficient for BART and role play task scores
2. measurement modeling: sim how this compares to 10% rule, using ranking within college
3. normative reasoning: email proposal for moving video prompts up front to ensure that all candidates at least had a chance to show job-relevant strengths before being denied the opportunity to interview.
4. normative reasoning: email to ask corporate diversity and inclusion leaders for their thoughts on cognitive testing, including plaintiff quotes pulled from Duke v. Griggs; Williams v Ford; EEOC v Ford.
5. normative reasoning: email four-panel comic showing (1) experience of candidate with many talents and skills (2) talks to recruiter and excited to interview, (3) does the screening and (4) is rejected without ever being asked to express anything meaningful about themselves. connect to "being respectful to customers, colleagues and even our competitors,"
5. normative reasoning: write up one-page proposal for including cognitive testing screening for promotion decisions, and ask four colleagues to read and share feedback. pull out quotes and email that to HR as an interesting thought experiment.
6. contestability: show the 'optimal candidate' in screening;
7. normative reasoning: in predicting facial attractiveness ratings, "the ultimate goal is similar to the system we're trying to build. the labels don't actually have absolute truth, but there is high correlation across people's ratings of facial attractiveness..." ([video clip](https://youtu.be/BJH6eEWEP_0?t=570)), direct comparison of constructing features for facial attractiveness ratings and potential employees.
### Power
1. contestability: show all candidates posted job skills, ask how well they think interview got at them (do this in screening)
2. extraction: reframe as harm metrics
3. extraction: differential harm metrics (eg, zip code estimates?)
4. extraction: compare cost to hire versus college admissions
5. law: make sim showing overfitting on pure noise as 'criterion validity'
6. law: Victorian guidelines indicate cognitive testing can only be used if you "reasonably require them to determine whether a person will be able to perform the job." i'm concerned that by only testing the "top performers" we haven't shown that this is the case.
:::
:::spoiler other notes...
[Passi and Sengers (2020)](https://journals.sagepub.com/doi/pdf/10.1177/2053951720939605):
> We must keep in mind that existing technologies thatdata science systems replace—even those that seem tohave nothing to do with data science—also shape howpractitioners envision what data science systems can orcannot, and should or should not, do.
:::
---
## (internal notes below)
:::spoiler show more...
### "notebook"
Approaches to group fairness from the literature are often not practical in applied settings (eg, [Partnership on AI 2020](https://www.partnershiponai.org/demographic-data/)).
> More serious issues arise when classifiers are not even subjected to proper validity checks. For example, there are a number of companies that claim to predict candidates’ suitability for jobs based on personality tests or body language and other characteristics in videos.Raghavan, Barocas, Kleinberg, and Levy, “Mitigating Bias in Algorithmic Employment Screening.” There is no peer-reviewed evidence that job performance is predictable using these factors, and no basis for such a belief. Thus, even if these systems don’t produce demographic disparities, they are unfair in the sense of being arbitrary: candidates receiving an adverse decision lack due process to understand the basis for the decision, contest it, or determine how to improve their chances of success.
https://fairmlbook.org/testing.html
> Observational fairness criteria including demographic parity, error rate parity, and calibration have received much attention in algorithmic fairness studies. Convenience has probably played a big role in this choice: these metrics are easy to gather and straightforward to report without necessarily connecting them to moral notions of fairness. We reiterate our caution about the overuse of parity-based notions; parity should rarely be made a goal by itself. At a minimum, it is important to understand the sources and mechanisms that produce disparities as well as the harms that result from them before deciding on appropriate interventions.
### 8/27: What to work on next?
- real harms in notebook
- rework format here to make connection more explicit with levels of "closure"? not sure if directionally will work - the current format does "background" then "concrete" and using the chart probably assumes way to much background on fairness. may just be useful for me https://docs.google.com/presentation/d/1ZppVDDN-lhdQdr8Azy9WZINOiJ8yOP4VYwWvWK7GZx8/edit?skip_itp2_check=true&pli=1#slide=id.g8e0b09fccd_0_151
- fairlearn metrics and plots, when no truth
- visual for showing
- express group differences in cognitive testing literature as noise in data-generating process, and show downstream impacts on model performance
### 8/20: What to work on next?
- ideas: express candidate opportunity cost in financial terms / lottery and sampling / recourse question / simulations based on normative assumptions
### Discussion with whole group
Here are some seed questions for discussion, but folks should bring up whatever they think is important:
- What do you notice? What sociotechnical context matters?
- What abstractions traps do you see, and what actionable steps could we suggest?
- What real harms are there?
- How could Fairlearn's Python package add value in reducing real harms?