owned this note
owned this note
Published
Linked with GitHub
# Fairlearn: Seed ideas for scenarios
*If you're trying to find something tangible to collaborate on or contribute, search for "next step"*
These scenarios are seed ideas for coming up with tangible and concrete deployment contexts where we can work through fairness questions.
These seed scenarios can help us work towards making:
- A. **fairlearn example notebooks**: Jupyter notebooks that illustrate the value of the Fairlearn Python library, possibly using synthetic datasets. This can help us show people why the project is valuable, rather than just telling them.
- B. **sociotechnical "talking points"**: bullet points that illustrate the work of approaching fairness as a sociotechnical challenge, in a way that is approachable to developers and data scientists who are new to thinking about fairness.
This hackpad evolved from https://github.com/fairlearn/fairlearn/pull/491, with Kevin adding initial scenarios, Roman adding additional ones, and group discussion that is recorded here as notes and questions. See the bottom of this hackpad for more historical notes and links.
Our original intention was to use these for **fairlearn example notebooks** and so that's what this hackpad focused on. We've also discovered that some of these scenarios won't make for good examples of the Fairlearn Python library. But they may be helpful seeds for [Sociotechnical "talking points"](https://hackmd.io/nDiDafJ6TMKi2cYDHnujtA).
## Contributing example notebooks
See https://fairlearn.github.io/contributor_guide/contributing_example_notebooks.html, which is pasted below for convenience.
> A good example notebook exhibits the following attributes:
>
> 1. **Deployment context**: Describes a real deployment context, not just a dataset.
> 2. **Real harms**: Focuses on real harms to real people. See [Blodget et al. (2020)](https://arxiv.org/abs/2005.14050).
> 3. **Sociotechnical**: Models the Fairlearn team's value that fairness is a sociotechnical challenge. Avoids abstraction traps. See [Selbst et al. (2020)](https://andrewselbst.files.wordpress.com/2019/10/selbst-et-al-fairness-and-abstraction-in-sociotechnical-systems.pdf).
> 4. **Substantiated**: Discusses trade-offs and compares alternatives. Describes why using particular Fairlearn functionalities makes sense.
> 5. **For developers**: Speaks the language of developers and data scientists. Considers real practitioner needs. Fits within the lifecycle of real practioner work. See [Holstein et al (2019)](https://arxiv.org/pdf/1812.05239.pdf), [Madaio et al. (2020)](http://www.jennwv.com/papers/checklists.pdf).
>
> Please keep these in mind when creating, discussing, and critiquing examples.
## Next steps
So where do we go from here for Fairlearn example notebooks? One path is that we:
1. Finish a brief walk through and **vote as a group on the top seed scenarios** that are worth working through further to create an example notebook that illustrates Fairlearn's value proposition for reducing real harm.
2. From there, we can **individually work through one or two deployment contexts offline** and see where we get in terms of the contributing guidelines.
3. If we find that no one votes for any of these seed scenarios, then we can **individually brainstorm and generate ten more seed scenarios offline**, and then try again to discuss and vote as a group. Sources include: personal experience, stories from people we know, news articles, research papers (eg, [Barocas and Selbst (2016)](https://www.cs.yale.edu/homes/jf/BarocasSelbst.pdf)), etc.
## Seed scenarios
#### 1a. Identifying potential tax fraud XXX
You're a member of an analytics team in a European country, and brought in to consult about a project that has already started to scale the deployment of models for predicting which tax returns may require further investigation for fraud. The team has used a model trained in other jurisdictions by a large predictive analytics supplier, and hopes that they can leverage this at a lower cost that would be required to invest in the capability in-house. [Veale et al. (2018)](https://arxiv.org/pdf/1802.01029.pdf)
- stakeholders: everyone filing a tax return, data scientist, auditors
- real harms: False Positive = audit on someone who made no mistake (perhaps burden for them? waste of time/money for auditor); False Negative = fraud undetected;
- collaborators: Michael Veale
- questions: Could this lead to feedback loops if people find out what criteria cause audits? What percentage of returns can be audited? Can we find a dataset for this?
- kevin: In EU context mentioned in Veale et al. (2018), use of protected characteristics in model development would be unconstitutional; according to "lead of analytics at a national tax agency... if someone wanted to use gender, or age, or ethnicity or sexual preference into a model, [they] would not allow that — it’s grounded in constitutional law." even when legally cleared, analysts do not use these features because they would also have to explain to citizens that they were used to trigger an investigation, and there are ethical norms in the agency against this.
*Next step: The seed scenario from Veale et al. (2018) is a good candidate for sociotechnical talking points, particularly focusing on the situation with the portability trap described in the paper.*
#### 1b. Identifying tax fraud, adapted to US context
(exploring adapting scenario #1 into US context)
- kevin: Electronic fraud detection used for decades in US ([source](https://www.treasury.gov/tigta/auditreports/2015reports/201520093fr.html)), decades of large-scale fraud (eg, [Panama papers](https://www.icij.org/tags/us-panama-papers-case)). IRS funding, staffing for "fraud technical analysts" and thusly fraud referrals have declined dramatically (>50%) over the last decade or so, and in 2018 the "audit rate for individual returns was 0.59%." ([source](https://news.bloombergtax.com/daily-tax-report/insight-the-irss-renewed-focus-on-fraud-implications-for-tax-practitioners)). in 2019, IRS says 1500 fraud investigations, with ~60% recommended for prosecution ([annual report](https://www.irs.gov/pub/irs-utl/2019_irs_criminal_investigation_annual_report.pdf)). Steps of the investiation process are [described here](https://www.irs.gov/compliance/criminal-investigation/how-criminal-investigations-are-initiated).
- kevin: Recent [IRS contracts](https://src.bna.com/C76) granted [to Palantir](https://news.bloombergtax.com/daily-tax-report/palantir-deal-may-make-irs-big-brother-ish-while-chasing-cheats). Since ~2019, anticipated increase in enforcement action ("the IRS is quite vocal about its increasingly specialized ability to analyze data in order to help it direct tax enforcement resources and develop criminal cases, touting its use of data analytics programs that can access and search over 9.5 billion records."). Contracts indicate this includes social network analysis, text and email communication, and other forms of non-financial data.
- kevin: IRS has an office of civil rights, but didn't find any reporting on real harms here related to over-investigation that I could connect to current Fairlearn capabilities. I'm assuming adoption is driven primarily by internal cost-savings, and that there's a clear natural equilibrium since "cost of fraud" is clear to express financially and to trade off with "cost of preventing fraud." It's challenging to make real harms of fraud tangible in human terms since downstream impact is so diffuse (ie, it doesn't directly translate to reductions in specific services).
- kevin: Other references: [IRS Criminal investigations](https://www.irs.gov/compliance/criminal-investigation/program-and-emphasis-areas-for-irs-criminal-investigation), [Artificial Intelligence: Entering the world of tax (Deloitte, 2019)](https://www2.deloitte.com/content/dam/Deloitte/global/Documents/Tax/dttl-tax-artificial-intelligence-in-tax.pdf), [Advanced Analytics for Better Tax Administration: Putting Data to Work (OECD, 2016)](http://www.oecd.org/publications/advanced-analytics-for-better-tax-administration-9789264256453-en.htm)
*Next step: Table it, unless we find out more about how to express real harms in human terms.*
#### 2. Debit card fraud investigation XXX
You're a data scientist at a Dutch financial services company, and your manager asks you to join an existing team. This team has deploy a model trained on historical transaction data and now new debit transaction data is arriving. For each new transaction, the model predicts whether it is potentially fraudulent and then will trigger an alert and inspection by human analysts. The output that matters for the company is the final decision by the human analyst of whether to block the transaction, allow it but flag for further investigation by another team, or flag the transaction as normal. [Weerts et al. (2019)](https://arxiv.org/abs/1907.03334)
- stakeholders: customers, data scientists, analysts
- real harms: False negatives mean clients can't get their money back (eg, in a phishing scheme), while false positives may overwhelm the team of human analysts or disrupt clients making legitimate purchases.
- collaborators: Hilde Weerts
- questions: debit/credit card usage varies by country which changes what costs are associated with FP/FN; Can we find a dataset for this?
- kevin: concerned about a natural equilibrium because of funding incentives within organization. Couldn't uncover real harms, since there are strong recourse and contestability procedures.
*Next step: To move forward, find reporting on real harms to real people (eg, increasing the costs of fraud prevention would create barriers to entry for people in the Netherlands to use debit cards, or real harms from failure of contestability and recourse procedures related to fraud).*
#### 3. Measuring brand sentiment
You're a member of a team trying to measure brand sentiment from online comments and reviews. The team hopes to use an existing language model, a third-party service for flagging abusive comments, and then train a more targeted sentiment classifier for your brand on top. [Hutchinson et al. (2020)](https://arxiv.org/pdf/2005.00813.pdf)
- questions: we have concerns about sentiment classification in general; Fairlearn may not want to focus on deep learning for text tasks at this point
- How does the system behave if/when the third-party service for abusive comments updates its model?
- How multilingual do we need to be? What about different dialects of the same language?
#### 4. Candidate screening X X X
A potential client asks you if ML can help with predicting job candidates' suitability for jobs based on a combination of personality tests and body language [Rhagavan et al. (2019)](https://arxiv.org/pdf/1906.09208.pdf)
- collaborators: Solon is one of the authors!
- notes: application looks sketchy, need to talk to Solon, perhaps this could be rewritten to be about qualifications rather than, for example, body language.
- lisa: I find some of the job screening/posting scenarios interesting given the significance some of these algorithms/models in our current world where many people are and will be searching for new jobs.
*Next steps: a) Work towards an example Fairlearn notebook, see [Fairlearn: Candidate screening example](https://hackmd.io/GMli82s7SxORABkabCgw8Q), b) Work this seed into sociotechnical "talking points".*
#### 5. Advertising jobs to potential candidates X
You work as a data scientist for an online job platform where people search for new jobs and exchange professional content and updates. You are in charge of the system that decides to whom to recommend which positions. [Upturn report](https://www.upturn.org/static/reports/2018/hiring-algorithms/files/Upturn%20--%20Help%20Wanted%20-%20An%20Exploration%20of%20Hiring%20Algorithms,%20Equity%20and%20Bias.pdf) & [The Guardian article](https://www.theguardian.com/technology/2015/jul/08/women-less-likely-ads-high-paid-jobs-google-study)
- stakeholders: users of the job platform, employers advertising on the platform, data scientist working for the job platform
- harms: recommending certain jobs only to certain groups of people increases the likelihood that the employers won't get a diverse set of applicants and job seekers may have no chance of seeing certain kinds of jobs
- notes: somewhat complex setup compared to simple regression/classification scenarios supported by Fairlearn
- lisa: I find some of the job screening/posting scenarios interesting given the significance some of these algorithms/models in our current world where many people are and will be searching for new jobs.
- What is 'success' in the training data? That a candidate applied? Was interviewed? Was hired? Does this match the definition of success for the system (especially since applied->interviewed->hired is a very leaky pipe)
- Do we define fairness relative to the applicants in our user pool or in the broader population?
#### 6. Rankings for image search
You're on a team working on improving an image search system after receiving some complaints from users related to fairness. Users often use this system to find a selection of stock images to use when making multimedia presentations. In this system, requests start with some information about the context and user creating the query, and your team is trying to incorporate ideas about fairness like diversity and inclusion into how search results are ranked. [Mitchell et al. (2020)](https://arxiv.org/pdf/2002.03256.pdf)
- stakeholders: users of the search system, people who may (or may not) be shown in search results, data scientist
- harms: over- or underrepresentation
- notes: Fairlearn doesn't support ranking at the moment, but this may be a good application in the future.
#### 7. Sales leads for car loans X
You work at CarCorp, a company that collects special financing data: information on people who need car financing but have either low credit scores or limited credit histories, and sells this data to auto dealer as sales leads. CarCorp serves dealers across the United States. A new project manager asks about leveraging data science to “improve the quality” of leads so that dealers to not churn. CarCorp has a large amount of historical lead data (2 million leads in 2017 alone), but relatively less data on which leads had been approved for special financing (let alone why the loan was approved). [Passi and Barocas (2019)](https://arxiv.org/ftp/arxiv/papers/1901/1901.02547.pdf)
- notes: There are concerns around predatory terms on such loans, perhaps best to avoid this
- collaborators: paper by Solon
#### 8. Predictive policing
You are a contractor working with the police department in a large city. One of the project leaders in the department would like to construct a risk score for people who are known gang members engaging in knife crime. It's important to them that they can understand what the model is doing, and they are wary that any model will pick up on protected characteristics. [Veale et al. (2018)](https://arxiv.org/pdf/1802.01029.pdf)
- stakeholders: police officers, everyone in the community (especially people who may be more affected by predictive policing than others), data scientist
- harms: overpolicing of neighborhoods can lead to disproportionate effect on communities in these neighborhoods (perhaps exacerbated by feedback loop)
- notes: Feedback loops! Prediction on behavior based on circumstances; perhaps useful for aggregate observations about behavior but less so for individual predictions
- Dangerous to assume that 'crime' is a single phenomenon. There are different kinds of crime, so also need to make sure that data match the crime being predicted.
#### 9. Scheduling maintenance within a factory
You work within a manufacturing company, and are starting a new project that will create a schedule assigning employees to check and update certain components of the machinery to prevent critical operation failures. The component assignment is based on data that show how often different components have worn out and broken down in the past. [Kyung Lee (2018)](https://journals.sagepub.com/doi/full/10.1177/2053951718756684)
- questions: need to provide details on link between preventative maintenance model and fairness in scheduling, seems to be separate; also need to figure out how/if fairness in shift scheduling is measurable
#### 10. Child protective services hotline
You're collaborating with the child protective services agency as part of a county government in the US. The agency is redesigning the intake flow for reports of potential child abuse or neglect, and wants to discuss if a predictive analytics system could help them improve this system. [Brown et al. (2019)](https://www.andrew.cmu.edu/user/achoulde/files/accountability_final_balanced.pdf) and [Chouldechova et al. (2018)](http://proceedings.mlr.press/v81/chouldechova18a/chouldechova18a.pdf)
- stakeholders: children, parents, agency employees, data scientists
- collaborators: Alexandra Chouldechova
- notes: dataset? very high stakes
- notes: see [Measuring the predictability of life outcomes with a scientific mass collaboration (Salganik et al. 2020)](https://www.pnas.org/content/117/15/8398) for a large scale study of predictive systems for child life outcomes using a longitudinal dataset. The authors specifically focus on child welfare outcomes, and find that "despite using a rich dataset and applying machine-learning methods optimized for prediction, the best predictions were not very accurate...these results suggest practical limits to the predictability of life outcomes."
#### 11. Compliance in customer service calls
You work on a team within financial services that is building a system to reduce the company's compliance risk from customer service phone calls. Compliance risk includes when a company employee breaches confidentiality or engaging in instances of misrepresentation or fraud. Another team has leveraged third-party services to transcribe call audio into text, and then extract features for each call related to the presence of specific keywords. It's your team's role to take that vector of binary features, and build a system to estimate the compliance "risk score" for each call. A team of internal analysts will use these risk scores to triage which calls to investigate further. [vendor blog post (2020)](https://customers.microsoft.com/en-us/story/754840-kpmg-partner-professional-services-azure)
- note: might be singling out people in call center
- Do call-centre demographics roughly match the caller demographics? That gives extra avenues for miscommunication causing risk.
- How fine grained is the risk score? How well are the boundaries between different risk levels defined?
- What is the follow up process for an 'at risk' call (both for the caller and callee)?
- How are callers allocated to the call centre operators?
- Do the questions in the calls systematically vary with shift patterns (e.g. placing stock trades in the morning, cancelling credit cards in the evening)?
#### 12. Facial verification of taxi drivers
Your team in a taxi company is collaborating on a new feature, "selfies for security," which asks drivers to periodically take pictures of themselves in between rides. The intention is to reduce the company's risk in providing taxi's that are driven by someone who the company has not screened and approved. These photos will be taken within taxi cars on cell phones, in a wide range of conditions with uncontrolled lighting throughout the day. Another team in your company will generate the signal to "request a selfie" and your team is standing up a new service to process the photos through a third party facial verification vendor that returns a confidence score for how well the driver photo matches the last photo of the driver. Your team's service then decides whether to allow the driver to start picking up riders, or to block the driver's account and flag it for investigation by a small team of analysts. [taxi company blog post (2017)](https://eng.uber.com/real-time-id-check/) and [vendor blog post (2019)](https://customers.microsoft.com/en-us/story/731196-uber)
- richard: concern on facial recognition, image quality might be an issue
- hanna: avoid this one, maybe not want to build it as a notebook but as "talking points" because of landscape around facial recognition
- varoon: fairlearn is hard because the communities you are impacting and the people deploying the algorithms have misaligned values and ideals. about surveillance, etc. in their communities. large sociotechnical barriers for developers, so might not be best example. a bit tangentially, this may be appropriate for auditors (or "evaluators" or "journalists") - given a system because trade secrecy was waived.
- solon: would like not to avoid tricky cases, that's where people need most guidance. could include some where we think the right answer is to not build since no reasonable way to mitigate.
- richard: could do "talking points"
- miro: yes, but zero use cases now.
- Exactly what question are we trying to answer? Is it that the photo the driver sends matches the one on file? Or that the photo matches the one sent at the start of the shift? With accuracy <100%, those are not quite the same thing.
- How do we cope with a photo taken under Sodium-D lighting, especially of someone with darker skin?
- What about the drivers' privacy?
#### 13. Financial services product recommendations X X
You work at a Canadian financial services company that makes financial product recommendations for consumers. Other financial products describe their offerings and store them with your company. Users come to the app, agree to share their credit history, and then after their identity is authenticated, your team builds a model to rank the financial products that are the best fits.
[financial services (2020)](https://customers.microsoft.com/en-us/story/734799-borrowell-financial-services-azure-machine-learning-devops-canada)
- miro: earlier examples might be better. recommendations are tricky - fairlearn currently has binary classification and regression (but ranking could be implemented as scoring)
- hanna: don't want to lose track of this, if we do put fair ranking work in the project down the line
- Based _just_ on their credit history?
- What sort of reccommendations? Back in 2008, one of the scandals was that people with 'prime' credit scores were steered towards subprime loans.
#### 14. Customer Service triage, consulting XX
You work at consulting company. One of the services your company provides is setting up an single mailbox to receive incoming customer emails. Your role is to collaborate with company to create a classification system for labeling emails in one of six categories. The output of your system is then used to route the email to the correct department head. To do this, you're using a third party keyword extraction system that the company has already set up, and can extract ~1000 binary features from an email. [consulting blog post (2020)](https://customers.microsoft.com/en-us/story/774221-securex-professional-services-m365)
- hanna: i like this, because i've worked with emails, i like the diversity of this as well, it's not just finance or other things but a newer area
- miro: fairlearn supports binary classification and regression. if we want to support this, we'd have to present it as a score for each of the six categories, rather than six-way classification. people do that a lot in practice so it could be okay.
- What is the fairness issue? Presumably a misclassified message would simply be rerouted by the recipient. How much delay (i.e. harm) does that add to the processing of a message (especially as compared to resolution time once correctly routed)?
- How often can the underlying model be updated with examples of incorrectly routed messages (with the correct labels manually applied)?
#### 15. Job recommendations X
You work for a job recommendation product. Background processes gather job posting, and submits them to a third party search indexing service. When a user comes to the website and uploads their resume, the resume is processed and a set of job skills are extracted. Your team works on the service that takes the set of job skills in a resume, and searches the job posting index service managed by a third party vendor. Your team then provides the ranking of job postings that is ultimately shown to the user. [company blog post](https://azure.microsoft.com/en-us/blog/using-azure-search-custom-skills-to-create-personalized-job-recommendations/)
- How do we recognise 'job skills'
- How do we cope with unusual qualifications (could be as simple as attending university overseas)
#### 16. Alerting for first responder police officers
You work at a company providing a service to police officers that accompanies queries typically run when a police officer is a first responder. Three types of queries are run: driver’s license information, license plate information, and vehicle identification numbers. When an officer presses a button on their radio and speaks a license plate number, within seconds they hear an alert tone that classifies whether the queries returned information that is low priority, sensitive but not urgent, or high priority (eg, a prior arrest record or a stolen vehicle). The system relies on a third party language system to parse the audio and extract the license plate number, and then runs those queries through police department systems. You work on the team building the classification system that chooses which of the three alert tones to play through the officers radio. [company blog post](https://customers.microsoft.com/en-us/story/792324-motorola-solutions-manufacturing-azure-bot-service)
- note: hanna says no
#### 17. Choosing new retail sites X
You work at a clothing company, as an analyst working to select the location for three new physical stores that will be opened in the next six months. You're collaborating with a third-party vendor to estimate potential revenues at new site locations. You've gathered data on past store openings, and shared it with the vendor, who has created a model that can estimate the potential revenue for the first two years of operation in new sites. The vendor's model relies on data you've provided about your company's past openings, and other undisclosed data sources about retail sales, real-estate prices, foot traffic, etc. [company blog post](https://customers.microsoft.com/en-us/story/816179-carhartt-retailers-azure)
#### 18. Streaming music recommendations XX
You’re a member of a team working on the music recommendation system of a music streaming platform. Previously, your team has primarily focused on optimizing recommendations for user satisfaction, which is measured implicitly as time spent listening on the platform. The company has received complaints from several artists that their music is not getting enough exposure, many of whom belong to groups that are historically underrepresented in the music industry. Your team decides to work on improving the recommendation system to allow for more diverse recommendations. [Ferraro et al. (2019)](https://arxiv.org/pdf/1911.04827.pdf)
#### 19. Deciding the credit card limit
You work for a bank as a data scientist. You're tasked with building a system that decides the credit limit for new credit cards. Inspired by [Apple Card](https://hbswk.hbs.edu/item/gender-bias-complaints-against-apple-card-signal-a-dark-side-to-fintech)
- stakeholders: credit card holders, bank (employees)
- harms: receiving a lower credit limit may restrict the opportunities of the credit card holder by preventing them from being able to afford things
- questions: Is this how it works in real life? Or is this part of the application itself? Need to consult with subject matter experts.
#### 20. School choice X
You work as a data scientist for a large school district. Your task is to create a system that assigns children to schools based on their (parents') preferences. Inspired by [Edweek](https://www.edweek.org/ew/articles/2013/12/04/13algorithm_ep.h33.html)
- stakeholders: children, parents, data scientists
- note: see section on NYC school assignment in [AI Now 2019 Report](https://ainowinstitute.org/ads-shadowreport-2019.pdf) for more history on this in NYC, with links to critiques of racial and socioteconomic segregeation, subsequent legislation, task force around algorithmic transparency, etc. see also [High School Choice in New York City:A Report on the School Choices and Placements of Low-Achieving Students (Nathanson et al. 2013)](https://research.steinhardt.nyu.edu/scmsAdmin/media/users/ggg5/HSChoiceReport-April2013.pdf) for a critique of an older high school assignment algorithm in NYC, which is overlaid over a [longer history of segregation](https://civilrightsproject.ucla.edu/research/k-12-education/integration-and-diversity/ny-norflet-report-placeholder/Kucsera-New-York-Extreme-Segregation-2014.pdf).
#### 21. Hate speech detection
You work for a social network as a data scientist. Your task is to build a system to identify hate speech so that the network can notify/warn users before reading the hate speech, or potentially block it. Inspired by [TheRegister](https://www.theregister.com/2019/10/11/ai_black_people/), perhaps somewhat related is toxicity, inspired by [Medium](https://medium.com/@carolinesinders/toxicity-and-tone-are-not-the-same-thing-analyzing-the-new-google-api-on-toxicity-perspectiveapi-14abe4e728b3)
- stakeholders: users of social network (both content creators and consumers), data scientist
- harms: false positive mean that benign posts are flagged as hate speech, false negatives mean that actual hate speech isn't flagged as such
- notes: There's a lot of overhead for the social network to manually decide what is hate speech. Hate speech detection itself is very much a NLP task, but it's possible that disparities between groups could be mitigated by postprocessing probabilities.
- note: the scenario mentioned in the article uses a third party service, Google's Perspective API, rather than developing a system in-house. Various kinds of fairness audits have been conducted and written about it, and the Perspective API itself has a [public model card](https://medium.com/the-false-positive/increasing-transparency-in-machine-learning-models-311ee08ca58a)
- There's another open source MSR project called [CheckList](https://github.com/marcotcr/checklist) that might be applicable for some of these low-level kinds of fairness checks (eg, particular identity statements like "I am gay" returning negative sentiment).
#### 22. Predicting who needs special attention in healthcare X X X
You work for a hospital network to create a system that should predict to which patients healthcare professionals should pay special attention. Inspired by [TheVerge](https://www.theverge.com/2019/10/24/20929337/care-algorithm-study-race-bias-health)
- stakeholders: patients, healthcare professionals, data scientists
- harms: False positive = somebody who doesn't need special attention gets special attention (perhaps unnecessary effort) and this care is potentially taken away from somebody else who needs it; false negative = somebody who needed special attention doesn't get it (potentially severe health consequences)
- questions: need to find out what percentage of patients actually get special attention, and overall more details on such an application; Can we somehow get a dataset?
- sociotechnical: "special attention" needs to be more concrete; also, does implementing the algorithm provide more budget for additional staffing? Does funding for the algorithmic system come with funding for increased capacity for "special attention," or does this function as a new kind of pressure for how to allocate staffing attention that will have to compete with other existing pressures?
- sociotechnical: see [Yang et al. (2016)](https://www.researchgate.net/publication/292320820_Investigating_the_Heart_Pump_Implant_Decision_Process_Opportunities_for_Decision_Support_Tools_to_Help) for a case study of a heart pump implant decision that found "lack of perceived need for and trust of machine intelligence, as well as many barriers to computer use at the point of clinical decision-making," or [Estimate the hidden deployment cost of predictive models to improve patient care (Morse et. al. 2020)](https://www.nature.com/articles/s41591-019-0651-8) for some caution on what it takes to actually deploy these kinds of models in a way that impacts patient outcomes. [Sendak et al. (2020)](https://www.nature.com/articles/s41746-020-0253-3.pdf) describes "Model Facts," a medical variant on model cards.
- sociotechnical: More broadly, digitization of human service data is often [incredibly challenging](https://www.motherjones.com/politics/2015/10/epic-systems-judith-faulkner-hitech-ehr-interoperability/), with [data quality](https://fortune.com/longform/medical-records/) and [overhead of data entry](https://www.selecthub.com/medical-software/emr/electronic-medical-records-future-emr-trends/) as core issues. Common assumptions about scalable cost efficiencies in research papers (eg, [Rajkomar et al. 2018)](https://arxiv.org/ftp/arxiv/papers/1801/1801.07860.pdf) have not been validated by field research (eg, [Soron and Collins (2017)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5596299/)).
- note: see [Baker et al. (2020)](https://alexhanna.github.io/algo-identity/) for some interactives on "administrative violence" in the healthcare system (eg, related to gender identity).
- note: example of [fairness analysis](https://storage.googleapis.com/covid-external/COVID-19ForecastFairnessAnalysis.pdf) [whitepaper](https://storage.googleapis.com/covid-external/COVID-19ForecastWhitePaper.pdf) re: COVID forecasting. starts with citing existing disparite impact, and focuses on absolute errors by subgroups, but binned into quartiles of counties (partially because of data sources). also alludes in passing to differential costs of over- and under-prediction (described in papers but hidden from "prediction" CSVs or UIs)
- See [ml4health](https://ml4health.github.io/)
- specific example to build on: https://ai.googleblog.com/2020/08/using-machine-learning-to-detect.html
*Next step: Start with either a) finding a specific deployment context and writing it into a paragraph, or b) finding where there are significant real harms in healthcare, and explore in that direction.*
----
## Historical notes and links
Here's what's happened so far.
1. **Research paper examples**. The initial research papers used abstracted datasets, and ran some experiments to demonstrate the approach empirically. These included the "UCI credit card" dataset and the "COMPAS" dataset, but these didn't engage with sociotechnical context.
2. **Critiques**. We tried exploring some of the sociotechnical context around those initial examples (eg, https://github.com/fairlearn/fairlearn/issues/413 and then more detailed explorations of credit card applications in https://github.com/fairlearn/fairlearn/issues/418, consumer lending in https://github.com/fairlearn/fairlearn/pull/492, and pre-trial detention in https://github.com/fairlearn/fairlearn/issues/478). The conclusion has been mostly that these deployment contexts may not be the best illustration of the project's core value that fairness is a sociotechnical challenge.
3. **"How to talk and write about fairness"**. We wrote up a document with aspirations for how the team would talk about fairness on the project ([link](https://fairlearn.github.io/contributor_guide/how_to_talk_about_fairness.html)). This was difficult to use in practice, and in practice the document wasn't influencing how we were talking or writing. We also found that the existing example notebooks were not reflecting these kinds of project values.
4. **"Contributing example notebooks"**. We developed a [microrubric for critiques](https://github.com/fairlearn/fairlearn/pull/490) to try to make a more concise and usable checklist. This becamse part of the contributor guide for [Contributing example notebooks](https://fairlearn.github.io/contributor_guide/contributing_example_notebooks.html).
5. **"Seed scenarios"**. To explore other deployment contexts where we might illustrate the value of the Fairlearn Python library, we created and discussed a set of "seed scenarios" as potential candidates (see https://github.com/fairlearn/fairlearn/pull/491). This led to productive discussion about fairness as as sociotechnical challenges, but concerns about whether Fairlearn was an appropriate choice for reducing real harms in any of these deployment contexts. It also raised questions about whether the team would be able to do this kind of work on its own, without other kinds of interdisciplinary collaboration.