# Fairness and Health Equity Workshop (York)
[toc]
## Key Information
Zoom call: https://york-ac-uk.zoom.us/j/94602182792?pwd=Y1ZFa0o4YVBCZ0JvYW9HbW9ZZkJ1QT09
Interactive Lifecycle: https://github.com/alan-turing-institute/turing-commons/blob/resources/resources/additional-documents/lifecycle-cheatsheet-interactive.pdf?raw=true
Bias Cards: https://github.com/alan-turing-institute/turing-commons/blob/resources/resources/activities/bias-and-mitigation-cards.pdf?raw=true
### Agenda
- 10.30-10.35 Hello & Welcome
- 10.35-11.20 Presentations
- Marten Kaas – The AI lifecycle
- Zoe Porter – Principles-based ethics assurance argument
- Philippa Ryan – AI in healthcare: Complex development and clinical deployment
- 11.20-12.05 A Case Study Deploying a clinical diagnostic support system.
- 12.05-12.35 Lunch
- 12.35-13.10 Session One: Design
- 1) Project planning
- 2) Problem formulation
- 3) Data extraction and procurement
- 4) Data analysis
- 13.10-13.45 Session Two: Development
- 1) Preprocessing and feature engineering
- 2) Model selection and training
- 3) Model testing and validation
- 4) Model reporting
- 13.45-13.55 Coffee Break
- 13.55-14.30 Session Three: Deployment
- 1) System implementation
- 2) User training
- 3) System use and monitoring
- 4) Model updating and deprovisioning
### Discussion Prompts
:::info
Note that this discussion should be in the context of a clinical diagnostic support system (CDSS) that is used to aid clinicians in predicting Type-2 diabetes related co-morbidities. The CDSS can also recommend potential treatments. The type of data collected includes most standard medical data, e.g., height, weight, BMI, heart rate, blood pressure, blood oxygen level, white blood cell count, etc.
:::
#### Project Design
1. Project Planning
a. What is the goal of the project (e.g., to create a clinical diagnostic support system)? Is it fair to use AI to achieve this goal? If it is, fair for whom?
b. What anticipatory reflections might we make about fairness at this very early stage?
c. What measures could be used to assess whether project planning was conducted fairly?
d. What kind of goal is “fair project planning”?
2. Problem Formulation
a. For whom is the “problem” a problem?
b. Should affected stakeholders be engaged? What would fair stakeholder engagement amount to?
c. Is it fair to address this problem using AI, i.e., using some target variable to make predictions or recommendations about a property of interest?
3. Data Extraction or Procurement
a. How is the data being collected? Are people being fairly compensated for their data? Has data been fairly collected?
b. Has the data been fairly procured? What evidence would need to be provided to demonstrate that data has been fairly procured or collected?
4. Data Analysis
a. Is the data that has been collected representative of the intended use context? What measures or evidence would you need to demonstrate this?
b. How should missing data be dealt with?
c. How can biases in the data be detected and dealt with?
#### Model Development
5. Preprocessing and Feature Engineering
a. How should the data be organized or cleaned?
b. What evidence would be needed to say that the data has been processed fairly?
c. What does fair feature engineering, i.e., picking and choosing of certain features of the raw data, amount to?
d. Who should be involved in organizing and cleaning the data?
6. Model Selection and Training
a. How should models, i.e., training algorithms, be selected?
b. What would it mean to choose a model fairly? What evidence would be needed to demonstrate that a model was selected fairly?
7. Model Testing and Validation
a. What would it mean to separate testing from training data fairly? What evidence would be needed to demonstrate that training and testing data was separated fairly?
b. Who should be involved in identifying the data that will serve as training data and the data that will serve as test data?
c. What is a fair measure of success? How will this measure be identified? Who will be involved in identifying this measure?
8. Model Documentation
a. How much of the design and development process needs to be documented explicitly?
b. What evidence would be required for a fair assessment of the model by a third party?
c. How should the design and development process be documented, e.g., in writing, diagrams, etc. How should reports about the model be disseminated, and to whom?
#### System Deployment
9. System Implementation
a. How is the system being integrated into the broader social and organizational practices of the organization, e.g., into diagnostic healthcare pathway?
b. Will the relevant supporting software and hardware infrastructure already be in place?
c. Whose responsibility is it to ensure there exists infrastructure to support the AI-enabled CDSS?
10. User Training
a. Who will carry out user training? How much of the system will expert users, e.g., clinicians, be expected to understand?
b. Can people opt out from using the system?
c. What evidence would be required to demonstrate that expert users have received appropriate user training?
11. System Use and Monitoring
a. Whose responsibility would it be to monitor the system? Is it fair to expect this kind of monitoring from this person or group of people?
b. What measures will be used to ensure that model performance does shift, degrade or entrench existing inequalities?
c. Can people opt out from using the system and still expect similar treatment?
12. Model Updating or Deprovisioning
a. When is it appropriate to update or change the model? Whose responsibility is it to initiate updates? Is it fair to expect this person or group to initiate the update process?
b. When is it appropriate to deprovision the system? Whose responsibility is it to initiate this process, and is it fair to expect this person or group to initiate this process?
c. What other concerns or issues might prompt model updating or deprovisioning?
## Notes
Attendees
- I’m Christopher Burr from the Alan Turing Institute. I’m the PI for this project on Trustworthy and Ethical Assurance. Looking forward to the discussion today.
- Hello! Kalle Westerling here - I am a Research Application Manager at the Alan Turing Institute, working with the Trustworthy and Ethical Assurance Platform 🙂 https://www.turing.ac.uk/people/researchers/kalle-westerling
- Hi everyone. I’m Nayha Sethi, a lecturer at the University of Edinburgh and co-I on a couple of projects on trustworthy AI (Trustworthy Autonomous Systems: Making Systems Answer) and BRAID (Bridging Responsible AI Divides). Looking forward to learning from you all today
- Hi all, I’m Ernest, PhD Candidate at York supervised by Ibrahim Habli and Zoe Porter. My research focus is on building an assurance case for voice-based conversational AI for use in healthcare. I’m also medical director at Ufonia Limited - we’re an Oxford based SME that develops a conversational AI agent used in the NHS to support care, I’m also an NHS ophthalmic surgeon by background. 😃 Excited to meet everyone!
- Hello, I am Valeria Venditti, lecturer in Ethics at the School of Nursing and Midwifery University College Cork.
- Hi everyone. I’m Magdalena Furgalska, a lecturer in law, focusing on healthcare law. Currently leading a stream on fairness in healthcare in the context of psychiatric detention.
Focus of today: Understanding AI in digitally-enabled health.
This is the third workshop of the project
### Presentation 1 (Marten Kaas – The AI lifecycle)
:::info
- Slides: ==Add link==
:::

Intervening in the AI lifecycle. The diagram illustrates how software dev proceeds at a rapid pace where development and operations (deployment) are mixing together.

AI = utilises machine learning components.
Stage 1. Starts with data management. What do we do with the data? Preprocessing? Setting aside certain bits for training and testing. Cleaning the data?
Stage 2. Model learning. Selecting a particular ML algo - different ways to transform datasets into outputs through models/algos.
Stage 3. Verify the model. Is it performing as we intended? Doing what we wanted it to do? Does it reach measure of success?
Stage 4. Deployment. Takes the ML-based system and integrate it into a larger sociotechnical system. Sits in a self-driving car, for instance.

Larger image: https://www.bmj.com/content/bmj/372/bmj.n304/F1.large.jpg
Broader view of the lifecycle.
Starting with the world. Patterns of interaction, ways of representation of people.
Might collect data about that world. Might be discriminatory, non-representational.
Using it in a system. We might get mislabeled data that's fed into a system that perpetuates systemic bias. Failing to evaluate the model, affecting stakeholders.
These systems are put into the world. Affects people in the real world. Exacerbating inequalities, etc.
We want to be somewhere in the model.

Interactive Lifecycle: https://github.com/alan-turing-institute/turing-commons/blob/resources/resources/additional-documents/lifecycle-cheatsheet-interactive.pdf?raw=true
What we'll do today: look at the different parts of the model

Session 1: Addressing the top right side of the model.

Session 2: Development.
- what will we do with the data? inappropriate to feed the model all the data that we have. Collecting data might lead to *too many datapoints*. You want to select, clean, the data. What's more important?
- Fed to which model? Why did you choose that? Listening to others
- Are we testing the model and evaluating the results? How do we measure success?
- Documentation is also important: what have you done in the previous phases and steps (Design, development - through all the steps).

Session 3: Deployment
- System implementation. The model has been developed, AI-enabled system. If you're in research, maybe this is where you stop. But in many other cases, the model will be integrated/deployed into other larger systems. Making recomm to clinicians, predicting likelihood of diseases... Embedded in already existing technologies and social structures.
- How do you train users? Clinicians, how do we train them appropriately?
- How do we make sure the system works as it should over time? Important if your system updates, continuously learning model, or so. But performance of a system/model should *always* be in focus.
- Deprovisioning of the system, how will that happen? Updating? Who's in charge? Who rings the alarm if soemthing goes wrong?

Natural breakpoints exist, where we should stop and ask questions. Our approach here is a narrow approach. What's the designs put into place to make sure that problem formulation can happen, *by design*. Thinking across the lifecycle: what evidence would be needed to say that we've thought of the ethical implications of our systems?
### Presentation 2 (Zoe Porter – Principles-based ethics assurance argument)
:::info
- Slides: ==Add link==
:::
How we've used assurance-case method to ensure an ethical approach to AI. Moving from principle to practice, with ethical aspects of AI systems.

Safety/assurance cases are structured cases, supported by evidence. Simple example above. **Claims**, broken down to **smaller claims**, supported by **evidence**. Provide feasible reasons to believe that this top claim is true.
Methodology used be safety engineers, especially. System will be *acceptibly safe* = physical risk is reduced to lowest acceptable.
ALARP (as low as reasonably practicable). Risk cannot be eliminated, only reduced. Reduction will always be proportional to effort.
Nuclear and automotive industries are good examples.

Original image: https://wilkins.law.harvard.edu/misc/PrincipledAI_FinalGraphic.jpg
The infographic here is from 2020, went through all the sets of ethical principles around ethical AI from 2016-2021. Hundreds => major insights into an infographic.
How can we use this methodology, to ensure all of these?

Based on social contract around "reasonably safe" = if everybody was concerned with ethically responsible, and no stakeholders would reject deployment, then it's ethically acceptable.

PRAISE is this methodology, based on medical ethics, but not limited to medical domain.
- Beneficience = do good.
- Non-maleficence = do no harm
What are the benefits of the system, and for whom? What are the risks and harms involved, and for whom?

Inspired by philosophical ideas around justice.

One particular fairness claim in the argument. The system should not make inequalities worse. What does that look like across the project lifecycle?


Not just what everyone needs but also that everyone does their part.

### Presentation 3 (Philippa Ryan – AI in healthcare: Complex development and clinical deployment)
:::info
- Slides: ==Add link==
:::
[AAIP](https://www.york.ac.uk/assuring-autonomy/about/our-team/) research fellow.

AI-based prediction-system, which can support clinical decision making for patients with Type 2 diabetes to not develop comorbidity.
Independent, additional opinion based on patient data. Trained on real patient data.
We found a lot of issues with the training data that needed resolving. There is a paper (see above) about that. False negatives and false positives from predictor. Bias and other issues play in here.

An actor who is responsible for doing tasks, outcomes. Related are resources they produce. Number of different relationships.

Here's a smple example.
- AI system is an actor. Predict collision, creates warning. Used by a safety driver, who intervenes and prevents collision.
- What can go wrong, from a safety perspective?
- Insufficiency in AI system => late warning
- Knock on effect of late intervention from safety driver
- Who gets blamed after the event? AI system has no moral agency, but safety driver is usually held accountable. This could be poorly engineered system, but the driver is blamed. This happened with AV in America. The company (Uber) was not found liable at all, but the driver did.

Broad ecosystem of the diabetes prediction system.
Left-hand side: dev/training part.
Right-hand side: ops side.

**Training data..**
Actors involved:
- database: how is it created in the first place? Millions of entries, gathered from clinics over time. Clinical staff (lots). Little control over the clinicians. Different people, with different expertise, processes, amount of time. Causes all kinds of problems. "Many hands problem" = no insight into what's going on.
- AI developer uses the system. Part of their task is to make some manipulation/assurance of the db content. Used for training AI system.
- Also clinical assessment criteria for how we review the criteria. What's significant features of patient data. That's also used in the dev process.
- Also: Off-the-shelf software tools, open-source, not dev with safety/assurance in mind.
- Health regulator needing to do approval of prediction system. Good practice guidance, common in safety-domain, to support them and help them do tasks correctly.
- Good practice = we think system is sufficiently safe

**...to operational side, and its impact on the patient.**
Actors involved:
- The AI system, predicting a heart attack here. Generating explainability data, making transparent why a prediction was generated.
:::warning
**Explainability**
Human-centred explanations are important. For example, contrastive explanations are helpful for people (e.g. why one prediction was different from another), and can be localised to important features.
How do people understand the importance of explainability in context of fairness (e.g. enabling redress when something goes wrong)?
:::
- 2 things were explored previously: features of interest, presents which features have the most impact on output; using prototypical examples, taking info from training database of similar patients, other similar predictions from this data. Provides clinician into understanding of why it comes to this conclusion.
- We do not know who has maintained data over time = potential problem with data that we're using.
- Non-electronic records...
- Clinician has a consultation with the patient. Come to decision/treatment. Moral obligation because of duty of care.
Training data problems flow through the whole system

Issues with training data:
- millions of patient rows, only interested in T2 diabetes patients. Lots of missing/invalid data... Some patients have been in system for a long time, so there's lots of data about them.
- Missing groups of patients: ethnicities, ages...
- 14k potential columns in db, cut down to 20.

What do we do about this?
- Synthetically creating data for missing bits of information. (RIsk: Can inflate the bias!)
- Prototypical examples = told this wasn't a ____ because clinicians shouldn't see the data for patients who aren't in there care.
- Cutting down to 20 columns...
- Missing data can be significant itself.
- Creating synthetic data is not always the right thing to do.
- How do we incorporate that into our ML process?
- Scalability was a problem.
- manual review of this is not doable.
### Questions
1. Clarificatory question around why privacy could be at risk.
- De-anonymisation can occur for some patients even if personal data (e.g. name) are removed, but patterns remain in data (e.g. age, gender, postcode).
2. Why do imputation techniques for handling missingness not address problems for the case study? Do we have to have fully complete datasets before commencing?
- Imputation methods can exacerbate bias if not handled responsibly. For instance, how do you know why data are missing.
- Transparency of models can also limit interpretability.
- One mitigation strategy also relies on human intervention (e.g. from clinician). This is why having explainability is important. Without it, it takes the clinician out of the loop.
3. How does use of the system in deployment affect early development of Diabetes?
- Tom Lawton has been involved in designing processes for how the model could be implemented in practice (i.e. in safety-critical domain of healthcare).
- What help was provided for feature validation?
- 14,000 features at start, reduced to 20 through clinician consultation. Clinicians reviewed features based on clinical significance.
- This process may encode bias of clinicans (e.g. expectations of features that support NICE guidelines).
- Are we just shifting the possible bias of algorithms to the bias of clinicians.
- Clinicians don't always agree on features.
4. How does explainability support fairness?
- Feature importance can mislead.
- Were clinicians involved in the process of choosing the interpretability methods?
- Project about simulated care...?
5. What does the system return in terms of outputs?
- High-risk vs. low-risk.
6. Terminological clarification around relationship between features and input variables.
https://photos.google.com/share/AF1QipM5-srkGLx7SWR0dACD7OsJ8hwZ8KVUg9S8iDZgfTbcrsC966N_w8P2dW2AhtCwrw?key=VWZnQ2d1UUJfUmNfM2ZfM0pjaDFjZHBzdllqejdB
An image from a recent ophthalmology conference highlighting how much of an issue ‘clinician agreement’ is…
The figure shows just how badly ‘experts’ agree on grading of the severity of diabetic eye disease on a retinal scan
## Photos of Activity
### Group 1

### Group 2



### Group 3
- Miro Board: https://miro.com/app/board/uXjVNJyPqpA=/