Bias Assessment Meeting 1 Summary

# Bias Assessment Meeting 1 Summary ###### tags: Pre-SoW ###### links: [Statement of Work](https://hackmd.io/IAuSG5jXR3azUz8nA4vBSg?both) Summary of topics discussed during the first meeting with the customer. This document also contains project-related information and links to potentially usable artefacts. [toc] ### Recap From discussion between team and customer, we covered various topics outlined below related to this executive summary: > **Executive Summary** Focus on bias and fairness has garnered a lot of attention in the past few years as deployed models have been connected to unfair practices1 and dangerous outcomes2. A common practice in data science is to leverage pretrained models to save on time, cost, and resources. Bias and fairness implications of these off-the-shelf, pretrained models are poorly attended to and understood. This project aims to detect, assess, and mitigate learnt biases in image classification models to inform future related work. 1. [Biased datasets and unfair policing (RUSI, 2020/2021)](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/831750/RUSI_Report_-_Algorithms_and_Bias_in_Policing.pdf) 2. [Fatal self-driving Uber accident from biased dataset (BBC, 2020)](https://www.bbc.com/news/business-50312340) 1. **Definition** - What bias and fairness could mean and the contexts in which it is important (e.g., rare-diseases, review job applications for teachers). - Bias assessment: understand what biases a model could have 3. **Metrics** - the customer mentioned parity metrics currently being used. - we discussed how different metrics are mutually exclusive based on individual versus group fairness - these metrics would drive the development of bias assessment analysis 4. **Approaches** - customer is interested in image classification - customer expressed interest in looking at datasets that are not face recognition - interested in looking at different computer vision architectures (e.g., transformers, CNNs) - specific stage of focus is post-processing - see post-processing techniques in [General Notes](https://hackmd.io/EcGAYuaOQdmGPKwbvUp-_w#Post-Processing-Models). 5. **Hypotheses** - comparing bias in extant networks - compare bias between image classification and object detection - _important to include way to make these future-focussed, which can also be a new hypothesis set that looks into comparing bias between base and head models for approaches like few-shot learning._ ### Task/Hypothesis * H1: Comparison of the level of bias in each model per dataset (transformers vs. CNN vs. others?) * H2: object detection vs. image classification **Notes**: customer is interested in the bias assessment mechanism and ensuring that it is future-focussed. For example, the bias assessment implementation here should be applicable to novel approaches like few-shot. ### Datasets Focus on people related * image data * need to have associated metadata * gender, race, age ... > &Dopf; &leftarrow; { Race/Ethnicity, Gender, Age, Skin Type/Tone, Lighting } Name | D1 | D2 | D3 | D4 | D5 | Dx | Description :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- [Pilot Parliament Benchmark][1]1 (n=1,270) |&check; | | &check; | &check; | | | The Pilot Parliaments Benchmark (PPB) was developed to achieve better intersectional representation on the basis of gender and skin type. PPB consists of 1,270 individuals from three African countries and three European countries. This dataset is labeled by the Fitzpatrick six-point skin type scale, allowing benchmarking of the performance of gender classification algorithms by skin type. [Casual Conversations][3]2 (n≈45,000) |&check;|&check;|&check; | &check; | | | Casual Conversations is composed of over 45,000 videos (3,011 participants) and intended to be used for assessing the performance of already trained models in computer vision and audio applications, to measure the robustness of AI and machine-learning models for algorithmic fairness in computer vision applications across a diverse set of age, genders, apparent skin-tones, and ambient lighting conditions. [DiveFace][4] (n=115,729)| &check; | &check; | | | | | DiveFace contains annotations equally distributed among six classes related to gender and ethnicity (male, female and three ethnic groups). Gender and ethnicity have been annotated following a semi-automatic process. There are 24K identities (4K for class). The average number of images per identity is 5.5 with a minimum number of 3 for a total number of images greater than 150K. Users are grouped according to their gender (male or female) and three categories related with ethnic physical characteristics. [FairFace][5] (n=108,501) | &check; | &check; | &check; | | | | Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation. [CelebA][6]3 (n=202,599) | ~ | &check; | &check; | | &check; | &check; | CelebA is a face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including 10,177 identities, 202,599 facial images, along with 5 landmark locations per image. [CASIA Webface][7] (n=903,304) | &check;* | &check;* | | | | | CASIA-WebFace dataset is used for face verification and face identification tasks. The dataset contains 494,414 face images of 10,575 real identities from IMDB. With additional annotations evaluated through clustering of the gathered images. [VGGFace2][8] (n≈3,310,000) | &check;* | &check; | &check;* | | &check;* | | VGGFace2 contains 3.31 million images of 9,131 subjects, with an average of 362 images for each subject. Images are downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession (e.g. actors, athletes, politicians). 1. [A data access request form][1.1] is required. 2. More information on the dataset on Meta's [data card][3.1]. 3. This data set also includes categories like facial hair and features alluding to race. [1]: https://www.ajl.org [1.1]: https://assets.website-files.com/5e027ca188c99e3515b404b7/5e175cf3a3e13937d905ee00_PPB_Dataset_Access_Agreement_-_March_5%2C_2018.pdf [2]: https://vision.princeton.edu/projects/2010/SUN/ [3]: https://ai.facebook.com/datasets/casual-conversations-dataset/ [3.1]: https://scontent-frt3-1.xx.fbcdn.net/v/t39.8562-6/271807697_654565988914899_800952401536247731_n.pdf?_nc_cat=102&ccb=1-5&_nc_sid=ae5e01&_nc_ohc=ayPdAztAcJoAX--6jbW&_nc_ht=scontent-frt3-1.xx&oh=00_AT8SM7u8vQkcgV7LbfGjw_HR9QmaKZwofP2NNv8QOezBHg&oe=6219FA58 [4]: https://github.com/BiDAlab/DiveFace [5]: https://arxiv.org/abs/1902.00334 [6]: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html [7]: https://paperswithcode.com/dataset/casia-webface [8]: https://arxiv.org/pdf/1710.08092v2.pdf [9]: https://arxiv.org/pdf/1411.7923v1.pdf  [d1]: https://arxiv.org/pdf/1910.13268.pdf ### Models Transformers v. CNN * Need to detail subclasses of these architectures * CNN: single-shot detectors, pyramid-networks... * Transformers: encoders, decoders ... #### Classification Architecture | Subclass | Model | Training Data :-- | :-- | :-- | :-- Transformer | Encoder | [ViT](https://arxiv.org/abs/2010.11929)1 | ImageNet-1k, [ImageNette](https://github.com/fastai/imagenette), COCO Hybrid | Encoder + CNN | [DeiT](https://arxiv.org/abs/2012.12877)1 | ImageNet-1k, Face-Mask18K Transformer | Encoder | [CLIP](https://arxiv.org/abs/2103.00020) | [Custom](https://github.com/openai/CLIP/blob/main/model-card.md) CNN | - | [AlexNet](https://arxiv.org/abs/1404.5997)2 | ImageNet(-1k?) CNN | - | [VGG](https://arxiv.org/abs/1409.1556)2 | ImageNet(-1k?) CNN | Residual | [ResNet](https://arxiv.org/abs/1512.03385)2 | ImageNet(-1k?) CNN | Residual | [SqueezeNet](https://arxiv.org/abs/1602.07360)2 | ImageNet(-1k?) CNN | Residual | [DenseNet](https://arxiv.org/abs/1608.06993)2 | ImageNet(-1k?) CNN | - | [Inception v3](https://arxiv.org/abs/1512.00567)2 | ImageNet(-1k?) CNN | - | [GoogLeNet](https://arxiv.org/abs/1409.4842)2 | ImageNet(-1k?) CNN | - | [ShuffleNet v2](https://arxiv.org/abs/1807.11164)2 | ImageNet(-1k?) CNN | Inv Residual | [MobileNet v2](https://arxiv.org/abs/1801.04381)2 | ImageNet(-1k?) CNN | Inv Residual | [MobileNet v3](https://arxiv.org/abs/1905.02244)2 | ImageNet(-1k?) CNN | Residual | [ResNeXt](https://arxiv.org/abs/1611.05431)2 | ImageNet(-1k?) CNN | Residual | [Wide ResNet](https://pytorch.org/vision/stable/models.html#wide-resnet)2 | ImageNet(-1k?) CNN | - | [MNASNet](https://arxiv.org/abs/1807.11626)2 | ImageNet(-1k?) CNN | - | [EfficientNet](https://arxiv.org/abs/1905.11946)2 | ImageNet(-1k?) CNN | Residual | [RegNet](https://arxiv.org/abs/2003.13678)2 | ImageNet(-1k?) #### Object Detection Architecture | Subclass | Model | Training Data :-- | :-- | :-- | :-- Hybrid | CNN + Encoder-Decoder | [DETR](https://arxiv.org/abs/2005.12872)1 | COCO Hybrid | CNN + Encoder-Decoder | [D-DETR](https://arxiv.org/abs/2010.04159) | [COCO](https://github.com/fundamentalvision/Deformable-DETR) CNN | Feature Pyramid | [Faster R-CNN](https://arxiv.org/abs/1506.01497)2 | COCO CNN | Feature Pyramid | [Mask R-CNN](https://arxiv.org/abs/1703.06870)2 | COCO CNN | Focal Loss | [RetinaNet](https://arxiv.org/abs/1708.02002)2 | COCO CNN | Single-Shot | [SSD](https://arxiv.org/abs/1512.02325)2 | COCO CNN | Single-Shot | [SSDlite](https://arxiv.org/abs/1801.04381)2 | COCO CNN | Single-Shot | [YOLOv5](https://github.com/ultralytics/yolov5)3 | COCO 1 Pretrained models avaliable from [Hugging Face](https://huggingface.co/models?pipeline_tag=image-classification&sort=downloads) 2 Pretrained models avaliable from [TorchVision](https://pytorch.org/vision/stable/models.html) 3 Pretrained models avaliable from [Torch Hub](https://pytorch.org/hub/ultralytics_yolov5/) ### Metrics * Statistical Parity: customer already looked into this metric and may be most interested in this