Results - HackMD

# Results The subsequent sections discuss experiment outputs that address each of the two hypotheses set forth in previou sections. The first section discusses the presence of bias in models with respect to fairness metrics and the baseline data set. The second section discusses the presence of and difference in bias measurements across models and their architectures. ## Hypothesis 1: Model Introduced Bias The first hypothesis is about the presence of bias in model output. Alternatively, this hypothesis is about the introduction of learnt bias into a model's output. There were three tests run on experiment outputs to support the decision to accept or reject the null hypothesis in favour of the alternative hypothesis. In general, the null hypothesis states that models introduce bias during inference. The alternative hypothesis states that models do introduce a statistically significant amount of bias into its outputs. The three tests discussed in this section include: 1. Model statistical parity (μmodel) vs. perfect statistical parity (μ=0) 2. Model statistical parity (μmodel) vs. baseline data statistical parity (μdata) 3. In- vs. out-of-bound model statistical parity quartiles (-θ < Q1 & Q4 < θ) Each test for model introduced bias was run on each of the four data subsets described in Section X. In general, the results were inconclusive statistically speaking. However, there is practical significance to be considered. ### Model vs. Perfect Statistical Parity The first test compares the mean statistical parity of each model against perfect statistical parity (μ=0). This test checks for absolute bias with respect to a perfectly unbiased benchmark (see Formula X). Formula X: H0: μmodel = 0 HA: μmodel ≠ 0 Each of these tests compared each model per data subset derived from the test set. Each model was run over _k_ folds to generate multiple observations. Each model's _k_ fold distribution mean was compared against a statistical parity of zero using a two-tailed, independent t-test. [insert: table with p-values] [insert: barplot with thresholds] Table X shows that there is no scenario in which any difference between any model's prediction and a perfect statistical parity score (pmin=0.48; pmax=0.99; α=0.05). For this test, we must accept the null hypothesis that model predictions are not statistically different from a statistical parity score of zero. This result does not indicate that these models never produce biased predictions. Instead, this result shows that model predictions can, at times, be unbiased. The practical significance of this result shows a medium to large difference between the means and warrants further investigation. This difference is visible in the consistent tendency for model means to shift left or right in Table X even when remaining within the threshold. While this test compares model output with a statistical parity score of zero, i.e., no bias, it does not quite tease apart how much bias is introduced by the model. The next test looks at model introduced bias. * Describe testing parameters * α, 1 - β, n, effect size * Describe results * Not statistically significant * Accept the null hypothesis * Effect size still warrants further investigation * Distribution spread is variable; a little luck of the draw ### Model vs. Data Set Statistical Parity The second test compares the mean statistical parity of each model against the mean statistical parity of each data set run through each model. This test checks for relative bias with respect to the measure of bias already present in the data (see Formula X). Formula X: H0: μmodel = μdata set HA: μmodel ≠ μdata set Similar to the previous section, each of these tests compared each model per data subset derived from the test set. Each model was run over _k_ folds to generate multiple observations. Additionally, each data subset was measured over the same _k_ folds to create a comparable distribution. Each model's _k_ fold distribution mean was compared against its corresponding data subset using a two-tailed, independent t-test. [insert: table with p-values] [insert: barplot with thresholds] Table X shows that a majority of the scenarios result in a statistically significant difference between model and data subset means (pmin≈0; pmax=0.29; α=0.05). Only 2 of the 28 tests showed no statistical significance with p-values of 0.09 and 0.28 respectively. Both of these non-statistically significant tests come from the CLIP transformer model for the data subsets predicting a binary age. For this test, we reject the null hypothesis for the remaining 26 of 28 scenarios. That is, model predictions' statistical parity scores are statistically different from that of the data subsets passed through them. This result indicates that these models tend to introduce bias into their predictions regardless of the protected attribute. More specifically, models tend to introduce bias that favours privileged groups moreso than unprivileged groups. This is true regardless of the data subset's statistical parity score being under or over the acceptable threshold (θ=&pm;0.1). This effect is particularly noticeable in `wiki-age-low` (see Chart X), which has starting statistical parity fairly close to zero (μdata<0.05) yet has model predictions that surpass zero and tend more toward the privileged group while remaining within the threshold. The only notable exception to this trend is with the CLIP transformer for `wiki-age-low` and, possibly, `wiki-age-high`. The mean in both scenarios does not tend toward the privileged group on inference. Rather, it is equal to or tends toward the unprivileged group. A likely reason for this exception could be attributed to its unique architecture that takes both image and text as input. While it is possible that there is a lack of age-related text in its pretrained weights, we did trained CLIP on images only as each image in the `wiki` data set had no associated text provided. This test and the previous test in Section X look at the means of the _k_ fold distributions of statistical parity scores. However, not all distributions fall cleanly within the acceptable threshold for statistical parity (|x|≤0.1). The next test looks at how neatly these distributions fall within acceptable thresholds. * Describe results * Majority statistically significant * Majority reject null hypothesis * Effect size does lend weight to consideration * Though distribution spread is variable, majority of distribution means shift to the right (i.e., favour the privileged group) * Exception is CLIP for the age prediction * Could be that age is harder to predict for with an image alone, the text modules of CLIP were not trained and CLIP has reliance on those parts ### Bounded Model Statistical Parity The third test compares the distribution of statistical parity scores for each model against the positive and negative threshold. This test checks if the first and fourth quartiles fall within the threshold that separates biased from unbiased model for a set of predictions (see Formula X). Formula X: H0: -θ < Q1, Q4 < θ H0: -θ ≥ Q1, Q4 ≥ θ This test compared the _k_ fold distribution of statistical parity scores for each model per data subset derived from the test set against the threshold for acceptable statistical parity (θ=&pm;0.1). That is, checking if and how frequently a model is biased toward the privileged or unprivileged group. Each model's _k_ fold distribution was compared against the threshold θ. Only the first and fourth quartiles for each distribution were checked in a first-across-the-finish-line style comparison. Any quartile that exceeded the threshold was considered out-of-bounds. The null hypothesis accepted only when both quartiles (i.e., Q1 and Q4) are within the threshold bounds (see Formula X). [insert: table with upper, lower, and total excess] [insert: barplot with Q1/4 crossings circled] Table X shows that models' distributions exceed the statistical parity threshold in a majority of the scenarios (nexceed=20; ntotal=28). Models are more likely to exceed the threshold, i.e., produce biased results, in both `wiki-*-low` scenarios. These subsets are balanced by undersampling the overrepresented targets to match the underrepresented targets. Convolutional neural networks exceed the threshold less frequently (freq=50%) than transformers (freq=87.5%) as a whole. All models exceeded the threshold for the `wiki-age-low` subset, possibly from a combination of target conversion from ordinal to binary and subsampling processes. For this test, we reject the null hypothesis for 20 of the 28 scenarios and accept for 8 of the 28 scenarios. Overall, the results are inconclusive. However, the results do indicate that models are prone to producing biased output at varying frequencies. In general, models produce biased output in about 1 in 5 predictions (freq=18.9%) on average apart from models that were completely within bounds (see Table X, Chart X). In `wiki-*-low` subsets, models produce biased output more frequently from 1 in 3 predictions (freq=32.4%) on average to as high as 1 in 2 predictions (freq=58.3%). In `wiki-*-high` subsets, models produce biased output less frequently from about 1 in 20 (freq=4.8%) to about 1 in 14 predictions (freq=7.3%). These results highlight both the benefit of balancing the data set for fine-tuning a model and the tendency for biased output in incomplete data sets. In `wiki-*-high` subsets, the underrepresented target label is oversampled with replacement to balance the data set. Neither target label suffered from sample loss with respect to the entire data set. This type of balancing created fine-tuned models that were less likely to produce biased output. In `wiki-*-low` subsets, the overrepresented target label is subsampled to balance the data set. In effect, data samples were excluded from the model during fine-tuning. This type of balancing resulted in models that were more likely to produce biased output. This test looked at the quartiles of each model's statistical parity distribution and compared the first and fourth quartiles against the threshold. Overall, models were observed to produce biased output in the majority of experiments. The data show that oversampling the underrepresented class with replacement has a positive effect in reducing the frequency of biased output. * Describe data grouping * Data grouped by `model_tag`, `data_tag`, `data_split`, `kfold` * Describe testing parameters * n * Each `model_tag`'s quartiles are checked * Outliers are excluded * Describe results * Majority statistically significant * Majority reject null hypothesis * Effect size lends weight to consideration * Though distribution spread is variable, certain models were firmly within the threshold &pm;θ for at least half of the scenarios w.r.t. `data_tag` subsets #### Summary * Overall, inconclusive * Majority of tests do indicate models introduce bias * Effect size is medium to large for each test indicating practical significance in results * Almost all `data_tag` subsets were skewed toward the unprivileged class for each target ∈ {age, gender} * Data sets were balanced for even targets * Results found that models tend to favour privileged classes * Need to test data sets with baseline favouring privileged classes --- * Set context * Reference points * Data set: μij = &Dopf;j * Ideal metric: -θ < μij < θ * Summary * Inconclusive overall * Inconclusive per data set as 1/28 experiments does not reject null * Conditionally conclusive per idea metric as all experiments fall within normal values for bias * Reference Point 1: Data Set * Reference Point 2: Ideal Metric * Summarise hypothesis 1 results * Statistically speaking, inconclusive * Practically speaking, balancing ## Hypothesis 2: Difference in Bias The second hypothesis is about detecting any difference in model-introduced bias between the different models and architectures. There were two tests run on experiment output to support the decision to accept or reject the null hypothesis in favour of the alternative hypothesis. In general, the null hypothesis states that there is no difference in the amount of bias each model or model class introduces into its output, if at all. The alternative hypothesis states that there is a statistically significant difference in the amount of bias introduced by difference models and model classes. The three tests that were run include: 1. Statistical parity by model (n=7) 2. Statistical parity by model architecture (n=2) 3. Statistical parity spread by model As a whole, the results overwhelmingly indicated that the difference in models and model architectures' statistical parity means were not statistically significant. However a comparison of the distributional spread of each model's experimental runs do show that there is a clear difference between the transformer and convolutional neural network architectures. * Describe data * Refer to data table showing data splits * Describe testing parameters * α, 1 - β, n, effect size ### Statistical Parity by Model The first test compares different models outputs against each other (see Formula X). This test aims to identify if the models are similar with respect to biased output or if there are one or more that are significantly different. Formula X: H0: μmodel = μmodel HA: μmodel ≠ μmodel This test compared the _k_ fold distribution of statistical parity scores for each model per data subset derived against the other models' distributions. Each model was compared using an independent, two-tailed t-test for each subset. There were a total of 196 (7 models &cross; 7 models &cross; 4 subsets) comparisons made. An ANOVA test was also run across these 7 different models for each subset and showed similar results as the t-test, albeit as a summary figure. [insert: table of hypotheses with accept/reject][insert: table of ANOVA] The p-values in Table X show that the difference between models' output means are not statistically different. One model, CLIP, presents statistically significant differences when compared against both transformers (ViT, DeiT) and convolutional neural networks (resnet, densenet). Two models (resnet, SWIN) consistently show diverging distributions across three subsets. For this test, the results are inconclusive though the majority of the individual experiments do support the null hypothesis. CLIP is the exception to the test that shows statistical significance, which, similar to reasons in Section X (Model vs. Data Statistical Parity), stem from an unused section of the model. Apart from statistical significance, the data indicate there is practical significance to be noted in about 1 in 7 experiments, including CLIP. These experiments have effect sizes that are medium to large (p≤.3; n=17; α=0.05; power=0.8). These effect sizes indicate that despite not every model showing statistical significance, the means are different enough to have an impact on the amount of bias in a models output overall. [insert: table of spreads] model_tag | architecture | Q4 - Q1 --- | --- | --- vit | transformer | 0.340 swin | transformer | 0.296 deit | transformer | 0.445 clip | transformer | 0.400 resnet | convnet | 0.140 densenet | convnet | 0.217 convnext_tiny | convnet | 0.393 Practical significance can be further observed by the difference in each model's distribution spreads. The distribution spread indicates the tendency for a model to produce biased output (wider) or unbiased output (tighter). Overall, resnet has the tightest distribution, i.e., least variability and DeiT has the widest distribution, i.e., most variability. This test looked at a model versus model comparison for statistically significant differences in their statistical parity distribution means. The null hypothesis is accepted for the overwhelming majority of the models, with CLIP being the notable exception. Though strictly inconclusive considering CLIP, most experiments support the null hypothesis; models' generally exhibit a similar amount of biased output. * Describe data grouping * Data grouped by `model_tag`, `data_tag`, `data_split`, `kfold` * Describe results * Majority not statistically significant * Majority accept null hypothesis ### Statistical Parity by Model Architecture The second test compares different model architectures (see Table X). This test checks for differences in the means between transformers and convolutional neural networks (see Formula X). Formula X: H0: μt = μc HA: μt ≠ μc This test compared the _k_ fold distribution of statistical parity scores for transformers and convolutional neural networks (α=0.05; power=0.8; n=119). The architectures were compared using an independent, two-tailed t-test for each subset. [insert: table of hypotheses with accept/reject] | data_tag | p_value | statistic | df | |:-----------------|----------:|------------:|-----:| | wiki-age-high | 0.86 | -0.18 | 117 | | wiki-age-low | 0.54 | -0.61 | 117 | | wiki-gender-high | 0.89 | -0.14 | 117 | | wiki-gender-low | 0.21 | -1.26 | 117 | [insert: chart of architectures boxplots] The p-values in Table X show that differences between the architectures are not significantly different. The null hypothesis is accepted. Similar to the effects of data balance sampling introduced in Section X and Y, the p-value is lower in both the `wiki-*-low` subsets. That is, any difference between architectures becomes more pronounced. Despite the lack of statistical significance in terms of distribution mean values, the architectures do exhibit visual differences. Chart X shows that the convolutional neural networks have a tighter distribution than transformers. Similar observations can be made from the individual models' distributions as well in Section X Chart Y. More specifically, convolutional neural networks (dc=Q4-Q1=0.062) are about 1/3 less variable as transformers (dt=Q4-Q1=0.093). Transformers extend beyond the threshold further than convolutional neural networs (Q1c-Q1t+Q4c-Q4t=0.085). This test compared the two computer vision architectures: transformers and convolutional neural networks. There were no statistically significant differences detected between the two architectures. However, convolutional neural networks generally have less variation in their scores as seen in their boxplots. * Describe data grouping * Data grouped by `model_tag`, `data_tag`, `data_split`, `kfold` * Describe results * Majority not statistically significant * Majority accept null hypothesis --- * Set context * Reference points * By model * By architecture ## Bias Mitigation Methods exist to mitigate bias learnt and introduced into any of the three phases of model training and inference (cite X). These phases are preprocessing, training, and postprocessing. We explore bias mitigation strategies that address postprocessing model output during inference. Specifically, we utilise reject option classification. The sections below discuss the results of the reject option classifier built and trained for each experiment per Table X. In short, reject option classification does a good job in bringing distribution means for statistical parity closer to zero (see Section X) and decreasing variability in each model's predictions (see Section X). ### Reject Option Classifier The reject option classifier (ROC) learns a new decision boundary as well as a rejection margin around the boundary. The ROC leverages protected attribute data during training to identify the optimal threshold and margin for future inference. [insert: formula] For each data subset and model, a trained ROC (see Section X) was run on each model's predictions over _k_ folds. The ROC mitigates bias introduced during model inference. This mitigation strategy shows to have a minimal impact on bringing a model's statistical parity score closer to zero. The ROC shifts distribution means by thousands of a statistical parity score measure. However, the ROC does have a large impact on reducing the variability in a model's statistical parity score distribution. Score distributions are reduced from 81% to 92% of their original size (see Table X). The ROC model, in effect, reduces the model's propensity to generate biased output. The reduction in variability can be seen in Chart X. [insert: table distribution changes] [insert: chart boxplot] The ROC performed well for these experiments. It is worth noting that the data used to fine-tune the models and to train the ROC were drawn from the same distribution. Similarity of data distribution needs to be taken into account when employing the ROC for unseen data. Similarly, it is not recommended to use the ROC mitigation strategy on its own. Rather, it should be used in conjunction with other bias mitigation strategies (e.g., data balancing) (cite). # Discussion The two research questions this project set out to investigate asked about the presence of bias in model outputs and if model architecture played a role. Strictly looking at statistical significance results in many inconclusive results from the various tests for each research question (see Section X, Y, Z). However, with respect to a data set's baseline level of statistical parity, the results show that models do introduce bias during inference (see Section X). The bias introduced by models tends to favour privileged groups in the data. This trend holds true through various contexts and methods of balancing data (see Section X). However, it is clear that balancing data has a huge impact on learnt biases (see Section X). Practically speaking, _ themes emerge from the results: 1. Models introduce bias during inference favouring privileged groups 2. Convolutional neural networks show less propensity for biased output than transformers 3. Oversampling underrepresented target labels have a positive effect in reducing biased predictions Section X demonstrates that models generally introduce bias with respect to the data set passed for inference that usually favours the privileged group. The protected attribute information that identifies privileged and unprivileged groups were never used to train the model. This result hints that models are good at implicitly learning attributes in the data set. Convolutional neuralworks in general were less prone to exceeding acceptable statistical parity thresholds than transformers. Though we could point to the origin of each of these models with convnets designed specifically for images and transformers designed for text, it is difficult to say what exactly about the model accounts for this difference. Lastly, oversampling the underrepresented target label noticeable reduced bias in a model's output. The protected attributes were not used to identify which observations to oversample. Despite the disconnect between protected attribute and target labels, it is possible that oversampling the underrepresented target labels replicated enough samples belonging to the unprivileged group though not definitive. In turn, this benefited the model's ability to return less biased output. # Recommendation This project demonstrated that there is some level of bias learnt by the models during training or fine-tuning and is detectable in its outputs during inference. This project explored existence of bias, differences between models, and the effect of the reject option bias mitigation classifier. These activities just scratch the surface for the type of work that can be done to mitigate bias in models. While there are a lot of researchers devoted to better understanding and identifying strategies to address bias in machine learning, some of the strategies employed during this project do show immediate utility and prompt further investigation. Two key takeaways that can be immediately put into practise are balancing data sets and model selection. Balancing data sets prior to fine-tuning or model training has shown to have a large impact in biased outputs (see Section X). This project highlight the positive effect of oversampling underrepresented data and, subsequently, unprivileged populations those data represent. Selecting the appropriate model for the task has shown to also have an impact in biased outputs (see Section X). Though models generally perform similarly with respect to the level of bias in their outputs, convolutional networks appear to have an upperhand when it comes to fairer outcomes. This could stem from the origin of each of these model architectures with convnets mimicking vision and transformers being coopted from the natural language processing space. Further investigation is warranted for data sets with tasks that have more appropriate protected attributes and with model training strategies. This project worked with a data set that was intended for age prediction, rather than a meaningful task (e.g., applicant screening, criminal identification). The authors scraped much of the data from the internet, which resulted into a mostly clean set of images. However, this left many images that were either corrupted or irrelevant (e.g., a person's home instead of the person). Further, some of the protected attribute data were calculated and outdated. For example, calculated ages were sometimes negative and the gender column was binary. Looking into model training strategies more closely would also help increase understanding of what part of the model, if applicable, contributes most to biased outputs. This would be particularly useful to know when deciding which model to use for a certain task. Lastly, bias mitigation strategies are not a one-size-fits-all approach. The ROC was the main focus of this project with some rudimentary data balancing. It is worth to look into different ensembles of various bias mitigation strategies to move toward a less biased model.