# 11.12.2020: Potential Next Experiments
On a high level, explanations are interesting and relevant since social scientists often prefer interpretable methods and favor explanation over prediction. We are also in a good position to conduct this research because explainability in CSS has yet to be studied in depth and having access to Social Scientists is a benefit.
1. Counterfactuals and Performance
Training on counterfactuals have led to improvement of machine learning methods for sentiment analysis ([Kaushik and Lipton, 2020](https://arxiv.org/pdf/1909.12434.pdf)) and sexism (Samory et al, 2021), especially for ‘out of domain’ data. We propose a more fine-grained and rigorous evaluation of the effect of training on counterfactuals.
**Data**: Sexism data (ours) and Sentiment (Kaushik and Lipton, 2020)
**Test**: one held-out ‘out-of-domain’ data {sexism: Reddit manosphere, sentiment: other datasets}
**Method**: ML models (LogReg, CNN, BERT)
**Experiment 0**: Train on counterfactuals and see if it leads to better performance on original data
**RQ1:** Does training on counterfactuals lead to more robust methods? [robust = resistant to adversarial attacks]
**Experiment 1:** Train on counterfactuals and see if it leads to models that perform better on adversarial examples, where adversarial examples are word and character perturbation-based, and with synonym replacement.
**RQ2:** Does training on counterfactuals lead to more generalizable methods? [generalizable = better performance on out-of-domain data and controversial cases (lower agreement)]
**RQ3:** What are the characteristics of an effective counterfactual? [effective = leads to better performance of the method] -or- Is there a universally effective counterfactual?
**Experiment 3:** Disaggregate counterfactuals based on their multiple characteristics:
- ‘Distance’ from original
- Edit distance
- Semantic distance
- Category of counterfactual (based on qualitative coding, for example, ‘changing modifier’, ‘removing gender words’, etc)
- Diversity of counterfactuals (are more diverse counterfactuals, i.e, those that have several categories for a single original tweet, better?)
And see which category is most significant through a regression model.
Bonus: Can we automatically generate counterfactuals by training a model on existing counterfactuals?
----------------------------------------------------------------------------------------
2. Counterfactuals and Explanations
Explanations are important to understand if a machine learning model can be trusted or not. Counterfactuals are described as the highest order of explanations ([Moraffah et al, 2020](https://www.kdd.org/exploration_files/4._CR._25._Causal_Explainability_Survey-final.pdf), [Pearl, 2018](https://arxiv.org/pdf/1801.04016.pdf)), but the relationship between counterfactuals for text data and their explanatory power has yet to be studied. We have already seen that training on counterfactuals lead to more robust methods (in some ways), but does training on counterfactuals also lead to more explainable methods?
**Data:** Sexism data (ours) and Sentiment (Kaushik and Lipton, 2020)
**ML Methods:** Logreg, CNN, BERT, XGBoost
Automated Explanations Methods: LIME, SHAP, In-built, Saliency maps
**RQ1:** Does training on counterfactuals lead to more explainable methods? [explainable = ‘better’ and more ‘meaningful’ explanations]
**Experiment:**
Define and operationalize the quality of an explanation. Some options from past literature: consistency, overlap with human explanations, faithfulness ([Atanasova 2020](https://www.aclweb.org/anthology/2020.emnlp-main.263.pdf))
Get human rationales for sentiment and sexism data [example task here]
Compare the explanations from models trained on counterfactual vs those trained on just original data.
**RQ2:** How can we generate explanations from counterfactuals?
We start with the concept of ‘Counterfactual explanation’ (CE) for NLP, where a CE refers to the original tokens changed in generating the counterfactual data point. For example, for the original tweet, “A lot of females try to think like a man to avoid getting hurt/played instead of just being a woman. Take a chance ,love again.”, and it’s counterfactual, “A lot of people try to think like a cheater to avoid getting hurt/played instead of just being a partner. Take a chance ,love again.”, the counterfactual explanation / CE for the original is: [females, man, woman]
**Experiment:**
Compare the counterfactual explanations with other explanation methods: Human explanations and automated explanation methods (LIME, SHAP, etc).
**RQ3:** How effective are automated explanation generation methods?
Here, we propose to connect the literature on generating explanatory spans (for example, [Toxic spans](https://sites.google.com/view/toxicspans)) with automated explanation methods like LIME and SHAP.
**Experiment:**
Train a method on the CEs and automatically generate the explanatory spans for held-out data. Let’s call them automated counterfactual explanations (ACE).
Find the performance of ACE for the held-out data (using Mean Average Precision like the Toxic Spans Task)
Compare the performance of automated explanation methods against ACE
----------------------------------------------------------------------------------------
3. Do explanations help Machine Learning Model end-users?
Past research has explored if explanations help end-users in detecting Toxic Wikipedia comments, and found that feature-based explanations are ineffective ([Carton et al, 2020](https://ojs.aaai.org/index.php/ICWSM/article/view/7282/7136)) while there is also research in the (lack of) effectiveness of explanations in deception detection ([Lai and Tan, 2019](https://arxiv.org/pdf/1811.07901.pdf)) . The effect of counterfactual explanations and explanation-based explanations on end-users have yet to be studied. Furthermore, it has only been studied for the use case of toxicity and deception detection. We propose to expand both the types of explanation methods and the use cases.
**Two types of users:** model developers and model end-users
**Data:** Sentiment and Sexism
**ML Methods:** LogReg, CNN, BERT, XGBoost
**Explanations Methods:** LIME, SHAP, Saliency, in-built, Counterfactual explanations, example-based explanations. For the latter, the explanation consists of finding the datapoints in the training set that were responsible for a ML model’s classification of a test datapoint (See [Data Influence](https://arxiv.org/pdf/2005.06676.pdf))
**RQ1:** Can explanations improve end user decision making?
Do explanations reduce time spent labeling?
Do explanations improve annotator agreement?
Do explanations improve annotations for harder or more subtle cases?
**RQ2:** Which types of explanations are more helpful?
Compare feature-based explanations (LIME, SHAP, etc), example-based explanations (showing the instances of training data that were responsible for the ) and counterfactual explanations in helping end-users?
**Experiments for both RQs:** Adopt a between users design and provide different experimental settings for different groups of annotators. The base setup / instructions would look something like this:
> Tweet: “RT @BMKT8 Call me sexist or whatever you wish, but for some reason i really hate when girls talk about football.”
>
> An automated method has labeled it as SEXIST
>
> What do you think is the label of this tweet?
> Sexist
> Non-sexist
> Can’t say
>
**Group 1:** No explanation
**Setup:** Same as above
**Group 2:** Feature-based explanation (highlighted words from LIME, SHAP, in-built)
**Setup:**
> Tweet: “RT @BMKT8 Call me sexist or whatever you wish, but for some reason i really hate when girls talk about football.”
>
> An automated method has labeled it as SEXIST based on the presence of highlighted terms.
>
> What do you think is the label of this tweet?
> Sexist
> Non-sexist
> Can’t say
>
**Group 3:** Counterfactual explanation (highlighted words based on the tokens changed in generating the counterfactual)
**Setup:** Same as above but with the highlighted terms coming from CEs
**Group 4:** Example-based explanation
**Setup:**
> Tweet: “RT @BMKT8 Call me sexist or whatever you wish, but for some reason i really hate when girls talk about football.”
>
> An automated method has labeled it as SEXIST. Here are a few tweets that are related to it and their labels:
>
> “Your approval is so worthless you should pay people to take it @stiles_ben I do not approve of female football presenters” [SEXIST]
>
> “RT @iamomarkhalifa I'm not sexist but I don't give a fuck about women's football #BallonDOr2014” [SEXIST]
>
> What do you think is the label of this tweet?
> Sexist
> Non-sexist
> Can’t say
>
Based on the results of the four groups, see:
Which group as higher accuracy in labeling sexism / sentiment,
Which group performs the task faster
Which group achieves better inter-annotator agreement
----------------------------------------------------------------------------------------
4. Others / Miscellaneous
Active Learning and counterfactuals
Having a sexism detection method that works off-the-shelf / out of box
Get more annotations in different domains
Connect sexism detection with literature on gender bias
For example, the linguistic features found in Unsupervised discovery of gender bias, especially for fine-grained sexism
Targeted sexism detection: for measuring sexism towards politicians which would also entail adapting the codebook
Get more annotations for politicians’ data
The unit of analysis when it comes to sexism: a tweet about a female politician’s appearance is not sexist by itself but if she gets a disproportionate number of comments about her appearance, compared to a male politician, then it could be an indication of sexism --- a pattern post-level methods would miss
The role of context in sexism: the identity of the speaker and the identity of the referee.