# NLP-Powered Gitcoin Grant Reviews
*TL;DR: Language model go brrr and get 57.5% accuracy (not great compared to our baseline model of 64.4%), explains reasons why, improvements, and the exciting future of AI at Gitcoin.*

[Gitcoin grants](https://gitcoin.co/grants) is an application dedicated to funding public goods. Needless to say, when there's money on the line, you're bound to see freeloaders. They usually come in the form of [sybil accounts](https://en.wikipedia.org/wiki/Sybil_attack) (looking to steal funds from the [matching pool](https://support.newsmatch.org/article/564-what-is-a-matching-pool-or-matching-fund) by falsifying votes), vaporware grants looking to manipulate people into donating to them, and combinations thereof. This document focuses on the latter, detecting manipulative grants.
Currently, we have the [Grants Intelligence Agency (GIA)](https://gitcoin.notion.site/Grants-Intelligence-Agency-13445a7023bd481789ca1e314c145391), a group of humans that review as many grants by hand as they can. Usually they see hundreds of new grants overwhelmingly flow in per day. This problem is significantly worse at scale.
This is where [Natural Language Processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing) comes in! A technique that can be used to leverage grants that were previously reviewed by humans to learn patterns used by these freeloaders.
## The Data
**Note: No personally identifying information was exposed in this dataset.**

For preliminary experiments, we gathered a small dataset from the Gitcoin database, it contains a set of ~5,000 grants (aka samples) alongside their title, description, and whether or not it was approved. 45% of these samples were approved, the other 55% were unapproved.
The data was split into 2 subsets:
1. **Train** (~4,000 samples, ~45% positive)
2. **Evaluation** (~1,000 samples, ~45% positive) The model used this data to evaluate how well it might perform in the real world (simulating never-before seen data).
Here's the code used for splitting the main dataset into the train/test subsets:
```python=
def get_balanced_train_test_split(path: str, split=0.8):
"""Preprocesses the dataset and splits it into train and test sets.
The train dataframe has the length of the split of the original dataframe.
The test dataframe has the remainder.
"""
assert split < 1 and split > 0, "Split must be between 0 and 1."
# NOTE: this function simply renames some columns in the original dataframe.
df = _preprocess_grants_dataframe(path)
# get the rows corresponding to the classes
approved_grants = df[df["label"] == 1]
unapproved_grants = df[df["label"] == 0]
num_samples = len(df)
# some sanity checks
assert approved_grants.label.sum() == 2517
assert unapproved_grants.label.sum() == 0
assert num_samples == 5557
assert len(approved_grants) + len(unapproved_grants) == num_samples
# make sure the train and test data are split into the same proportion of classes
train_num_approved_samples = int(split * len(approved_grants))
train_num_unapproved_samples = int(split * len(unapproved_grants))
train_df = pd.concat([approved_grants[:train_num_approved_samples], unapproved_grants[:train_num_unapproved_samples]])
test_df = pd.concat([approved_grants[train_num_approved_samples:], unapproved_grants[train_num_unapproved_samples:]])
# more sanity checks
train_percent_approved = train_df.label.sum() / len(train_df)
test_percent_approved = test_df.label.sum() / len(test_df)
assert len(train_df) + len(test_df) == len(df)
assert isclose(train_percent_approved, test_percent_approved, rel_tol=0.01), "Train and test split is not well balanced."
assert isclose(train_percent_approved, 0.45, rel_tol=0.05), "Split should be around 45% approved (55% unapproved)."
return train_df, test_df
```
## The Model
*TL;DR: I initialized a pretrained transformer model called [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) and [fine-tuned/transfer-learned](https://deeplizard.com/learn/video/5T-iXNNiwIs) it on the train subset*
**This section gets a bit technical, feel free to skip it.**
It's commonly understood in NLP that pre-training a large language model on many different tasks BEFORE training on your actual dataset performs significantly better than without pre-training. This is one of the core pillars of the DistilBERT contribution.
Architecturally, DistilBERT is just a smaller version of BERT (~60% less parameters). Although, the biggest difference between them is in the way they were trained:
- BERT is a pretrained transformer model which means the researchers took many large NLP datasets and trained the model to be state of the art (SOTA) in all of them.
- DistilBERT is also a pretrained transformer model, however it wasn't trained on the same datasets as BERT, *instead, it was trained on the outputs of the BERT model*.
- This technique is known as ["Knowledge **Distil**ation"](https://arxiv.org/abs/1503.02531). At it's core, it's a model compression algorithm that seeks to pack as much performance into as small of a model as possible. It's brilliant! The performance speaks for itself.
- Thanks to knowledge distillation, DistilBERT, despite being significantly smaller, has nearly the same performance as BERT.
Now, let's build the model!
### First, Load the Data and Tokenize
* Tokenization is a technique that takes in the raw text and converts it into vectors of numbers, which are easier for models to work with. Some tokenization techniques don't require any data to train, however [WordPiece](https://towardsdatascience.com/wordpiece-subword-based-tokenization-algorithm-1fbd14394ed7) (the tokenizer DistilBERT uses) does indeed need to be trained (thankfully it comes pre-trained).
* Padding ensures the input sentences to the tokenizer will be of uniform length, this makes it more efficient computationally to process the input data since all dimensions in a matrix need to be the same length.
The handy-dandy [Huggingface AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertTokenizer) abstracts most of this away, allowing us to pad and use the pretrained tokenizer. This makes for some pretty clean code:
```python=
# call the function defined in the previous code block
train_ds, test_ds = get_balanced_train_test_split("path/to/dataset.csv")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
return tokenizer(examples["grant_title_and_description"], truncation=True)
tokenized_train_ds = train_ds.map(preprocess_function, batched=True)
tokenized_test_ds = test_ds.map(preprocess_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
```
### Next, instantiate the model
```python=
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=2, # binary classification (approve/disapprove)
)
```
### Finally, Train!
```python=
# define the hyper parameters:
training_args = TrainingArguments(
output_dir="./nlp_outputs",
learning_rate=5e-6,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=50, # best accuracy after training epoch 2
weight_decay=0.01,
save_strategy = "epoch",
report_to="wandb", # weights and biases is elite (generates the visualizations https://wandb.ai/)
evaluation_strategy = "epoch", # evaluation should be performed each epoch
)
# accuracy is more human readable than loss for evaluation
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy_metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train_ds,
eval_dataset=tokenized_test_ds,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
```
Now, we can begin training :)
```python=
%%wandb # if using a jupyter notebook, this will run in-line wandb visualizations!
trainer.train()
```
### Training Results
Every epoch I evaluated how well the model performed on the evaluation dataset (the data it hasn't been trained on). This serves as an approximation to how well the model would do if it were used in the real world for grant reviews.
I spent a decent amount of time trying to tune the [hyperparameters (HP)](https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac) such that we could get smooth train/eval [cross-entropy loss](https://machinelearningmastery.com/cross-entropy-for-machine-learning/) curves. The band for useful HPs was pretty narrow, which is to be expected with only 5,000 samples. Here is what the smoothest loss looked like:

As you can see, the train loss is steep whereas the eval loss follows, but then diverges pretty quickly (at epoch 5), steadily decreasing (epochs 5-20), then eventually plateuing followed by a gradual increase (epochs 20+). This is important, because it shows the model overfitting.
This was the most stable training session I was able to muster (pictured above), and interestingly it didn't get the best accuracy. The eval accuracy of that particular run was only 53.7%.
The **best** eval accuracy was 57.5%, and it's eval loss curve was monotonically increasing:

The accuracy peaked after only 2 epochs of training, which is consistent with community expectations of fine-tuning distilbert requiring only 1-3 epochs.
Here is a composite plot showing 9 of the best training runs' evaluation accuracy metrics over time:

It's pretty volatile! The solution? More data!
### Comparison to Baseline
Previously [Eric Hare](https://github.com/erichare) set out to perform the same classification task using a set of more traditional [XGBoost](https://xgboost.readthedocs.io/en/stable/) ML methods in an [ensemble](https://machinelearningmastery.com/tour-of-ensemble-learning-algorithms/) (combining many models and averaging their predictions), the performance he was able to get was significantly better than what I was able to achieve at 64.4% accuracy (as compared to my 57.5% accuracy).
The main explanation for this large discrepancy is the amount of data we have. Language models are notorious for needing significantly more data than traditional modeling techniques, afterwhich they become state of the art. We will need a tad more than ~4,000 training samples.
With more time playing around with the hyperparameters, I could probably close this performance gap, however we decided it would be a better decision to hold off and see what the community has to say before spending more resources here.
So, moving forward, here are some things to consider:
## The Future!!!

### More Data
A dataset with 5,000 samples (only 4,000 of which are used for training) is usually not enough for an NLP model to learn new insights, and since sequence classification models are generally fairly complicated, their likelihood to [overfit](https://en.wikipedia.org/wiki/Overfitting) is high (and that's exactly what happened).
**Solution**
* Gather more data! The GIA is a faucet of data to train on, they're pumping out reviews like no tomorrow.
### Grant Captioning!
This goes hand in hand with the More Data section; Instead of the model saying "YES" or "NO" if the grant should be approved or not, maybe it could also generate a "caption", aka, text explaining why the grant should be approved or not.
Think about it -- this is exactly what human grant reviewers currently do!
**Solution**
* Train a sequence-to-sequence NLP model to take in the grant information, and train against the human reviewer comments generated by the GIA.
### Knowledge Graphs
Representing the dataset as a knowledge graph can be helpful for finding correlations between potentially dud grants (even without AI).
For example, if a grant description has a link to an external website, add the link as a node in the graph, and add an edge between the grant node and the link node, repeat this process for every grant and every link.
If a link has an edge to 10 different grants, that means all of those grants reference that link. If all 10 of those grants individually seemed suspicious, however you weren't 100% certain, this correlation could be the nail in the coffin.
Additionally, we can follow the website links, download the HTML, and have that be part of the input to the NLP model. The GIA does this already, if you have a link in your description, it's fair game.
* We can architect a hybrid transformer-[graph neural network (GNN)](https://towardsdatascience.com/a-gentle-introduction-to-graph-neural-network-basics-deepwalk-and-graphsage-db5d540d50b3) model, giving more context to the text being processed.
* *Warning: [We should be careful](http://www.incompleteideas.net/IncIdeas/BitterLesson.html) when introducing complexity to intelligence based applications.*
### Factual Analysis
How can we know if the claims a grant creator makes are actually true rather than taking them at face value?
**Solutions**
* Human In The Loop (HITL) that allows the GIA human reviewers to participate in the AI's review process (maybe through question and answering?).
* Maybe using Gitcoin's [dPopp](https://github.com/gitcoinco/dPopp) to cross-validate claims is possible (albeit, a very hard technical challenge).
* Factual analysis compatible with a knowledge graph representation.
### Deploying to Production
Currently, the model I developed should **not** be used in production (to review real grants) lol. However, in the future if we were able to train a useful model, it could be used to screen out easy approvals/disapprovals for sufficiently high degrees of certainty. It should be noted that the model should not be an end-all-be-all, but rather serve as an extra tool for the GIA toolbelt for making their decisions.
### Thank You!
Thanks for reading, follow me on [twonker](https://twitter.com/home) and [github](https://github.com/nollied).
### Relevant Papers for Further Reading
- [distilbert paper](https://arxiv.org/abs/1910.01108)
- [bert paper](https://arxiv.org/pdf/1810.04805.pdf)