Spotter tool - HackMD

# Spotter tool **Compute access** - via ssh to a machine in Swansea - ssh requires access to toran (querying with Swansea IT) - brush off your vim skills or use VSCode ssh extension https://code.visualstudio.com/docs/remote/ssh --- **Code base** - access via https://cs-dragon-s.swansea.ac.uk/git/user/login - login creds shared with you - ML repo we will be working on [spotter-developers/model-server](https://cs-dragon-s.swansea.ac.uk/git/spotter-developers/model-server) - 4 branches with `refactored` branch the only one updated in the last year - Managing dependencies with conda environments (environment.yml files) - Using .yml files to pass model params experiments/config.yml - **Question** how have they arrived at parameter values i.e. dropout 0.2? experimentation? - They have built an API to get predictions from the model (hence server) --- ** Data ** - Training data is stored in NextCloud - Data/Data with tactics/ --- **Model details** - **spaCy** and **PyTorch Lightening** weapons of choice, currently going into an LSTM --- **Ideas for improvement** The main issue will be the dataset is small. Recommending more data would take resource from annotators so not really an option at this stage. So first some ideas that could work with the current setup of Pytorch lightening LSTM (in order of priority): - Data Augmentation. Cheap to do and can slot in after all the data processing and word embedding has been done (no need to propagate changes all the way through the existing codebase - Semi-supervised / Psuedo labelling. Train initial model, make predictions and treat instances with high confidence as gold standard. Retrain with these new labels. (We may end up biasing away from edge cases though) - Transfer Learning. It looks like they are training the model from scratch. they are using two datasets, one public that they learn word embeddings from and a smaller dataset from the police that they fine tune on. We could use a pre-trained model, freeze the head and fine tune with the public dataset they have. We can have a variable learning rate as we go through the network to pay more attention to the fine tuning layers. - Active Learning (would require some annotator resource, and they don't have a UI feature for this like in Markup) - [Flash and BaaL](https://devblog.pytorchlightning.ai/active-learning-made-simple-using-flash-and-baal-2216df6f872c) - [Lightening Trainer module](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html) - Multi-task Learning - use similar tasks and train multiple models simultainiously that share information and improve performance - https://arxiv.org/abs/2109.09138 - https://github.com/Lightning-AI/lightning/pull/1959 **Other Models** - **Random Forest / XGBoost** - is Deep Learning overkill? we lack an ontology like with prescriptions problem that may have enabled the use of RF - so this could be more difficult. But maybe with Active Learning and some feature engineering it's worth a go (I've had some success with feature compression via GloVe embeddings + kmeans -> Random Forest to detect toxic comments https://www.kaggle.com/code/arronlacey/random-forest-w-glove-kmeans-compression). The general idea is to create "rectangular" data rather than tensors by learning word embeddings and using k means to cluster words / ntokens into 100,....1000 clusters that are used as features. Sorry that it's in R! - **Transformers**: - [XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet) has been identified by the DRAGON-S project via an external consultant, but they couldn't commit the time. - [SetFit](https://huggingface.co/blog/setfit) - a few-shot model --- **Notes from onboarding meeting** - model-server/refector is clean, core branch, but not all functionality - refactorization repo is clean, but basic functiontionality - experiment folder contains different configuration files - experiment/scripts/ contains metrics and outputs - main.py has functionality for different options - training and testing - testing - data loader - checkpoints - data loader (shouldn't need to be changed): try to optimize how much data can be stored in memory - sorts by convo length and creates batches of similar length - would like to update the word embedding method/loading. - embedding_creation repo: mix of old and new code - launguage_model.py - YAML configs to define model params, etc. - Still not clear how they came to get the params? random values? - predict_tactic param: will produce a binary classication when false, or predict against 13 tactics otherwise - JSONL data used for both datasets - Have a file loader that will read in batches to prevent memory overloading - Uses checkpoints - PyTorch lightning used for modelling (+ only have a RNN) - Police (for spotter) are only requesting classifications of grooming/non-grooming. Adding tactics enables quicker verification/validation of the NN --- **Tactics** -- Good tactics classification helps increase overall accuracy --- **Embeddings** - Need to work out what they are doing here - I've not come across the method they are using - Three embeddings: - for anonymized (LE data) - not anonymized (perverted justice dataset) - partially? - which one can be specified in config files - Existing embedding only handled two convo participants, new embedding can handle multiple participants - All data is used to generate embeddings --- **Data descriptions** - Data is imbalanced (mostly negative) - LE data smaller than PJ data - Tried multiple configurations i.e. artificially balanced, naturally imbalanced, LE-ratio is different to PJ ratio - Police data is labelled, but many inconsistencies between annotators - Each conversation has a conversation_id, and each fold has a list of conversation ids --- ** Questions ** - Which repo/branch shall we work from? by priority - embedding_creation - refactorisation - Do we have access to add datasets to NextCloud? - Link to public dataset - https://ceur-ws.org/Vol-1178/CLEF2012wn-PAN-InchesEt2012.pdf - May need to request access - Priorities for closing out the project - embeddings - mapping new tactics detection to old tactics - current accuracy? approaches to improve? - are annotations n-token format? - do you need help with documentation?