Labeling Tool Data Cleaning / Improvement Backend Design

# Labeling Tool Data Cleaning / Improvement Backend Design ## Life Cycle of Data Examples - inject unlabeled examples to `candidates` collection - inject labeled examples to `training` collection - start a battle: move examples from `prediction` to `to_label` - training: + move labeled examples from `to_label to `training` + [new] we don't really need to separate `to_label` and `training` collection, maybe just rename `to_label` as `training` and don't use `training` any more + move unlabeled examples from `candidates` to `prediction` with their predictions + [new] run prediction for examples in `training` collection, and copy labeled examples whose prediction is not consistent to `conflicts` + [new] update predictions for examples in `failures` collection - [new] data cleaning: + ackonowledge an example as a failue + copy to `failures` + tag as `is_ackonowledged` + correct the labels and they are consistent to prediction + change the example in `to_label` collection + remove from the `conflicts` collection + delte this example (TBD) Max's concern: - prediction: mug; label: computer; + in data cleaning, we acknowledge it as a failure; + after another round of training, prediction becomes laptop, which is more precise and correct than computer, but it's already in failures - what shall we do? - do we need a failure collection? ## Data Schemas ### conflicts - same as schema in `to_label` collection - has `is_acknowledged` tag to indicate if this is acknowledged ### failures - we keep track of prediction history so we can see the progress of ``` { "example_id": "This product is very good for dry skin", "sentence": "This product is very good for dry skin", "created_at": datetime, "images": [{ # optional "url": _IMAGE_URL_STR_ }], "prediction_history": [ { "precitions": [{ "category": "GOOD FOR DRY SKIN", "is_position": true, "score": 0.98, }], "model_id": "model_20201015_181022", "predicted_at": datetime, } ], "manual_labels": [ { "editor_id": "mohamed@adeptmind.ai", "timestamp": "2020-10-20 21:45:33", "labels": [{ "category": "GOOD FOR DRY SKIN", "label": 1 }, { "category": "GOOD FOR SENSITIVE SKIN", "label": 0 }, ] } ] } ``` ## tasks ### conflicts - enhance training process: run prediction on training (considering failures collection to add is_ackonoledged tag) - /conflicts/summary endpoint - /conflicts/example endpoint - update `/submit` for data cleaning ### failures - training process: run prediction on failures collection - /failed/summary endpoint - /start_battle_from_failed_examples endpoint - update `/submit` for data labeling