# Labeling Tool Data Cleaning / Improvement Backend Design
## Life Cycle of Data Examples
- inject unlabeled examples to `candidates` collection
- inject labeled examples to `training` collection
- start a battle: move examples from `prediction` to `to_label`
- training:
+ move labeled examples from `to_label to `training`
+ [new] we don't really need to separate `to_label` and `training` collection, maybe just rename `to_label` as `training` and don't use `training` any more
+ move unlabeled examples from `candidates` to `prediction` with their predictions
+ [new] run prediction for examples in `training` collection, and copy labeled examples whose prediction is not consistent to `conflicts`
+ [new] update predictions for examples in `failures` collection
- [new] data cleaning:
+ ackonowledge an example as a failue
+ copy to `failures`
+ tag as `is_ackonowledged`
+ correct the labels and they are consistent to prediction
+ change the example in `to_label` collection
+ remove from the `conflicts` collection
+ delte this example (TBD)
Max's concern:
- prediction: mug; label: computer;
+ in data cleaning, we acknowledge it as a failure;
+ after another round of training, prediction becomes laptop, which is more precise and correct than computer, but it's already in failures - what shall we do?
- do we need a failure collection?
## Data Schemas
### conflicts
- same as schema in `to_label` collection
- has `is_acknowledged` tag to indicate if this is acknowledged
### failures
- we keep track of prediction history so we can see the progress of
```
{
"example_id": "This product is very good for dry skin",
"sentence": "This product is very good for dry skin",
"created_at": datetime,
"images": [{ # optional
"url": _IMAGE_URL_STR_
}],
"prediction_history": [
{
"precitions": [{
"category": "GOOD FOR DRY SKIN",
"is_position": true,
"score": 0.98,
}],
"model_id": "model_20201015_181022",
"predicted_at": datetime,
}
],
"manual_labels": [
{
"editor_id": "mohamed@adeptmind.ai",
"timestamp": "2020-10-20 21:45:33",
"labels": [{
"category": "GOOD FOR DRY SKIN",
"label": 1
},
{
"category": "GOOD FOR SENSITIVE SKIN",
"label": 0
},
]
}
]
}
```
## tasks
### conflicts
- enhance training process: run prediction on training (considering failures collection to add is_ackonoledged tag)
- /conflicts/summary endpoint
- /conflicts/example endpoint
- update `/submit` for data cleaning
### failures
- training process: run prediction on failures collection
- /failed/summary endpoint
- /start_battle_from_failed_examples endpoint
- update `/submit` for data labeling