AIP Object Detection

# AIP Object Detection This documentation aims to provide the specifications we use in the `aip_detection` repository (repo). The most important files that you'll be working with are: 1. `preprocess.py` 2. `train.py` 3. `batch_inference.py` 4. `inferece.py` For each of the following sections, we'll detail the I/O specs of each function. ## Preprocessing Preprocessing is handled by the `preprocess.py` function. It takes AIP formatted labels (default is `annotation.csv`) and processes it (will cover later). Depending on whether we process with `training` mode or `inference` mode, the output files of `preprocess.py` will vary. The output files are meant to be used for `train.py`, `batch_inference.py`, or `inference.py` as input. ### Input CSV Format First, `preprocess.py` takes an AIP format detection labels as: ``` # -------------------------------------------------------------------------------------------- # | file_name | detection_label | bbox_x_min | bbox_y_min | bbox_width | bbox_height | # -------------------------------------------------------------------------------------------- # | 2010_005668.jpg | cow | 24 | 14 | 455 |398 | # | 2009_005133.jpg | car | 445 | 196 | 55 |39 | # | 2009_005133.jpg | car | 390 | 191 | 78 |24 | # | 2011_002585.jpg | bicycle | 88 | 53 | 105 |107 | # | 2011_002231.jpg | nothing to label| | | | | # -------------------------------------------------------------------------------------------- ``` We allow repeats in the `file_name` column. For an image with multiple bounding boxes, it'll have the same filename repeated, but each row will contain the information for each bounding box (see `2009_005133.jpg` above). For images with no bounding boxes, the `detection_label` value must be `nothing to label`, and the remaining box information is left blank. When saving the AIP Detection annotation file, save it as `annotation.csv`, and it should look something like: ``` 2010_005668.jpg,cow,24,14,455,398 2009_005133.jpg,car,445,196,55,39 2009_005133.jpg,car,390,191,78,24 2011_002585.jpg,bicycle,88,53,105,107 2011_002231.jpg,nothing to label,,,, ``` ### Different Preprocessing Modes When running `preprocess.py`, we must provide a `datadir` variable that indicates the folder we intend to preprocess. We can provide the `--inference` flag (default as `False`) to indicate we are preprocessing the folder for *batch inference* usage. We can also provide a `--series_type` parameter (takes `2d`, `2.5d` for now, default at `2d`) that indicates the type of image we plan to process in. Here is an overview of the main functions that are callled when calling `preprocess.py`: ![](https://i.imgur.com/bgEVGVA.png) The folder structure for 2D preprocessing is (using VOC 2012 dataset as example): ``` └── voc_2012 ├── train │ ├── image │ │ ├── xxx.jpg │ │ ├── yyy.jpg │ │ ├── ... │ │ └── zzz.jpg │ └── annotation.csv │ ├── valid-label │ ├── image │ │ ├── aaa.jpg │ │ ├── bbb.jpg │ │ ├── ... │ │ └── ddd.jpg │ └── annotation.csv │ └── valid-nolabel └── image ├── aaa.jpg ├── bbb.jpg ├── ... └── ddd.jpg ``` To process the training set, we would run: ``` preprocess.py --datadir=voc_2012/train ``` To process the validation set with labels, we would run: ``` preprocess.py --datadir=voc_2012/valid-label --inference ``` To process the validation set without labels, we would run: ``` preprocess.py --datadir=voc_2012/valid-nolabel --inference ``` **Notice that regardless whether we have labels or not, running with `--inference` will automatilly handle cases that don't provide the annotation file. The difference is in the output.** In training mode: `preprocess.py` will output `train_anno.json` and `valid_anno.json` for the training and validation set needed by `train.py`. In inference mode: (1) If `annotation.csv` is given: `preprocess.py` will output `infer_anno.json` as the files needed for `batch_inference.py`. (2) If `annotation.csv` is not given: `preprocess.py` will output `infer_files.json` as the files needed for `batch_inference.py`. ## Training Model training is handeled by the `train.py` script. A configuration file is passed to the `--config` parameter, and this file determines the training behavior. See the `README.md` for more details on how to use the configuration file. **Backend Return Values** After `train.py` has been called, the backend team reads the following set of parameters (encoded in json format) from stdout, so we are required to provide the following information in a json structure: ``` { "files": ["stepx.pth", "model_info_stepx.json"], "time": 177.94996309280396, "loss": 0.9384218454360962, "score": 7.437419187997448e-05, "acc": 7.337419187997448e-05 } ``` `files`: specifies the output files from the just completed training `time`: specifies the time in seconds it took to run one training session `loss`: the average loss accumulated during the training session `score`: the mAP score `acc`: the mAP score (same as `score` for simplicity) ## Batch Inference Batch inference (handled by `batch_inference.py`) can either refer to "batch evaluation" or "batch prediction". Batch evaluation means that there is provided ground truth, and therefore one must set `metrics=True` in the configuration file. Batch prediction refers to purely predicting the bounding boxes for each file in a given preprocessed dataset. **Backend Return Values** As `batch_inference.py` is evaluating batches of samples in the given dataset, it outputs the progress: ``` PROGRESS {"inferenced": 8, "total_num": 5823, "percentage": 0.13738622703074016} ... PROGRESS {"inferenced": 5823, "total_num": 5823, "percentage": 1.00} ``` After it has completed evaluating each sample, `batch_inference.py` outputs a json dictionary containing organized as follows: ``` RESULT { "prediction": { "2011_001847.jpg": [], "2009_003126.jpg": [], "2008_000080.jpg": [], "2008_000782.jpg": [], "2011_001350.jpg": [], "2010_004050.jpg": [], "2009_000771.jpg": [], "2009_004940.jpg": [ { "category": "person", "bbox": [255.25765991210938, 212.8616943359375, 31.549407958984375, 121.72079467773438], "confidence": 0.7429825663566589 } ], }, "metrics": { "AP": 6.343370520580997e-05, "AP50": 0.00032025060178860707, "AP75": 4.7575250021061345e-06, "APs": 0.0, "APm": 3.485762416133066e-07, "APl": 0.00014756558471165586, "AR@1": 0.0009446751478690392, "AR@10": 0.00342661963060001, "AR@100": 0.0034765217832418882, "ARs@100": 0.0, "ARm@100": 0.0001094838850174216, "ARl@100": 0.006700979399520094 }, "ground_truth": { "2011_001847.jpg": [ { "category": "chair", "bbox": [225, 288, 107, 169] }, { "category": "chair", "bbox": [2, 307, 151, 193] }, { "category": "chair", "bbox": [26, 227, 136, 141] }, { "category": "chair", "bbox": [193, 221, 98, 178] }, { "category": "diningtable", "bbox": [50, 245, 245, 211] }, { "category": "bottle", "bbox": [186, 208, 18, 62] }], ... } } ``` For batch evaluation, it outputs as shown above. For batch prediction, it only outputs the `prediction` key from the json dictionary as shown above. The requirement is that we start with a keyword called `RESULT ` (with one space following RESULT). Then we output a dictionary with the `prediction`, `metrics`, and `ground_truth` dictionaries. Each item in `prediction` has a key with the name of the file (as given in the annotation `'file_name' value`), and the value is a list containing dictionaries of predicted bounding box. Empty list indicates nothing predicted for the image. `metrics` dictionary contains keys as the name of the detection metrics, and the values are its value. `ground_truth` dictionary is similar to the `prediction` dictionary. Each item has its key as the filename, and the value is a list of dictionaries, each for a bounding box in that image. ## Inference Inference (or more specifically single image prediction) is handled by the `inference.py` script. One may provide an image file to the `image` parameter in the config file (the image doesn't need to be preprocessed). **Backend Return Values** The required output from `inference.py` is similar to that of `batch_inference.py`: ``` RESULT [{"category": "dog", "confidence": 0.95, "bbox":[35, 50, 120, 150]}, {"category": "cat", "confidence": 0.77, "bbox":[240, 260, 90, 100]}] ``` It's the same `RESULT ` (with one trailing space) followed by a list of dictionaries, each indicating a predicted bounding box.