# AIP Object Detection
This documentation aims to provide the specifications we use in the `aip_detection` repository (repo).
The most important files that you'll be working with are:
1. `preprocess.py`
2. `train.py`
3. `batch_inference.py`
4. `inferece.py`
For each of the following sections, we'll detail the I/O specs of each function.
## Preprocessing
Preprocessing is handled by the `preprocess.py` function. It takes AIP formatted labels (default is `annotation.csv`) and processes it (will cover later). Depending on whether we process with `training` mode or `inference` mode, the output files of `preprocess.py` will vary. The output files are meant to be used for `train.py`, `batch_inference.py`, or `inference.py` as input.
### Input CSV Format
First, `preprocess.py` takes an AIP format detection labels as:
```
# --------------------------------------------------------------------------------------------
# | file_name | detection_label | bbox_x_min | bbox_y_min | bbox_width | bbox_height |
# --------------------------------------------------------------------------------------------
# | 2010_005668.jpg | cow | 24 | 14 | 455 |398 |
# | 2009_005133.jpg | car | 445 | 196 | 55 |39 |
# | 2009_005133.jpg | car | 390 | 191 | 78 |24 |
# | 2011_002585.jpg | bicycle | 88 | 53 | 105 |107 |
# | 2011_002231.jpg | nothing to label| | | | |
# --------------------------------------------------------------------------------------------
```
We allow repeats in the `file_name` column. For an image with multiple bounding boxes, it'll have the same filename repeated, but each row will contain the information for each bounding box (see `2009_005133.jpg` above).
For images with no bounding boxes, the `detection_label` value must be `nothing to label`, and the remaining box information is left blank.
When saving the AIP Detection annotation file, save it as `annotation.csv`, and it should look something like:
```
2010_005668.jpg,cow,24,14,455,398
2009_005133.jpg,car,445,196,55,39
2009_005133.jpg,car,390,191,78,24
2011_002585.jpg,bicycle,88,53,105,107
2011_002231.jpg,nothing to label,,,,
```
### Different Preprocessing Modes
When running `preprocess.py`, we must provide a `datadir` variable that indicates the folder we intend to preprocess. We can provide the `--inference` flag (default as `False`) to indicate we are preprocessing the folder for *batch inference* usage. We can also provide a `--series_type` parameter (takes `2d`, `2.5d` for now, default at `2d`) that indicates the type of image we plan to process in.
Here is an overview of the main functions that are callled when calling `preprocess.py`:

The folder structure for 2D preprocessing is (using VOC 2012 dataset as example):
```
└── voc_2012
├── train
│ ├── image
│ │ ├── xxx.jpg
│ │ ├── yyy.jpg
│ │ ├── ...
│ │ └── zzz.jpg
│ └── annotation.csv
│
├── valid-label
│ ├── image
│ │ ├── aaa.jpg
│ │ ├── bbb.jpg
│ │ ├── ...
│ │ └── ddd.jpg
│ └── annotation.csv
│
└── valid-nolabel
└── image
├── aaa.jpg
├── bbb.jpg
├── ...
└── ddd.jpg
```
To process the training set, we would run:
```
preprocess.py --datadir=voc_2012/train
```
To process the validation set with labels, we would run:
```
preprocess.py --datadir=voc_2012/valid-label --inference
```
To process the validation set without labels, we would run:
```
preprocess.py --datadir=voc_2012/valid-nolabel --inference
```
**Notice that regardless whether we have labels or not, running with `--inference` will automatilly handle cases that don't provide the annotation file. The difference is in the output.**
In training mode:
`preprocess.py` will output `train_anno.json` and `valid_anno.json` for the training and validation set needed by `train.py`.
In inference mode:
(1) If `annotation.csv` is given:
`preprocess.py` will output `infer_anno.json` as the files needed for `batch_inference.py`.
(2) If `annotation.csv` is not given:
`preprocess.py` will output `infer_files.json` as the files needed for `batch_inference.py`.
## Training
Model training is handeled by the `train.py` script. A configuration file is passed to the `--config` parameter, and this file determines the training behavior. See the `README.md` for more details on how to use the configuration file.
**Backend Return Values**
After `train.py` has been called, the backend team reads the following set of parameters (encoded in json format) from stdout, so we are required to provide the following information in a json structure:
```
{
"files": ["stepx.pth", "model_info_stepx.json"],
"time": 177.94996309280396,
"loss": 0.9384218454360962,
"score": 7.437419187997448e-05,
"acc": 7.337419187997448e-05
}
```
`files`: specifies the output files from the just completed training
`time`: specifies the time in seconds it took to run one training session
`loss`: the average loss accumulated during the training session
`score`: the mAP score
`acc`: the mAP score (same as `score` for simplicity)
## Batch Inference
Batch inference (handled by `batch_inference.py`) can either refer to "batch evaluation" or "batch prediction". Batch evaluation means that there is provided ground truth, and therefore one must set `metrics=True` in the configuration file. Batch prediction refers to purely predicting the bounding boxes for each file in a given preprocessed dataset.
**Backend Return Values**
As `batch_inference.py` is evaluating batches of samples in the given dataset, it outputs the progress:
```
PROGRESS {"inferenced": 8, "total_num": 5823, "percentage": 0.13738622703074016}
...
PROGRESS {"inferenced": 5823, "total_num": 5823, "percentage": 1.00}
```
After it has completed evaluating each sample, `batch_inference.py` outputs a json dictionary containing organized as follows:
```
RESULT {
"prediction": {
"2011_001847.jpg": [],
"2009_003126.jpg": [],
"2008_000080.jpg": [],
"2008_000782.jpg": [],
"2011_001350.jpg": [],
"2010_004050.jpg": [],
"2009_000771.jpg": [],
"2009_004940.jpg": [
{
"category": "person",
"bbox": [255.25765991210938, 212.8616943359375, 31.549407958984375, 121.72079467773438],
"confidence": 0.7429825663566589
}
],
},
"metrics": {
"AP": 6.343370520580997e-05,
"AP50": 0.00032025060178860707,
"AP75": 4.7575250021061345e-06,
"APs": 0.0,
"APm": 3.485762416133066e-07,
"APl": 0.00014756558471165586,
"AR@1": 0.0009446751478690392,
"AR@10": 0.00342661963060001,
"AR@100": 0.0034765217832418882,
"ARs@100": 0.0,
"ARm@100": 0.0001094838850174216,
"ARl@100": 0.006700979399520094
},
"ground_truth": {
"2011_001847.jpg": [
{
"category": "chair",
"bbox": [225, 288, 107, 169]
},
{
"category": "chair",
"bbox": [2, 307, 151, 193]
},
{
"category": "chair",
"bbox": [26, 227, 136, 141]
}, {
"category": "chair",
"bbox": [193, 221, 98, 178]
}, {
"category": "diningtable",
"bbox": [50, 245, 245, 211]
}, {
"category": "bottle",
"bbox": [186, 208, 18, 62]
}],
...
}
}
```
For batch evaluation, it outputs as shown above. For batch prediction, it only outputs the `prediction` key from the json dictionary as shown above.
The requirement is that we start with a keyword called `RESULT ` (with one space following RESULT). Then we output a dictionary with the `prediction`, `metrics`, and `ground_truth` dictionaries.
Each item in `prediction` has a key with the name of the file (as given in the annotation `'file_name' value`), and the value is a list containing dictionaries of predicted bounding box. Empty list indicates nothing predicted for the image.
`metrics` dictionary contains keys as the name of the detection metrics, and the values are its value.
`ground_truth` dictionary is similar to the `prediction` dictionary. Each item has its key as the filename, and the value is a list of dictionaries, each for a bounding box in that image.
## Inference
Inference (or more specifically single image prediction) is handled by the `inference.py` script. One may provide an image file to the `image` parameter in the config file (the image doesn't need to be preprocessed).
**Backend Return Values**
The required output from `inference.py` is similar to that of `batch_inference.py`:
```
RESULT [{"category": "dog", "confidence": 0.95, "bbox":[35, 50, 120, 150]},
{"category": "cat", "confidence": 0.77, "bbox":[240, 260, 90, 100]}]
```
It's the same `RESULT ` (with one trailing space) followed by a list of dictionaries, each indicating a predicted bounding box.