# Machine Learning Dataset
## Available Datasets
Open dataset (primary)
COCO dataset
## Collecting Dataset
under `~/jetson-inference/python/training/detection/ssd`
see how many pictures per class
* `--stats-only` means only download the annotation data, not the actual images
```
$ python3 open_images_downloader.py --stats-only --class-names "Person,Oven,Microwave oven,Gas stove,Coffeemaker" --data=data/step
-------------------------------------
'train' set statistics
-------------------------------------
Image count: 221058
Bounding box count: 923985
Bounding box distribution:
Person: 922047/923985 = 1.00
Oven: 633/923985 = 0.00
Gas stove: 509/923985 = 0.00
Microwave oven: 482/923985 = 0.00
Coffeemaker: 314/923985 = 0.00
```
## set max download images and actually downloading the dataset
* `--max-images` limits the total dataset to the specified number of images, while keeping the distribution of images per class roughly the same as the original dataset. If one class has more images than another, the ratio will remain roughly the same.
* `--max-annotations-per-class` limits each class to the specified number of bounding boxes, and if a class has less than that number available, all of it's data will be used - this is useful if the distribution of data is unbalanced across classes.
`Person` has too many images (unbalanced ratio)
=> only download the other 4 classes
```
$ python3 open_images_downloader.py --max-images=2500 --class-names "Oven,Microwave oven,Gas stove,Coffeemaker" --data=data/step
```
* The dataset downloaded CANNOT be used. The anootation file is `.csv` but we need `.xml` files
## Fiftyone: Developer tool for machine learning
https://voxel51.com/docs/fiftyone/index.html
### Installation
https://voxel51.com/docs/fiftyone/getting_started/install.html
```
$ pip3 install fiftyone
$ pip3 install fiftyone[desktop]
```
### Download from coco dataset (one by one)
https://voxel51.com/docs/fiftyone/integrations/coco.html
```
$ cd ~/fiftyone
$ python3
>>> import fiftyone as fo
>>> import fiftyone.zoo as foz
>>> foz.list_zoo_datasets()
'open-images-v6', 'quickstart', 'quickstart-geo', 'quickstart-video', 'ucf101', 'voc-2007', 'voc-2012']
# "person"x5000, "microwave"x1547, "oven"x2500, "toaster"x217, "sink"x2500, "refrigerator"x2360
>>> dataset = foz.load_zoo_dataset("coco-2017", split="train", label_types=["detections"], classes=["microwave"], max_samples=2500,)
>>> dataset = foz.load_zoo_dataset("coco-2017", split="train", label_types=["detections"], classes=["oven"], max_samples=2500,)
>>> dataset = foz.load_zoo_dataset("coco-2017", split="train", label_types=["detections"], classes=["person"], max_samples=2500,)
>>> print(fo.list_datasets())
>>> dataset1 = fo.load_dataset('coco-2017-train-2500')
>>> session = fo.launch_app(dataset1)
>>> exit() # dataset merge automatically after exit python3
```
`session = fo.launch_app(dataset1)` redirects us to this webpage

### Load Local Data
```
$ python3
>>> import fiftyone as fo
>>> data_path = "~/fiftyone/coco-2017/train/data"
>>> labels_path = "~/fiftyone/coco-2017/train/labels.json"
# Import the dataset
>>> dataset = fo.Dataset.from_dir(
dataset_type=fo.types.COCODetectionDataset,
data_path=data_path,
labels_path=labels_path,
)
>>> dataset.name = "coco-2017-train" # rename dataset
>>> print(fo.list_datasets())
>>> session = fo.launch_app(dataset)
```
### Convert coco to voc dataset
```
$ python3
>>> import fiftyone as fo
>>> fo.utils.data.converters.convert_dataset(input_dir='coco-2017/train', input_type=fo.types.COCODetectionDataset, output_dir='~/jetson-inference/python/training/detection/ssd/data/step', output_type=fo.types.VOCDetectionDataset)
# generates files
# ~/jetson-inference/python/training/detection/ssd/data/step/data/*.jpg
# ~/jetson-inference/python/training/detection/ssd/data/step/labels/*.xml
>>> exit()
```
#### Directory Format Exported by Fiftyone
```
- data
- labels
```
#### xml File Format Exported by Fiftyone (under `labels`)
```xml=
<annotation>
<folder></folder>
<filename>000000000036.jpg</filename>
<path>/home/huang/jetson-inference/python/training/detection/ssd/data/step/data/000000000036.jpg</path>
<source>
<database></database>
</source>
<size>
<width>481</width>
<height>640</height>
<depth></depth>
</size>
<segmented></segmented>
<object>
<name>umbrella</name>
<iscrowd>0</iscrowd>
<supercategory>accessory</supercategory>
<bndbox>
<xmin>0</xmin>
<ymin>50</ymin>
<xmax>457</xmax>
<ymax>480</ymax>
</bndbox>
</object>
<object>
<name>person</name>
<iscrowd>0</iscrowd>
<supercategory>person</supercategory>
<bndbox>
<xmin>167</xmin>
<ymin>162</ymin>
<xmax>478</xmax>
<ymax>628</ymax>
</bndbox>
</object>
</annotation>
```
#### Directory Format Exported by Camera-capture Tool
```
- Annotations
- ImageSets
- JEPGImages
- labels.txt
```
#### xml Format Exported by Camera-capture Tool (under `Annotations`)
```xml=
<annotation>
<filename>20220120-160838.jpg</filename>
<folder>cooker</folder>
<source>
<database>cooker</database>
<annotation>custom</annotation>
<image>custom</image>
</source>
<size>
<width>1280</width>
<height>720</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>Gas stove</name>
<pose>unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>266</xmin>
<ymin>267</ymin>
<xmax>1053</xmax>
<ymax>588</ymax>
</bndbox>
</object>
</annotation>
```
### Unite xml File Format
* Label names
* coco
* person -> Person
* mircowave -> Microwave
* oven -> Oven
#### Change File Format of coco Download Files
`step_change_xml_1.py` under `~/jetson-inference/python/training/detection/ssd/`
```python=
'''
replace string in labels/*.xml
<path>~/jetson-inference/python/training/detection/ssd/data/step/data/*.jpg</path>
=> <path>~/jetson-inference/python/training/detection/ssd/data/step/JPEGImages/*.jpg</path>
'''
from os import listdir
from os.path import isfile, join
import re
# PATH = '~/workon_pytorch/pytorch-ssd/data/step'
PATH = '~/jetson-inference/python/training/detection/ssd/data/step'
filenames = [f for f in listdir(PATH+'/_labels') if isfile(join(PATH+'/_labels', f))]
# print(filenames)
for filename in filenames:
if (re.search('.xml$', filename, flags=re.IGNORECASE) == None):
continue
print(filename)
f = open(PATH+'/_labels/'+filename, "rt")
lines = f.read()
# change path
lines = lines.replace('step/data', 'step/JPEGImages')
# change label names
lines = lines.replace('person', 'Person')
lines = lines.replace('microwave', 'Microwave')
lines = lines.replace('oven', 'Oven')
f.close()
f = open(PATH+'/_labels/'+filename, "wt")
f.write(lines)
f.close()
```
#### Remove Redundant Labels
`$ python3 step_change_xml_2.py` under `~/jetson-inference/python/training/detection/ssd/`
```python=
'''
remove redundant labels from *.xml
'''
from os import listdir
from os.path import isfile, join
import re
PATH = '/home/huang/jetson-inference/python/training/detection/ssd/data/step'
with open(PATH+'/labels.txt', "r")as f:
lines = f.readlines()
labels = [line[:-1] for line in lines]
filenames = [f for f in listdir(PATH+'/_labels') if isfile(join(PATH+'/_labels', f))]
for filename in filenames:
print(filename)
with open(PATH+'/_labels/'+filename, "r") as f:
lines = f.readlines()
nlines = []
obj = []
for line in lines:
if (re.search( '<object>', line) != None):
obj.append(line)
elif (re.search('</object>', line) != None):
obj.append(line)
for label in labels:
if (re.search('<name>'+label+'</name>', obj[1]) != None):
print(obj[1])
nlines += obj
obj = []
elif (re.search( '<object>', line) == None) and (len(obj)!=0):
obj.append(line)
else:
nlines.append(line)
with open(PATH+'/_labels/'+filename, "w") as f:
f.writelines(nlines)
```
#### Create Imagesets
under `~/jetson-inference/python/training/detection/ssd/data/setp/ImageSets/Main` create 4 `txt` files:
```
test.txt
train.txt
trainval.txt
val.txt
```
```python=
from os import listdir
from os.path import isfile, join
import re
import os
import random
'''
file structure
--------------
fireAlarm/labels.txt
fireAlarm/JPEGImages/*.jpg
fireAlarm/Annotations/*.xml
fireAlarm/ImageSets/Main/train.txt test.txt (optional trainval.txt val.txt)
'''
# PATH = '/home/huang/workon_pytorch/pytorch-ssd/data/fireAlarm'
PATH = '/home/huang/jetson-inference/python/training/detection/ssd/data/step'
if not os.path.exists(PATH+'/JPEGImages'): os.makedirs(PATH+'/JPEGImages')
if not os.path.exists(PATH+'/Annotations'): os.makedirs(PATH+'/Annotations')
if not os.path.exists(PATH+'/ImageSets'): os.makedirs(PATH+'/ImageSets')
if not os.path.exists(PATH+'/ImageSets/Main'): os.makedirs(PATH+'/ImageSets/Main')
filenames = [f for f in listdir(PATH+'/Annotations') if isfile(join(PATH+'/Annotations', f))]
random.shuffle(filenames)
num = len(filenames)//10
# print(filenames)
idx = 0
train = ''
test = ''
val = ''
trainval = ''
for filename in filenames:
print(filename)
if (re.search('.xml$', filename, flags=re.IGNORECASE) == None):
continue
filename = filename.replace('.xml', '')
if (idx % 10) == 0:
test += filename + '\n'
elif (idx % 10) == 1:
val += filename + '\n'
trainval += filename + '\n'
else:
train += filename + '\n'
trainval += filename + '\n'
idx += 1
f = open(PATH+'/ImageSets/Main/train.txt', "wt")
f.write(train)
f.close()
f = open(PATH+'/ImageSets/Main/test.txt', "wt")
f.write(test)
f.close()
f = open(PATH+'/ImageSets/Main/val.txt', "wt")
f.write(val)
f.close()
f = open(PATH+'/ImageSets/Main/trainval.txt', "wt")
f.write(trainval)
f.close()
```
under `~/jetson-inference/python/training/detection/ssd/`
1. `$ python3 step_change_xml_1.py`
2. `$ python3 step_change_xml_2.py`
3. manually move xml files from `_labels` to `Annotations`
4. manually move jpg files from `_data` to `JEPGImages`
5. `$ python3 step_change_xml_4.py`
### Transfer Train Model
```
$ workon pytorch
$ cd jetson-inference/python/training/detection/ssd
$ python3 train_ssd.py --dataset-type=voc --data=data/step --model-dir=models/step
$ python3 onnx_export.py --model-dir=models/step
```