# LAVIS: Adding Datasets
(未完成)
由於 LAVIS 允許新增資料集,所以這篇的重點在於如何設定自訂資料集。
使用從 [台灣觀光多媒體開放資料](https://media.taiwan.net.tw/zh-tw/portal) 上下載的資料來作範例,我將他取名為 Taiwan Caption Datasets
## Dataset Configuration
假設今天我們要新增一個新的資料集,首先我們需要新增他的資料卡。
首先我們要在 `lavis.configs.datasets` 中,新增 `taiwan_cap/defaults.yaml` :
```yaml
datasets:
taiwan_caption: # name of the dataset builder
dataset_card: dataset_card/taiwan_caption.md
data_type: images # [images|videos|features]
build_info:
# Be careful not to append minus sign (-) before split to avoid itemizing
annotations:
train:
# url: https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_train.json
storage: code/get_photo/image_info.json
val:
# url: https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_val.json
storage: code/get_photo/image_info.json
test:
# url: https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_test.json
storage: code/get_photo/image_info.json
images:
storage: code/get_photo/image/
```
其中, `taiwan_dialogus` 是讓dataset builder讀入的名字。
dataset_card 是資料集的資料卡,他定義了一些資料集的規範,並處存在 `dataset_card/taiwan_caption.md` 中。
而資料卡內容如下:
---
# Taiwan Caption Dataset (Captioning)
## Description
A collection of taiwan sightseeing pictures.
## Task
(from https://paperswithcode.com/task/image-captioning)
**Image captioning** is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence.
## Metrics
Models are typically evaluated according to a [BLEU](https://aclanthology.org/P02-1040/) or [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf) metric.
## Leaderboard
....
## Auto-Downloading
Unable to Auto-Downloading
## References
https://media.taiwan.net.tw/zh-tw/portal
---
`data_type` 是資料的類型,總共有三種:images, videos, features
而 `build_info` 是資料集的建構資訊,在 `annotations` 底下,有 `train` `val` `test` 。
在這三個類別中,分成 `url` 和 `storage` 。其中, `url` 是資料集的遠端網址,但由於我並沒有把 Taiwan Caption Datasets 上傳到網路上,所以我將 `url` 暫時註解。而 `storage` 是資料的本地 annotations ,我將資料集存放的位置填上去。
接著我們需要將資料集讀進檔案,為此,我們需要製作一個資料集的class,讓我們可以方便的讀取資料集。
而每個資料集都需要繼承自 `BaseDataset` 以 Taiwan caption datasets 為例,如下所示:
```python
import os
from collections import OrderedDict
from PIL import Image
from lavis.datasets.datasets.base_dataset import BaseDataset
import json
import copy
class TaiwanCaptionDatasets(BaseDataset):
def __init__(self, vis_processor, text_processor, vis_root, ann_paths):
"""
vis_root (string): Root directory of images (e.g. coco/images/)
ann_root (string): directory to store the annotation file
"""
self.vis_root = vis_root
self.images_info = {}
self.annotation = []
for ann_path in ann_paths:
images = json.load(open(ann_path, "r"))
for image in images:
self.annotation.append(image)
self.vis_processor = vis_processor
self.text_processor = text_processor
self._add_instance_ids()
```
其中 `__getitem__` 方法暫時使用了父類別中的默認引用,而 `collater` 方法也是。