`torchvision` - HackMD

<style> .reveal { font-size: 36px; } </style> # `torchvision` ## Revamp of datasets Philip Meier [@pmeier](https://github.com/pmeier) [Quansight](https://www.quansight.com/) --- ## Preliminaries ### "old" ```python from torchvision import datasets ``` ### "new" ```python from torchvision.prototype import datasets ``` --- ## Greatest changes - Datasets follow the iter-style rather than the map-style than before - fully compatible with the rework of the dataloader - allows streaming from remote sources - Datasets return everything as `Tensor`'s rather than foreign types ---- ### iter- to map-style dataset ```python class MapDataset(Dataset): def __init__(self, samples, *, decoder=None): self.samples = samples self.decoder = decoder def __getitem__(self, idx): sample = self.samples[idx] if self.decoder: sample = self.decoder(sample) return sample def __len__(self): return len(self.samples) ``` ---- ```python map_dataset = MapDataset( tuple(datasets.load("caltech256", decoder=None)) ) len(map_dataset) map_dataset[3141] ``` --- ## API --- ## Loading ### old - The namespace exposes one class for each dataset that needs to be instantiated ```python dataset = datasets.ImageNet(...) ``` ---- ### Issues - The names are not standardized other than default camel-case notation - [#1398](https://github.com/pytorch/vision/issues/1398) `Imagenet` vs. `ImageNet` ---- ### new - All datasets are loaded by name through a single point of entry ```python dataset = datasets.load("imagnet", ...) ``` ``` ValueError: Unknown dataset 'imagnet'. Did you mean 'imagenet'? ``` --- ## Data location and download ## old - First argument is always `root` - Most datasets support a `download: bool` flag ```python dataset = datasets.MNIST(root, ..., download=True) ``` ---- ### Issues - For the most common use case of having all the data in one place, `root` is superfluous - Some datasets put the data directly in `root` while others create a directory in it - The `download` flag should not be needed, because data always has to be downloaded upfront ---- ### new - All data is automatically downloaded and managed by `torchvision` - The data is stored by default in `~/.cache/torch/datasets/vision` - The path can be changed through `datasets.home()` or an environment variable --- ## Attributes ### old - Some datasets carry additional meta data as attributes ```python dataset = datasets.ImageNet(...) dataset.classes ``` ---- ### Issues - The extra attributes are not standardized across datasets - If a dataset carries the information it is usually poorly or not at all documented ---- ### new - All static information can be queried through a single entrypoint ```python info = datasets.info("imagenet") info.categories ``` --- ## Return type ### old - Each dataset returns a tuple as sample - Return types are usually foreign ```python dataset = datasets.CocoDetection(...) sample = dataset[0] print(type(sample[0])) print( [ (key, type(value).__name__) for key, value in sample[1][0].items() ] ) ``` ``` <class 'PIL.Image.Image'> [('segmentation', 'list'), ('area', 'float'), ('iscrowd', 'int'), ('image_id', 'int'), ('bbox', 'list'), ('category_id', 'int'), ('id', 'int')] ``` ---- ### Issues - A tuple is a sub-par datastructure to return more information - Each foreign type needs to be converted to a `Tensor` ---- ### new - Each dataset returns a dictionary containing all the information - All features are collated and converted and thus ready to use ```python dataset = datasets.load("coco", annotations="instances") sample = next(iter(dataset)) print([(key, type(value).__name__) for key, value in sample.items()]) ``` ``` [('path', 'str'), ('image', 'Tensor'), ('segmentations', 'Tensor'), ('areas', 'Tensor'), ('crowds', 'Tensor'), ('bounding_boxes', 'Tensor'), ('labels', 'Tensor'), 'categories', 'list'), ('super_categories', 'list'), ('ann_ids', 'list')] ``` --- ## Transformations ### old - Each dataset takes a combination of - `transform`, - `target_transform`, and - `transforms`, that will be applied before the sample is returned ```python from torchvision import transforms transform = transforms.RandomHorizontalFlip() dataset = datasets.ImageNet(..., transform=transform) ``` ---- ### Issues - Usage of the keyword argument [is not consistent](https://gist.github.com/pmeier/14756fe0501287b2974e03ab8d651c10) - It is hard to reuse the same transformation for multiple datasets, since their return types are not standardized ---- ### new - Transformations are completely decoupled from datasets and now are applied afterwards ```python from torchvision.prototype import transforms transform = transforms.HorizontalFlip() dataset = datasets.load("imagenet").map(transform) ``` --- ## Implementation ```python from torchdata.datapipes.iter import IterDataPipe class MyDataset(datasets.utils.Dataset): def _make_info(self) -> datasets.utils.DatasetInfo: ... def resources( self, config: datasets.utils.DatasetConfig ) -> List[datasets.utils.OnlineResource]: ... def _make_datapipe( self, resource_dps: List[IterDataPipe], *, config: datasets.utils.DatasetConfig, decoder, ) -> IterDataPipe[Dict[str, Any]]:... ``` ---- ### `def _make_info(self):` - static information about the dataset - for example, available categories, homepage, third party dependencies, ... - can be accessed without loading the datapipe - can be used to autogenerate documentation for the dataset (TBD) ---- ### `def resources(self, config):` - defines all resources that need to be locally available to start loading the data - will be downloaded automatically ---- ### `def _make_datapipe(self, resource_dps, *, config, decoder):` - heart of the dataset (varies wildly between different datasets) - gets already loaded datapipes of all resources - needs to return a `IterDataPipe[Dict[str, Any]]` with the complete sample --- ## Example 1: [`caltech256`](https://github.com/pytorch/vision/blob/65438e9eba26951206cbfaafeac1d5b1ac805193/torchvision/prototype/datasets/_builtin/caltech.py#L143) --- ## Example 2: [`caltech101`](https://github.com/pytorch/vision/blob/65438e9eba26951206cbfaafeac1d5b1ac805193/torchvision/prototype/datasets/_builtin/caltech.py#L27) --- ## Questions?