In this tutorial we will provide information to help you work with large datasets on the triton cluster. The general information contained in this tutorial is not restricted to triton, but should be applicable to any cluster system (the numbers given of course refer to triton as of Oct 2021). We will further provide some tools and instructions particularily for the imagenet dataset.
When using large datasets (>100GB) on triton, there are a few things to keep in mind:
dgx
nodes), which can offer individual IO speeds from the local filesystem (i.e. /tmp
) of up to 30GB/s. So, if a dataset needs to be read multiple times (e.g. multiple retrainings), and doesn't fit into memory, using one of these nodes, transferring the data to the node and loading from the local drive will be as efficient as it gets.Sharding a dataset means to split a huge dataset into large subsets (3-10GB/shard). This representation allows the following benefits:
webdataset
and pytorch
webdataset
offers a convenient way to load large and sharded datasets into pytorch
, it implements the iterabledataset
interface of pytorch
, and can thus be used like a pytorch
dataset.
webdataset
expects the data o be stored as a tarball (.tar
) file, with all data for each item stored in individual, sequential files. e.g.:
Dataset.tar
|-> image1.png
|-> image1.cls
|-> image1.meta
|-> image2.png
|-> image2.cls
|-> image2.meta
....
|-> imageXYZ.png
|-> imageXYZ.cls
|-> imageXYZ.meta
for a dataset with image-data in image1.png
, class information in image1.cls
and metadata in image1.meta
. Only the image file and the class file are necessary to train machine learning algorithms.
It is essential, that the data is stored in this order, since otherwise sequential reading of the data is impossible, which would make extremely inefficient random access necessary. The file suffixes will be interpreted as keys by webdataset
.
Luckily, the tar
program offers you to sort a tar file by name (--sort=name
). Make sure, that there are no items with identical names in your dataset!
The authors of webdataset
offer a convenience tool tarp
that allows you to easily create a sharded tar files from a folder with your dataset (assuming your dataset is stored in tuples as mentioned above). The tarp command is webdataset specific and can be installed according to the instructions on:
https://github.com/webdataset/tarp
On a dataset folder, you can run:
tar --sorted -cf - . | tarp cat --shuffle=5000 - -o - | tarp split - -o ../dataset-%06.tar
which will create a sharded dataset with default maximium file sizes (3 GB) and maximum number of files (10.000). This dataset is "shuffled" with a buffer of 5000 elements, i.e. if your original order is 10.000 Images of class A, 10.000 images of class B…, you will start with at least 5.000 class A images and only then slowly moving to class B, so your final dataset will still be very "unshuffled".
ImageNet commonly has a split metadata <-> image-files, so this needs to be put together seperately. webDataset
offers a map function that can be used for this see e.g. at 4:42 in this video.
For data that is split between a tar of tars (e.g. the train)
One issue that can come up is naming of the files in imageNet. If they start with their WordNet-ID sorting them might lead to them being ordered according to their WNID, which can lead to shards containing only one class.
In order to avoid this, a one-time preprocessing step for imagenet datasets is to move them into shuffled shards. Since doing this "in place" is impossible (the datasets need too much memory), the files need to be extracted from the imagenet file once.
In order to do this on Triton, use a node with a large SSD (srun -p interactive -gres=spindle
) and copy the dataset you want to Shard to the /tmp folder.
Extract it there and build the image to ID mapping.
Then run makeShards
providing the mapping and the imagefile names.
or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing