# Large-scale Training
## Large-scale models
### CLIP
[PDF](https://arxiv.org/pdf/2103.00020.pdf)
- Dataset: WIT (WebImageText)
- 400 Million image-text pairs
- 500,000 text queries
- Up to 20,000 image-text paris per query
- Model
- Image Encoder: ResNet-50
- Equally increasing the width depth, and resolution of the model
- ResNet-50 | ResNet-101 | RN50x4 | RN50x16 | RN50x64
- ViT-B/32 | ViT-B/16 | Vit-L/14
- Text Encoder: Transformer
- Base size: 63M-parameter 12-layer 512-wide 8-attention
- Only scale the width of the model to be proportional to the calculated increase in witdth of the ResNet, do not scale depthy at all.
- Training
- 32 Epochs
- MiniBatch: 32768
- (Equavilent updates): 390,625
- Mixed Precision
- Time
- Largest ResNet -- RN50x64: 18 days on 592 V100 GPUs
- Largest Vit -- ViT-L/14: 12 days on 256 V100 GPUs
- Model selection:
For the ViT-L/14 we also pre-train at a higher 336 pixel resolution for one additional epoch to boost performance similar wot FixRes, which is denoted as ViT-L/14@336px
- k
### DALL·E
[PDF](https://arxiv.org/pdf/2102.12092.pdf)
- Dataset
- 250 million image-text pairs
- Model
- 12-billion parameters autoregressive transformer
- Training
- Mixed-Precision Training
> Getting the model to train in 16-bit precision past one billion parameters, without diverging, was the most challenging part of this project.
> The cause of this instability to be underflow in the 16-bit gradients.
> Using separate "gradient scale" for each resblock in the model.
- Distributed Optimization
- 12-Billion parameter model consumes about 24GB of memory when stored in 16-bit precision.
- Solution: parameter sharding + PowerSGD
- Batch size: 1024
- 430,000 updates
- Time
- 1024, 16GB V100
- (estimated time): > 15 days
### DALL·E 2
- Dataset
- Encoder: 650 million image-text pairs (DALL·E + CLIP)
- Decoder/upsamplers/prior: 250 million DALL·E dataset
- Model
- 3.5 billion parameter GLIDE model
### DALL·E mini
- Dataset: 15 million image-text pairs
- [Conceptual Captions Dataset](https://aclanthology.org/P18-1238/) which contains 3 million image and caption pairs.
- [Conceptual 12M](https://arxiv.org/abs/2102.08981) which contains 12 million image and caption pairs.
- The [OpenAI subset](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md) of [YFCC100M](https://multimediacommons.wordpress.com/yfcc100m-core-dataset/) which contains about 15 million images and that we further sub-sampled to 2 million images due to limitations in storage space. We used both title and description as caption and removed html tags, new lines and extra spaces.
For fine-tuning our image encoder, we only used a subset of 2 million images.
We used all the images we had (about 15 million) for training our Seq2Seq model.
- Model
- 0.4 billion parameters
- Training
- Single TPU v3-8, three days
## Large-scale training requirements
### System
#### 1. Resouces
- DALL·E mini level
- One DGX(A100 x 8), three days for one experiments. 8 DGX should be fine for moving forward.
- DALL·E Level :warning:
- 128 A100, one week for one experiment
- Might take > 1 month for hyperparameters search
#### 2. Dataset storage and retrieval :warning:
- Super large dataset (x00 million), > 100 T storage
- High-speed retrieval for training perpose on a cluster
- Easy to integrate with the training code
#### 3. Large-scale training platform :warning:
- When model size (including gradients, optimizer states) < single GPU memory, EASY
- Mixed precision supportting is good across all deep learning platforms
- When model size > single gpu memory (beyong my capability)
- Mixed precision
- Distributed optimization
- When a super large number of gpus required
- System stability when some gpus are dead
### Data and Algorithm
- Data :warning:
- The quality of data is important!
- Algorithm
- hah