Large-scale Training

# Large-scale Training ## Large-scale models ### CLIP [PDF](https://arxiv.org/pdf/2103.00020.pdf) - Dataset: WIT (WebImageText) - 400 Million image-text pairs - 500,000 text queries - Up to 20,000 image-text paris per query - Model - Image Encoder: ResNet-50 - Equally increasing the width depth, and resolution of the model - ResNet-50 | ResNet-101 | RN50x4 | RN50x16 | RN50x64 - ViT-B/32 | ViT-B/16 | Vit-L/14 - Text Encoder: Transformer - Base size: 63M-parameter 12-layer 512-wide 8-attention - Only scale the width of the model to be proportional to the calculated increase in witdth of the ResNet, do not scale depthy at all. - Training - 32 Epochs - MiniBatch: 32768 - (Equavilent updates): 390,625 - Mixed Precision - Time - Largest ResNet -- RN50x64: 18 days on 592 V100 GPUs - Largest Vit -- ViT-L/14: 12 days on 256 V100 GPUs - Model selection: For the ViT-L/14 we also pre-train at a higher 336 pixel resolution for one additional epoch to boost performance similar wot FixRes, which is denoted as ViT-L/14@336px - k ### DALL·E [PDF](https://arxiv.org/pdf/2102.12092.pdf) - Dataset - 250 million image-text pairs - Model - 12-billion parameters autoregressive transformer - Training - Mixed-Precision Training > Getting the model to train in 16-bit precision past one billion parameters, without diverging, was the most challenging part of this project. > The cause of this instability to be underflow in the 16-bit gradients. > Using separate "gradient scale" for each resblock in the model. - Distributed Optimization - 12-Billion parameter model consumes about 24GB of memory when stored in 16-bit precision. - Solution: parameter sharding + PowerSGD - Batch size: 1024 - 430,000 updates - Time - 1024, 16GB V100 - (estimated time): > 15 days ### DALL·E 2 - Dataset - Encoder: 650 million image-text pairs (DALL·E + CLIP) - Decoder/upsamplers/prior: 250 million DALL·E dataset - Model - 3.5 billion parameter GLIDE model ### DALL·E mini - Dataset: 15 million image-text pairs - [Conceptual Captions Dataset](https://aclanthology.org/P18-1238/) which contains 3 million image and caption pairs. - [Conceptual 12M](https://arxiv.org/abs/2102.08981) which contains 12 million image and caption pairs. - The [OpenAI subset](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md) of [YFCC100M](https://multimediacommons.wordpress.com/yfcc100m-core-dataset/) which contains about 15 million images and that we further sub-sampled to 2 million images due to limitations in storage space. We used both title and description as caption and removed html tags, new lines and extra spaces. For fine-tuning our image encoder, we only used a subset of 2 million images. We used all the images we had (about 15 million) for training our Seq2Seq model. - Model - 0.4 billion parameters - Training - Single TPU v3-8, three days ## Large-scale training requirements ### System #### 1. Resouces - DALL·E mini level - One DGX(A100 x 8), three days for one experiments. 8 DGX should be fine for moving forward. - DALL·E Level :warning: - 128 A100, one week for one experiment - Might take > 1 month for hyperparameters search #### 2. Dataset storage and retrieval :warning: - Super large dataset (x00 million), > 100 T storage - High-speed retrieval for training perpose on a cluster - Easy to integrate with the training code #### 3. Large-scale training platform :warning: - When model size (including gradients, optimizer states) < single GPU memory, EASY - Mixed precision supportting is good across all deep learning platforms - When model size > single gpu memory (beyong my capability) - Mixed precision - Distributed optimization - When a super large number of gpus required - System stability when some gpus are dead ### Data and Algorithm - Data :warning: - The quality of data is important! - Algorithm - hah