# Books / Tutorials : Multi-GPU Programming & AI Model Training
Reference \:
> * Learn CUDA Programming
> https://github.com/PacktPublishing/Learn-CUDA-Programming
> * 萬顆 GPU 的訓練 - 分散式機器學習 — 系統工程與實戰
> https://www.tenlong.com.tw/products/9786267383278
> * Multi-GPU Programming
> https://medium.com/gpgpu/multi-gpu-programming-6768eeb42e2c
> * NvLink or PCIe, how to specify the interconnect?
> https://stackoverflow.com/questions/53174224/nvlink-or-pcie-how-to-specify-the-interconnect
> * GPU Computing with CUDA (and beyond) Part 3: Programing Multiple GPUs
> https://www.sintef.no/contentassets/3eb4691190f2451fb21349eb24cb9e8e/part-3-multi-gpu-programming.pdf
> * Distributed Machine Learning with Python
> https://www.tenlong.com.tw/products/9781801815697
# Hardwares
> Reference for Detial Spec \:
> https://hackmd.io/@Erebustsai/SJPwQzOvh
**CPU** : *EPYC 7302*
**GPU** : *Nvidia RTX A5000 x 2 with NvLink*
**RAM** : *ECC 2400 512GB*
**MB** : *ROMED8-2T*
**Power** : *ASUS Thor 1600w on a dedicate 220v power*
**Output of `nvidia-smi nvlink -status`**
```
GPU 0: NVIDIA RTX A5000 (UUID: GPU-74bb45a8-f503-261d-121b-bba1d462c33c)
Link 0: 14.062 GB/s
Link 1: 14.062 GB/s
Link 2: 14.062 GB/s
Link 3: 14.062 GB/s
GPU 1: NVIDIA RTX A5000 (UUID: GPU-8d513fbe-2239-85cd-c234-61a58ccdd468)
Link 0: 14.062 GB/s
Link 1: 14.062 GB/s
Link 2: 14.062 GB/s
Link 3: 14.062 GB/s
```
**Output of `nvidia-smi nvlink --capabilities`**
```
GPU 0: NVIDIA RTX A5000 (UUID: GPU-74bb45a8-f503-261d-121b-bba1d462c33c)
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false
Link 2, P2P is supported: true
Link 2, Access to system memory supported: true
Link 2, P2P atomics supported: true
Link 2, System memory atomics supported: true
Link 2, SLI is supported: true
Link 2, Link is supported: false
Link 3, P2P is supported: true
Link 3, Access to system memory supported: true
Link 3, P2P atomics supported: true
Link 3, System memory atomics supported: true
Link 3, SLI is supported: true
Link 3, Link is supported: false
GPU 1: NVIDIA RTX A5000 (UUID: GPU-8d513fbe-2239-85cd-c234-61a58ccdd468)
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false
Link 2, P2P is supported: true
Link 2, Access to system memory supported: true
Link 2, P2P atomics supported: true
Link 2, System memory atomics supported: true
Link 2, SLI is supported: true
Link 2, Link is supported: false
Link 3, P2P is supported: true
Link 3, Access to system memory supported: true
Link 3, P2P atomics supported: true
Link 3, System memory atomics supported: true
Link 3, SLI is supported: true
Link 3, Link is supported: false
```
**Output of `p2pBandwidthLatencyTest`**
```
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A5000, pciBusID: 46, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A5000, pciBusID: 81, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 439.52 20.62
1 21.23 675.82
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 673.20 52.71
1 52.90 677.58
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 680.09 27.93
1 29.42 680.80
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 680.98 101.57
1 101.85 680.08
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.55 12.65
1 12.68 1.58
CPU 0 1
0 3.38 10.64
1 10.49 3.28
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.52 1.49
1 1.30 1.57
CPU 0 1
0 3.34 2.74
1 2.93 3.40
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
```
# Multi-GPU Programming
## Distribute Independent Jobs to Different GPUs
*Please refer to the Github Repository for the bash script. TODO : Update Github Link*
In the Multi-GPU Programming article \[https://medium.com/gpgpu/multi-gpu-programming-6768eeb42e2c\], the author provide a shell script to distribute independent jobs to different GPUs. I provide my version of the script and it basically do the same thing.
## GPUDirect P2P

Reference \:
> https://github.com/PacktPublishing/Learn-CUDA-Programming
### Output of System GPU Topology
In the following diagram, showes that I have two GPUs in my system and they are connected with Nvlink.
```
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 0-127 N/A N/A
GPU1 NV4 X 0-127 N/A N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
```
:::info
:bulb: **Example Output without NvLink**

This is a machine with a i9-13900k and two RTX4090 without Nvlink
:::
### No Nvlink in Newer Work Station GPUs
> Reference \:
> * Nvidia推出新世代專業GPU,導入Ada Lovelace架構
> https://www.ithome.com.tw/review/155388
> * RTX A6000 ADA - no more NV Link even on Pro GPUs?
> https://forums.developer.nvidia.com/t/rtx-a6000-ada-no-more-nv-link-even-on-pro-gpus/230874
Apperently, Nvidia decided that you need to buy a expensive Data Center GPUs to have the privillage to use Nvlink. This definitely has impact on the AI training performance. To sum up, this is definitely a business decision that benefit no one but sales records of Nvidia.
### Setup GPUDirect
```cpp
void fCUDA::setGPUDirect() {
int deviceCount = 0;
cudaError_t error_id;
int canAccessPeer = 0;
error_id = cudaGetDeviceCount(&deviceCount);
if (error_id != cudaSuccess) {
printf("Error cudaGetDeviceCount() : %d\n%s\n\n", static_cast<int>(error_id), cudaGetErrorString(error_id));
exit(EXIT_FAILURE);
}
for (size_t i = 0; i < deviceCount; ++i) {
cudaSetDevice(i);
for (size_t j = 0; j < deviceCount; ++j) {
if (i == j)
continue;
cudaDeviceCanAccessPeer(&canAccessPeer, i, j);
if (canAccessPeer) {
printf("GPU #%zu and GPU #%zu: Start Setup\n", i, j);
cudaDeviceEnablePeerAccess(j, 0);
}
else {
printf("GPU #%zu and GPU #%zu: Not Supported\n", i, j);
}
}
}
}
```
## Example \:
In the reference article, the author provide a naive GPU kernel which simplily divide every pixel in an image with an integer. In the article an optimized technique is used by making each thraed work on multiple pixels.
To apply using multiple GPUs on this algorithm, we just divide the input image to chunks of data and assign each equal sized chunk to different GPU.
:::info
:bulb: **Optimized Technique**
***The `char_vec` struct design***
The following code snippet is the full code of a char vector structure.
* Use `reinterpret_cast<>` to cast between `char*` and `int2`.
* Use `int2` to replace a loop through 8 `char` since they are both 8 byte wide.
```cpp
struct char_vec
{
public:
static constexpr int size = 8;
unsigned char data[size];
__device__ char_vec (const unsigned char* __restrict__ input) {
*reinterpret_cast<int2*>(data) = *reinterpret_cast<const int2*>(input);
}
__device__ void store (unsigned char* output) const {
*reinterpret_cast<int2*>(output) = *reinterpret_cast<const int2*>(data);
}
};
```
:::
```cpp
auto globalIdx = blockIdx.x * blockDim.x + threadIdx.x;
char_vec vec(input + globalIdx * 8);
unsigned char *data = vec.data;
for (int i = 0; i < vec.size; ++i)
data[i] /= div;
vec.store(output + globalIdx * 8);
```
# Multi-GPU Data Transfer Consideration
## Blocking / Non-Blocking
* `cudaMemcpy` is a blocking memory transfer function which will cause the CPU to wait for GPU to complete and move on to next CPU host instruction.
* `cudaMemcpyAsync` is asynchronous with respect to the CPU thread. The CUDA runtime will fall back to calling blocking version if requirement is not met.
* The host memory need to be pinned.
* Performace issue with overhead when using only one CPU thread.
**Summary**
The best method for a data parallel pattern algorithm is to use individual threads for individual GPU launch with non-blocking memory access.
* A C++11 thread libraray thread
* A OpenMP thread
:::success
:information_source: **Programing Multiple GPUs**
https://www.sintef.no/contentassets/3eb4691190f2451fb21349eb24cb9e8e/part-3-multi-gpu-programming.pdf
:::
# Multi-GPU Communication
## NCCL
> Reference \:
> * Official Doc
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/overview.html
> * ZOMI Tutorials
> https://www.youtube.com/@ZOMI666/search?query=NCCL
## Unified Virtual Addressing \(UVA\)
> * Reference \:
> https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf
## Throwing CPU out of GPU-GPU Memory Transfer => P2P
> Reference \: https://medium.com/gpgpu/multi-gpu-programming-6768eeb42e2c
> A runtime complexity of a cudaMalloc call is about O(lg(n)), where N is the number of prior allocations on the current GPU. After cudaDeviceEnablePeerAccess call cudaMalloc must map its allocations to all devices with peer access enabled. That fact changes the complexity of cudaMalloc to O(D * lg(N)), where D is the number of devices with the peer access.
**However, I got different result.**
```clike
./main
0.240549 s => 0.117959
```
## NVLink
* Bandwidth
* Global Atomic Support
**NVLink & Cache**
> Performance degradation with -dlcm=cg clearly indicates that L1 is used with NVLink memory operations.
> The difference in L2 cache is quite interesting. The data is actually cached in L2. The only difference is that it’s L2 of a different GPU.
:::info
:bulb: **Ptxas Options**
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#ptxas-options
:::
## Example \: Create Stream for Each GPU
*TODO \: Github Link*
In this example, a stream is created and timed for each GPU in the System. In the output, I have two RTX A5000 in my system, and thus can be expected that the execution time for each steps are the same.
```clink
CUDA Device Info
Detected 2 CUDA Capable device(s)
Device: 0, NVIDIA RTX A5000
Device: 1, NVIDIA RTX A5000
GPU #0 and GPU #1: Start Setup
GPU #1 and GPU #0: Start Setup
ID: 0 NVIDIA RTX A5000 Copy To : 2.14 ms
ID: 0 NVIDIA RTX A5000 Kernel : 0.17 ms
ID: 0 NVIDIA RTX A5000 Copy Back : 2.15 ms
ID: 0 NVIDIA RTX A5000 Execution Time : 4.46 ms
ID: 0 NVIDIA RTX A5000 Component Time : 4.46 ms
ID: 0 NVIDIA RTX A5000 Array check passed
ID: 1 NVIDIA RTX A5000 Copy To : 2.13 ms
ID: 1 NVIDIA RTX A5000 Kernel : 0.16 ms
ID: 1 NVIDIA RTX A5000 Copy Back : 2.18 ms
ID: 1 NVIDIA RTX A5000 Execution Time : 4.47 ms
ID: 1 NVIDIA RTX A5000 Component Time : 4.47 ms
ID: 1 NVIDIA RTX A5000 Array check passed
```
# Paper \: Evaluating Multi-GPU Sorting with Modern Interconnects
> Reference \: https://github.com/hpides/multi-gpu-sorting
*This section will focusing on summary that I have based on the paper and its implementation provided by authors on github.*
There are two ways to implment sorting with multi-GPU in your system. One is GPUs only which utilize P2P interconnection between GPUs. The other one is using both CPU and GPUs for sorting. The both algorithm work only differ on how to merge sorted data reside in different GPUs.
The authors point out that there are two sorting algorithms that this paper based on \: *merge sort and radix sort*.
:::success
:bulb: **Radix Sort**
https://www.researchgate.net/publication/316849058_A_Memory_Bandwidth-Efficient_Hybrid_Radix_Sort_on_GPUs
:::
## P2P-based Multi-GPU Sorting
> After the data is locally sorted on each GPU, the merge phase produces the globally sorted array across all 6 GPU chunks, which are then copied back to the host.
Therefore, the most important part of multi-GPU sorting is merging the parital results returned from all the GPUs.
> Multiple GPUs merge their data by swapping blocks of keys between each other through P2P data transfers, based on a specific **pivot**
**Pivot Selection**
Before we calculate and select the pivot, the following diagram shows that how pivot point worked in merging two sorted array.
> A pivot `p` in B and a mirrored position `p'` in A, where `p' = |A| - p`. The pivot ensures that he first `p'` keys of A and the first `p` keys in B are `<=` to the last `p` keys of A and the last `p'` keys in B.


In the above diagram, if we exchange the `0 ~ p'` in A with `p ~ 12` in B, we can have both array to be of equal size. This allows for load-balancing the multi-GPU sorting and the chunk size stored in each GPU is constant.

After memory swapped, the array A and B each have two sublists that is sorted by their self and required to be merged. The above diagram shows that to create a AB sorted array, a local merge of the swapped array is required.
> We implement the pivot selection using an adapted binary-search that operates on two sorted arrays ... our pivot selection guarantees to pick the leftmost pivot. This minimizes the number of keys transferred via the P2P interconnects.
**Merge with four GPUs**

# Multi-GPU Matrix Multiplication
* Multi-GPU Matrix Multiplication \(MPI-CUDA\)
https://github.com/hoyathali/MultiGPUMatMul
* Multi-GPU GEMM Algorithm Performance Analysis for Nvidia and AMD GPUs Connected by NVLink and PCIe
https://www.hse.ru/data/2024/06/13/2113948822/3_Multi-GPU%20GEMM%20Algorithm%20Performance%20A..Us%20Connected%20by%20NVLink%20and%20PCIe.pdf
TODO \: Implementation with EPYC Server.
# NvLink \& NvSwitch
> Reference \: https://github.com/chenzomi12/AIFoundation/tree/main/01AIChip/04NVIDIA
:::success
:bulb: **Hardware I have**
* NVIDIA RTX and NVLink Overview
https://www.youtube.com/watch?app=desktop&v=Z0IWrcmJvYQ
I have 2 * RTX A5000 in my one of my system and the corresponding NvLink version is 3.0. Another older system has Nvidia Quadro P40 in it and both of the system connected to a network switch. Therefore, the communcation between systems can only be in TCP\/IP faction.
:::
:::info
:information_source: **RDMA \(Remote Direct Memory Access\)**
https://zh.wikipedia.org/zh-tw/%E8%BF%9C%E7%A8%8B%E7%9B%B4%E6%8E%A5%E5%86%85%E5%AD%98%E8%AE%BF%E9%97%AE
:::
# Distributed Machine Learning with Python
> Reference \:
> * Github Repository
> https://github.com/PacktPublishing/Distributed-Machine-Learning-with-Python
## Chapter \#1\: Splitting Input Data
keep a copy in all the GPUs that participate the computation. For both training and inferencing, it basically just split the input data. \(training data for training \& input data for inferencing\) Apparently, inference will not always have a batch of data ready to be computed \(e.g. Object Detection\) so it will not apply on all the inference cases.
### Using SGD
Just like single-node training, we are not going to use gradient of the entire training set instead, we are going to use a mini-batch of the training set. This yield faster model convergence.
### Model Synchronization
> 1. Collects and sums up all the gradients from all the GPUs in use.
> 2. Broadcasts the aggregated gradients to all the GPUs.
### Hyperparameter tuning
We use large global batch size since we have multiple GPUs to fit all the data in thier memory. We usually need to multiply the learning rate of the original single-node training with `N`, which is the number of the GPU participates the training.
> Recent research literature suggests that, for large-batch data parallel training, we should have a warmup stage at the very beginning of the training stage. This warmup policy suggests that we should start data parallel training **with a relatively small learning rate**.
### Model synchronization schemes
```python
torch.distributed.init_process_group(backend='nccl', init_method = '...', world_size = N, timeout = M)
```
* `backend`\: Pytorch mainly support three different communication backends\: NCCL, Gloo and MPI.
:::success
:bulb: **NCCL、OpenMPI、Gloo對比**
https://blog.csdn.net/taoqick/article/details/126449935
:::
## Chapter \#2\: Parameter Server and All-Reduce
### Parameter Server Implementation
* Parameter Server\: Master Node that aggregate model updates from all the workers and update the model parameters held on the parameter server.
* Workers\: Compute Nodes
> Reference \:
> * Github Repository
> https://github.com/PacktPublishing/Distributed-Machine-Learning-with-Python

### Issue with the Parameter Server
* Given N nodes, it is unclear what the best ratio is between the parameter server and workers.
* Code Complexity
### All-Reduce Architecture
> In the All-Reduce architecture, we abandon the parameter server role in the parameter server architecture. Now, every node is equivalent and all of them are worker nodes.
The communication libraries such as NCCL and Blink will implement the collective communication protocols. We don't need to explicitly implement `push_gradients` and `pull_weight`.
:::info
:information_source: **Message Passing Protocols with MPI**
* Course / Book / Implementation : MPI & Distributed Systems
https://hackmd.io/@Erebustsai/SkyCU2g4n
:::

### Implementation of All-Reduce `Ring All-Reduce`
> Reference\:
> * \[深度學習\]Ring All-reduce的數學性質
> https://blog.csdn.net/zwqjoy/article/details/130501965
The above reference provide better expliantation and diagram than the book.
## Chapter \#3\: Building a Data Parallel Training and Serving Pipeline

> Code Reference\:
> https://github.com/PacktPublishing/Distributed-Machine-Learning-with-Python/blob/main/Chapter03/ddp/main.py
### Input Data Partition
```python
torch.utils.data.DistributedSampler()
# class DistributedSampler(
# dataset: Dataset,
# num_replicas: int | None = None,
# rank: int | None = None,
# shuffle: bool = True,
# seed: int = 0,
# drop_last: bool = False
# )
```
The above function basically make sure the data is scattered without any duplicated data.
### Model Synchronization
```python
otpimizer.zero_grad()
loss_fn.backward()
```
When `loss_fn.backward()` is called, the following happends.
> 1. After a layer generates its local gradients, PyTorch initializes a per-layer All-Reduce function to get the globally synchronized gradients of this layer. To reduce the system control's overhead, PyTorch often groups multiple consequent layers and conducts a per-group All-Reduce function.
> Once all the layers have finished their All-Reduce operations, PyTorch will write all the layers' gradients to the `gradient` space in `model_parameters()`. Note that this is a blocking function call, which means that the workers will not start further operations until the whole model's All-Reduce finishes.
:::success
:bulb: **Code Example**
**\[PyTorch\] 使用 torch.distributed 在單機多 GPU 上進行分散式訓練**
https://idataagent.com/2023/01/07/pytorch-distributed-data-parallerl-multi-gpu/
:::
### Code Summary
In the above `Code Example` section, we know that their are two ways to do data parallelism.
* **DP**\: Basically just use threads, which will be bounded by the GIL and will only work on Multi-GPU in Single Node.
* **DDP**\: Will use processes, which can be deployed to multiple Nodes.
DDP is in most way better than DP with the only downside as code complexity.
## Chapter \#4\: Bottlenecks and Solutions
Before going further, the book list assumptions that it make in the following discussions. Notice that these assumptions are valid and most of the time is the truth. \(At least for my system setup see [hardware reference](https://hackmd.io/9NN76OIXSCmPmEkX4IJt_w?view#Hardwares)\)
* Accelerators are homogenous
* Most of the system will use same model for all the GPUs.
* I'm rocking two Nvidia RTX A5000s
* On-device memory is way smaller than the system memory
* RAM\: 512GB vs VRAM for each GPU\: 24GB without ECC enabled
* Cross-machine network bandwidth is significantly lower than communication bandwidth among GPUs within a single machine.
* Usually ethernet v.s. PCIe
* Training happened on GPUs not CPU
* GPU P2P communication is faster than host to guest communication.
* True for my system with Nvlink; however, most system will not have Nvlink if they are not using Data-Center GPUs. \(Greedy green company. See [reference](https://hackmd.io/9NN76OIXSCmPmEkX4IJt_w?view#No-Nvlink-in-Newer-Work-Station-GPUs)\)
* GPU memory is the bottleneck compare to GPU computation horse power.
### Tree All-Reduce
The following is a fully connected GPU sets. Usually they can connect each other with NVLink if they are Nvidia GPUs.

Each GPU will receive 3 part of the first chunk of the gradient from other three GPUs. Notice that these parts can be aggregated since they can form the first chunk of the full gradient.

Finally, each GPU just send out the part that is part of the full gradient to others and everyone have the full gradient.
:::success
:bulb: **Key Points to Remember**
> Reference\:
> * Distributed Machine Learning with Python
> https://www.tenlong.com.tw/products/9781801815697
>
> The following points explain how Tree All-Reduce is better than Ring All-Reduce:
> \(1\) Given an arbitrary network topology, the tree-based solution can create more concurrent data transfer channels than the ring-based solution. Therefore, the tree-based solution is faster.
> \(2\) The tree-based solution can leverage more network links than the ringbased solution. Therefore, the tree-based solution is more efficient.
:::
### Hybrid Data Transfer over PCIe \& NVLink
Load balancing the transfer between PCIe and NVLink.
> * Calculate the BW ratio: R = BW_NVLink / BW_PCIe.
> * Split the total data into 1+R chunks.
> * Concurrently transfer the data over PCIe and NVLink. The data size of PCIe should be 1/(1+R) of the total data, while the data size on NVLink should be R/(1+R) of the total data.
## Chapter \#5\: Splitting the Model
Model Parallelism can be used when a model is too large to fit into a single GPU's memory. The following assumption is used.
* For each NLP training job, we usually focus on the fine-tuning process and not the pre-training process.
* For the fine-tuning process, we use a much smaller training dataset than training data used in the pre-training process.
* We assume each job ran exclusively on a set of GPUs or other accelerators.
* We assume a model has enough layers to split across multiple GPUs.
* We assume we always have a pre-trained model available for fine-tuning.
### ELMo, BERT, and GPT
In this section, start with RNN, we are going to look into NLP models.
An RNN usually need to maintain the states from previous input. It is just like memory for human beings.
#### Basic Concepts
* **One-to-many Problems**

Notice that in the above diagram, the model is the same model that will receive the previous states from the previous run.
E.g. image captioning
* **Many-to-One Problems**

E.g. Sentiment classification
* **Many-to-Many Problems**

The no-delayed version is basically work on every input as a one-to-one problem. E.g. Object detection
The one with delay is the model require previous information to start generating things. E.g. translation problem
:::success
:bulb: **This section will be extend to become the following post**
https://hackmd.io/@Erebustsai/SJ6eVa9Flx
:::
### Pre-training and fine-tuning
We definitely **CANNOT** do `pre-training` so no need to go further on this topic.
Fine-tuning can make the model do something that relate to what its original function but more precise. For example, a model doing next sentence prediction can be fine-tune into question answering model. Notice that the fine-tuning dataset will be significant smaller than the pre-training dataset.
### Nvidia Workstations
> Reference\:
> * 2025 COMPUTEX DGX HGX MGX 剖析
> https://vocus.cc/article/683b0d4dfd897800014ba027
> * NVIDIA MGX 提供系統製造商模組化架構以滿足全球資料中心多樣化的加速運算需求
> https://blogs.nvidia.com.tw/blog/nvidia-mgx-server-specification/
> * NVIDIA DGX 與 NVIDIA HGX 解析兩者的差異
> https://block-jam.com/article/NVIDIA-DGX-and-HGX-explain-the-differences-between-them
## Chapter \#6\: Pipeline Input and Layer Split
The diagram in this chapter will only show the inference part. The backpropergation is the same thing just the order is reversed.
### Vanilla Model Parallelism

Notice that the above diagram, each GPU holds the same size of layer block \(We assume all GPU nodes are in same model and the layer block size is same as well\). We can see that most of the time the GPU is idling and this idling percentage will increase when we have more and more GPU participate the training process.
### Pipeline Input \/ Pipeline Paralellism
Basically, pipeline input will divided the input batch into multiple micro-batch \(same amount of the number of GPU\) and a micro-batch can be processed and passed to the next GPU without waiting for the entire batch to finish.

In pipeline parallelism GPUs will have to do more frequent communications. This might be mitigate by having NvLink.
### Layer Split
> Generally speaking, the data structure for holding each layer's neurons can be represented as matrices. One common function during NLP model training and serving is matrix multiplication. Therefore, we can split a layer's matrix in some way to enable in-parallel execution
Basically, this method split the matrix multiplication onto different GPUs. However, this will definitely introduce some issue in my opinion since not all the layer can be split this way.
## Chapter \#7\: Implementing Model Parallel Training and Serving Workflows
> Reference\:
> * Official Document\: PyTorch Distributed Overview
> https://docs.pytorch.org/tutorials/beginner/dist_overview.html#parallelism-apis
```python
import torch
import torch.nn as nn
import torch.nn.functional as F
class MyNet(nn.Module):
def __init__(self):
super(MyNet, self).__init__()
self.seq1 = nn.Sequential(
nn.Conv2d(1,32,3,1),
nn.Dropout2d(0.5),
nn.Conv2d(32,64,3,1),
nn.Dropout2d(0.75)).to('cuda:0')
self.seq2 = nn.Sequential(
nn.Linear(9216, 128),
nn.Linear(128,20),
nn.Linear(20,10)).to('cuda:2')
def forward(self, x):
x = self.seq1(x.to('cuda:0'))
x = F.max_pool2d(x,2).to('cuda:1')
x = torch.flatten(x,1).to('cuda:1')
x = self.seq2(x.to('cuda:2'))
output = F.log_softmax(x, dim = 1)
return output
```
As the above code showed, putting model layers in different GPUs by using `.to()`
```python
for epoch in range(args.epochs):
print(f"Epoch {epoch}")
for idx, (data, target) in enumerate(trainloader):
data = data.to('cuda:0')
optimizer.zero_grad()
output = model(data)
target = target.to(output.device)
loss = F.cross_entropy(output, target)
loss.backward()
optimizer.step()
print(f"batch {idx} training :: loss {loss.item()}")
print("Training Done!")
```
In the above code snippet, we can see that the data is passed to the *GPU0* with `.to('cuda:0')` and the `target` is send to `output.device`. Pytorch will automatically know that the output is on *GPU2* and pass the target to *GPU2*.
:::success
Apparently, we just need to make sure that the input output device is marked correcly. For inference, it is the same thing.
`predict = output.argmax(dim=1, keepdim=True).to(output.device)`
:::
## Chapter \#8\: Achieving Higher Throughput and Lower Latency
### Freezing Layers
Assume that different layers in our training model will converge at different stage of the training process. Those layers can be freezed.
* abandon the intermediate results on particular layers during forward propagation
* avoid generating gradients during backward propagation
### To CPU and Back
One way to solve memory insufficient is naively push data back to CPU with `.to('cpu')`
# Appendix
## AI Foundation \: 大模型與分佈式訓練 Select
Reference \: https://www.youtube.com/playlist?list=PLuufbYGcg3p4cv3QJdZEw08EM0SP00B_1
*I only pick topic that interest me in this part of the course.*
## Book : Hands-On Machine Learning with C++ by Kirill Kolodiazhnyi
> Reference \:
> * Hands-On Machine Learning with C++
> https://www.tenlong.com.tw/products/9781789955330
## Chapter \#1 \: Introduce to Machine Learning in C++
### My Other Post
* Eigen Programming Notes
https://hackmd.io/@Erebustsai/r16P2hb2C
I focuse on using `Eigen` so other libraries are only used when intorduced by the book. I will even try to rewrite it to Eigen.
### API for Linear Algebra
:::success
:bulb: :key: **A look at the performance of expression templates in C++: Eigen vs Blaze vs Fastor vs Armadillo vs XTensor**
https://romanpoya.medium.com/a-look-at-the-performance-of-expression-templates-in-c-eigen-vs-blaze-vs-fastor-vs-armadillo-vs-2474ed38d982
:::
* `Eigen`
* `xtensor` \: The xtensor library is a C++ library for numerical analysis with multidimensional array expressions. Containers of xtensor are inspired by NumPy, the Python array programming library.
* `Blaze` \: Blaze is a general-purpose high-performance C++ library for dense and sparse linear algebra.
* `ArrayFire` \: ArrayFire is a general-purpose high-performance C++ library for parallel computing.
* `Dlib` \: Dlib is a modern C++ toolkit containing ML algorithms and tools for creating computer vision software in C++. Most of the linear algebra tools in Dlib deal with dense matrices. However, there is also limited support for working with sparse matrices and vectors. In particular, the Dlib tools represent sparse vectors using the containers from the C++ standard template library (STL).
* `mlpack`
## Chapter \#2 \: Data Processing
> This chapter discusses how to process popular file formats that we use for storing data. It shows what libraries exist for working with JavaScript Object Notation \(JSON\), Comma-Separated Values \(CSV\), and Hierarchical Data Format v5 \(HDF5\) formats. This chapter also introduces the basic operations required to load and process image data with the OpenCV and Dlib libraries and how to convert the data format used in these libraries to data types used in linear algebra libraries.
### My Other Post
* File Operation for the Parallel World
https://hackmd.io/vGkriUD3S-icJaVd4ubdug#File-Operation-for-the-Parallel-World
### Parsing data formats to C++ data structures
This chapter provide libraries and examples for loading data from different format of files.
#### CSV
* `Fast-CPP-CSV-Parser` \: it is a small single-file header-only library with the minimal required functionality, which can be easily integrated into a development code base.
* `mlpack` \: load a CSV file into the matrix object.
#### JSON \: name-value pairs
* `nlohmann-json`
> Disadvantages are its slow parsing speed in comparison with binary formats and the fact it is not very useful for representing numerical
matrices
#### HDF5
* `HighFive`
> HDF5 is a specialized file format for storing scientific data. This file format was developed to store heterogeneous multidimensional data with a complex structure. It provides fast access to single elements because it has optimized data structures for using secondary storage. Furthermore, HDF5 supports data compression. In general, this file format consists of named groups that contain multidimensional arrays of multitype data. Disadvantages are the requirement of specialized tools for editing and viewing by users, the limited support of type conversions among different platforms, and using a single file for the whole dataset.
### Initialize Matrix with Different Libraries
* Eigen
* Blaze
* Dlib
* ArrayFire
* mlpack
### Working with Images
* OpenCV
* Dlib