CNN CUDA 3.0 Development Notes

Introduction

在這次的程式中，我們將來探討有關於sparse matrix對於CNN的影響，進而利用sparse CNN的特性更進一步的效能優化。用來表示sparse的方式有很多種，如COO,CSR等。這次主要針對COO format進行優化，並提出一個我自己設計的架構FAST format 改善COO的缺點與提升整體的效能
在文中會針對上一次的program也一起加入比較
除了針對軟體層面的思維外，也會從硬體方面著手嘗試從硬體的層面將整體的系統效能優化

Current Record

- Configurations

- Performance

我們跟part1的優化版本(DualKernel_200)做比較，thread/block的configuration都一樣，實驗結果如下所示，可以發現效能部分，coo與FAST皆比之前的版本改善許多。FAST在資料傳輸與kernel運算的效能都比COO的效果來的好。Transmission部分是因為減少了matrix的傳輸數量，kernel部分是減少了global memory存取的時間，因此利用FAST format優化後的最終結果是約167倍。

Representaion of sparse matrix

對於sparse matrix有很多種的表示法，如COO,CSR等，我們這次是針對COO的資料格式進行優化並在最後跟我自行設計的FAST format比較與分析

whats is COO

when a matrix is filled with many 0 elements inside, we call it "sparse matrix". Therefore, we could use COO format to represent this sparse matix. For each sparse matrix, we will use 3 other array to represent it's non-zero data and also the row and column information of each non-zero data.

Sparity of CNN

Sparsity in filters

如果我們的filter是一個sparse的狀態，我們其實可以有80%的部分其實可不需要用到

Sparsity in Neurons

If we could get rid of zeros

Reduce the amount of computation dramatically
Reduce the time spend on transfering data between Host and Device
Could get even better performance

Implementation – COO format

對於CNN的平行化設計，我們可以藉由觀察base program 與CNN的架構來決定加速的演算法，如上圖所示。在我們的程式中主要有兩個kernel，分別是convolution kernel 與max pooling kernel。從base program中發現，每個Input Neurons與 filter是不相依的，因此可以在這個地方做平行化。

這次的COO我們主要是用在filter上面，neurons就用原本的資料格式。因為在coo的格式下面，可以直接存取filter裡面非零的值，對於filter的部分，每個thread不需要做512x3x3次，只要做512次即可，InNeu[inNeuIdx]我們另外加了一個判斷式，判斷是否為0。因此主要是加速Convolution 這個kernel的速度，結果如下所示：











for (int i = 0; i < 512; ++i)
{
    CooIdx = threadX*512 + i;

    ifmy = threadY - 3 / 2 + FiltCooRow[CooIdx];		
    ifmx = threadZ - 3 / 2 + FiltCooCol[CooIdx];		
    inNeuIdx = i * fmArea + ifmy * 32 + ifmx;	
    if(InNeu[inNeuIdx] != 0)
        if(ifmy >= 0 && ifmy < 32 && ifmx >= 0 && ifmx < 32)	
            sum += FiltCooData[CooIdx] * InNeu[inNeuIdx];
}

================ Result ===================
CPU time for executing a typical convolutional layer = 17160.6ms
GPU time for executing a typical convolutional layer = 123.628ms
Congratulations! You pass the check.
Speedup: 138.808
=====================================================

由於這次的效能計算必須把資料傳輸時間也一起加入考慮，因此想要提升效能有兩個地方，kernel的效能與資料傳輸的時間。由nvvp所分析的結果如下圖，可以發現在資料傳輸的部分花了不少時間。

原因是coo的特性，他會用4個matrix去表示一個sparse matrix，分別是data、row、col與nnz，因此資料傳輸的時間就會比原本的多出許多。




























void initCooMemoryCopy()
{
	int filtCOOVol = sizeof(short)*FILTNUM*FMDEPTH; 	//512x512x1
	int NeuCOOVol = sizeof(short)*FMDEPTH;

	cudaMalloc(&devfiltCooNNZ, filtCOOVol);	//short input COO to kernel

	//input COO to kernel filter
	cudaMalloc(&devfiltCooNNZ, filtCOOVol);
	cudaMalloc(&devfiltCooData, filtCOOVol);
	cudaMalloc(&devfiltCooRow, filtCOOVol);
	cudaMalloc(&devfiltCooCol, filtCOOVol);
	// input COO to kernel neurons
	cudaMalloc(&devinNeuCooNNZ, NeuCOOVol);
	cudaMalloc(&devinNeuCooData, NeuCOOVol);
	cudaMalloc(&devinNeuCooRow, NeuCOOVol);
	cudaMalloc(&devinNeuCooCol, NeuCOOVol);

	cudaMemcpy(devfiltCooNNZ, filtCooNNZ, filtCOOVol, cudaMemcpyHostToDevice );
	cudaMemcpy(devfiltCooData, filtCooData, filtCOOVol, cudaMemcpyHostToDevice );
	cudaMemcpy(devfiltCooRow, filtCooRow, filtCOOVol, cudaMemcpyHostToDevice );
	cudaMemcpy(devfiltCooCol, filtCooCol, filtCOOVol, cudaMemcpyHostToDevice );

	cudaMemcpy(devinNeuCooNNZ, inNeuCooNNZ, NeuCOOVol, cudaMemcpyHostToDevice );
	cudaMemcpy(devinNeuCooData, inNeuCooData, NeuCOOVol, cudaMemcpyHostToDevice );
	cudaMemcpy(devinNeuCooRow, inNeuCooRow, NeuCOOVol, cudaMemcpyHostToDevice );
	cudaMemcpy(devinNeuCooCol, inNeuCooCol, NeuCOOVol, cudaMemcpyHostToDevice );
}

從kernel的部分來看，他在存取global memory的次數也會增加，如下圖所示，Memory的存取佔最多的時間

- nvprof result

==31878== Profiling application: ./cnnConvLayer
==31878== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 98.26%  43.087ms         1  43.087ms  43.087ms  43.087ms  convLayerGPU(short*, short*, short*, short*, int*, int*)
  1.20%  527.99us         9  58.665us     832ns  172.57us  [CUDA memcpy HtoD]
  0.35%  153.50us         1  153.50us  153.50us  153.50us  MaxPoolingGPU(int*, int*)
  0.18%  80.158us         1  80.158us  80.158us  80.158us  [CUDA memcpy DtoH]

==31878== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 63.89%  79.152ms        12  6.5960ms  3.4230us  78.732ms  cudaMalloc
 34.91%  43.247ms         1  43.247ms  43.247ms  43.247ms  cudaDeviceSynchronize
  0.89%  1.1068ms        10  110.68us  6.4950us  201.98us  cudaMemcpy
  0.19%  241.51us        83  2.9090us     698ns  81.226us  cuDeviceGetAttribute
  0.03%  37.156us         2  18.578us  8.7300us  28.426us  cudaLaunch
  0.03%  31.778us         1  31.778us  31.778us  31.778us  cuDeviceTotalMem
  0.02%  27.797us         1  27.797us  27.797us  27.797us  cuDeviceGetName
  0.02%  20.603us        12  1.7160us  1.0480us  8.0320us  cudaFree
  0.01%  10.405us         8  1.3000us     698ns  4.8890us  cudaSetupArgument
  0.00%  2.7230us         2  1.3610us     768ns  1.9550us  cuDeviceGetCount
  0.00%  2.4450us         2  1.2220us     908ns  1.5370us  cudaConfigureCall
  0.00%  1.7460us         2     873ns     768ns     978ns  cuDeviceGet

Implementation – FAST format

因此我們主要針對兩個部分進行優化，第一個是coo在資料傳輸(HtoD)的時間與kernel對於memory存取的時間。我們基於coo的格式設計的一個可以有效降低資料傳輸時間的FAST format，如下圖所示。

我們將coo的data、row、col壓縮到同一個matrix裡面，將data放在百位數之後，row放在十位數，col放在個位數，因此我們在host to device的memory malloc時就可以大幅減少傳輸的時間與call API的次數。如此設計有兩個好處，第一個是在HtoD時只需要傳輸一個matrix，另一個好處是可以減少kernel對global memory的存取次數，對以往的coo每個thread需要存取3次以上，但用FAST format只需要存取1次即可，存取完之後，在讓每個thread分別decode出個別的row、col與data。

FAST Format















void initFastFormat()
{
	int tempdata = 0;

	for (int j = 0; j < FILTNUM*FMDEPTH; j++)
		filtFastData[j] = 0;

	for (int i = 0; i < FILTNUM*FMDEPTH; i++)
	{
		tempdata = filtCooData[i];
		filtFastData[i] = tempdata*100; 
		filtFastData[i] += filtCooRow[i]*10; 
		filtFastData[i] += filtCooCol[i];
	}
}

Kernel

















for (int i = 0; i < 512; ++i)
{
    FastIdx = threadX*512 + i;
    data = FiltFastData[FastIdx];

    col = data % 10;
    data = (data - col)/10;
    row = data%10;
    data = (data - row)/10;


    ifmy = threadY - 3 / 2 + row;		
    ifmx = threadZ - 3 / 2 + col;		
    inNeuIdx = i * fmArea + ifmy * 32 + ifmx;	
    if(ifmy >= 0 && ifmy < 32 && ifmx >= 0 && ifmx < 32)	
        sum += data * InNeu[inNeuIdx];
}

由上圖可以看出在資料傳輸方面減少了call API的次數，意即減少傳輸的時間。





void initFASTMemoryCopy()
{
	cudaMalloc(&devfiltFastData, sizeof(int)*FILTNUM*FMDEPTH);
	cudaMemcpy(devfiltFastData, filtFastData, sizeof(int)*FILTNUM*FMDEPTH, cudaMemcpyHostToDevice);
}

另一方面，從kernel的角度來看：

減少了global memory存取的次數，但另一方面卻增加了arithmetic的時間，因為需要花額外的時間去decode FAST Format。

從上圖可以發現，FAST format也可以有效的降低kernel對memory的傳輸頻寬，從coo的140.915 (GB/s)降低到136.759 (GB/s)。因此FAST format的實驗結果如下所示，資料傳輸的時間與kernel的時間都有降低。

================ Result ===================
CPU time for executing a typical convolutional layer = 16438.6ms
GPU time for executing a typical convolutional laygter = 105.128ms
Congratulations! You pass the check.
Speedup: 156.368
=====================================================

但FAST format還是有個缺點，就是由於我們必須要把COO的三個matrix全部壓縮到同一個matrix裡面，難免會有爆掉的情況，因為COO的matrix全部都是short的變數型態，但short的變數型態只有16 bits，也就是說當COO的data只要超過655，那這樣FAST的matrix就會爆掉，因此我們只好犧牲容量，將FAST的變數型態宣告成int來解決會爆掉的情況，但相對的，在HtoD的malloc部分就會比short需要花上更多的時間。

Source code

FAST branch https://github.com/s880367/Sparse_CNN/tree/master
COO branch https://github.com/s880367/Sparse_CNN/tree/coo