> * OpenCL Parallel Programming Development Cookbook
> https://www.researchgate.net/publication/260792312_OpenCL_Parallel_Programming_Development_Cookbook
> * Parallel Programming Patterns in CUDA
> https://www.amazon.com/dp/1788996240
> * Programming Massively Parallel Processors
> https://shop.elsevier.com/books/programming-massively-parallel-processors/hwu/978-0-323-91231-0
*I am reading three books about GPGPU with CUDA C or OpenCL. In this post, I will present implementations of the following algorithms in at least one platform \(CUDA C or OpenCL\).*
## Hardware Used
```bash
Platform Name NVIDIA CUDA
Device Name Tesla P40
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 3.0 CUDA
Device UUID fe961b5a-b59b-3a4b-b15f-ba94324892e6
Driver UUID fe961b5a-b59b-3a4b-b15f-ba94324892e6
Valid Device LUID Yes
Device LUID ecf2-010000000000
Device Node Mask 0x1
Device Numeric Version 0xc00000 (3.0.0)
Driver Version 537.13
Device OpenCL C Version OpenCL C 1.2
Device OpenCL C all versions OpenCL C 0x400000 (1.0.0)
OpenCL C 0x401000 (1.1.0)
OpenCL C 0x402000 (1.2.0)
OpenCL C 0xc00000 (3.0.0)
Device OpenCL C features __opencl_c_fp64 0xc00000 (3.0.0)
__opencl_c_images 0xc00000 (3.0.0)
__opencl_c_int64 0xc00000 (3.0.0)
__opencl_c_3d_image_writes 0xc00000 (3.0.0)
```
Nvidia GPU only support OpenCL 1.2. Please refer to [OpenCL Programming Notes](https://hackmd.io/fWapRSunQGS_sCvyMp-3Eg?view)
## 1. Histogram
### Sequential Code
```cpp
unique_ptr<uint[]> data = make_unique<uint[]>(DATASIZE * sizeof(uint));
unique_ptr<uint[]> bin = make_unique<uint[]>(BINSIZE * sizeof(uint));
for (int i = 0; i < DATASIZE; ++i) {
data[i] = rand() % BINSIZE;
}
std::fill_n(bin.get(), BINSIZE, 0x0);
for (int i = 0; i < DATASIZE; ++i) {
bin[data[i]]++;
}
uint sum = 0;
for (int i = 0; i < BINSIZE; ++i) {
sum += bin[i];
}
```
This code is *O(N)*. The data array is accessed sequentially and cache-friendly manner *\(fetch from DRAM to cache\)* and the bin size *\(256 * 4 byte\)* can be fit into **L1 cache**. Therefore, this **CPU code is memory bound**.
```bash
Model name: Intel(R) Xeon(R) CPU E5-2696 v3 @ 2.30GHz
Stepping: 2
CPU MHz: 2294.685
BogoMIPS: 4589.37
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 576 KiB
L1i cache: 576 KiB
L2 cache: 4.5 MiB
L3 cache: 45 MiB
```
:::info
:information_source: **Extra Material : GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell**
https://developer.nvidia.com/blog/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/
:::
### Breif of Implementation
A naive implementation is simply use **atomic** add on global memory directly compute the output. Thread atomic contention occured on the entire grid.
:::info
:information_source: **Latency and throughput of atomic operations**
> Programming Massively Parallel Processors Page 198 Chapter 9.3
*Please Refer to the book for detail*
:::
A better solution is having different Thread Block have a copy of partial histogram and finally merge all the partial histogram together which reduce the contention scope to only happen in each Thread Block **\(less contender compare to entire grid\)**.
**Shared Memory *\(CUDA\)* / Local Memory *\(OpenCL\)* should be used.**
In the following code, **the first kernel create N partial histogram** and **the second kernel sum up them to the final histogram**.
### Code Snippets
*Please refer to the github repository for the entire code.*
```cpp
#define NULL 0
__kernel void dHistogram(global int* X, global int* Y)
{
int globalIdx = get_global_id(0);
int globalSize = get_global_size(0);
int localIdx = get_local_id(0);
int localSize = get_local_size(0);
int groupIdx = get_group_id(0);
int groupSize = get_num_groups(0);
__local int sharedBins[256];
// Initialize local memory to 0
for (int i = localIdx; i < 256; i += localSize) {
sharedBins[i] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
atomic_inc(&sharedBins[X[globalIdx]]);
barrier(CLK_LOCAL_MEM_FENCE);
for (int i = localIdx; i < 256; i += localSize) {
Y[groupIdx * 256 + i] = sharedBins[i];
}
}
/*
* @breif Should set worksize to 256
*
*/
__kernel void dHistogramFinalAccum256(global int* X, global int* Y, int numPartition)
{
int globalIdx = get_global_id(0);
int total = 0;
for (int i = 0; i < numPartition; ++i) {
total += X[i * 256 + globalIdx];
}
Y[globalIdx] = total;
}
```
```cpp
// main.cpp
...
cl_int error;
cl_event launchEvent;
size_t gws[3]{ DATASIZE, 0, 0 };
size_t lws[3]{ LOCALWORKSIZE, 0, 0 };
unique_ptr<cl_int[]> outputBins = make_unique<cl_int[]>(256 * ((DATASIZE)/(LOCALWORKSIZE)));
cl_mem dX = clCreateBuffer(dContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, DATASIZE * sizeof(cl_int), data.get(), &error);
if (error != CL_SUCCESS) {
perror("CL buffer create & to-device error");
exit(EXIT_FAILURE);
}
cl_mem dY = clCreateBuffer(dContext, CL_MEM_WRITE_ONLY, (256 * ((DATASIZE) / (LOCALWORKSIZE))) * sizeof(cl_int), nullptr, &error);
if (error != CL_SUCCESS) {
perror("CL buffer create error");
exit(EXIT_FAILURE);
}
clSetKernelArg(kHistogram, 0, sizeof(cl_mem), static_cast<void*>(&dX));
clSetKernelArg(kHistogram, 1, sizeof(cl_mem), static_cast<void*>(&dY));
error = clEnqueueNDRangeKernel(dQueue, kHistogram, 1, nullptr, gws, lws, 0, nullptr, &launchEvent);
if (error != CL_SUCCESS) {
perror("CL kernel runtime error");
exit(EXIT_FAILURE);
}
clWaitForEvents(1, &launchEvent);
error = clEnqueueReadBuffer(dQueue, dY, CL_TRUE, 0, (256 * ((DATASIZE) / (LOCALWORKSIZE))) * sizeof(cl_int), outputBins.get(), 0, nullptr, nullptr);
if (error != CL_SUCCESS) {
perror("CL buffer to-host error");
exit(EXIT_FAILURE);
}
...
cl_mem dhistogramBin = clCreateBuffer(dContext, CL_MEM_WRITE_ONLY, 256 * sizeof(cl_int), nullptr, &error);
if (error != CL_SUCCESS) {
perror("CL buffer create error");
exit(EXIT_FAILURE);
}
cl_int numPartition = ((DATASIZE) / (LOCALWORKSIZE));
clSetKernelArg(kHistogramFinalAccum256, 0, sizeof(cl_mem), static_cast<void*>(&dY));
clSetKernelArg(kHistogramFinalAccum256, 1, sizeof(cl_mem), static_cast<void*>(&dhistogramBin));
clSetKernelArg(kHistogramFinalAccum256, 2, sizeof(cl_int), static_cast<void*>(&numPartition));
gws[0] = 256;
lws[0] = 1;
error = clEnqueueNDRangeKernel(dQueue, kHistogramFinalAccum256, 1, nullptr, gws, lws, 0, nullptr, &launchEvent);
if (error != CL_SUCCESS) {
perror("CL kernel runtime error");
exit(EXIT_FAILURE);
}
clWaitForEvents(1, &launchEvent);
error = clEnqueueReadBuffer(dQueue, dhistogramBin, CL_TRUE, 0, 256 * sizeof(cl_int), histogramBin.get(), 0, nullptr, nullptr);
if (error != CL_SUCCESS) {
perror("CL buffer to-host error");
exit(EXIT_FAILURE);
}
...
```
In the above code, we can see that without releasing the memory object, we can reuse `dY`. Additionally, `launchEvent` , `error`, `gws`, `lws` are reused as well.
:::success
:bulb: **Programming Tips : Traversing or Initailizeing an Data Array with Abitrary Size of Threads**
> Reference : https://developer.nvidia.com/blog/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/
```cpp
// initialize temporary accumulation array in global memory
unsigned int *gmem = out + g * NUM_BINS // iniial post error corrext
for (int i = t; i < 3 * NUM_BINS; i += nt) gmem[i] = 0;
```
This way, all the data needed is **distribute** evenly \(with big enough data\) **to all threads in the block and initialize the data** without any of the data initialized more than once.
:::
:::success
:bulb: **Programming Tips : Aggregation Update**
> Programming Massively Parallel Processors Page 206 Chapter 9.6
If there exist a **frequently appeared element or there is a large concentration of identical values in a area** *\(Example from Book : A blue sky picture\)*. We can **batch the update together** and reduce the atomic operation. However, this require more condition instructions and more variable which can lead to branching and reduced occupancy.
:::
## 2. Matrix Multiplication
### Sequantial Implementation

```cpp
/*
* @brief Sequantial Matrix Multiplication with Arbitrary Size
*/
void matrixMulF32(float* C, const float* A, const float* B, size_t H, size_t W, size_t inner)
{
for (size_t i = 0; i < H; ++i)
for (size_t j = 0; j < W; ++j) {
float sum = 0;
for (size_t k = 0; k < inner; ++k) {
float a = A[i * inner + k];
float b = B[k * W + j];
sum += a * b;
}
C[i * W + j] = sum;
}
}
```
### Naive Implementation
In this implementation, the direct mapping of the sequantial version of the matrix multiplication.

```cpp=
__kernel void dMatrixMultiplication(int H, int W, int inner, __global float* A, __global float* B, __global float* C) {
int i = get_global_id(0); // Height
int j = get_global_id(1); // Width
float sum = 0.0f;
if (i < H && j < W) {
for (int k = 0; k < inner; ++k) {
sum += A[i * inner + k] * B[k * W + j];
}
}
C[i * W + j] = sum;
}
```
:::success
:bulb: **Implementation Detials**
In the line 7, a boundary check is provided. Instead of directly using the global work sizes, the `Height` and `Width` are passed as parameters. This can help selecting a better global work size to have a better match of local work size which can provide better performance.
:::
### Alternative Implementation *Thread Coarsening*
This implemention make a thread work on more elements. This means that a reduced work size but a lower performance if hardware can handle the amount of threads forked in the original kernel. However, if the work size is too big for the hardware, having smaller work size can sometimes improve performance. Some limited hardware has fewer compute units.
```cpp
__kernel void dMatrixMultiplication_threadCoarsening(int H, int W, int inner, __global float* A, __global float* B, __global float* C)
{
int i = get_global_id(0); // Height
float sum = 0.0f;
if (i < H) {
for (int j = 0; j < W; ++j) {
sum = 0.0f;
for (int k = 0; k < inner; ++k) {
sum += A[i * inner + k] * B[k * W + j];
}
C[i * W + j] = sum;
}
}
}
```
### Improve Performance with Data Prefetching and Shared Memory
In this implementation, a register is used to prefetch a row from matrix A and a shared memory is used to store a column in matrix B.
```cpp
__kernel void dMatrixMultiplication_shared(int H, int W, int inner, __global float* A, __global float* B, __global float* C, __local float* shared) {
int globalIdx = get_global_id(0);
int localIdx = get_local_id(0);
int localSize = get_local_size(0);
float sum = 0.0f;
float regData[1024];
if (globalIdx < H) {
for (int k = 0; k < inner; ++k) {
regData[k] = A[globalIdx * H + k]; // Store a row of A in registers
}
for (int j = 0; j < W; ++j) {
for (int k = localIdx; k < inner; k += localSize)
shared[k] = B[k * W + j];
barrier(CLK_LOCAL_MEM_FENCE);
sum = 0.0f;
for (int k = 0; k < inner; ++k) {
sum += shared[k] * regData[k];
}
C[globalIdx * W + j] = sum;
}
}
}
```
### Improve Performance with Tiling

In the above image, the orange part are the elements that participated the computation of the output matrix element and the yellow part representing the participated tiles.
:::info
:bulb: **Tiling Implementation**
A tiling implementation make all the computation happen with both elements are in the shared memory.
:::
## 3. Convolution
*Please refer to AI Framework Github Repository for the implementation of Convolution Neural Network Layer.*
### Special Convolution : A Sobel Kernel
Sobel operation can be consider as a special type of convolution. It works on 2 set of specific kernels one for *x* axis and the other for *y* axis to detect edges in image. On the other hand, the convolution operation does not have specific kernels. Therefore, a sobel kernel can have more optimization because it's a simpler algorithm.
In sobel the following kernels are used together.

**Implementation \: Indexing of the output**
```bash
i00 i10 i20
i01 i11 i21
i02 i12 i22
```
In the above indexing, the center point of the sobel kernel is i11. This makes accessing all the participated pixels much easierFor a normal convolution operation, this assumption cannot be applied. Additionally the assumption of being able to skip border pixels is not necessary true for convolution operations.
```cpp
gradient_x = i00 + convert_float4(2.0f) * i10 + i20 - i02 - convert_float4(2.0f) * i12 - i22;
gradient_y = i00 - i20 + convert_float4(2.0f) * i01 - convert_float4(2.0f) * i21 + i02 - i22;
```
In the above code snippet, instead of using multiple for loops to calculate the kernel, we can simply compute the result since we already know the kernel value.
### Simple Convolution Implementation
Similar to the naive version in `CUDA C AI Framework`, this implementation is a direct mapping of the sequantial implementation.
```cpp
__kernel void dConvolution(int H, int W, int fSize, __global float* input, __global float* output, __global float* filter) {
int globalIdx_x = get_global_id(0);
int globalIdx_y = get_global_id(1);
int padSize = fSize / 2;
float sum = 0.0f;
for (int fH = -padSize; fH <= padSize; ++fH) {
for (int fW = -padSize; fW <= padSize; ++fW) {
int iW = globalIdx_x + fW;
int iH = globalIdx_y + fH;
float iValue = 0.0f;
if (iH >= 0 && iH < H && iW >= 0 && iW < W) {
iValue = input[iH * W + iW];
}
float fValue = filter[(fH + padSize) * fSize + fW + padSize];
sum += iValue * fValue;
}
}
output[globalIdx_y * W + globalIdx_x] = sum;
}
```
### Improve Performance with Shared Memory and Tiling
In this implementation, the filter\(kernel\) is stored in the constant memory since it will not be modified and a block of input is stored in the shared memory. Therefore, the computation happened in the shared memory and constant memory instead of global memory.
:::info
:bulb: **Shared Memory \/ Constant Memory Size Consideration**
In the `CUDA C AI Framework` implementation, I only put the weight\(filter\) in the shared memory which will be used multiple times. The reason behind this is that the amount of shared memory size is not big enough to store the entire weight of a convolutional layer since it includes input and output channels.
:::
```cpp
int pSize = fSize / 2;
int tSize = BLOCK_DIM + (2 * pSize);
float sum = 0.0f;
for (int bH = 0; bH < (tSize + (BLOCK_DIM - 1)) / BLOCK_DIM; ++bH) {
for (int bW = 0; bW < (tSize + (BLOCK_DIM - 1)) / BLOCK_DIM; ++bW) {
int iH = globalIdx_y + (BLOCK_DIM * bH) - pSize;
int iW = globalIdx_x + (BLOCK_DIM * bW) - pSize;
int fH = localIdx_y + (BLOCK_DIM * bH);
int fW = localIdx_x + (BLOCK_DIM * bW);
if (fH >= tSize || fW >= tSize)
continue;
shared[fH * tSize + fW] = 0.0f;
if (iH >= 0 && iH < H && iW >= 0 && iW < W)
shared[fH * tSize + fW] = input[iH * W + iW];
}
}
barrier(CLK_LOCAL_MEM_FENCE);
```
In the above code snippet, a block of input is stored in the shared memory. Notice that a padding size should be kept to make sure all the participated elements are included. Notice that the block size is predefined \(`#define BLOCK_DIM (16)`\). This can be implemented with `get_local_size(dim)` instead of giving a fixed value. However, for performance consideration, this should be configured according to the device specs.
## 5. Prefix Scan
:::success
:bulb: **Chapter 39. Parallel Prefix Sum \(Scan\) with CUDA**
https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
:::
Compare to a reduction, prefix scan need to calculate values for the entire array.
### Naive Implementation

```cpp
__kernel void dPrefixScan(int Size, __global float* input, __global float* output)
{
int globalIdx = get_global_id(0);
float sum = 0.0f;
for (int offset = 0; offset < Size; ++offset) {
if (globalIdx - offset) {
sum += input[globalIdx - offset];
}
output[globalIdx] = sum;
}
}
```
### Blelloch Scan
The following diagram showes the calculation of the Blelloch Scan. Please refer to my github repository for the code.

To achieve better performance, we utilize shared memory and calculate all the result within it. Notice that we use a thread size of half of the data size which means it can process two times bigger data than the thread size. However, using shared memory will kept the biggest data size to the maximum of `Max work item sizes`.
```bash
# Nvidia Tesla P40 24G
Max work item sizes 1024x1024x64
```
:::info
:bulb: **Barriers in OpenCL**
https://stackoverflow.com/questions/6890302/barriers-in-opencl
:::
## 6. N-body
> Reference \:
> https://yangwc.com/2019/06/20/NbodySimulation/
N-body is a common simulation of interaction of particles. In this example Newton's law of universal gravitation is applied as the interaction force. Every particle will be affected by others.
:::info
:bulb: **Vector Types**
https://www.informit.com/articles/article.aspx?p=1732873&seqNum=3
:::
### Implementation
All the particle need to compute interactive force from any others. Therefore, this calculation can be consider as populate an entire square matrix.\(*O(N^2)*\) We can parallelize the entire computation of the square matrix; however, the amount of computation is too small compare to the memory accessing. We only parallelize through particles and each thread run a loop through all the particles \(O(N)\). Since a thread need to traverse through all the particles, we can use local memory to store and access particles block by block.
```cpp
for (int tileIdx = 0; tileIdx < get_num_groups(0); ++tileIdx) {
float4 temp = pos[tileIdx * get_local_size(0) + get_local_id(0)];
shared_pos[get_local_id(0)].x = temp.x;
shared_pos[get_local_id(0)].y = temp.y;
shared_pos[get_local_id(0)].z = temp.z;
barrier(CLK_LOCAL_MEM_FENCE);
for (int i = 0; i < BLOCK_DIM; ++i) {
float dist_x = shared_pos[i].x - pos[globalIdx].x;
float dist_y = shared_pos[i].y - pos[globalIdx].y;
float dist_z = shared_pos[i].z - pos[globalIdx].z;
float distSqr = pow(dist_x, 2) + pow(dist_y, 2) + pow(dist_z, 2) + EPSILON;
float invDist = rsqrt(distSqr);
float invDistCube = pow(invDist, 3);
f.x += dist_x * invDistCube;
f.y += dist_y * invDistCube;
f.z += dist_z * invDistCube;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
```
## 7. Sparse Matrix
:::info
:information_source: **Sparse Matrix Definition**
https://hackmd.io/@Aquamay/H1nxBOLcO/https%3A%2F%2Fhackmd.io%2F%40Aquamay%2FSyd8UdLqu
:::
### Design Consideration of Sparse Matrix Structure
> Reference \: Programming Massively Parallel Processors
> https://shop.elsevier.com/books/programming-massively-parallel-processors/hwu/978-0-323-91231-0
* Space Efficiency \: The amount of memory required.
* Flexibility \: Cost of modifying the nonzero value in the matrix.
* Accessibility
* Memory Access Efficiency
* Load balancing \: Work distribution of Multi-threaded Program.
### SpMV
SpMV stand for sparse matrix vector multiplication and accumulation.

### COO Format
COO \(coordinate list\) format remove all the zero from the data representation. However, introduce overhead to the structure with requiring to store row and col index.
> Reference \: Programming Massively Parallel Processors
> https://shop.elsevier.com/books/programming-massively-parallel-processors/hwu/978-0-323-91231-0

### COO Format SpMV
Naive approach is assigning each non-zero value to a thread. Notice that the \(Refer to the above image\) col index of a non-zero value is also used to access the vector to multiplie \(the vertical vector should be flatten\) and the row index is map to the store index.
```cpp
// value is the non-zero value
atomicAdd(&y[row], x[col] * value);
```
### COO Format Analisys
* **Space Efficiency** \: Overhead of storing row index and column index
* **Flexibility** \: Max flexibility
* **Accessibility** \: Hard to access data in all row or column
* **Memory Access Efficiency** \: Coalesced read and attomic and random write
* **Load balancing** \: Balanced since one thread work on one non-zero element
### COO Disadvantage
> Reference \: These atomic operations can be avoided if the same thread is responsible for all the nonzeros of a row.
### CSR Format
CSR format is similar to the COO format but grouping all the nonzero elements in one row together. Storing the column indexs and pointers point to the first element in a row.
> Reference \: Programming Massively Parallel Processors
> https://shop.elsevier.com/books/programming-massively-parallel-processors/hwu/978-0-323-91231-0

In the above image, an additional ptr point to the end is included and used for boundary checking for each thread to figure out the amount of data needed to be processed.
### CSR Format SpMV
```cpp
for (unsigned int i = csrMatrix.rowPtrs[row]; i < csrMatrix.rowPtrs[row+1]; ++i) {
// calculate the sum
}
// store sum to the location
```
### CSR Format Analisys
* **Space Efficiency** \: `rowPtrs` is smaller than the row index array in COO format thus more space efficiency.
* **Flexibility** \: Less flexibility than COO format since keeping nonzero element in a specific order is required.
* **Accessibility** \: Can access a specific row. However, the amount of row might not be enough to utilize the entire GPU.
* **Memory Access Efficiency** \: Data accessed by consecutive threads are far apart. \(row size apart\)
* **Load balancing** \: Apparently each thread will have to work different number of elements and warp divergence will be big.
### ELL Format
In order to addressed the noncoalesced memory access, a padding and a transpose is applied \(column major\). Padding with 0 make the matrix rectangular and will not need to be checked by any if statement.
> Reference \: Programming Massively Parallel Processors
> https://shop.elsevier.com/books/programming-massively-parallel-processors/hwu/978-0-323-91231-0

### ELL Format SpMV
> Note that the SpMV/ELL kernel assumes that the input matrix has a vector ellMatrix.nnzPerRow that records the number of nonzeros in each row and allows each thread to iterate only through the nonzeros in its assigned row.
```cpp
for (unsigned int t = 0; t < ellMatrix.nnzPerRow[row]; ++t) {
// calculate the sum
}
// store the sum to the location
```
In the above code snippet, size per row is determined and the padding value is simply 0.
### CSR Format Analisys
* **Space Efficiency** \: There are apparent overhead of padding the elements.
* **Flexibility** \: Adding an element can be almost no cost if it is not added to the biggest row.
* **Accessibility** \:
> ELL gives us the accessibility of both CSR and COO.
* **Memory Access Efficiency** \: Coalesced
* **Load balancing** \:
> For load balance we observe that SpMV/ELL still exhibits the same load imbalance as SpMV/CSR because each thread still loops over the number of nonzeros in the row for which it is responsible. Therefore ELL does not address the problem of control divergence.
### ELL-COO Format
> Reference \: Programming Massively Parallel Processors
> https://shop.elsevier.com/books/programming-massively-parallel-processors/hwu/978-0-323-91231-0
> The COO format can be used to curb the length of rows in the ELL format. Before we convert a sparse matrix to ELL, we can take away some of the elements from the rows with exceedingly large numbers of nonzero elements and place the elements into a separate COO storage.

### JDS Format
JDS as jagged diagonal storage. Basically, sorting the rows according to their length.

## 8. Parallel Sorting Algorithms
### Categories of Sorting Algorithm
> Reference \:
> * OpenCL Parallel Programming Development Cookbook
> https://www.researchgate.net/publication/260792312_OpenCL_Parallel_Programming_Development_Cookbook
* Data-driven Sorting \: The next step of an algorithm is depend on the value of the key in the current step, for example, the QuickSort
* Data-independent Sorting \: The sorting steps are fixed.
Apparently, when using GPU for caculation, we want as much independence as possible since it easier to apply parallelizm.
### Bitonic Sort \(Sorting Network\)
*TODO \: Adding detial explantation of bitonic sort*
*Please refer to the following references for detail before the TODO is resolved.*
> Reference \:
> https://blog.csdn.net/xbinworld/article/details/76408595
> https://blog.51cto.com/u_15069442/2915938

In the above reference, a implementation of recursive version is provided. A output with detail step by step is provide below. This output only print the first two step. Notice that the final stage is a in order loop and the reverse order loop will not meet the initial loop condition so the reverse order loop did nothing.
```
1 7 9 16 5 8 4 3 15 14 13 12 10 11 2 6
Order
stride = 2
biseqSize = 4
Start Index = 0
Before
1 7 9 16 5 8 4 3 15 14 13 12 10 11 2 6
Compare 1 with 7
After
1 7 9 16 5 8 4 3 15 14 13 12 10 11 2 6
Start Index = 4
Before
1 7 9 16 5 8 4 3 15 14 13 12 10 11 2 6
Compare 5 with 8
After
1 7 9 16 5 8 4 3 15 14 13 12 10 11 2 6
Start Index = 8
Before
1 7 9 16 5 8 4 3 15 14 13 12 10 11 2 6
Compare 15 with 14
After
1 7 9 16 5 8 4 3 14 15 13 12 10 11 2 6
Start Index = 12
Before
1 7 9 16 5 8 4 3 14 15 13 12 10 11 2 6
Compare 10 with 11
After
1 7 9 16 5 8 4 3 14 15 13 12 10 11 2 6
Reverse order
stride = 2
biseqSize = 4
Start Index = 2
Compare 9 with 16
1 7 16 9 5 8 4 3 14 15 13 12 10 11 2 6
Start Index = 6
Compare 4 with 3
1 7 16 9 5 8 4 3 14 15 13 12 10 11 2 6
Start Index = 10
Compare 13 with 12
1 7 16 9 5 8 4 3 14 15 13 12 10 11 2 6
Start Index = 14
Compare 2 with 6
1 7 16 9 5 8 4 3 14 15 13 12 10 11 6 2
Order
stride = 4
biseqSize = 8
Start Index = 0
Before
1 7 16 9 5 8 4 3 14 15 13 12 10 11 6 2
Compare 1 with 16
Compare 7 with 9
Compare 1 with 7
Compare 16 with 9
After
1 7 9 16 5 8 4 3 14 15 13 12 10 11 6 2
Start Index = 8
Before
1 7 9 16 5 8 4 3 14 15 13 12 10 11 6 2
Compare 14 with 13
Compare 15 with 12
Compare 13 with 12
Compare 14 with 15
After
1 7 9 16 5 8 4 3 12 13 14 15 10 11 6 2
Reverse order
stride = 4
biseqSize = 8
Start Index = 4
Compare 5 with 4
Compare 8 with 3
Compare 5 with 8
Compare 4 with 3
1 7 9 16 8 5 4 3 12 13 14 15 10 11 6 2
Start Index = 12
Compare 10 with 6
Compare 11 with 2
Compare 10 with 11
Compare 6 with 2
1 7 9 16 8 5 4 3 12 13 14 15 11 10 6 2
......
Reverse order
stride = 16
biseqSize = 32
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
```
**Implementation of Interative Version**

> https://www.researchgate.net/publication/260792312_OpenCL_Parallel_Programming_Development_Cookbook
The book provide the algorithm implementation as following code. I simply copy and paste the implementation and add in some check to make sure it follows the compute flow as expected. Notice that the sorting network is different than the one in recursive. In this network, we only need to put the smaller or bigger on top to have a sorted array in descending or increasing order.
```cpp
void hBitonicSortIterative(int* a, int l, int r) {
int i, j, k, p;
int N = r - l + 1;
for (p = 1; p < N; p += p)
for (k = p; k > 0; k /= 2)
for (j = k % p; j + k < N; j += (k + k))
for (i = 0; i < k; ++i)
if (j + i + k < N)
if ((j + i) / (p + p) == (j + i + k) / (p + p)) {
if (compare(a[l + j + i], a[l + j + i + k], false)) {
int temp = a[l + j + i + k];
a[l + j + i + k] = a[l + j + i];
a[l + j + i] = temp;
}
}
}
```
The following output applied this function on size 16 array. Additional info is printed to better understand the indexing
```
compareXchg([0], [1]), p = 1, k = 1, j = 0, i = 0
compareXchg([2], [3]), p = 1, k = 1, j = 2, i = 0
compareXchg([4], [5]), p = 1, k = 1, j = 4, i = 0
compareXchg([6], [7]), p = 1, k = 1, j = 6, i = 0
compareXchg([8], [9]), p = 1, k = 1, j = 8, i = 0
compareXchg([10], [11]), p = 1, k = 1, j = 10, i = 0
compareXchg([12], [13]), p = 1, k = 1, j = 12, i = 0
compareXchg([14], [15]), p = 1, k = 1, j = 14, i = 0
compareXchg([0], [2]), p = 2, k = 2, j = 0, i = 0
compareXchg([1], [3]), p = 2, k = 2, j = 0, i = 1
compareXchg([4], [6]), p = 2, k = 2, j = 4, i = 0
compareXchg([5], [7]), p = 2, k = 2, j = 4, i = 1
compareXchg([8], [10]), p = 2, k = 2, j = 8, i = 0
compareXchg([9], [11]), p = 2, k = 2, j = 8, i = 1
compareXchg([12], [14]), p = 2, k = 2, j = 12, i = 0
compareXchg([13], [15]), p = 2, k = 2, j = 12, i = 1
compareXchg([1], [2]), p = 2, k = 1, j = 1, i = 0
compareXchg([5], [6]), p = 2, k = 1, j = 5, i = 0
compareXchg([9], [10]), p = 2, k = 1, j = 9, i = 0
compareXchg([13], [14]), p = 2, k = 1, j = 13, i = 0
compareXchg([0], [4]), p = 4, k = 4, j = 0, i = 0
compareXchg([1], [5]), p = 4, k = 4, j = 0, i = 1
compareXchg([2], [6]), p = 4, k = 4, j = 0, i = 2
compareXchg([3], [7]), p = 4, k = 4, j = 0, i = 3
compareXchg([8], [12]), p = 4, k = 4, j = 8, i = 0
compareXchg([9], [13]), p = 4, k = 4, j = 8, i = 1
compareXchg([10], [14]), p = 4, k = 4, j = 8, i = 2
compareXchg([11], [15]), p = 4, k = 4, j = 8, i = 3
compareXchg([2], [4]), p = 4, k = 2, j = 2, i = 0
compareXchg([3], [5]), p = 4, k = 2, j = 2, i = 1
compareXchg([10], [12]), p = 4, k = 2, j = 10, i = 0
compareXchg([11], [13]), p = 4, k = 2, j = 10, i = 1
compareXchg([1], [2]), p = 4, k = 1, j = 1, i = 0
compareXchg([3], [4]), p = 4, k = 1, j = 3, i = 0
compareXchg([5], [6]), p = 4, k = 1, j = 5, i = 0
compareXchg([9], [10]), p = 4, k = 1, j = 9, i = 0
compareXchg([11], [12]), p = 4, k = 1, j = 11, i = 0
compareXchg([13], [14]), p = 4, k = 1, j = 13, i = 0
compareXchg([0], [8]), p = 8, k = 8, j = 0, i = 0
compareXchg([1], [9]), p = 8, k = 8, j = 0, i = 1
compareXchg([2], [10]), p = 8, k = 8, j = 0, i = 2
compareXchg([3], [11]), p = 8, k = 8, j = 0, i = 3
compareXchg([4], [12]), p = 8, k = 8, j = 0, i = 4
compareXchg([5], [13]), p = 8, k = 8, j = 0, i = 5
compareXchg([6], [14]), p = 8, k = 8, j = 0, i = 6
compareXchg([7], [15]), p = 8, k = 8, j = 0, i = 7
compareXchg([4], [8]), p = 8, k = 4, j = 4, i = 0
compareXchg([5], [9]), p = 8, k = 4, j = 4, i = 1
compareXchg([6], [10]), p = 8, k = 4, j = 4, i = 2
compareXchg([7], [11]), p = 8, k = 4, j = 4, i = 3
compareXchg([2], [4]), p = 8, k = 2, j = 2, i = 0
compareXchg([3], [5]), p = 8, k = 2, j = 2, i = 1
compareXchg([6], [8]), p = 8, k = 2, j = 6, i = 0
compareXchg([7], [9]), p = 8, k = 2, j = 6, i = 1
compareXchg([10], [12]), p = 8, k = 2, j = 10, i = 0
compareXchg([11], [13]), p = 8, k = 2, j = 10, i = 1
compareXchg([1], [2]), p = 8, k = 1, j = 1, i = 0
compareXchg([3], [4]), p = 8, k = 1, j = 3, i = 0
compareXchg([5], [6]), p = 8, k = 1, j = 5, i = 0
compareXchg([7], [8]), p = 8, k = 1, j = 7, i = 0
compareXchg([9], [10]), p = 8, k = 1, j = 9, i = 0
compareXchg([11], [12]), p = 8, k = 1, j = 11, i = 0
compareXchg([13], [14]), p = 8, k = 1, j = 13, i = 0
```
### OpenCL Implementation
The OpneCL version implemented as the following diagram.

In each stage and substage, host launch a kernel and compute the comparison in parallel.
```cpp
for (cl_uint stage = 0; stage < stages; ++stage) {
clSetKernelArg(kBitonicSort, 1, sizeof(cl_uint), (void*)&stage);
for (cl_uint subStage = 0; subStage < stage + 1; ++subStage) {
clSetKernelArg(kBitonicSort, 2, sizeof(cl_uint), (void*)&subStage);
ocl::launchOneKernelAndWait(dQueue, kBitonicSort, 1, gws, lws);
}
}
```
The following code snippet is how the kernel define the necessary parameters.
```cpp
uint globalIdx = get_global_id(0);
uint sortDirection = direction;
uint pairWidth = 1 << (stage - subStage);
uint blockWidth = pairWidth << 1;
uint leftIdx = (globalIdx % pairWidth) + (globalIdx / pairWidth) * blockWidth;
uint rightIdx = leftIdx + pairWidth;
uint sameDirectionBlockWidth = 1 << stage;
```
Compare and swap
```cpp
uint temp;
if ((leftElement > rightElement && (bool)sortDirection) || (leftElement > rightElement && !((bool)sortDirection))) {
temp = data[leftIdx];
data[leftIdx] = data[rightIdx];
data[rightIdx] = temp;
}
```
### Radix Sort
Radix sort is a stable sorting algorithm and thus require the sub sorting routine is stable as well. In most implementation, counting sort is used.
:::info
:bulb: **Counting Sort**
計數排序 Counting sort
https://rust-algo.club/sorting/counting_sort/
:::
In this implementation, LSD Radix sort is implemented. Notice that the program will also use a reduction, calculate the prefix scan and histogram.
* Histogram-Keys \: Calculate frequency of keys
* Scan-Buckets \: Prefix scan to generate elements location
* Rank-And-Permute \: Place the elements to their final location.
:::success
:bulb: ***GPU Gems 2 \:* Chapter 46. Improved GPU Sorting**
Although GPU Gems is old, this can be used as a reference to GPU sorting idea.
https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-46-improved-gpu-sorting
:::
## 9. Graph Travel \[TODO\]
> There are multiple strategies for parallelizing graph computations, some of which are centered on processing vertices in parallel, while others are centered on processing edges in parallel.
### Adjacency Matrix
> Reference \: Programming Massively Parallel Processors
> https://shop.elsevier.com/books/programming-massively-parallel-processors/hwu/978-0-323-91231-0

If a graph is not full-connected, it will be refered as sparsely connected. Notice that in the above graph and its adjacency matrix, the matrix can be considered as a **sparse matrix**.

> The CSR representations give easy access to the outgoing edges of a given vertex. The CSC representation gives easy access to the incoming edges of a given vertex. The COO representation gives easy access to the source and destination vertices of a given edge.
:::info
:information_source: **BFS \& DFS**
**\[演算法\]\-深度優先搜尋DFS與廣度優先搜尋BFS**
https://ithelp.ithome.com.tw/articles/10281404
:::
:::info
:information_source: **Dijkstra Algorithm**
**\[演算法\] 學習筆記 — 14\. Dijkstra Algorithm 最短路徑演算法**
https://medium.com/@amber.fragments/%E6%BC%94%E7%AE%97%E6%B3%95-%E5%AD%B8%E7%BF%92%E7%AD%86%E8%A8%98-14-dijkstra-algorithm-%E6%9C%80%E7%9F%AD%E8%B7%AF%E5%BE%91%E6%BC%94%E7%AE%97%E6%B3%95-745983dd4332
:::
## 10. Noise Removal With Median Filter
> Reference \:
> * Noise removal using Median filter in C++
> https://www.geeksforgeeks.org/noise-removal-using-median-filter-in-c/
> * Image Filters in CUDA
> https://github.com/jonaylor89/Median-Filter-CUDA/tree/master
> Median filtering is a nonlinear process useful in reducing impulsive, or salt-and-pepper noise.
In finding median number within a filter size, a sorting is needed. In the second reference, the \"Image Filters in CUDA\", the author use a bubble sort for each thread CUDA invoke.
### Steps
For each thread invoked, the following steps will be processed.
* Store the pixel values within the filter size into a temparay array
* A padding is included
* Require a local array with the filter size for each thread. A possible register spilling might occur
* Sort the temparary array \(Branch \#1\)
* Possible solution \: **Bubble Sort** as in the reference \[2\]
* Possible solution \: **Bitonic Sort**
* **Quick Select** to find Median \(Branch \#2\)
* Store median value to the output array
:::info
:information_source: **Quick Select**
https://hackmd.io/@Erebustsai/SyfT89rsC
:::
In my implementation, I use a device scope function that do the partition and in it, a bit hack is applied since there is no native swap function for CUDA kernel.
:::success
:bulb: **Reduce Compare Statement**
Notice that comparing two indexes can be replaced with a force casting on a signed variable since a negative number casting to a unsigned variable will be a vary big number.
```cpp
// if (hp >= 0 && wp >= 0 && hp < inputH && wp < inputW) {
// Should equal to the following statement since casting negative numbers to unsigned will be a vary big number
if ((unsigned int)hp < inputH && (unsigned int)wp < inputW) {
```
:::
### Github Repo
https://github.com/Chen-KaiTsai/GPGPU_CUDA/tree/main/ImageFilter
**Implementation Detail**
*Swap two variable*
```cpp
data[a] = data[a] ^ data[b];
data[b] = data[a] ^ data[b];
data[a] = data[a] ^ data[b];
```