> * OpenCL Parallel Programming Development Cookbook
> https://www.researchgate.net/publication/260792312_OpenCL_Parallel_Programming_Development_Cookbook
*Please refer to the Github repository for source code*
## Installation
Usually OpneCL support is built-in feature in drivers of AMD / Intel / Nvidia GPUs. We only need to support OpenCL headers.
In the repository of Algorithms, I build program on OpenCL headers provide by *ProjectPhysX* \(https://github.com/ProjectPhysX/OpenCL-Wrapper\) and in this post the AI Framework Sample code is built on AMD or Intel GPU device which support OpenCL 2.0&above features.
## OpenCL Basic
:::info
:bulb: **Max Work Group Size, Max Work Item Sizes**
https://stackoverflow.com/questions/3957125/questions-about-global-and-local-work-size
```bash
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple (device) 32
Preferred work group size multiple (kernel) 32
Warp size (NV) 32
```
In the above output from clinfo on Nvidia Tesla P40, the `max work item sizes` define the **maximum size of local work size in each dimension**. However, **total local work size should not exceed the max** `work group size`
:::
:::info
:bulb: **About cl_ Prefix Data Types**
https://stackoverflow.com/questions/25659972/whats-the-difference-between-int-and-cl-int-in-opencl
> So, whenever you are writing host code that deals with values or buffers that will be passed to the device, you should always use the cl_ datatypes.
:::
:::success
:information_source: **OpenCL Kernel Build Option**
https://registry.khronos.org/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html
:::
:::info
:information_source: **Nvidia OpenCL Support**
As answer in the following stack-overflow post, **Nvidia never implement functions of OpenCL 2.0 above**. A rename is appiled to it and have a so-called 3.0 version. Therefore, ***Nvidia GPU only support up to OpenCL 1.2***
https://stackoverflow.com/questions/75260617/how-to-install-an-older-version-of-opencl
:::
## OpenCL AI Framework Sample Codes
:::info
:bulb: **Original Documentation by Walter Weng**
Please refer to the following powerpoint document
> https://docs.google.com/presentation/d/1oeheah-ypjQTCaBUyMlz-ainb6bScrPR/edit?usp=sharing&ouid=116697045683854327760&rtpof=true&sd=true
The document including :
1. *Basic OpenCL Programming*
2. *Base Structure of the API*
3. *Sample Codes (Outdated)*
The reset of this document will focusing on **API changes / Environment Installation**.
:::
:::success
:information_source: **Revision Doucmentation**
Please refer to the following powerpoint doucment for **framework v1.0**
> https://docs.google.com/presentation/d/1CWSTYlM-Yra_nOEXujNv8sTRTkQMAp87/edit?usp=sharing&ouid=116697045683854327760&rtpof=true&sd=true
Please refer to the following powerpoint doucment for **framework v1.4**
> https://docs.google.com/presentation/d/1pBfmjLRncPispFCF3AKbB2k-cjU0yqpj/edit?usp=sharing&ouid=116697045683854327760&rtpof=true&sd=true
:::
### Framework Introduction
#### Framework Scope

#### Framework Workflow

### Sample Codes
#### Initial OpenCL
Create context, queue and build kernel programs. In the following example, the platform and device are selected accordingly. The number and order of platform and devices for a platform can be viewed with *clinfo*.
```cpp
void initialOpenCL()
{
#ifdef __CLMESAGE__
printf("Start OpenCL initialization\n\n");
#endif
auto platforms = ocl::getPlatformID();
OCL_PLATFORM dPlatform = platforms[0];
auto devices = ocl::getDeviceID(dPlatform);
dDevice = devices[0];
localMemSize = ocl::getLocalMemorySize(dDevice);
dContext = ocl::createContext(dPlatform, dDevice);
dQueue = ocl::createQueue(dContext, dDevice);
dProgram = ocl::createProgramWithSrc(dContext, { "Program.cl" });
const char options[] = "-cl-fast-relaxed-math";
ocl::buildProgram(dProgram, dDevice, options);
#ifdef __CLMESAGE__
printf("OpenCL initialization finish\n\n");
#endif
}
```
#### Initial Model
The following function build a model. The model is based on ***mobilenetV2*** and heavily reduce the model.
```cpp
void initialModel()
{
// 0
model->add(new ocl::input({ 1, 3, 224, 224 })); //0
model->add(new ocl::conv2d_opt(model->back(), 32, 3, 2, 1, false, 1, {16, 4, 4}));
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
// 1
model->add(new ocl::dwconv2d_opt(model->back(), 3, 1, 1, false, 1, {16, 4, 4})); //4
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
model->add(new ocl::pwconv2d_opt(model->back(), 16, false, 1, {16, 4, 4}));
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
// 2
model->add(new ocl::pwconv2d_opt(model->back(), 32, false, 1, {8, 8, 4})); //9
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
model->add(new ocl::dwconv2d_opt(model->back(), 3, 2, 1, false, 1, {8, 8, 4}));
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
model->add(new ocl::pwconv2d_opt(model->back(), 24, false, 1, {8, 8, 4}));
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
// 4
model->add(new ocl::pwconv2d_opt(model->back(), 48, false, 1, {8, 8, 4})); //17
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
model->add(new ocl::dwconv2d_opt(model->back(), 3, 2, 1, false, 1, {4, 4, 16}));
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
model->add(new ocl::pwconv2d_opt(model->back(), 32, false, 1, {4, 4, 16}));
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
// 7
model->add(new ocl::pwconv2d_opt(model->back(), 64, false, 1, {4, 4, 16})); //25
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
model->add(new ocl::dwconv2d_opt(model->back(), 3, 2, 1, false, 1, {14, 1, 16}));
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
model->add(new ocl::pwconv2d_opt(model->back(), 64, false, 1, {14, 1, 16}));
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
// 11
model->add(new ocl::pwconv2d_opt(model->back(), 128, false, 1, {14, 1, 16})); //33
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
model->add(new ocl::dwconv2d_opt(model->back(), 3, 1, 1, false, 1, {7, 7, 4}));
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
model->add(new ocl::pwconv2d_opt(model->back(), 96, false, 1, {14, 1, 16}));
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
// 14
model->add(new ocl::pwconv2d_opt(model->back(), 192, false, 1, {14, 1, 16})); //41
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
model->add(new ocl::dwconv2d_opt(model->back(), 3, 2, 1, false, 1, {7, 7, 4}));
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
model->add(new ocl::pwconv2d_opt(model->back(), 160, false, 1, {7, 7, 4}));
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
// 18
model->add(new ocl::pwconv2d_opt(model->back(), 80, false, 1, {7, 7, 5})); //49
model->add(new ocl::batchNorm(model->back(), 1e-5, true));
model->add(new ocl::relu6(model->back(), true));
model->add(new ocl::globalAvg(model->back())); //52
model->add(new ocl::dense(model->back(), 2, true)); //53
#ifdef __DEBUG__
model->summary();
#endif
model->loadWeight("CNRParkingLot_furtherReduced_final.bin");
printf("total gpu memory size : %llu bytes(not include weight memory size)\n", model->get_mem_size() * 4);
}
```
:::info
:information_source: **Reduced MobileNetV2**
:bulb: **Original MobileNetV2**

:bulb: **Reduced MobileNetV2**
This version reduced and eliminate some of the bottleneck repeated layers

:bulb: **Further Reduced MobileNetV2**


:::
### Utility Functions
This support function provide better interface with other library (OpenCV, Python)
#### Load Image
```cpp
void bufferedImageLoader(cv::Mat& image, OCL_TYPE* data)
{
cv::resize(image, image, cv::Size(224, 224));
cv::cvtColor(image, image, cv::COLOR_BGR2RGB);
cv::Mat If32;
image.convertTo(If32, CV_32FC3);
// OpenCV HWC -> CHW
#pragma omp parallel for
for (int c = 0; c < image.channels(); ++c) {
for (int y = 0; y < image.rows; ++y) {
for (int x = 0; x < image.cols; ++x) {
data[c * image.rows * image.cols + y * image.cols + x] = (OCL_TYPE)If32.at<cv::Vec3f>(y, x)[c] / 255.0f;
}
}
}
}
```
#### Clean Up Framework
```cpp
void clCleanUp()
{
#ifdef __DEBUG__
printf("Start Clean up OpenCL\n");
#endif
model->~sequential();
clReleaseProgram(dProgram);
clReleaseCommandQueue(dQueue);
clReleaseContext(dContext);
delete[] predict;
delete[] data;
#ifdef __DEBUG__
printf("Finish Clean up OpenCL\n");
#endif
}
```
#### Export Weight for Framework (Python Code)
```python
if EXPORT_WEIGHT:
model.eval()
if not LOAD_MODEL:
print("----------Warning!----------\n Not Loading ANY MODEL!\nThe output binary file might not be meaningful.")
print("Exporting weight file " + LOAD_PATH)
LOAD_PATH = LOAD_PATH.replace('pt', 'bin')
with open(LOAD_PATH, "wb") as file:
for param_name in model.state_dict():
if param_name.find("num_batches_tracked") != -1:
continue
layer_weight = model.state_dict()[param_name].flatten().numpy()
for weight in layer_weight:
file.write(weight)
print(LOAD_PATH + " Weight file exported")
exit()
```
## OpenCL AI Framework Example \: YOLOv1
```cpp
#include <time.h>
#include <iostream>
#include <opencv2/opencv.hpp>
#include "ocl.h"
#include <future>
#include <thread>
#include <vector>
#include <memory>
#include <chrono>
void bufferedImageLoader(const char* imagePath, OCL_TYPE* data) {
cv::Mat image = cv::imread("/home/erebus/opencl_dev/OpenCL_ML_API/mug_validation.jpg");
cv::resize(image, image, cv::Size(448, 448));
// printf("Input image channel size: %d\n", image.channels());
cv::cvtColor(image, image, cv::COLOR_BGR2GRAY);
std::vector<cv::Mat> gray_three_channel{image, image, image};
cv::merge(gray_three_channel, image);
// printf("Gray image channels size: %d\n", image.channels());
cv::Mat If32;
image.convertTo(If32, CV_32FC3);
// OpenCV HWC -> CHW
for (int c = 0; c < image.channels(); ++c) {
for (int y = 0; y < image.rows; ++y) {
for (int x = 0; x < image.cols; ++x)
// data[c * image.rows * image.cols + y * image.cols + x] = (OCL_TYPE)If32.at<cv::Vec3f>(y, x)[c] / 127.5 - 1;
data[c * image.rows * image.cols + y * image.cols + x] = (OCL_TYPE)1;
}
}
}
int main() {
std::unique_ptr<OCL_TYPE[]> predict = std::make_unique<OCL_TYPE[]>(1470);
std::unique_ptr<OCL_TYPE[]> data = std::make_unique<OCL_TYPE[]>(3*448*448);
std::future<void> backgroundThread = std::async(std::launch::async, bufferedImageLoader, "/home/erebus/opencl_dev/Com_OpenCL_API/OpenCL_ML_API/mug_validation.jpg", data.get());
auto platforms = ocl::getPlatformID();
OCL_PLATFORM dPlatform = platforms[0];
auto devices = ocl::getDeviceID(dPlatform);
dDevice = devices[0];
dContext = ocl::createContext(dPlatform, dDevice);
dQueue = ocl::createQueue(dContext, dDevice);
dProgram = ocl::createProgramWithSrc(dContext, {"/home/erebus/opencl_dev/OpenCL_ML_API/Program.cl"});
// const char options[] = "-cl-fast-relaxed-math";
const char options[] = "";
ocl::buildProgram(dProgram, dDevice, options);
ocl::sequential model;
model.add(new ocl::input({1, 3, 448, 448})); //0
/* CNNBlock #0 */
model.add(new ocl::conv2d_opt(model.back(), 64, 7, 2, 3, false)); //1
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* MaxPool #1*/
model.add(new ocl::maxpool2d_opt(model.back(), 2, 2)); //4
/* CNNBlock #2 */
model.add(new ocl::conv2d_opt(model.back(), 192, 3, 1, 1, false)); //5
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* MaxPool #3 */
model.add(new ocl::maxpool2d_opt(model.back(), 2, 2)); //8
/* CNNBlock #4 */
model.add(new ocl::pwconv2d_opt(model.back(), 128, false)); //9
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #5 */
model.add(new ocl::conv2d_opt(model.back(), 256, 3, 1, 1, false)); //12
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #6 */
model.add(new ocl::pwconv2d_opt(model.back(), 256, false)); //15
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #7 */
model.add(new ocl::conv2d_opt(model.back(), 512, 3, 1, 1, false)); //18
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* MaxPool #8 */
model.add(new ocl::maxpool2d_opt(model.back(), 2, 2)); //21
/* CNNBlock #9 */
model.add(new ocl::pwconv2d_opt(model.back(), 256, false)); //22
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #10 */
model.add(new ocl::conv2d_opt(model.back(), 512, 3, 1, 1, false)); //25
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #11 */
model.add(new ocl::pwconv2d_opt(model.back(), 256, false)); //28
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #12 */
model.add(new ocl::conv2d_opt(model.back(), 512, 3, 1, 1, false)); //31
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #13 */
model.add(new ocl::pwconv2d_opt(model.back(), 256, false)); //34
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #14 */
model.add(new ocl::conv2d_opt(model.back(), 512, 3, 1, 1, false)); //37
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #15 */
model.add(new ocl::pwconv2d_opt(model.back(), 256, false)); //40
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #16 */
model.add(new ocl::conv2d_opt(model.back(), 512, 3, 1, 1, false)); //43
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #17 */
model.add(new ocl::pwconv2d_opt(model.back(), 512, false)); //46
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #18 */
model.add(new ocl::conv2d_opt(model.back(), 1024, 3, 1, 1, false)); //49
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* MaxPool #19 */
model.add(new ocl::maxpool2d_opt(model.back(), 2, 2)); //52
/* CNNBlock #20 */
model.add(new ocl::pwconv2d_opt(model.back(), 512, false)); //53
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #21 */
model.add(new ocl::conv2d_opt(model.back(), 1024, 3, 1, 1, false)); //56
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #22 */
model.add(new ocl::pwconv2d_opt(model.back(), 512, false)); //59
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #23 */
model.add(new ocl::conv2d_opt(model.back(), 1024, 3, 1, 1, false)); //62
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #24 */
model.add(new ocl::conv2d_opt(model.back(), 1024, 3, 1, 1, false)); //65
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #25 */
model.add(new ocl::conv2d_opt(model.back(), 1024, 3, 2, 1, false)); //68
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #26 */
model.add(new ocl::conv2d_opt(model.back(), 1024, 3, 1, 1, false)); //71
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* CNNBlock #27 */
model.add(new ocl::conv2d_opt(model.back(), 1024, 3, 1, 1, false)); //74
model.add(new ocl::batchNorm_opt(model.back()));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
/* sequential */
/* Linear #1 */
model.add(new ocl::dense(model.back(), 512, true));
model.add(new ocl::leakyrelu_opt(model.back(), 0.1));
model.add(new ocl::dense(model.back(), 1470, true));
// 1470 = 7 * 7 * (20 + 2 * 5)
// model.summary();
model.loadWeight("/home/erebus/opencl_dev/OpenCL_ML_API/yolov1_100_epoch_100_seed.bin");
printf("total gpu memory size : %ld bytes(not include weight memory size)\n", model.get_mem_size() * 4);
backgroundThread.get();
// model.get_feature_vector(data.get(), predict.get(), 66);
auto start = std::chrono::steady_clock::now();
model.inference(data.get(), predict.get());
auto end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
printf("Inference Time: %lf\n", (double)duration.count() / 1000.0f);
for (int i = 0; i < 1470; ++i){
if ((i) % 30 == 0)
puts("");
printf("%3.4f\t", predict[i]);
}
puts("");
return 0;
}
```