# 如何在Google Colab跑CUTLASS及CuTe的程式碼
基本上跟跑CUDA的方式一樣
## 方法一
利用cell magic ==%%writefile== 將目前的儲存格的內容存檔,再利用nvcc編譯成執行檔,即可執行。有時候你想要印出一些資訊來做除錯,但是即使加了printf,結果卻沒看到任何的輸出,記得要依照你所用的機器,給予適當的參數,如-arch=sm_75,否則有時候不會有任何的結果產生
```cuda
%% writefile hello.cu
#include <stdio.h>
__global__ void hello(){
printf("Hello from block: %u, thread: %u\n", blockIdx.x, threadIdx.x);
}
int main(){
hello<<<2, 2>>>();
cudaDeviceSynchronize();
}
```
`!nvcc -arch=sm_75 -o hello hello.cu`
`!./hello`
> Hello from block: 0, thread: 0
Hello from block: 0, thread: 1
Hello from block: 1, thread: 0
Hello from block: 1, thread: 1
## 方法二
為了要在Colab上跑CUDA相關的程式碼,我們可以先安裝[nvcc4jupyter][1]這個套件,再利用cell magic ==%%cuda== 載入即可
```
!pip install nvcc4jupyter
%load_ext nvcc4jupyter
```
此時,我們可以試著跑一個CUDA範例程式看看
```
// 跟之前一樣,如果你的輸出沒有印出任何東西,或許可以加這個試試看
CUDA_COMPILE_ARGS = '-c "--gpu-architecture sm_75"'
```
```cuda
%%cuda $CUDA_COMPILE_ARGS
#include <stdio.h>
__global__ void hello(){
printf("Hello from block: %u, thread: %u\n", blockIdx.x, threadIdx.x);
}
int main(){
hello<<<2, 2>>>();
cudaDeviceSynchronize();
}
```
> Hello from block: 0, thread: 0
Hello from block: 0, thread: 1
Hello from block: 1, thread: 0
Hello from block: 1, thread: 1
接下來我們簡單寫個跟cutlass有關的程式碼,順便引入(include)幾個跟cutlass相關的表頭檔(header files)
```cuda
%%cuda
#include <iostream>
#include <cutlass/cutlass.h>
#include <cutlass/numeric_types.h>
#include <cutlass/core_io.h>
int main() {
cutlass::half_t x = 2.25_hf;
std::cout << x << std::endl;
return 0;
}
```
它的輸出
> /tmp/tmpt9uuwc4s/f0dc1fe9-f3d8-4722-b45b-6e6e34abbd4c/single_file.cu:3:10: fatal error: cutlass/cutlass.h: No such file or directory
3 | #include <cutlass/cutlass.h>
| ^~~~~~~~~~~~~~~~~~~
compilation terminated.
看起來編譯器(compiler)似乎找不到相關的表頭檔(header files)
此時,我們可以安裝[nvidia-cutlass][3]套件
`!pip install nvidia-cutlass`
之後再執行一次上述的cell magic,仍會得到同樣的結果
> /tmp/tmptz6ym7b8/75b812b3-f025-4e1c-94fb-72c173ac6856/single_file.cu:3:10: fatal error: cutlass/cutlass.h: No such file or directory
3 | #include <cutlass/cutlass.h>
| ^~~~~~~~~~~~~~~~~~~
compilation terminated.
這樣子仍然無法順利跑出我們要的結果,應該是編譯器(compiler),找不到相關的表頭檔(header files),檢視一下nvcc4jupyter套件,發現它允許我們可以帶入額外的編譯參數(compiler-args)給nvcc,可參考[連結][4]

接著,我們可以參考[NVCC Options][5],引入(include)所需表頭檔(header files)

稍微找一下,可以發現我們所需的表頭檔(header files)位置
`!ls ../usr/local/lib/python3.10/dist-packages/cutlass_library/source/include/cutlass`

所以最後我們的cell magic會長這樣
```cuda
%%cuda -c "--include-path ../usr/local/lib/python3.10/dist-packages/cutlass_library/source/include/"
#include <iostream>
#include <cutlass/cutlass.h>
#include <cutlass/numeric_types.h>
#include <cutlass/core_io.h>
int main() {
cutlass::half_t x = 2.25_hf;
std::cout << x << std::endl;
return 0;
}
```
最後順利的輸出我們想要的結果
> 2.25
___
CuTe是CUDA裡一組C++ header only的函式庫(library),它定義一些模板(Template)抽像化的概念及如何對應多維度的執行緒和資料之間的關係,細節的部分可參考官方的[說明文件][6],在此只是跑一個範例,說明它可以在colab上正常執行
```cuda
%%cuda -c "--include-path ../usr/local/lib/python3.10/dist-packages/cutlass_library/source/include/"
#include <cute/atom/mma_atom.hpp>
using namespace cute;
int main()
{
MMA_Atom mma = MMA_Atom<SM70_8x8x4_F32F16F16F32_NT>{};
print_latex(mma);
}
```
我們可以得到底下這樣的輸出(顯示方法可參考之前的[文章][8])

參考資料:
1. [nvcc4jupyter][1]
2. [An Even Easier Introduction to CUDA][2]
3. [nvidia-cutlass][3]
4. [nvcc4jupyter Magic API][4]
5. [NVCC Command Options][5]
6. [Getting Started With CuTe][6]
7. [CuTe's support for Matrix Multiply-Accumulate instructions][7]
[1]: https://github.com/andreinechaev/nvcc4jupyter
[2]: https://developer.nvidia.com/blog/even-easier-introduction-cuda/
[3]: https://pypi.org/project/nvidia-cutlass/
[4]: https://nvcc4jupyter.readthedocs.io/en/latest/magics.html#cuda-magic
[5]: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#nvcc-command-options
[6]: https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md
[7]: https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0t_mma_atom.md
[8]: https://hackmd.io/@chrischang80/ryQ8l6URR