如何在Google Colab跑CUTLASS及CuTe的程式碼

# 如何在Google Colab跑CUTLASS及CuTe的程式碼   基本上跟跑CUDA的方式一樣 ## 方法一利用cell magic ==%%writefile== 將目前的儲存格的內容存檔，再利用nvcc編譯成執行檔，即可執行。有時候你想要印出一些資訊來做除錯，但是即使加了printf，結果卻沒看到任何的輸出，記得要依照你所用的機器，給予適當的參數，如-arch=sm_75，否則有時候不會有任何的結果產生 ```cuda %% writefile hello.cu #include <stdio.h> __global__ void hello(){ printf("Hello from block: %u, thread: %u\n", blockIdx.x, threadIdx.x); } int main(){ hello<<<2, 2>>>(); cudaDeviceSynchronize(); } ``` `!nvcc -arch=sm_75 -o hello hello.cu` `!./hello` > Hello from block: 0, thread: 0 Hello from block: 0, thread: 1 Hello from block: 1, thread: 0 Hello from block: 1, thread: 1 ## 方法二為了要在Colab上跑CUDA相關的程式碼，我們可以先安裝[nvcc4jupyter][1]這個套件，再利用cell magic ==%%cuda== 載入即可 ``` !pip install nvcc4jupyter %load_ext nvcc4jupyter ``` 此時，我們可以試著跑一個CUDA範例程式看看 ``` // 跟之前一樣，如果你的輸出沒有印出任何東西，或許可以加這個試試看 CUDA_COMPILE_ARGS = '-c "--gpu-architecture sm_75"' ``` ```cuda %%cuda $CUDA_COMPILE_ARGS #include <stdio.h> __global__ void hello(){ printf("Hello from block: %u, thread: %u\n", blockIdx.x, threadIdx.x); } int main(){ hello<<<2, 2>>>(); cudaDeviceSynchronize(); } ``` > Hello from block: 0, thread: 0 Hello from block: 0, thread: 1 Hello from block: 1, thread: 0 Hello from block: 1, thread: 1 接下來我們簡單寫個跟cutlass有關的程式碼，順便引入(include)幾個跟cutlass相關的表頭檔(header files) ```cuda %%cuda #include <iostream> #include <cutlass/cutlass.h> #include <cutlass/numeric_types.h> #include <cutlass/core_io.h> int main() { cutlass::half_t x = 2.25_hf; std::cout << x << std::endl; return 0; } ``` 它的輸出 > /tmp/tmpt9uuwc4s/f0dc1fe9-f3d8-4722-b45b-6e6e34abbd4c/single_file.cu:3:10: fatal error: cutlass/cutlass.h: No such file or directory 3 | #include <cutlass/cutlass.h> | ^~~~~~~~~~~~~~~~~~~ compilation terminated. 看起來編譯器(compiler)似乎找不到相關的表頭檔(header files) 此時，我們可以安裝[nvidia-cutlass][3]套件 `!pip install nvidia-cutlass` 之後再執行一次上述的cell magic，仍會得到同樣的結果 > /tmp/tmptz6ym7b8/75b812b3-f025-4e1c-94fb-72c173ac6856/single_file.cu:3:10: fatal error: cutlass/cutlass.h: No such file or directory 3 | #include <cutlass/cutlass.h> | ^~~~~~~~~~~~~~~~~~~ compilation terminated. 這樣子仍然無法順利跑出我們要的結果，應該是編譯器(compiler)，找不到相關的表頭檔(header files)，檢視一下nvcc4jupyter套件，發現它允許我們可以帶入額外的編譯參數(compiler-args)給nvcc，可參考[連結][4] ![image](https://hackmd.io/_uploads/HJhjzaDRC.png) 接著，我們可以參考[NVCC Options][5]，引入(include)所需表頭檔(header files) ![image](https://hackmd.io/_uploads/rkaFCJ_AA.png) 稍微找一下，可以發現我們所需的表頭檔(header files)位置 `!ls ../usr/local/lib/python3.10/dist-packages/cutlass_library/source/include/cutlass` ![image](https://hackmd.io/_uploads/rJfYExOCR.png) 所以最後我們的cell magic會長這樣 ```cuda %%cuda -c "--include-path ../usr/local/lib/python3.10/dist-packages/cutlass_library/source/include/" #include <iostream> #include <cutlass/cutlass.h> #include <cutlass/numeric_types.h> #include <cutlass/core_io.h> int main() { cutlass::half_t x = 2.25_hf; std::cout << x << std::endl; return 0; } ``` 最後順利的輸出我們想要的結果 > 2.25 ___ CuTe是CUDA裡一組C++ header only的函式庫(library)，它定義一些模板(Template)抽像化的概念及如何對應多維度的執行緒和資料之間的關係，細節的部分可參考官方的[說明文件][6]，在此只是跑一個範例，說明它可以在colab上正常執行 ```cuda %%cuda -c "--include-path ../usr/local/lib/python3.10/dist-packages/cutlass_library/source/include/" #include <cute/atom/mma_atom.hpp> using namespace cute; int main() { MMA_Atom mma = MMA_Atom<SM70_8x8x4_F32F16F16F32_NT>{}; print_latex(mma); } ``` 我們可以得到底下這樣的輸出(顯示方法可參考之前的[文章][8]) ![image](https://hackmd.io/_uploads/S1IPyVuCC.png) 參考資料： 1. [nvcc4jupyter][1] 2. [An Even Easier Introduction to CUDA][2] 3. [nvidia-cutlass][3] 4. [nvcc4jupyter Magic API][4] 5. [NVCC Command Options][5] 6. [Getting Started With CuTe][6] 7. [CuTe's support for Matrix Multiply-Accumulate instructions][7] [1]: https://github.com/andreinechaev/nvcc4jupyter [2]: https://developer.nvidia.com/blog/even-easier-introduction-cuda/ [3]: https://pypi.org/project/nvidia-cutlass/ [4]: https://nvcc4jupyter.readthedocs.io/en/latest/magics.html#cuda-magic [5]: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#nvcc-command-options [6]: https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md [7]: https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0t_mma_atom.md [8]: https://hackmd.io/@chrischang80/ryQ8l6URR