GPU / NPU / CPU Hardware Design

# Books : GPU / NPU / CPU Hardware Design > Reference \: > * 算力晶片 — 高性能 CPU / GPU / NPU 微架構分析 > https://www.tenlong.com.tw/products/9787121483790 > * 不只是 CUDA，通用 GPU 程式模型及架構原理 > https://www.tenlong.com.tw/products/9786267273388 *This is a study note for the above books. These books unlike what I usually worked on, they are mostly about architecture and hardware design. Please notice that, I am a programmer and hardware design is not my specialty.* *Notice that I am only include what I need to add onto what I already know.* ## My Other Posts * CUDA Programming https://hackmd.io/@Erebustsai/SJdjopmEn * Advanced GPGPU Programming https://hackmd.io/@Erebustsai/HJ5p3-NFp * OpenCL Programming https://hackmd.io/@Erebustsai/Sku_EMMr2 * GPGPU Algorithm Implementation https://hackmd.io/@Erebustsai/Byul7e-Up # 不只是 CUDA，通用 GPU 程式模型及架構原理 > Reference \: > * Jensen Huang on GPUs - Computerphile > https://www.youtube.com/watch?v=G6R7UOFx1bw ## Chapter \#1 \: GPGPU Basic > Reference \: > * 不只是 CUDA，通用 GPU 程式模型及架構原理 Page 1-14 > * Advanced GPGPU Programming > https://hackmd.io/@Erebustsai/HJ5p3-NFp In this section, I am only listing articles that made for AMD GPGPU since there are not much data on the internet. I used to work on [AMD Ryzen™ Embedded R1000 Series](https://www.amd.com/en/products/embedded/ryzen/ryzen-r1000-series.html) but I am not familiar with all the AMD GPU Architecture. As far as I know, it was a mess. > Reference \: > * Chips and Cheese > https://chipsandcheese.com/ * The End of an Era: AMD Discontinues Pre-2016 GCN GPU Support * https://chipsandcheese.com/p/the-end-of-an-era-amd-discontinues-pre-2016-gcn-gpu-support ### Nvidia GPGPUs ![image](https://hackmd.io/_uploads/BJc9vfmTyx.png) * Fermi \: First Arch support HPC * Keplar \: Better on double precision floating * Maxwell \: Improve performance on NN * Pascal \: Improve more * Volta \: Tensor core, PC for all threads \(cooperative group\) * Turing \: RT core \(No explicit programming interface\) * Ampere \: Improve tensor core ### AMD GPGPUs ![image](https://hackmd.io/_uploads/HyyGozXTye.png) **AMD Compute Unit Architecture** > Reference \: > * ROCm Documents \: Compute unit \(CU\) > https://rocm.docs.amd.com/projects/omniperf/en/amd-staging/conceptual/compute-unit.html ## Chapter \#2 \: GPGPU Programming Model ### Thread Distribution ![image](https://hackmd.io/_uploads/BJh7aETnyx.png) In Volta architecture, a SM can support up to 32 thread blocks. When a thread block is distribute to a SM, thread in the thread block will be separated into warps. If a MP has 16 SPs, a warp will have two runs to complete. ### SIMD vs SIMT in Algorithm Design * SIMD \: Require data to continuous and gather and scatter needed. Additionally, a SIMD need to use active mask register to control if a data need to participate an instruction. * SIMT \: NO above limitations but the performance will be different. ### CUDA PTX ISA https://docs.nvidia.com/cuda/parallel-thread-execution/ ## Chapter \#3 \: GPGPU Core Control Structure ### GPGPU Instruction Pipeline > Reference \: > * GPGPU-sim > https://github.com/gpgpu-sim/gpgpu-sim_distribution Unlike CPU, GPU instruction pipeline is kept as simple as possible and leave as much space for compute hardware. ## Appendix ### CUDA \& OpenCL Terms on Programming Side This diagram viewing the term used in CUDA and OpenCL on the programming side. | | OpenCL | CUDA C | |:-----------------:| ----------------------------- | --------------------------------------------------- | | Thread Block ID | `get_group_id(uint dimIdx)` | `blockIdx.x[xyz]` | | Thread Local ID | `get_local_id(uint dimIdx)` | `threadIdx.x[xyz]` | | Num Thread in Block | `get_local _size(uint dimIdx)` | `blockDim.x[xyz]` | | Num Block in Grid | `get_num_groups(uint dimIdx)` | `gridDim.x[xyz]` | | Thread Global ID | `get_global_id(uint dimIdx)` | `blockIdx.x[xyz] * blockDim.[xyz] + threadIdx[xyz]` | ### CUDA \& OpenCL Terms on Hardware Side > Reference \: > * 不只是 CUDA，通用 GPU 程式模型及架構原理 Page x This book provide a very good chart for hardware terms mapping with CUDA and OpenCL. The chinese translation is a little bit odd but it is good used for reference. ![image](https://hackmd.io/_uploads/HkpHDrnnJe.png) # 算力晶片 — 高性能 CPU / GPU / NPU 微架構分析 *The first 5 chapters are about the general Modern CPU Hardware Design and chapter 6 provide examples. Similarly, chapter 7, 8, 9 are about general Modern GPGPU Hardware Design and chapter 10 provide examples. Lastly, chapter 11 is about Connection, Storage technologies for GPU.* :::info :information_source: **CPU 是怎麼發明的\？CISC 跟 RISC 的 50 年戰爭** https://www.youtube.com/watch?v=EGcgI66ReXs ::: ## Chapter \#1 \: TOP500 \& MLPerf > Reference \: > * TOP500 Home > https://top500.org/ > * MLPerf Benchmarks > https://www.nvidia.com/en-us/data-center/resources/mlperf-benchmarks/ ## Chapter \#2 \: CPU ISA \& Instruction Pipelining TODO \: Only glimpse\(03\/24\/2025\). Require reading in detail. *This chapter is basically a review of Computer Architecture that most of us took in College.* ### CISC v.s. RISC **My Other Posts** * Book : iThome EN 帶你寫個作業系統 https://hackmd.io/@Erebustsai/B14RvnwA0 > Reference \: > * 硬科技：豆知識 CISC和RISC的差別究竟在哪裡？ > https://www.cool3c.com/article/178288 Basically, with [Mico-Instruction Set Architecture](https://www.geeksforgeeks.org/microarchitecture-and-instruction-set-architecture/), most of the modern CISC CPU can combine both the CISC and RISC. ### Instruction Pipeline \& MIPS **Classic 5 Stage Pipeline** * Instruction Fetch * Instruction Decode * Execute * Memory Access * Write Back Modern CPU mostly have more than 5 stage of pipeline and can up to 15 \~ 25 stages. **Front End** \: first 2 stages * **Instruction Prefetch** * **Branch Prediction** * **Instruction Cache** * **Instruction Decode** \: An instruction to Micro-instructions **Back End** \: 3 stages * **Execution Unit** * **Load\/Store Unit** * **Write Back Unit** ### Modern Branch Prediction **Preceptron Branch Prediction** * Use a perceptron \(Neron\) to predict branch * AMD Zen CPU use this mechanism **TAGE \(TAgged GEometric history length predictor\) Branch Prediction** * [Implementation of TAGE Branch Predictor](https://github.com/2Bor2C/tage?tab=readme-ov-file) ## Chapter \#3 \: DRAM \& SRAM Hardware ### My Other Post [Memory Speed and CAS(Column Access Strobe Latency)](https://hackmd.io/dku91ON4T5uVag0c3TsUtA#Memory-Speed-and-CASColumn-Access-Strobe-Latency) > Reference \: > * DRAM 記憶體 60 年 : 最殘酷的半導體戰爭【DRAM 進化史】 > https://youtu.be/eB8SdLQNAmc?feature=shared > * RAM Explained - Random Access Memory > https://youtu.be/PVad0c2cljo?feature=shared > * How does Computer Memory Work? 💻🛠 > https://youtu.be/7J7X7aZvMXQ?feature=shared ### SRAM \& DRAM * SRAM * Access Speed \: 2 \~ 10ns * Lower Power Consumption * DRAM * Access Speed \: 10 \~ 70ns * Higher Power Consumption For how DRAM work, please refer to the [video](https://youtu.be/7J7X7aZvMXQ?feature=shared) provided in the thrid above reference. ## Chapter \#4 \: CPU ALU Design *This chapter is basically a review of Computer Architecture that most of us took in College.* ### My Other Post * SIMD Programming Notes https://hackmd.io/6h4V3xR6TIaa5xx6dzML1Q?view * SMID instruction and Vector Class Library from Anger Fog https://hackmd.io/@Erebustsai/B1ZGe5gsT ### My Repos * [x86 SIMD Algorithms](https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/tree/main/AlgorithmRepo_SIMDx86) * [x86 SIMD with VCL](https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/tree/main/GEMM_VCL) * [ARM NEON Matric Multiplication](https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/tree/main/AlgorithmRepr_NEON/Matrix%20Multiplication) ### SIMD Instructions \(Single Instruction Multiple Data\) > Reference \: > * 硬科技：淺談x86的SIMD指令擴張史 > https://www.cool3c.com/article/152918 > https://www.cool3c.com/article/152919 > https://www.cool3c.com/article/152953 **AVX-512** > Reference \: > * AVX 512 by [Creel](https://www.youtube.com/@WhatsACreel) > https://youtube.com/playlist?list=PLKK11LigqitgMouFszr2U-OcCgoknB2TW&feature=shared ## Chapter \#5 \: Topology ### My Other Post * MPI Programming https://hackmd.io/@Erebustsai/SkyCU2g4n * OpenMP Programming https://hackmd.io/@Erebustsai/HyNZhtM-C ### Infinity Fabric > Reference \: > * Infinity Fabric (IF) - AMD > https://en.wikichip.org/wiki/amd/infinity_fabric > * Pushing AMD’s Infinity Fabric to its Limits > https://chipsandcheese.com/p/pushing-amds-infinity-fabric-to-its Used in AMD CPU, this connection can be used for CPU core to CPU core inside a chip, CPU chips to CPU chips, CPU to GPU. Basically, this is how AMD CPU used to connect everything. ## Chapter \#6 \: CPU Chips Design Example > Reference \: > * Zen - AMD's New CPU Architecture > https://www.youtube.com/watch?v=vBXk-VSJ7UI > * Comparative Deep-Dive: Zen vs. Bulldozer Architectures > https://www.youtube.com/watch?v=rdhskuJt_IY > * How It’s Built: AMD Ryzen™ 5000 Series Desktop Processors > https://www.youtube.com/watch?v=5sAcXhad16k > * 從AMD CPU IO Die演進看高速介面IP發展趨勢 > https://zh-hant.m2ipm2.com/NewsStd_669.html ### Zen * ***Infinity Fabric*** \: Used to connect cores in a processor, processors in NUMA system and connect external device with CPU. * ***Chiplet*** \: I\/O Die and CCD die is seperated and thus create a flexible design to add more die and improve die yield. ### Zen2 * ***Front End*** \: Improve branch predictor with * level 1 \: Nerual Network-based Predictor * level 2 \: TAGE Predictor * ***Back End*** \: Provide more bandwith for [Instructions Retired](https://blog.csdn.net/weixin_32820767/article/details/85139080) from 192 entries to 224 entries. Improve through put of instruction \(IPC\). * ***SIMD*** \: In Zen 1, 256 bits AVX2 will use two 128 bits operation and require two clock cycles to complete. Zen 2 provide can complete a 256 bits AVX2 instruction in one cycle. ### Zen3 > * How It’s Built: AMD Ryzen™ 5000 Series Desktop Processors > https://www.youtube.com/watch?v=5sAcXhad16k ## Chapter \#7 \: From Graphic to GPGPU * [Triangle mesh](https://en.wikipedia.org/wiki/Triangle_mesh) \: All the mesh for a 3D object are build with triangles * [DirectX](https://zh.wikipedia.org/zh-tw/DirectX) * PTX \: PTX is the Intermiediate Representation of the real assembly \(SASS Streaming Assembly Shader ISA\) ## Chapter \#8 \: GPGPU Storage Architecture \& Thread Mangement ### GPU Memory :::info :information_source: **CUDA Compute Capability** https://developer.nvidia.com/cuda-gpus Mostly can be deduced with GPU's architecture. * Maxwell \: 5.x * Pascal \: 6.x * Volta \/ Turing \: 7.x * Ampere \: 8.x ::: > Reference \: Book Page 248 ![image](https://hackmd.io/_uploads/ryWWgDep1e.png) GPU usaully has big amount of registers as the following picture provided by the book. ![image](https://hackmd.io/_uploads/HkHoVwxp1g.png =x300) ### GPU On-Chip Memory Evolution ***From chapter 8.1.2, the book provide detailed description of the evolution of GPU on-chip storage*** \(Refer to book page 250\). * [Distributed Shared Memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html?highlight=Distributed%2520Shared%2520Memory#distributed-shared-memory) * [Independent Thread Scheduling](https://zhuanlan.zhihu.com/p/186192189) * [Thread Block Clusters](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html?highlight=Distributed%2520Shared%2520Memory#thread-block-clusters) ## Chapter \#9 \: Tensor Processor ### TPU > Reference \: > * 國外 AI 晶片 > https://github.com/chenzomi12/AIInfra/tree/main/01AIChip/05Abroad * [Systolic Array](https://zh.wikipedia.org/wiki/%E8%84%88%E5%8B%95%E9%99%A3%E5%88%97) \: Walk multiple steps before write back to memory. ![Systolic_Array](https://hackmd.io/_uploads/BJRxSVbTJx.png) TPU is consider a ASIC. ### Tensor Core Started from Volta Architecture * Volta architecture use a `wmma` instruction, which compute a 16 x 16 x 16 matrix multiplication \(mxnxk => matrix mxk X matrix kxn\) and produce a result of an 16 x 16 matrix. * Volta architecture separate a SM into 4 sub-cores and each sub-core has 2 tensor core. * A tensor core can complete a 4 x 4 GEMM operation `D = A x B + C` in one cycle. > Reference \: > * Introduction to Tensor Cores Programming > https://0mean1sigma.com/tgemm/ ## Chapter \# 10 \: Pendding ## Chapter \# 11 \: Storage and Connection ### GDDR \& HBM > Reference \: > * GDDR6 vs HBM - Defining GPU Memory Types > https://www.exxactcorp.com/blog/hpc/gddr6-vs-hbm-gpu-memory ### PCIe ![image](https://hackmd.io/_uploads/Hkffpd-T1l.png) ### CXL > Reference \: > https://www.youtube.com/watch?v=W5X8MEZVqzM ### NvLink #### My Other Post * Multi-GPU Programming & AI Model Training https://hackmd.io/@Erebustsai/SyfClHYCa * Advanced GPGPU Programming https://hackmd.io/MgVSv30jRnaEUJETjDiZgQ?view