# Books : GPU / NPU / CPU Hardware Design
> Reference \:
> * 算力晶片 — 高性能 CPU / GPU / NPU 微架構分析
> https://www.tenlong.com.tw/products/9787121483790
> * 不只是 CUDA,通用 GPU 程式模型及架構原理
> https://www.tenlong.com.tw/products/9786267273388
*This is a study note for the above books. These books unlike what I usually worked on, they are mostly about architecture and hardware design. Please notice that, I am a programmer and hardware design is not my specialty.*
*Notice that I am only include what I need to add onto what I already know.*
## My Other Posts
* CUDA Programming
https://hackmd.io/@Erebustsai/SJdjopmEn
* Advanced GPGPU Programming
https://hackmd.io/@Erebustsai/HJ5p3-NFp
* OpenCL Programming
https://hackmd.io/@Erebustsai/Sku_EMMr2
* GPGPU Algorithm Implementation
https://hackmd.io/@Erebustsai/Byul7e-Up
# 不只是 CUDA,通用 GPU 程式模型及架構原理
> Reference \:
> * Jensen Huang on GPUs - Computerphile
> https://www.youtube.com/watch?v=G6R7UOFx1bw
## Chapter \#1 \: GPGPU Basic
> Reference \:
> * 不只是 CUDA,通用 GPU 程式模型及架構原理 Page 1-14
> * Advanced GPGPU Programming
> https://hackmd.io/@Erebustsai/HJ5p3-NFp
In this section, I am only listing articles that made for AMD GPGPU since there are not much data on the internet. I used to work on [AMD Ryzen™ Embedded R1000 Series](https://www.amd.com/en/products/embedded/ryzen/ryzen-r1000-series.html) but I am not familiar with all the AMD GPU Architecture. As far as I know, it was a mess.
> Reference \:
> * Chips and Cheese
> https://chipsandcheese.com/
* The End of an Era: AMD Discontinues Pre-2016 GCN GPU Support
* https://chipsandcheese.com/p/the-end-of-an-era-amd-discontinues-pre-2016-gcn-gpu-support
### Nvidia GPGPUs

* Fermi \: First Arch support HPC
* Keplar \: Better on double precision floating
* Maxwell \: Improve performance on NN
* Pascal \: Improve more
* Volta \: Tensor core, PC for all threads \(cooperative group\)
* Turing \: RT core \(No explicit programming interface\)
* Ampere \: Improve tensor core
### AMD GPGPUs

**AMD Compute Unit Architecture**
> Reference \:
> * ROCm Documents \: Compute unit \(CU\)
> https://rocm.docs.amd.com/projects/omniperf/en/amd-staging/conceptual/compute-unit.html
## Chapter \#2 \: GPGPU Programming Model
### Thread Distribution

In Volta architecture, a SM can support up to 32 thread blocks. When a thread block is distribute to a SM, thread in the thread block will be separated into warps. If a MP has 16 SPs, a warp will have two runs to complete.
### SIMD vs SIMT in Algorithm Design
* SIMD \: Require data to continuous and gather and scatter needed. Additionally, a SIMD need to use active mask register to control if a data need to participate an instruction.
* SIMT \: NO above limitations but the performance will be different.
### CUDA PTX ISA
https://docs.nvidia.com/cuda/parallel-thread-execution/
## Chapter \#3 \: GPGPU Core Control Structure
### GPGPU Instruction Pipeline
> Reference \:
> * GPGPU-sim
> https://github.com/gpgpu-sim/gpgpu-sim_distribution
Unlike CPU, GPU instruction pipeline is kept as simple as possible and leave as much space for compute hardware.
## Appendix
### CUDA \& OpenCL Terms on Programming Side
This diagram viewing the term used in CUDA and OpenCL on the programming side.
| | OpenCL | CUDA C |
|:-----------------:| ----------------------------- | --------------------------------------------------- |
| Thread Block ID | `get_group_id(uint dimIdx)` | `blockIdx.x[xyz]` |
| Thread Local ID | `get_local_id(uint dimIdx)` | `threadIdx.x[xyz]` |
| Num Thread in Block | `get_local _size(uint dimIdx)` | `blockDim.x[xyz]` |
| Num Block in Grid | `get_num_groups(uint dimIdx)` | `gridDim.x[xyz]` |
| Thread Global ID | `get_global_id(uint dimIdx)` | `blockIdx.x[xyz] * blockDim.[xyz] + threadIdx[xyz]` |
### CUDA \& OpenCL Terms on Hardware Side
> Reference \:
> * 不只是 CUDA,通用 GPU 程式模型及架構原理 Page x
This book provide a very good chart for hardware terms mapping with CUDA and OpenCL. The chinese translation is a little bit odd but it is good used for reference.

# 算力晶片 — 高性能 CPU / GPU / NPU 微架構分析
*The first 5 chapters are about the general Modern CPU Hardware Design and chapter 6 provide examples. Similarly, chapter 7, 8, 9 are about general Modern GPGPU Hardware Design and chapter 10 provide examples. Lastly, chapter 11 is about Connection, Storage technologies for GPU.*
:::info
:information_source: **CPU 是怎麼發明的\?CISC 跟 RISC 的 50 年戰爭**
https://www.youtube.com/watch?v=EGcgI66ReXs
:::
## Chapter \#1 \: TOP500 \& MLPerf
> Reference \:
> * TOP500 Home
> https://top500.org/
> * MLPerf Benchmarks
> https://www.nvidia.com/en-us/data-center/resources/mlperf-benchmarks/
## Chapter \#2 \: CPU ISA \& Instruction Pipelining
TODO \: Only glimpse\(03\/24\/2025\). Require reading in detail.
*This chapter is basically a review of Computer Architecture that most of us took in College.*
### CISC v.s. RISC
**My Other Posts**
* Book : iThome EN 帶你寫個作業系統
https://hackmd.io/@Erebustsai/B14RvnwA0
> Reference \:
> * 硬科技:豆知識 CISC和RISC的差別究竟在哪裡?
> https://www.cool3c.com/article/178288
Basically, with [Mico-Instruction Set Architecture](https://www.geeksforgeeks.org/microarchitecture-and-instruction-set-architecture/), most of the modern CISC CPU can combine both the CISC and RISC.
### Instruction Pipeline \& MIPS
**Classic 5 Stage Pipeline**
* Instruction Fetch
* Instruction Decode
* Execute
* Memory Access
* Write Back
Modern CPU mostly have more than 5 stage of pipeline and can up to 15 \~ 25 stages.
**Front End** \: first 2 stages
* **Instruction Prefetch**
* **Branch Prediction**
* **Instruction Cache**
* **Instruction Decode** \: An instruction to Micro-instructions
**Back End** \: 3 stages
* **Execution Unit**
* **Load\/Store Unit**
* **Write Back Unit**
### Modern Branch Prediction
**Preceptron Branch Prediction**
* Use a perceptron \(Neron\) to predict branch
* AMD Zen CPU use this mechanism
**TAGE \(TAgged GEometric history length predictor\) Branch Prediction**
* [Implementation of TAGE Branch Predictor](https://github.com/2Bor2C/tage?tab=readme-ov-file)
## Chapter \#3 \: DRAM \& SRAM Hardware
### My Other Post
[Memory Speed and CAS(Column Access Strobe Latency)](https://hackmd.io/dku91ON4T5uVag0c3TsUtA#Memory-Speed-and-CASColumn-Access-Strobe-Latency)
> Reference \:
> * DRAM 記憶體 60 年 : 最殘酷的半導體戰爭【DRAM 進化史】
> https://youtu.be/eB8SdLQNAmc?feature=shared
> * RAM Explained - Random Access Memory
> https://youtu.be/PVad0c2cljo?feature=shared
> * How does Computer Memory Work? 💻🛠
> https://youtu.be/7J7X7aZvMXQ?feature=shared
### SRAM \& DRAM
* SRAM
* Access Speed \: 2 \~ 10ns
* Lower Power Consumption
* DRAM
* Access Speed \: 10 \~ 70ns
* Higher Power Consumption
For how DRAM work, please refer to the [video](https://youtu.be/7J7X7aZvMXQ?feature=shared) provided in the thrid above reference.
## Chapter \#4 \: CPU ALU Design
*This chapter is basically a review of Computer Architecture that most of us took in College.*
### My Other Post
* SIMD Programming Notes
https://hackmd.io/6h4V3xR6TIaa5xx6dzML1Q?view
* SMID instruction and Vector Class Library from Anger Fog
https://hackmd.io/@Erebustsai/B1ZGe5gsT
### My Repos
* [x86 SIMD Algorithms](https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/tree/main/AlgorithmRepo_SIMDx86)
* [x86 SIMD with VCL](https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/tree/main/GEMM_VCL)
* [ARM NEON Matric Multiplication](https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/tree/main/AlgorithmRepr_NEON/Matrix%20Multiplication)
### SIMD Instructions \(Single Instruction Multiple Data\)
> Reference \:
> * 硬科技:淺談x86的SIMD指令擴張史
> https://www.cool3c.com/article/152918
> https://www.cool3c.com/article/152919
> https://www.cool3c.com/article/152953
**AVX-512**
> Reference \:
> * AVX 512 by [Creel](https://www.youtube.com/@WhatsACreel)
> https://youtube.com/playlist?list=PLKK11LigqitgMouFszr2U-OcCgoknB2TW&feature=shared
## Chapter \#5 \: Topology
### My Other Post
* MPI Programming
https://hackmd.io/@Erebustsai/SkyCU2g4n
* OpenMP Programming
https://hackmd.io/@Erebustsai/HyNZhtM-C
### Infinity Fabric
> Reference \:
> * Infinity Fabric (IF) - AMD
> https://en.wikichip.org/wiki/amd/infinity_fabric
> * Pushing AMD’s Infinity Fabric to its Limits
> https://chipsandcheese.com/p/pushing-amds-infinity-fabric-to-its
Used in AMD CPU, this connection can be used for CPU core to CPU core inside a chip, CPU chips to CPU chips, CPU to GPU. Basically, this is how AMD CPU used to connect everything.
## Chapter \#6 \: CPU Chips Design Example
> Reference \:
> * Zen - AMD's New CPU Architecture
> https://www.youtube.com/watch?v=vBXk-VSJ7UI
> * Comparative Deep-Dive: Zen vs. Bulldozer Architectures
> https://www.youtube.com/watch?v=rdhskuJt_IY
> * How It’s Built: AMD Ryzen™ 5000 Series Desktop Processors
> https://www.youtube.com/watch?v=5sAcXhad16k
> * 從AMD CPU IO Die演進看高速介面IP發展趨勢
> https://zh-hant.m2ipm2.com/NewsStd_669.html
### Zen
* ***Infinity Fabric*** \: Used to connect cores in a processor, processors in NUMA system and connect external device with CPU.
* ***Chiplet*** \: I\/O Die and CCD die is seperated and thus create a flexible design to add more die and improve die yield.
### Zen2
* ***Front End*** \: Improve branch predictor with
* level 1 \: Nerual Network-based Predictor
* level 2 \: TAGE Predictor
* ***Back End*** \: Provide more bandwith for [Instructions Retired](https://blog.csdn.net/weixin_32820767/article/details/85139080) from 192 entries to 224 entries. Improve through put of instruction \(IPC\).
* ***SIMD*** \: In Zen 1, 256 bits AVX2 will use two 128 bits operation and require two clock cycles to complete. Zen 2 provide can complete a 256 bits AVX2 instruction in one cycle.
### Zen3
> * How It’s Built: AMD Ryzen™ 5000 Series Desktop Processors
> https://www.youtube.com/watch?v=5sAcXhad16k
## Chapter \#7 \: From Graphic to GPGPU
* [Triangle mesh](https://en.wikipedia.org/wiki/Triangle_mesh) \: All the mesh for a 3D object are build with triangles
* [DirectX](https://zh.wikipedia.org/zh-tw/DirectX)
* PTX \: PTX is the Intermiediate Representation of the real assembly \(SASS Streaming Assembly Shader ISA\)
## Chapter \#8 \: GPGPU Storage Architecture \& Thread Mangement
### GPU Memory
:::info
:information_source: **CUDA Compute Capability**
https://developer.nvidia.com/cuda-gpus
Mostly can be deduced with GPU's architecture.
* Maxwell \: 5.x
* Pascal \: 6.x
* Volta \/ Turing \: 7.x
* Ampere \: 8.x
:::
> Reference \: Book Page 248

GPU usaully has big amount of registers as the following picture provided by the book.

### GPU On-Chip Memory Evolution
***From chapter 8.1.2, the book provide detailed description of the evolution of GPU on-chip storage*** \(Refer to book page 250\).
* [Distributed Shared Memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html?highlight=Distributed%2520Shared%2520Memory#distributed-shared-memory)
* [Independent Thread Scheduling](https://zhuanlan.zhihu.com/p/186192189)
* [Thread Block Clusters](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html?highlight=Distributed%2520Shared%2520Memory#thread-block-clusters)
## Chapter \#9 \: Tensor Processor
### TPU
> Reference \:
> * 國外 AI 晶片
> https://github.com/chenzomi12/AIInfra/tree/main/01AIChip/05Abroad
* [Systolic Array](https://zh.wikipedia.org/wiki/%E8%84%88%E5%8B%95%E9%99%A3%E5%88%97) \: Walk multiple steps before write back to memory.

TPU is consider a ASIC.
### Tensor Core Started from Volta Architecture
* Volta architecture use a `wmma` instruction, which compute a 16 x 16 x 16 matrix multiplication \(mxnxk => matrix mxk X matrix kxn\) and produce a result of an 16 x 16 matrix.
* Volta architecture separate a SM into 4 sub-cores and each sub-core has 2 tensor core.
* A tensor core can complete a 4 x 4 GEMM operation `D = A x B + C` in one cycle.
> Reference \:
> * Introduction to Tensor Cores Programming
> https://0mean1sigma.com/tgemm/
## Chapter \# 10 \: Pendding
## Chapter \# 11 \: Storage and Connection
### GDDR \& HBM
> Reference \:
> * GDDR6 vs HBM - Defining GPU Memory Types
> https://www.exxactcorp.com/blog/hpc/gddr6-vs-hbm-gpu-memory
### PCIe

### CXL
> Reference \:
> https://www.youtube.com/watch?v=W5X8MEZVqzM
### NvLink
#### My Other Post
* Multi-GPU Programming & AI Model Training
https://hackmd.io/@Erebustsai/SyfClHYCa
* Advanced GPGPU Programming
https://hackmd.io/MgVSv30jRnaEUJETjDiZgQ?view