# 計算機結構 第四次小考 考前重點
:::info
課堂講義:
c4: Chapter 4 20231205r0
補充講義:
l20: lecture20_DLP_Vector.r0
:::
1. what are the basic variations of architextures exploiting SIMD/DLP?說出每個variations的前身。
上述架構變化的DLPs 的orders of magnitude(數量級)是多少? (3 * 3 * 5)
<!-- - SIMD architectures can exploit significant data-level parallelism for:
- Matrix-oriented scientific computing
- Media-oriented image and sound processors -->
- Three variations: (l20 p.10)
- Vector architectures -> Matrix-oriented scientific computing
- SIMD extensions -> ?
- Graphics Processor Units (GPUs) -> Matrix-oriented image and sound processors
- x86 processors -> MIMD + SIMD (l20 p.10)
Expect two additional cores per chip per year
SIMD width to double every four years
Potential speedup from SIMD to be twice that from MIMD
英文:
MMX(MultiMedia eXtensions)
SSE(Streaming SIMD Extensions )
AVX(Advanced Vector eXtensions)
2. 與 MIMD 架構相比,SIMD 架構的優勢是什麼?(2*10)
- SIMD is more energy efficient than MIMD (c4 p.2)
- Only needs to fetch one instruction per data operation
- Makes SIMD attractive for personal mobile devices
- Potential speedup from SIMD to be twice that from MIMD (c4 p.3)
3. 向量架構的基本想法是什麼? RV64V的主要零件有哪些?(15+4*5)
- Basic idea: (c4 p.4)
- Read sets of data elements into “vector registers”
- Operate on those registers
- Disperse the results back into memory
- RV64V: (c4 p.5)
- Loosely based on Cray-1
- 32 64-bit vector registers
- Register file has 16 read ports and 8 write ports
- Vector functional units
- Fully pipelined
- Data and control hazards are detected
- Vector load-store unit
- Fully pipelined
- One word per clock cycle after initial latency
- Scalar registers
- 31 general-purpose registers
- 32 floating-point registers
4. RV64V支援哪些 addressing modes?每種模式對應的load/store指令是什麼?這些模式適用於哪些場景? (5*10)
- Three types of addressing: (l20 p.27)
- Unit stride
» ==Contiguous block of information in memory==
» Fastest: always possible to optimize this
- Non-unit (constant) stride
» Harder to optimize memory system for all possible strides
» ==Prime number of data banks makes it easier to support different strides at full bandwidth==
- Indexed (gather-scatter)
» Vector equivalent of register indirect
» Good for ==sparse arrays== of data
» Increases number of programs that vectorize
- load (l20 p.15,16,25)
- vld
- vlds (stride, Non-unit)
- vldx (Indexed)
- store (l20 p.15,16,25)
- vst
- vsts (stride, Non-unit)
- vstx (Indexed)
5. 我們如何處理向量循環內的條件 IF statement,例如教科書第297頁? RV64V有什麼支持來解決這個問題? (2*10)

- Vector Mask Registers (c4 p.13, l20 p.24)
- Use predicate register to “disable” elements
- Predicate registers pi are used (l20 p.16)
6. SIMD ISA(SIMD Instruction Set Extensions)擴充功能如何運作?與向量架構相比,SIMD ISA 擴展有哪些限制? (10+3*10) (c4 p.18)
- Media applications operate on data types narrower than the native word size
Example: disconnect carry chains to “partition” adder
- Limitations, compared to vector instructions:
- Number of data operands encoded into op code
- No sophisticated addressing modes (strided, scatter-gather)
- No mask registers
7. 與vector架構和SIMD ISA擴展相比,對應的GPU指令擴展是什麼?它的獨特之處是什麼? (2*10)
<!-- - ISA is an abstraction of the hardware instruction set
- “Parallel Thread Execution (PTX)”
- opcode.type d,a,b,c;
- Uses virtual registers
- Translation to machine code is performed in software -->
- Single Instruction Multiple Thread(SIMT) (c4 p.23)
- Basic idea:
- Heterogeneous execution model
- CPU is the host, GPU is the device
- Develop a C-like programming language for GPU
- Unify all forms of GPU parallelism as CUDA thread
- Programming model is “Single Instruction Multiple Thread”
==SIMD architectures have no scatter-gather support==
<!-- - Differences:
No scalar processor
Uses multithreading to hide memory latency
Has many functional units, as opposed to a few deeply pipelined units like a vector processor -->
8. NVIDIA Pascal、Volta 和 Ampere GPU 架構的運算支援有何異同? (3*10)


- INT8 Tensor: Ampere
- FP16 Tensor: Ampere, Volta
- TF32 Tensor: Ampere
9. GPU的運算層次是怎樣的?請說明Thread、warp和Grid之間的關係? (4*10)
(上課板書)
Grid
↑
Block(s) (@SM)
↑
Thread (@Core)

Thread: A thread is associated with each data element (c4 p.35)
warp: 由Thread組成
Grid: 由blocks組成
(上課板書)
英文補充:
Hierarchical Clustered Array Processors
CU: Compute Unit
SM: Streaming Multiprocessor
TPC: Texture Processing Cluster
GPC: Graphics Processing Cluster
10. Similarities and Differences between GPU and vector processors? (4*10)
- Similarities to vector machines:
Works well with data-level parallel problems
Scatter-gather transfers
Mask registers
Large register files
- Differences:
No scalar processor
Uses multithreading to hide memory latency
Has many functional units, as opposed to a few deeply pipelined units like a vector processor