計算機結構第四次小考考前重點

# 計算機結構第四次小考考前重點 :::info 課堂講義： c4: Chapter 4 20231205r0 補充講義： l20: lecture20_DLP_Vector.r0 ::: 1. what are the basic variations of architextures exploiting SIMD/DLP？說出每個variations的前身。上述架構變化的DLPs 的orders of magnitude(數量級)是多少？ (3 * 3 * 5)  - Three variations: (l20 p.10) - Vector architectures -> Matrix-oriented scientific computing - SIMD extensions -> ? - Graphics Processor Units (GPUs) -> Matrix-oriented image and sound processors - x86 processors -> MIMD + SIMD (l20 p.10) Expect two additional cores per chip per year SIMD width to double every four years Potential speedup from SIMD to be twice that from MIMD 英文： MMX(MultiMedia eXtensions) SSE(Streaming SIMD Extensions ) AVX(Advanced Vector eXtensions) 2. 與 MIMD 架構相比，SIMD 架構的優勢是什麼？(2*10) - SIMD is more energy efficient than MIMD (c4 p.2) - Only needs to fetch one instruction per data operation - Makes SIMD attractive for personal mobile devices - Potential speedup from SIMD to be twice that from MIMD (c4 p.3) 3. 向量架構的基本想法是什麼？ RV64V的主要零件有哪些？(15+4*5) - Basic idea: (c4 p.4) - Read sets of data elements into “vector registers” - Operate on those registers - Disperse the results back into memory - RV64V: (c4 p.5) - Loosely based on Cray-1 - 32 64-bit vector registers - Register file has 16 read ports and 8 write ports - Vector functional units - Fully pipelined - Data and control hazards are detected - Vector load-store unit - Fully pipelined - One word per clock cycle after initial latency - Scalar registers - 31 general-purpose registers - 32 floating-point registers 4. RV64V支援哪些 addressing modes？每種模式對應的load/store指令是什麼？這些模式適用於哪些場景？ (5*10) - Three types of addressing: (l20 p.27) - Unit stride » ==Contiguous block of information in memory== » Fastest: always possible to optimize this - Non-unit (constant) stride » Harder to optimize memory system for all possible strides » ==Prime number of data banks makes it easier to support different strides at full bandwidth== - Indexed (gather-scatter) » Vector equivalent of register indirect » Good for ==sparse arrays== of data » Increases number of programs that vectorize - load (l20 p.15,16,25) - vld - vlds (stride, Non-unit) - vldx (Indexed) - store (l20 p.15,16,25) - vst - vsts (stride, Non-unit) - vstx (Indexed) 5. 我們如何處理向量循環內的條件 IF statement，例如教科書第297頁？ RV64V有什麼支持來解決這個問題？ (2*10) ![image](https://hackmd.io/_uploads/rJN8pKEPa.png) - Vector Mask Registers (c4 p.13, l20 p.24) - Use predicate register to “disable” elements - Predicate registers pi are used (l20 p.16) 6. SIMD ISA(SIMD Instruction Set Extensions)擴充功能如何運作？與向量架構相比，SIMD ISA 擴展有哪些限制？ (10+3*10) (c4 p.18) - Media applications operate on data types narrower than the native word size Example: disconnect carry chains to “partition” adder - Limitations, compared to vector instructions: - Number of data operands encoded into op code - No sophisticated addressing modes (strided, scatter-gather) - No mask registers 7. 與vector架構和SIMD ISA擴展相比，對應的GPU指令擴展是什麼？它的獨特之處是什麼？ (2*10)  - Single Instruction Multiple Thread(SIMT) (c4 p.23) - Basic idea: - Heterogeneous execution model - CPU is the host, GPU is the device - Develop a C-like programming language for GPU - Unify all forms of GPU parallelism as CUDA thread - Programming model is “Single Instruction Multiple Thread” ==SIMD architectures have no scatter-gather support==  8. NVIDIA Pascal、Volta 和 Ampere GPU 架構的運算支援有何異同？ (3*10) ![image](https://hackmd.io/_uploads/ry6d1vrvp.png) ![image](https://hackmd.io/_uploads/SyaylDHDa.png) - INT8 Tensor: Ampere - FP16 Tensor: Ampere, Volta - TF32 Tensor: Ampere 9. GPU的運算層次是怎樣的？請說明Thread、warp和Grid之間的關係？ (4*10) (上課板書) Grid ↑ Block(s) (@SM) ↑ Thread (@Core) ![image](https://hackmd.io/_uploads/ry-TmDBDT.png) Thread: A thread is associated with each data element (c4 p.35) warp: 由Thread組成 Grid: 由blocks組成 (上課板書) 英文補充： Hierarchical Clustered Array Processors CU: Compute Unit SM: Streaming Multiprocessor TPC: Texture Processing Cluster GPC: Graphics Processing Cluster 10. Similarities and Differences between GPU and vector processors? (4*10) - Similarities to vector machines: Works well with data-level parallel problems Scatter-gather transfers Mask registers Large register files - Differences: No scalar processor Uses multithreading to hide memory latency Has many functional units, as opposed to a few deeply pipelined units like a vector processor