# 計算機結構 第四次小考 考前重點 :::info 課堂講義: c4: Chapter 4 20231205r0 補充講義: l20: lecture20_DLP_Vector.r0 ::: 1. what are the basic variations of architextures exploiting SIMD/DLP?說出每個variations的前身。 上述架構變化的DLPs 的orders of magnitude(數量級)是多少? (3 * 3 * 5) <!-- - SIMD architectures can exploit significant data-level parallelism for: - Matrix-oriented scientific computing - Media-oriented image and sound processors --> - Three variations: (l20 p.10) - Vector architectures -> Matrix-oriented scientific computing - SIMD extensions -> ? - Graphics Processor Units (GPUs) -> Matrix-oriented image and sound processors - x86 processors -> MIMD + SIMD (l20 p.10) Expect two additional cores per chip per year SIMD width to double every four years Potential speedup from SIMD to be twice that from MIMD 英文: MMX(MultiMedia eXtensions) SSE(Streaming SIMD Extensions ) AVX(Advanced Vector eXtensions) 2. 與 MIMD 架構相比,SIMD 架構的優勢是什麼?(2*10) - SIMD is more energy efficient than MIMD (c4 p.2) - Only needs to fetch one instruction per data operation - Makes SIMD attractive for personal mobile devices - Potential speedup from SIMD to be twice that from MIMD (c4 p.3) 3. 向量架構的基本想法是什麼? RV64V的主要零件有哪些?(15+4*5) - Basic idea: (c4 p.4) - Read sets of data elements into “vector registers” - Operate on those registers - Disperse the results back into memory - RV64V: (c4 p.5) - Loosely based on Cray-1 - 32 64-bit vector registers - Register file has 16 read ports and 8 write ports - Vector functional units - Fully pipelined - Data and control hazards are detected - Vector load-store unit - Fully pipelined - One word per clock cycle after initial latency - Scalar registers - 31 general-purpose registers - 32 floating-point registers 4. RV64V支援哪些 addressing modes?每種模式對應的load/store指令是什麼?這些模式適用於哪些場景? (5*10) - Three types of addressing: (l20 p.27) - Unit stride » ==Contiguous block of information in memory== » Fastest: always possible to optimize this - Non-unit (constant) stride » Harder to optimize memory system for all possible strides » ==Prime number of data banks makes it easier to support different strides at full bandwidth== - Indexed (gather-scatter) » Vector equivalent of register indirect » Good for ==sparse arrays== of data » Increases number of programs that vectorize - load (l20 p.15,16,25) - vld - vlds (stride, Non-unit) - vldx (Indexed) - store (l20 p.15,16,25) - vst - vsts (stride, Non-unit) - vstx (Indexed) 5. 我們如何處理向量循環內的條件 IF statement,例如教科書第297頁? RV64V有什麼支持來解決這個問題? (2*10) ![image](https://hackmd.io/_uploads/rJN8pKEPa.png) - Vector Mask Registers (c4 p.13, l20 p.24) - Use predicate register to “disable” elements - Predicate registers pi are used (l20 p.16) 6. SIMD ISA(SIMD Instruction Set Extensions)擴充功能如何運作?與向量架構相比,SIMD ISA 擴展有哪些限制? (10+3*10) (c4 p.18) - Media applications operate on data types narrower than the native word size Example: disconnect carry chains to “partition” adder - Limitations, compared to vector instructions: - Number of data operands encoded into op code - No sophisticated addressing modes (strided, scatter-gather) - No mask registers 7. 與vector架構和SIMD ISA擴展相比,對應的GPU指令擴展是什麼?它的獨特之處是什麼? (2*10) <!-- - ISA is an abstraction of the hardware instruction set - “Parallel Thread Execution (PTX)” - opcode.type d,a,b,c; - Uses virtual registers - Translation to machine code is performed in software --> - Single Instruction Multiple Thread(SIMT) (c4 p.23) - Basic idea: - Heterogeneous execution model - CPU is the host, GPU is the device - Develop a C-like programming language for GPU - Unify all forms of GPU parallelism as CUDA thread - Programming model is “Single Instruction Multiple Thread” ==SIMD architectures have no scatter-gather support== <!-- - Differences: No scalar processor Uses multithreading to hide memory latency Has many functional units, as opposed to a few deeply pipelined units like a vector processor --> 8. NVIDIA Pascal、Volta 和 Ampere GPU 架構的運算支援有何異同? (3*10) ![image](https://hackmd.io/_uploads/ry6d1vrvp.png) ![image](https://hackmd.io/_uploads/SyaylDHDa.png) - INT8 Tensor: Ampere - FP16 Tensor: Ampere, Volta - TF32 Tensor: Ampere 9. GPU的運算層次是怎樣的?請說明Thread、warp和Grid之間的關係? (4*10) (上課板書) Grid ↑ Block(s) (@SM) ↑ Thread (@Core) ![image](https://hackmd.io/_uploads/ry-TmDBDT.png) Thread: A thread is associated with each data element (c4 p.35) warp: 由Thread組成 Grid: 由blocks組成 (上課板書) 英文補充: Hierarchical Clustered Array Processors CU: Compute Unit SM: Streaming Multiprocessor TPC: Texture Processing Cluster GPC: Graphics Processing Cluster 10. Similarities and Differences between GPU and vector processors? (4*10) - Similarities to vector machines: Works well with data-level parallel problems Scatter-gather transfers Mask registers Large register files - Differences: No scalar processor Uses multithreading to hide memory latency Has many functional units, as opposed to a few deeply pipelined units like a vector processor