# Unlimited Vector Extension with Data Streaming Support ##### origin: ISCA '21 ##### paper: [link1(IEEE)](https://ieeexplore.ieee.org/document/9499750), [link2(INESC-ID)](https://www.inesc-id.pt/publications/16585/pdf/) ###### tags: `Vector extensions` ## Introduction ### Background * Single Instruction Multiple Data (SIMD) instruction set extensions potentiate the exploitation of Data-Level Parallelism(DLP) to provide significant performance speedups. * Conventional SIMD extensions (Intel MMX, SSE, AVX, etc., or ARM NEON) * Fixed-size vector registers * Recompilation needed if any modification to the register length * ARM SVE and RISC-V Vector extension (RVV) * Agnostic to physical vector register size from the SW developer/compiler point of view * Predicate and/or vector control instructions required to disable vector elements outside loop bounds → increasing the number of loop instructions ### Problem * Instruction overhead (memory indexing, loop control and memory access, etc.) * Neither of these directly contribute to maximize the data processing throughput but often represent the majority of the loop code. ![](https://i.imgur.com/NwG2szN.png) ![](https://i.imgur.com/DNisLe0.png) * SW/HW prefetch ### Proposed Solution * Unlimited Vector Extension (**UVE**) * Decoupled memory accesses * Input data is directly streamed to the register file by the **Streaming Engine**, allowing data load/store to occur in parallel with data manipulation. * Reduce load-to-use latency * Indexing-free loops and implicit load/store * All streams are described at the loop preamble, not only can one remove indexing instructions, but also all explicit loads and stores. * Simplified vectorization * The loop access patterns are exactly described by **descriptor representations** (3 types). ![](https://i.imgur.com/sjFinGm.png) * Register-size agnostic * Similar to SVE and RVV * The **Streaming Engine** automatically disables all vector elements that fall out of bounds. ## Data Streaming ### Memory access modeling \begin{equation}y(X) = {y_{{\text{base }}}} + \sum\limits_{k = 0}^{{{\dim }_y}} {{x_k}} \times {S_k} \end{equation} \begin{equation*} X = \{x_0, \dots , x_{dim_y}\}, x_k \in [O_k, E_k+O_k] \end{equation*} :::info * y(X) : stream address access * y~base~ : base address of an n-dimensional variable * x~k~ : indexing variable * S~k~ : stride multiplication factors * O~k~ : indexing offest * E~k~ : # of data elements (size) ::: ### Stream descriptor representation * Base stream descriptors ![](https://i.imgur.com/f5vT8G2.png) * Static descriptor modifiers * Indirect descriptor modifiers ![](https://i.imgur.com/vHyTYha.png) ## Proposed UVE Extension * Base ISA: RISC-V ### Extension Design * Architectural State * Vector Registers (u0-u31) * Minimum length: byte/half-word/word/double-word * Maximum length: multiple of the minimum length * The element width is independently configured for each vector register * Streaming interface: Each data stream is implicitly associated with a specific vector register (u0-u31) * Predicate Registers (p0-p15) * p0-p7: Regular memory and arithmetic instructions, with p0 hardwired to 1 * p8-p15: Configuration of the first 8 or to allow for context saving * Streaming Support * Scalability and Destructive behavior * The consumption/production of streams automatically enforces its iteration, performed with only a basic set of stream-conditional branches * Eliminate the need for additional step-instructions in each loop, promoting code reduction. * Complexity limitation * Support up to 8 dimensions and 7 modifiers * Compiler optimizations (future work) * Streaming memory model * Automatic pre-loading of input data in order to potentiate the offered performance → Source memory locations of an input stream cannot be modified (may cause RAW hazard) * The loop code encoding the dependencies between memory operations, ensuring the support for in-place computations → WAR & WAW is well handled * The processor is responsible for the synchronization between data streams and load/store instructions. ### Instruction Set 26 integer, 15 floating-point and 19 memory (including streaming) major instructions, a total of 450 instructions including all variations * Stream configuration (with prefix `ss`) * `ss.ld`, `ss.st` * `ss.{ld|st}.sta`,`ss.app[.mod|.ind]`, `ss.end[.mod|.ind]` * suffix `{b|h|w|d}` ![](https://i.imgur.com/XZuT4E8.png) * Stream control * `ss.suspend`, `ss.resume`, `ss.stop` * Predication * Loop control * Predicate-based * End-of-stream * End-of-dimension ![](https://i.imgur.com/IMeOHeI.png) * Vector manipulation * `so.v.dup`, `so.a.mul.fp`, `so.a.add.fp` * `ss.load`, `ss.store` : Conventional (non-streaming) load/store * Scalar processing * Advanced control * `ss.getvl`, `ss.setvl` * `so.cfg.memx`: Direct the corresponding stream to operate over the Lx cache * Concurrent streams ## Microarchitecture Support ![](https://i.imgur.com/TfXA8LJ.png) ![](https://i.imgur.com/CfFfVYb.png) ![](https://i.imgur.com/sWGNhxO.png) ### Streaming Engine ![](https://i.imgur.com/RmlU2gq.png) ![](https://i.imgur.com/P3KZoeA.png) * Stream Configuration * Whenever a new stream configuration reaches the rename stage of the processor pipeline, it is registered (in order) on the **Stream Configuration Reorder Buffer (SCROB)**. * Instructions are retrieved in order (one per clock cycle), validated, and used to write the data pattern configuration into the **Stream Table**. * Stream Processing * The stream processing is managed by the **Stream Scheduler**, which selects a set of n load/store streams from the Stream Table (prioritizes streams whose FIFO queues are less occupied). * The selected streams are then iterated by the **Address Generators** on the **Stream Processing Modules**. * Iterating load/store streams generates new load/store requests, which are registered on the **Load/Store FIFO** and **Memory Request Queues**. * The Arbiter picks such requests, performs the virtual-to-physical page translation (through a TLB access) and issues them to the memory. * Load/Store FIFOs * Each stream is associated with an independent fixed-length FIFO queue, whose size (depth) was set to 8 in order to constraint the required hardware resources. * Designing a single queue and sharing it across all streams. (Future work) ## Experimental Methodology * Simulator: Gem5 ![](https://i.imgur.com/Uz3XdMA.png) * Benchmarks: See Fig. 8 ## Results ### Performance evaluation ![](https://i.imgur.com/8RHXo1x.png) * (Fig. 8.A) A significant code reduction with an average 60.9% (93.2%) less committed instructions than ARM SVE (NEON). * (Fig. 8.B) The proposed extension provides a significant (average) performance advantage of 2.4× over ARM SVE (considering only vectorized benchmarks). * (Fig. 8.C) A significant decrease (33.4%) of rename blocks per cycle, when compared with the SVE-enabled core (considering only the benchmarks vectorized by the ARM compiler) * (Fig. 8.D) A considerable improvement of the memory bus utilization, resulting on an average increase as high as 41× (see Fig. 8.D). * (Fig. 8.E) The significant speed-ups are attained without relying on specific code optimizations, such as loop unrolling, as these would provide even greater performance improvements. ### Sensitivity to parameter variation * Number of Vector Registers ![](https://i.imgur.com/rVPPY8a.png) * Load/Store FIFO depth ![](https://i.imgur.com/jArv9OC.png) * Streaming cache level ![](https://i.imgur.com/fQYghtf.png) * Stream Processing modules * The number of Stream Processing Modules in the Streaming Engine varied between 2 and 8. * There is no significant difference in the overall performance, with the results varying by less than 0.1%. ### Hardware overheads * Footprint close to 1/2-th of an L1 cache * Stream Table and SCROB (17 KB) * 32 concurrent streams, each with a maximum of 8 descriptor dimensions and 7 modifiers * Memory Request Queue (160 B) * Maintaining up to 16 outstanding requests, each packed within a 10-byte entry * Load/Store FIFO buffers (17 KB) * Composed of a 256 × 66-byte structure (for 32 streams, each with 8 entries)