# Unlimited Vector Extension with Data Streaming Support
##### origin: ISCA '21
##### paper: [link1(IEEE)](https://ieeexplore.ieee.org/document/9499750), [link2(INESC-ID)](https://www.inesc-id.pt/publications/16585/pdf/)
###### tags: `Vector extensions`
## Introduction
### Background
* Single Instruction Multiple Data (SIMD) instruction set extensions potentiate the exploitation of Data-Level Parallelism(DLP) to provide significant performance speedups.
* Conventional SIMD extensions (Intel MMX, SSE, AVX, etc., or ARM NEON)
* Fixed-size vector registers
* Recompilation needed if any modification to the register length
* ARM SVE and RISC-V Vector extension (RVV)
* Agnostic to physical vector register size from the SW developer/compiler point of view
* Predicate and/or vector control instructions required to disable vector elements outside loop bounds → increasing the number of loop instructions
### Problem
* Instruction overhead (memory indexing, loop control and memory access, etc.)
* Neither of these directly contribute to maximize the data processing throughput but often represent the majority of the loop code.


* SW/HW prefetch
### Proposed Solution
* Unlimited Vector Extension (**UVE**)
* Decoupled memory accesses
* Input data is directly streamed to the register file by the **Streaming Engine**, allowing data load/store to occur in parallel with data manipulation.
* Reduce load-to-use latency
* Indexing-free loops and implicit load/store
* All streams are described at the loop preamble, not only can one remove indexing instructions, but also all explicit loads and stores.
* Simplified vectorization
* The loop access patterns are exactly described by **descriptor representations** (3 types).

* Register-size agnostic
* Similar to SVE and RVV
* The **Streaming Engine** automatically disables all vector elements that fall out of bounds.
## Data Streaming
### Memory access modeling
\begin{equation}y(X) = {y_{{\text{base }}}} + \sum\limits_{k = 0}^{{{\dim }_y}} {{x_k}} \times {S_k} \end{equation}
\begin{equation*} X = \{x_0, \dots , x_{dim_y}\}, x_k \in [O_k, E_k+O_k] \end{equation*}
:::info
* y(X) : stream address access
* y~base~ : base address of an n-dimensional variable
* x~k~ : indexing variable
* S~k~ : stride multiplication factors
* O~k~ : indexing offest
* E~k~ : # of data elements (size)
:::
### Stream descriptor representation
* Base stream descriptors

* Static descriptor modifiers
* Indirect descriptor modifiers

## Proposed UVE Extension
* Base ISA: RISC-V
### Extension Design
* Architectural State
* Vector Registers (u0-u31)
* Minimum length: byte/half-word/word/double-word
* Maximum length: multiple of the minimum length
* The element width is independently configured for each vector register
* Streaming interface: Each data stream is implicitly associated with a specific vector register (u0-u31)
* Predicate Registers (p0-p15)
* p0-p7: Regular memory and arithmetic instructions, with p0 hardwired to 1
* p8-p15: Configuration of the first 8 or to allow for context saving
* Streaming Support
* Scalability and Destructive behavior
* The consumption/production of streams automatically enforces its iteration, performed with only a basic set of stream-conditional branches
* Eliminate the need for additional step-instructions in each loop, promoting code reduction.
* Complexity limitation
* Support up to 8 dimensions and 7 modifiers
* Compiler optimizations (future work)
* Streaming memory model
* Automatic pre-loading of input data in order to potentiate the offered performance
→ Source memory locations of an input stream cannot be modified (may cause RAW hazard)
* The loop code encoding the dependencies between memory operations, ensuring the support for in-place computations
→ WAR & WAW is well handled
* The processor is responsible for the synchronization between data streams and load/store instructions.
### Instruction Set
26 integer, 15 floating-point and 19 memory (including streaming) major instructions, a total of 450 instructions including all variations
* Stream configuration (with prefix `ss`)
* `ss.ld`, `ss.st`
* `ss.{ld|st}.sta`,`ss.app[.mod|.ind]`, `ss.end[.mod|.ind]`
* suffix `{b|h|w|d}`

* Stream control
* `ss.suspend`, `ss.resume`, `ss.stop`
* Predication
* Loop control
* Predicate-based
* End-of-stream
* End-of-dimension

* Vector manipulation
* `so.v.dup`, `so.a.mul.fp`, `so.a.add.fp`
* `ss.load`, `ss.store` : Conventional (non-streaming) load/store
* Scalar processing
* Advanced control
* `ss.getvl`, `ss.setvl`
* `so.cfg.memx`: Direct the corresponding stream to operate over the Lx cache
* Concurrent streams
## Microarchitecture Support



### Streaming Engine


* Stream Configuration
* Whenever a new stream configuration reaches the rename stage of the processor pipeline, it is registered (in order) on the **Stream Configuration Reorder Buffer (SCROB)**.
* Instructions are retrieved in order (one per clock cycle), validated, and used to write the data pattern configuration into the **Stream Table**.
* Stream Processing
* The stream processing is managed by the **Stream Scheduler**, which selects a set of n load/store streams from the Stream Table (prioritizes streams whose FIFO queues are less occupied).
* The selected streams are then iterated by the **Address Generators** on the **Stream Processing Modules**.
* Iterating load/store streams generates new load/store requests, which are registered on the **Load/Store FIFO** and **Memory Request Queues**.
* The Arbiter picks such requests, performs the virtual-to-physical page translation (through a TLB access) and issues them to the memory.
* Load/Store FIFOs
* Each stream is associated with an independent fixed-length FIFO queue, whose size (depth) was set to 8 in order to constraint the required hardware resources.
* Designing a single queue and sharing it across all streams. (Future work)
## Experimental Methodology
* Simulator: Gem5

* Benchmarks: See Fig. 8
## Results
### Performance evaluation

* (Fig. 8.A) A significant code reduction with an average 60.9% (93.2%) less committed instructions than ARM SVE (NEON).
* (Fig. 8.B) The proposed extension provides a significant (average) performance advantage of 2.4× over ARM SVE (considering only vectorized benchmarks).
* (Fig. 8.C) A significant decrease (33.4%) of rename blocks per cycle, when compared with the SVE-enabled core (considering only the benchmarks vectorized by the ARM compiler)
* (Fig. 8.D) A considerable improvement of the memory bus utilization, resulting on an average increase as high as 41× (see Fig. 8.D).
* (Fig. 8.E) The significant speed-ups are attained without relying on specific code optimizations, such as loop unrolling, as these would provide even greater performance improvements.
### Sensitivity to parameter variation
* Number of Vector Registers

* Load/Store FIFO depth

* Streaming cache level

* Stream Processing modules
* The number of Stream Processing Modules in the Streaming Engine varied between 2 and 8.
* There is no significant difference in the overall performance, with the results varying by less than 0.1%.
### Hardware overheads
* Footprint close to 1/2-th of an L1 cache
* Stream Table and SCROB (17 KB)
* 32 concurrent streams, each with a maximum of 8 descriptor dimensions and 7 modifiers
* Memory Request Queue (160 B)
* Maintaining up to 16 outstanding requests, each packed within a 10-byte entry
* Load/Store FIFO buffers (17 KB)
* Composed of a 256 × 66-byte structure (for 32 streams, each with 8 entries)