# Near-Stream Computing: General and Transparent Near-Cache Acceleration
###### tags: `Accelerators`
paper link : https://seanzw.github.io/pub/hpca2022-near-stream-computing.pdf
## Background AND MOTIVATION
* As systems scale, the overheads of data movement and communication become the primary bottlenecks for high performance energy-efficient execution
* A variety of specialized architectures mitigate these overheads by carefully scheduling computation near data and orchestrating data-movement in efficient pipelines
* This broad paradigm of near-data processing (NDP) includes near-memory techniques as well as near-cache
* Bringing NDP to general purpose computing is challenging because of three competing goals
1. **Transparency** to the programmer
2. **Synchronization efficiency** of offloaded computations to keep overheads low
3. **Generality** of computations offloadable.
## APPROACH
* In this work, their goal is to provide effective and general near-cache computing capability for general purpose cores without programmer help
* Their primary insight is that using **streams**
1. **Generality** : Streams capture long-term per-data-structure behavior, so optimizations can be more aggressive than with instruction-level offloading.
2. **Synchronization efficiency** : Streams enable efficient autonomous offloading by eliminating coordination overhead
3. **Transparency** : Streams reduce the overhead of maintaining sequential memory semantics by enabling detection of memory ordering violations using per-data-structure access summaries rather than individual accesses
* Based on these insights, they develop a paradigm that we call **near-stream computing**
* Contribution :
1. Exploration of a novel program abstraction and granularity – streams – for performing near-data computing
2. Range-based synchronization and memory disambiguation protocol for maintaining sequential semantics with distributed computation at low overhead
3. Novel compiler techniques that perform aggressive neardata optimizations with simple pragmas.
4. Evaluation of near-stream computing against multiple prior near-data approaches
* Implementation:
* CPU ISA extension (x86)
* A set of LLVM-based compiler transforms and backend
* Microarchitecture
* Results:
* Significant traffic reduction was possible; on average 76%, and up to 98%.
* Performance gains were even higher due to reduced latency, including an average of 2:13, 2:48 over
## OVERVIEW
### 1. Taxonomy and Opportunity
* Taxonomy
* Address patterns
* affine (e.g. A[i,j])
* indirect (e.g. B[A[i,j]+w])
* pointer-chasing (e.g. P=P.next)
* Compute Patterns
* Near-Load-Stream
* Near-Store-Stream
* RMW
* Reduction
* Near-Stream Opportunity
* 
* 21% are associated with load-streams (including reduction) and 31% with store and RMW.
* 
* adding private caches only reduces 27% of data traffic, due to the large reuse distance. However, near-LLC computing reduces the data traffic by 64%.
### 2. Optimization Overview
* The basic principle of stream-based near-data computing is that a decoupled stream may be offloaded near an LLC bank, along with some computation
1. **Reduce, Load, Store** : 
2. **RMW** : RMW streams (e.g. A[i]+=C) are a hybrid case of both load and store computation. Semantically, they guarantee atomicity of the update.
3. **Access Pattern: Multi-op** : 
4. **Access Pattern: Indirection** : 
5. **Access Pattern: Pointer-Chasing** : 
## NEAR-STREAM PRELIMINARIES
### 1. Near-Stream Computing ISAs
1. Address-Only Stream ISA Concepts

* The basic ISA abstractions are adapted from decoupled-stream ISAs , which focus on address generation and embed no near-data computation
* **s_step** instruction explicitly advances the stream, enabling conditional stream usage and decoupling the address pattern from the control flow
2. Representing Near-Data Computations
a. Reduction

b. Store

c. Indirect Atomic

d. Nest

* Stream Configuration
* **s_cfg** will be split into a sequence of instructions starting with a **s_cfg_begin** This may be followed by a sequence of **s_cfg_input** instructions, Finally, a **s_cfg_end** completes the configuration and the stream can begin executing
### 2. Compiler Support
1. Load : For each load stream, the compiler performs a BFS on its user instructions, and checks if visited instructions forms a closure
2. Store : Similar to loads, the compiler searches for instructions computing the stored value, and records a value dependence when encountering a load instruction (https://www.cnblogs.com/ilocker/p/4897325.html)
3. Reduce : Reduction variables are typically represented as phi nodes in the loop entry basic block, and can be recognized by searching backwards for computation instructions
4. RMW : A load and the following store to the same address are merged into a single update stream. Atomics are handled similarly to stores, with a possible return value
### 3. Core Microarchitecture
* The primary extension is the core’s stream engine (abbreviated SEcore), which is essentially a programmable prefetcher
* The stream computing manager (SCM) manages the execution of near-stream thread contexts, arbitrating between requests of the local streams on its SEcor, and remote streams from its SEL3
* The SCM is responsible to schedule computation instances onto iterations of this loop
## NEAR-STREAM COMPUTING
### 1. Major Challenge and System Overview



* One major challenge is to synchronize after decoupling streams and computations to the cache. This involves maintaining the precise state and detecting aliasing between streams and the cor
* range-based synchronization (range-sync), is to only synchronize every few iterations and check aliasing against the range of touched addresses instead of individual accesses
### 2. Range-Based Synchronization
1. Alias Check with Ranges : To amortize synchronization overheads, alias check between core and offloaded streams is performed at ranges of touched addresses instead of individual accesses
2. Hardware Units : Add a **stream buffer** to SEL3 to hold operands and intermediate states before they are committed
3. Coarse-Grained Protocol : the synchronization protocol using ranges, with all control messages designed to be coarse-grained detail below:
* Stream Configure : SEcore makes the offloading decision based on the stream’s configuration and history information
* Stream Forward : Once configured, SEL3 computes the addresses and issues requests to the colocated L3 cache controller
* Compute in SEL3 : The issue unit schedules ready computations to a scalar PE (for simple computations) or the local core’s SCM within the same tile to fully reuse existing hardware resources
* Precise State : : Range-sync helps define the architectural state of offloaded streams consistently with the core

### 3. SYNCHRONIZATION-FREE OPTIMIZATION
* Although range-sync amortizes the control overhead with coarse-grained messages, it still introduces extra traffic and longer dependence chains
* programmers can add a pragma s_sync_free to a loop, indicating that streams in this region never alias
* 
## METHODOLOGY
* Evaluation stack : use gem5-20 for executiondriven, cycle-accurate simulation, extended with partial AVX 512 support, with Garnet for the NoC and DRAMsim3 for DDR4. We implement an LLVM-based compiler with x86 backed to recognize streams and associated computations
* Benchmarks : 14 OpenMP workloads from Rodinia, MineBench and Gap Graph Suite
* Systems and Comparison :

## EVALUTATION


