# In-Memory Data Parallel Processor
###### tags: `PIM`
###### paper origin: ACM ASPLOS 2018
###### papers: [link](https://www.usenix.org/system/files/conference/nsdi17/nsdi17-crankshaw.pdf)
###### slide: [link](https://www.usenix.org/sites/default/files/conference/protected-files/nsdi17_slides_crankshaw.pdf)
## 1. INTRODUCTION
### Research Problems
* Non-Volatile Memories offers significant performance gain. However, Previous works have relied on manual mapping of specialized kernels to the memory arrays, making it infeasible to execute more general workloads.
### Proposed Solution
* programmable in-memory processor architecture
* Massive parallelism
* Reduction in data movement.
* data-parallel programming framework
A compact instruction set provides generalized computation capabilities for the memory array.
## 2. Processor Architecture
### Micro-architecture

* The tiles are connected by an H-Tree router network
* Suit communication patterns typical in our programming model
* high-bandwidth communication for external I/O
* Each memory array can be thought of as a vector processing unit with few SIMD lanes.
### Instruction Set Architecture

* The proposed Instruction Set Architecture (ISA) is simple and compact.
* does not support complex (e.g. division) and specialized instructions

* add
* format: `add <mask><dst>`
* A sample and hold (S + H) circuit receives the bit-line current and feeds it the ADC unit which outputs the digital value for the current.
* An external digital shifter and adder (S + A) combines the partial sums from bit-lines.
* Each of ReRAM crossbar (XB), ADC and S+A takes 1 cycle, resulting in 3 cycles in total.
* dot
* format: `dot <mask><reg_mask><dst>`
* mul
* format: `mul <src><src><dst>`
* sub
* format: `sub <mask><mask><dst>`
## 3. Programming model
### Tensorflow
* TensorFlow (TF) offers a suitable programming paradigm for data parallel in-memory computing.
* nodes in TF’s DFGs can operate on multi-dimensional matrices.
* irregular memory accesses are restricted by not allowing subscript notation.
* the DFG naturally exposes Instruction Level Parallelism (ILP).
* TensorFlow supports a persistent memory context in nodes of the DFG. This is useful in our merged memory and compute architecture for storing persistent data across kernel invocations.
### TensorFlow primitives

* Input nodes
* Placeholder: non-persistent input
* Const: known at compile time
* Variable: input with persistent memory context
* Control Flow
* `O[i] = Cond[i] ? A[i] : B[i].`
* Reduction, Scatter, Gather
## Execution Model

* DFG
* a DFG generated by TensorFlow can represent one module
* Modules
* The proposed execution model allows restricted communication between instances of modules. Such communication is only allowed using scatter/gather nodes or reduction nodes in the DFG. We find these communication primitives are sufficient to express most TensorFlow kernels.
* IB
* An IB consists of a list of instructions which will be executed sequentially.
* The compiler explores several optimizations to increase the number of concurrent IBs in a module and thereby exposes the ILP inside a module.
* RERam
* Each IB is mapped to a single SIMD lane
## Compiler
### Supporting Complex Operations
general purpose computation requires supporting a diverse set of operations ranging from division, exponents, transcendental functions, etc. We support these complex operations by converting them into a set of LUT, addition and multiplication instructions based on algorithms used in Intel IA-64 processors
### Compiler Optimizations
* Node Merging

* Instruction Block Scheduler
* A Bottom-Up-Greedy (BUG) algorithm is modified to introduce the notion of in-memory computin
* Instruction Block (IB) Expansion

* Instruction Blocks that use multi-dimensional vectors as operands can be expanded into several instruction blocks with lower-dimension vectors to further exploit ILP and DLP.
* 2D matrices: sizes [2, 1024]
* Pipelining
* using two arrays, one array computes while writing back the previous result to the other array
* Balancing Inter-Module and Intra-Module Parallelism
* Because of Amdahl’s law, increasing the number of IBs in a module will not result in linear speedup.
* the number of IBs across all module instances exceeds the aggregate SIMD slots in the memory chip.
### Result
