In-Memory Data Parallel Processor

# In-Memory Data Parallel Processor ###### tags: `PIM` ###### paper origin: ACM ASPLOS 2018 ###### papers: [link](https://www.usenix.org/system/files/conference/nsdi17/nsdi17-crankshaw.pdf) ###### slide: [link](https://www.usenix.org/sites/default/files/conference/protected-files/nsdi17_slides_crankshaw.pdf) ## 1. INTRODUCTION ### Research Problems * Non-Volatile Memories offers significant performance gain. However, Previous works have relied on manual mapping of specialized kernels to the memory arrays, making it infeasible to execute more general workloads. ### Proposed Solution * programmable in-memory processor architecture * Massive parallelism * Reduction in data movement. * data-parallel programming framework A compact instruction set provides generalized computation capabilities for the memory array. ## 2. Processor Architecture ### Micro-architecture ![](https://i.imgur.com/Xr3023T.png) * The tiles are connected by an H-Tree router network * Suit communication patterns typical in our programming model * high-bandwidth communication for external I/O * Each memory array can be thought of as a vector processing unit with few SIMD lanes. ### Instruction Set Architecture ![](https://i.imgur.com/UCkART8.png) * The proposed Instruction Set Architecture (ISA) is simple and compact. * does not support complex (e.g. division) and specialized instructions ![](https://i.imgur.com/ATeLQqK.png) * add * format: `add <mask><dst>` * A sample and hold (S + H) circuit receives the bit-line current and feeds it the ADC unit which outputs the digital value for the current. * An external digital shifter and adder (S + A) combines the partial sums from bit-lines. * Each of ReRAM crossbar (XB), ADC and S+A takes 1 cycle, resulting in 3 cycles in total. * dot * format: `dot <mask><reg_mask><dst>` * mul * format: `mul <src><src><dst>` * sub * format: `sub <mask><mask><dst>` ## 3. Programming model ### Tensorflow * TensorFlow (TF) offers a suitable programming paradigm for data parallel in-memory computing. * nodes in TF’s DFGs can operate on multi-dimensional matrices. * irregular memory accesses are restricted by not allowing subscript notation. * the DFG naturally exposes Instruction Level Parallelism (ILP). * TensorFlow supports a persistent memory context in nodes of the DFG. This is useful in our merged memory and compute architecture for storing persistent data across kernel invocations. ### TensorFlow primitives ![](https://i.imgur.com/Fb3pbTY.png) * Input nodes * Placeholder: non-persistent input * Const: known at compile time * Variable: input with persistent memory context * Control Flow * `O[i] = Cond[i] ? A[i] : B[i].` * Reduction, Scatter, Gather ## Execution Model ![](https://i.imgur.com/SUPpeTQ.png) * DFG * a DFG generated by TensorFlow can represent one module * Modules * The proposed execution model allows restricted communication between instances of modules. Such communication is only allowed using scatter/gather nodes or reduction nodes in the DFG. We find these communication primitives are sufficient to express most TensorFlow kernels. * IB * An IB consists of a list of instructions which will be executed sequentially. * The compiler explores several optimizations to increase the number of concurrent IBs in a module and thereby exposes the ILP inside a module. * RERam * Each IB is mapped to a single SIMD lane ## Compiler ### Supporting Complex Operations general purpose computation requires supporting a diverse set of operations ranging from division, exponents, transcendental functions, etc. We support these complex operations by converting them into a set of LUT, addition and multiplication instructions based on algorithms used in Intel IA-64 processors ### Compiler Optimizations * Node Merging ![](https://i.imgur.com/wJSnoUJ.png) * Instruction Block Scheduler * A Bottom-Up-Greedy (BUG) algorithm is modified to introduce the notion of in-memory computin * Instruction Block (IB) Expansion ![](https://i.imgur.com/BVPCse3.png) * Instruction Blocks that use multi-dimensional vectors as operands can be expanded into several instruction blocks with lower-dimension vectors to further exploit ILP and DLP. * 2D matrices: sizes [2, 1024] * Pipelining * using two arrays, one array computes while writing back the previous result to the other array * Balancing Inter-Module and Intra-Module Parallelism * Because of Amdahl’s law, increasing the number of IBs in a module will not result in linear speedup. * the number of IBs across all module instances exceeds the aggregate SIMD slots in the memory chip. ### Result ![](https://i.imgur.com/qIAHaIC.png)