Reading Note – Compute Caches

# Reading Note – Compute Caches ###### tags: `paper` ## Introduction * paper: [here](http://web.eecs.umich.edu/~reetudas/papers/compute_cache.pdf) * author: Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das * publish: 2017 HPCA * key: bit-line SRAM circuit as computational cache ## background * SRAM circuit bit line computing [2],[3] > parallelism & reduce data movement > trade off: 8% of cache area overhead * major problem: * operand locality, Bit-line computing requires that the data operands are stored in rows that share the same set of bit-lines. * software geometry * near-place compute caches: read out from cache sub-arrays, perform arithmetic close to cache controller * managing parallelism across cache levels * evalutaion & application * text processing, bitmap indexing, copy-on-write checkpointing in OS, and bit matrix multiplication (BMM); a critical primitive used in cryptography, bioinformatics, and image processing. * ## background * compute cache overview on Intel's SandyBridge ![](https://i.imgur.com/as9mwkM.png) * A sub-array in a cache bank is organized into multiple rows of data-storing bit-cells * SRAM circuit for in-place operation ![](https://i.imgur.com/1FTuLDE.png) ## compute cache advantages * reduces on-chip data movement overhead: energy spent on data transfer: energy spent on data transfer and energy spent when reading and writing in the higher-level caches ## compute cache architecture * prerequisite: operands are mapped to sub-arrays such that they share the same bit-lines * a cache geometry that allows a compiler to satisfy operand locality by ensuring that the operands are page-aligned * Compute Cache (CC) ISA ![](https://i.imgur.com/xe5iZxt.png) * operand size: 64 words (512 bytes) * By feeding the result of the sense-amplifiers back to the bit-lines, one word-line can be copied to another without ever latching the source operand. * operand locality -- the operands need to be physi- cally stored in a sub-array, such that they share the same set of bitlines * cache organization ![](https://i.imgur.com/1DCMxnR.png) * Block Partition (BP): group of cache blocks in that sub-array that share the same bitline * all the ways in a set are mapped to the same block partition * software ![](https://i.imgur.com/S30KKNc.png) * software can ensure operand locality as long as operands are page-aligned, i.e., have the same page offset. ## simulation: * [SniperSim](http://snipersim.org/w/The_Sniper_Multi-Core_Simulator)