# Reading Note – ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs
###### tags: `paper`
## Introduction
* Paper: [here](https://parallel.princeton.edu/papers/micro19-gao.pdf)
* Author: Fei Gao, Georgios Tziantzioulis, David Wentzlaff; Princeton University
* Published at MICRO 2019
* TL;DR:
**the first work to demonstrate in-memory computation with off-the- shelf, unmodified, commercial, DRAM**. This is achieved mainly by violating timing constraint of DRAM.
## Main Idea
### basics of DRAM
* DRAM arch: channels, ranks, banks, and rows/columns
```
one DIMM per channel
tow ranks per module
8 chips / rank , share same module
8 indep banks / chip
rows/columns per bank
a column refers to the smallest unit of data (one byte per chip) that can be addressed.
512 rows/ subarray, where each bit-line connects all the cells in one bit of a column and the sense amplifier at the local row buffer.
```
* traditional memory controller(MC) is properly timed to send commands to DRAM
> Their work exploits "timing"
* commands of DRAM
* PRECHARGE: **zero** all word-lines in target bank: **bit-lines to Vdd/2**
* ACTIVATE: word-line of address row is raised high and connects the cells of that row directly to bit-lines --> capacitance of cell cause the bit-line voltage deviates from Vdd/2, then sense amplifier turns on to drag volt to VDD or GND --> amplify the value in cell
* READ/WRITE: commands apply to four or eight consecutive columns according to the burst mod
* timing constraint:


### Computation using off-the-shelf DRAM
* goal: by violating timing constraint, achieve: charge sharing & open multiple rows to perform oeprations on the bit-line
* row copy: reduce T2 so that second ACTIVATE interrupt the PRECHARGE , but should not be too short otherwise another row may open

* limitation:
* require share same bit-line --> operates within a sub-array dof bank
* second PRECHARGE is needed to make second ACT recognized by DRAM
* AND/OR (desctructive): open three different rows simultaneously
* PRECHARGE immediately after ACTIVATE(R1) in case R1 dominates other values (discuss below)
* row address upadte --> implicity open of third row
* control over the result: fix the lower two bits of the addresses of R1 and R2 to 012 and 102 and require their higher bits to be identical
* R1, opened early, more time to influence --> table of robusts operation --> prerequisite of logical AND/OR is the constant zero/one.

* dotted circle: R2 & R3
* solid circle: R1 | R2
* assumption: cells are fully charged/discharged --> need refresh beforehand
### operation reliability
* manufacturing variation: capacitance variation --> require different timing
* row remapping: above bad collumns are remapped at post-manufacturing testing --> software remapping error table
### in-memory compute framework
* software --> use three baisc ops to build inverse and overcome errors
* inversione:
* all variables in our model are composed of two parts: one in regular form and one negated(additional steps to generate)
* the adoption of a pairwise value format utilizes twice the amount of memory space and twice the number of operations. We believe that compared to the improvements in functionality, these overheads are acceptable
* preserve operands:
* only perform the computation in the first three rows in each sub-array. ie the lower nine bits of R1, R2 and R3 are always 000000001, 000000010 and 000000000 (assuming 512 rows in a sub-array).
* 'bascially same as the destructive read so to put back the data
* bit-serial arithmetic:

* copy across sub-array: MC, read whole row and store in MC then write to destination
### experiment
* SoftMC to direct control DRAM module attached to FPGA
* constrain operating frequency to 400MHz and 800MHz
* exp: Vdd, temperature
* PoC: which DRAM can achive above theory
* the row-wise success ratio for a given pair of timing intervals in command cycles.
* ( the ratio of columns with successful computation over the total number of columns in a row)
* using the results from the bank that provided the highest row success ratio for each module.

* blue vertial line distriubtion: row copy can be implemented using a small T2 timing interval. (fig 3)!
* blue diagonal distriubtion: [possibly timing check] OR [sum of T1 + T2 determine whether the copy op is performed ]
* robustness
* different line -> different module , 1000 times copy on over 32K rows for banks of a module
* all operand randomly generated
* sharp rise at the beginning of the CDF: in all modules, at least half of the columns can reliably perform the row copy operation.
* prove "bad" columns

* supply voltage and temperature
* lower Vdd: slower , vendor and choice dependent

### Discussion
* throughput

* single ADD operation accounts for over a thousand cycles and the DRAM command frequency is lower than that of a processor core.
* main overhead of ComputeDRAM is the duplicated com- putation and storage required by our proposed framework
* As computation in CPU and DRAM is decoupled, the overall computational capability of the system is increased. Furthermore, as computation is moved from the CPU to the main memory the in- terconnect’s utilization is reduced. This can lead to reduced latency and runtime.
## ref
* compute cache
* Bruce Jacob, Spencer Ng, and David Wang. 2007. Memory systems: cache, DRAM, disk. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
* T. Finkbeiner, G. Hush, T. Larsen, P. Lea, J. Leidel, and T. Manning. 2017. In- memory intelligence. IEEE Micro 37, 4 (2017), 30–38.