# ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs
###### tags: `PIM` `DRAM` `bit-serial` `main memory`
###### paper origin: MICRO-2019
###### papers: [link](https://dl.acm.org/doi/10.1145/3352460.3358260)
## Introduction
### Motivation
1. Moving data between compute resources and main memory utilizes a large portion of the overall system energy and significantly contributes to program execution time.
2. The communication latency between compute resources and off-chip DRAM has not improved as fast as the amount of computing resources have increased.
### Problem
1. Performing computations with memory resources has relied on either emerging memory technologies or has required additional circuits be added to RAM array.
2. Previous techniques relied on hardware modifications.
### Solution
1. Computations with the memory
2. Proposing a novel method that performs computiation with off-the-shelf, unmodified, commercial DRAM.
3. Utilizing a customized memory controller, we are able to change the timing of standard DRAM memory transactions, operating outside of specification, to perform massively parallel logical AND, logical OR, and row copy operations.
## DRAM System Organization

### DRAM commands and timing
**1.** *PRECHARGE*: Apply to a whole bank. It first closes the currently opened row by zeroing all word-lines in the target bank, and subsequently drives all bit-lines to V~dd~/2 as an intitial value.
**2.** *ACTIVATE*: The command targets a specific row. The word-line of the addressed row is raised high, which connects the cells of that row directly to the bit-lines. Charge sharing occurs between the storage cell and the bit-line. A sense amplifier is enabled which drags the voltage to V~dd~ or GND.
**3.** *READ*/**4.** *WRITE*: The command applys to four or eight consecutive columns. These commands must be sent after a row is activated in that bank.

* Row Access to Column Access Delay (t~RCD~): the minimum time between an *AVTIVATE* and a *READ*/*WRITE*.
* Row Access Strobe (t~RAS~): the minimum time after *ACTIVATE* that a *PRECHARGE* can be sent. This is time used to open a row, enable the sense amplifier, and wait for the voltage to reach V~dd~ or GND.
* Row Precharge (t~RP~): the minimum time after the *PRECHARGE* in a bank before a new row access. It ensurse that the previously activated row is closed and the bit-line voltage has reach V~dd~/2.
* Row Cycle (t~RC~=t~RAS~+t~RP~): the interval required between acccesses to different rows in the same bank.
## COMPUTE IN DRAM

* Three commands, *ACTIVATE*(R~1~), *PRECHARGE*, and *ACTIVATE*(R~2~), that target two different rows of the same bank are executed.
* Under nominal operation, where *T~1~* is required to be longer than t~RAS~, and *T~2~* longer than t~RP~.
* By appropriate reducing the timing intervals, outside the specification limits, we can force the chip to reach a non-nominal state that realizes a set of different operations.
### Basic In-Memory Operations
#### Row Copy

1. Reducing the timing interval *T~2~* to a value significantly shorter than t~RP~. This causes the second *ACTIVATE* command to interrupt the *PRECHARGE* command.
2. Initial state: the bit-line set to V~dd~/2, the cell in R~1~ charged(storing one), and the cell in R~2~ discharged(storing zero).
3. In step **1**, we send the first *ACTIVATE* command, which opens the row R~1~.
4. Step **2** shows the result of the charge sharing, and the sense amplifier starting to drive both the cell in R~1~ and the bit-line to V~dd~.
5. In step **3**, we execute the *PRECHARGE* command, which will attempt to close row R~1~ and drive the bit-line to V~dd~/2.
6. In step **4**, the word-line of R~1~ has been zeroed out to close R~1~, but the bit-line did not have enough time to be fully discharged to V~dd~/2.
7. In step **5**, the second *ACTIVATE* command is executed so as to interrupt the *PRECHARGE* process. Charge starts to flow from bit-line to the cells in R~2~. When the sense amplifier is enabled to drive the bit-line, together with the cell in R~2~, it will reinforce their value to V~dd~.
#### AND/OR

1. Setting both *T~1~* and *T~2~* to the minimum value, which means we execute the *ACTIVATE*(R~1~), *PRECHARGE*, and *ACTIVATE*(R~2~) in rapid succession with no idle cycles in between.
2. In step **1**, we sent the first *ACTIVATE* command which opens R~1~.
3. we interrupt the first *ACTIVATE* by sending the *PRECHARGE* immediately in step **2**.
4. In step **3**, the second *ACTIVATE* is sent, thus interrupting the *PRECHARGE* command. An adequately samll *T~2~ prevent R~1~ to be closed.
5. On its execution, the second *ACTIVATE* command will change the row address from R~1~ to R~2~.
6. In the process of changing the activated row address, an intermediate value R~3~ appears on the row address bus. Previously, the *PRECHARGE* was interrupted right in the begining, so the bank is still in the state of "setting the word-line" as in the activation process, which leads to open R~3~.
7. The specified destination R~2~ will be opened at the end of step **3**.
8. After the charge sharing, all cells and bit-line reach to the same voltage levle in step **4**.
9. The resultant voltage depends on **majority value** in the cells R~1~, R~2~ and R~3~.

* Fixing the value of R~1~ to zero reduces the truth table to the dotted circle, meaning a logical AND would be performed on the values in R~2~ and R~3~
* Fixing the value of R~3~ to constant one, we reduce the truth table to the solid circle, this time performing a logical OR o the values in R~1~ and R~2~.
#### Arbitrary Computations with Memory
* The missing functionality for arbitrary computations is the NOT operation.
* To avoids modifications to DRAM design, all variables are composed of two parts: one in regular form and one negated.

### Operation Reliability
#### Manufacturing Variations
* Due to imperfections in the manufacturing process, size and capacitance of elements are not uniform across the chip.
* The behavior divergence leads to a partial execution of an operation or erroneous results.
* Also, it is not always possible to find a timing interval that makes all colums work, the timing interval that covers most columns is selected.
* The columns that fail to perform the operation are excluded from in-memory computations.
#### Row Remapping
* Imperfections that render some DRAM cells unusable can also manifest.
* Post-manufactuing testing remaps the faulty cells to redundant rows.
* In-memory computing operations needs the operand rows are in the same sub-array.
* We use software remapping and an error table to solve these challenges.
## Result
We built a test benchmark that selects a random sub-array from each bank of a DRAM module and perfroms an exploratory scan to identify pairs of timing intervals that produce successful computations.
### Proof of Concept

* Only DRAMs in groups SKhynix_2G_1333 and SKhynix_4G_1333B are able to perform the operations across all columns of the sub-array.
* For the modules that cannot perform operations, we hypothesize that there exist hardware in them that checks the timing of operations and drops commands that are too tightly timed.
### Robustness

* The flat horizontal line in the CDF indicates that most of the remaining columns **always** fails to perform the row copy operation.
* This result validates our strategy of producing an error table of "bad" columns to avoid utilizing the offending columns.
### Effect of Supply Voltage and Temperature

* Voltage
* Reducing supply voltage increases propagation delay making the circuit slower, and we need more cycles to achieve the same effect.

* Temperature
* At higher temperature, the circuit's carrier mobility decrease, thus increasing the propagation delay and making the circuit slower.