# Reading Note – ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs ###### tags: `paper` ## Introduction * Paper: [here](https://parallel.princeton.edu/papers/micro19-gao.pdf) * Author: Fei Gao, Georgios Tziantzioulis, David Wentzlaff; Princeton University * Published at MICRO 2019 * TL;DR: **the first work to demonstrate in-memory computation with off-the- shelf, unmodified, commercial, DRAM**. This is achieved mainly by violating timing constraint of DRAM. ## Main Idea ### basics of DRAM * DRAM arch: channels, ranks, banks, and rows/columns ``` one DIMM per channel tow ranks per module 8 chips / rank , share same module 8 indep banks / chip rows/columns per bank a column refers to the smallest unit of data (one byte per chip) that can be addressed. 512 rows/ subarray, where each bit-line connects all the cells in one bit of a column and the sense amplifier at the local row buffer. ``` * traditional memory controller(MC) is properly timed to send commands to DRAM > Their work exploits "timing" * commands of DRAM * PRECHARGE: **zero** all word-lines in target bank: **bit-lines to Vdd/2** * ACTIVATE: word-line of address row is raised high and connects the cells of that row directly to bit-lines --> capacitance of cell cause the bit-line voltage deviates from Vdd/2, then sense amplifier turns on to drag volt to VDD or GND --> amplify the value in cell * READ/WRITE: commands apply to four or eight consecutive columns according to the burst mod * timing constraint: ![](https://i.imgur.com/aeYCfMZ.png) ![](https://i.imgur.com/xPS4Rrr.png) ### Computation using off-the-shelf DRAM * goal: by violating timing constraint, achieve: charge sharing & open multiple rows to perform oeprations on the bit-line * row copy: reduce T2 so that second ACTIVATE interrupt the PRECHARGE , but should not be too short otherwise another row may open ![](https://i.imgur.com/G4rYi3B.png) * limitation: * require share same bit-line --> operates within a sub-array dof bank * second PRECHARGE is needed to make second ACT recognized by DRAM * AND/OR (desctructive): open three different rows simultaneously * PRECHARGE immediately after ACTIVATE(R1) in case R1 dominates other values (discuss below) * row address upadte --> implicity open of third row * control over the result: fix the lower two bits of the addresses of R1 and R2 to 012 and 102 and require their higher bits to be identical * R1, opened early, more time to influence --> table of robusts operation --> prerequisite of logical AND/OR is the constant zero/one. ![](https://i.imgur.com/UnktRNo.png) * dotted circle: R2 & R3 * solid circle: R1 | R2 * assumption: cells are fully charged/discharged --> need refresh beforehand ### operation reliability * manufacturing variation: capacitance variation --> require different timing * row remapping: above bad collumns are remapped at post-manufacturing testing --> software remapping error table ### in-memory compute framework * software --> use three baisc ops to build inverse and overcome errors * inversione: * all variables in our model are composed of two parts: one in regular form and one negated(additional steps to generate) * the adoption of a pairwise value format utilizes twice the amount of memory space and twice the number of operations. We believe that compared to the improvements in functionality, these overheads are acceptable * preserve operands: * only perform the computation in the first three rows in each sub-array. ie the lower nine bits of R1, R2 and R3 are always 000000001, 000000010 and 000000000 (assuming 512 rows in a sub-array). * 'bascially same as the destructive read so to put back the data * bit-serial arithmetic: ![](https://i.imgur.com/1TPM0zT.png) * copy across sub-array: MC, read whole row and store in MC then write to destination ### experiment * SoftMC to direct control DRAM module attached to FPGA * constrain operating frequency to 400MHz and 800MHz * exp: Vdd, temperature * PoC: which DRAM can achive above theory * the row-wise success ratio for a given pair of timing intervals in command cycles. * ( the ratio of columns with successful computation over the total number of columns in a row) * using the results from the bank that provided the highest row success ratio for each module. ![](https://i.imgur.com/NGRTjrk.png) * blue vertial line distriubtion: row copy can be implemented using a small T2 timing interval. (fig 3)! * blue diagonal distriubtion: [possibly timing check] OR [sum of T1 + T2 determine whether the copy op is performed ] * robustness * different line -> different module , 1000 times copy on over 32K rows for banks of a module * all operand randomly generated * sharp rise at the beginning of the CDF: in all modules, at least half of the columns can reliably perform the row copy operation. * prove "bad" columns ![](https://i.imgur.com/InCWde6.png) * supply voltage and temperature * lower Vdd: slower , vendor and choice dependent ![](https://i.imgur.com/CpB68w8.png) ### Discussion * throughput ![](https://i.imgur.com/8jQUM7H.png) * single ADD operation accounts for over a thousand cycles and the DRAM command frequency is lower than that of a processor core. * main overhead of ComputeDRAM is the duplicated com- putation and storage required by our proposed framework * As computation in CPU and DRAM is decoupled, the overall computational capability of the system is increased. Furthermore, as computation is moved from the CPU to the main memory the in- terconnect’s utilization is reduced. This can lead to reduced latency and runtime. ## ref * compute cache * Bruce Jacob, Spencer Ng, and David Wang. 2007. Memory systems: cache, DRAM, disk. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. * T. Finkbeiner, G. Hush, T. Larsen, P. Lea, J. Leidel, and T. Manning. 2017. In- memory intelligence. IEEE Micro 37, 4 (2017), 30–38.