Final report - HackMD

###### tags: `Advanced memory` # Final report ## Motivation Bulk data operations degrade system performance and energy efficiency. Several system-level operations trigger bulk data copy or initialization. Even though these bulk data operations do not require any computation, current systems transfer a large quantity of data back and forth on the memory channel to perform such operations. In this work, we focus our attention on optimizing two important classes of bandwidth-intensive memory operations that frequently occur in modern systems: 1) Bulk data copy—copying a large quantity of data from one location in physical memory to another. 2) Bulk data initialization—initializing a large quantity of data to a specific value. Our goal is to design a mechanism that reduces the latency, bandwidth, and energy consumed by bulk data operations. ## Methodology #### Background 以下會解釋 DRAM 的架構，與對 DRAM unit 的基本操作 ![](https://i.imgur.com/Xw6nEid.png) ![](https://i.imgur.com/wyZ8FDv.png) DRAM chip consists of multiple DRAM banks, which are further divided into multiple subarrays (Figure 1). Figure 2 shows the internal organization of a subarray. Each subarray consists of a 2-D array of DRAM Cells connected to a single row of sense amplifiers, also referred to as the row buffer. A DRAM cell consists of two components: 1) a capacitor, which represents logical state in terms of charge, 2) access transistor that determines if the cell is currently being accessed. A sense amplifier is a component that is used to sense the state of the DRAM cell and amplify the state to a stable voltage level. The wire that connects the DRAM cell to the corresponding sense amplifier is called a bitline and the wire that controls the access transistor of a DRAM cell is called a wordline. Each subarray contains a row decoding logic that determines which row of cells (if any) is currently being accessed #### subarray access ![](https://i.imgur.com/E6nvSxA.png) Picture shows the three steps of a subarray access. In the initial precharged state ➊, the sense amplifiers and the bit lines are maintained at an equilibrium voltage level of 1/2 VDD (half the maximum voltage, VDD). The row decoder is fed with a sentinel input (—), such that all the wordlines within the subarray are lowered, i.e., no row is connected to the sense amplifiers. To access data from a row A of a subarray, the DRAM controller first issues an ACTIVATE command with the row address. Upon receiving this command, the bank feeds the row decoder of the corresponding subarray with the address A, which raises the wordline of the corresponding row. As a result, all cells of the row are connected to the corresponding bitlines and in turn to the sense amplifiers.Depending on the initial state of the cells, they lose/gain charge, thereby raising/lowering the voltage level on the corresponding bitlines (state ➋). In this state, the sense amplifiers detect the deviation from 1/2 VDD and begin to amplify the deviation (state ➌). In the process, they also drive the DRAM cells back to their original state. Once the amplification process is complete, all the sense amplifiers and bitlines are at a stable state (VDD or 0) and the cells are fully restored to their initial state (state ➍). The required data can be accessed from the sense amplifiers as soon as their voltage levels reach a threshold (state ➌). When the DRAM controller wants to access data from a different row in the bank, it must first take the subarray back to the precharged state. This is done by issuing the PRECHARGE command to the bank. The precharge operation involves two steps. First, it feeds the row decoder with the sentinel input (—), lowering the wordline corresponding to the currently activated row. This essentially disconnects the cells from the corresponding bitlines. Second, it drives the sense amplifiers and the bitlines back to the equilibrium voltage level of 1/2 VDD (state ➏ – we intentionally ignore state ➎ for now as it is part of our mechanism to be described in Section 3.1) #### ROWCLONE: Fast Parallel Mode (FPM) row clone 分成兩種模式 fast parallel mode 、 Pipelined Serial Mode ![](https://i.imgur.com/Qk2WRje.png) 這個方法只適用於在同個 bank 同個 subarray ， copy row data。 To copy data from a source row (src) to a destination row (dst) within the same subarray, FPM first activates the source row. At the end of the activation, the sense amplifiers and the bitlines are in a stable state corresponding to the data of the source row. The cells of the source row are fully restored to their original state. This is depicted by state ➍ in Figure 4. In this state, simply lowering the wordline of src and raising the wordline corresponding to dst connects the cells of the destination row with the stable bitlines. Based on the observation made above, doing so overwrites the data on the cells of the destination row with the data on the bitlines, as depicted by state ➎ in Figure 4. Pros: 1. FPM copies a 4KB page of data 11.6x faster and with 74.4x less energy than an existing system. Cros: 1. Requires the source and destination rows to be within the same subarray. 1. Cannot partially copy data from one row to another. #### ROWCLONE: Pipelined Serial Mode (PSM) ![](https://i.imgur.com/ehSVP2k.png) The Pipelined Serial Mode efficiently copies data from a source row in one bank to a destination row in a different bank. To copy data from a source row in one bank to a destination row in a different bank, PSM first activates the corresponding rows in both banks. It then puts the source bank in the read mode, the destination bank in the write mode, and transfers data one cache line (corresponding to a column of data—64 bytes) at a time. For this purpose, we propose a new DRAM command called TRANSFER. It copies the cache line corresponding to the source column index in the activated row of the source bank to the cache line corresponding to the destination column index in the activated row of the destination bank. Unlike READ/WRITE which interact with the memory channel connecting the processor and main memory, TRANSFER does not transfer data outside the chip. As shown in the figure, in contrast to the READ or WRITE commands, TRANSFER does not transfer data from or to the memory channel. Pros: 1. PSM results in a 1.9x reduction in latency and 3.2x reduction in energy of a 4KB bulk copy operation 2. Can transfer data between chip and sub-array Cros: 1. Performance not enfficient than Fast Parallel Mode (FPM) #### Mechanism for Bulk Data Copy When the data from a source row (src) needs to be copied to a destination row (dst), there are three possible cases depending on the location of src and dst: 1) src and dst are within the same subarray, 2) src and dst are in different banks, 3) src and dst are in different subarrays within the same bank. For case 1 and case 2, RowClone uses FPM and PSM, respectively, to complete the operation. To keep our design simple, for such an intra-bank copy operation, our mechanism uses PSM to first copy the data from src to a temporary row (tmp) in a different bank. It then uses PSM again to copy the data back from tmp to dst. #### Mechanism for Bulk Data Initialization Bulk data initialization sets a large block of memory to a specific value. To perform this operation efficiently, our mechanism first initializes a single DRAM row with the corresponding value. It then uses the appropriate copy mechanism (from Section 3.3) to copy the data to the other rows to be initialized. Bulk Zeroing (or BuZ), a special case of bulk initialization, is a frequently occurring operation in today’s systems. To accelerate BuZ, one can reserve one row in each subarray that is always initialized to zero. By doing so, our mechanism can use FPM to efficiently BuZ any row in DRAM by copying data from the reserved zero row of the corresponding subarray into the destination row. The capacity loss of reserving one row out of 512 rows in each subarray is very modest (0.2%). ## Evaluation ![](https://i.imgur.com/TxBhW7P.png) ![](https://i.imgur.com/j5QvHOm.png) ## Summary RowClone, a new technique for exporting bulk data copy and initialization operations to DRAM. DRAM can internally transfer multiple kilo-bytes of data between the DRAM cells and the row buffer, with very few changes to the DRAM architecture.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.