Processing-in-memory: A workload-driven perspective

# Processing-in-memory: A workload-driven perspective ###### tags: `PIM` `Onur Mutlu` `IBM journal` ```tags```:: [paper](https://users.ece.cmu.edu/~saugatag/papers/19ibmjrd_pim.pdf) || [slides]() ## 0. Abstract * Memory Wall 所造成的 memory movement 耗費能量及影響效能 * CPU 幾乎做了所有運算 * 藉由 PIM 技術可以避免這兩個問題，但有幾個問題要克服 * enable workload * progrmmar 使用 PIM 的 interface 會跟傳統用 CPU 不同 * 論文提出三個 PIM 架構的設計方向 1. 介紹 PIM 適合的 workload(ML、Data analytics, genome analysis) 2. 解決在 PIM 架構上 programming 這些 application 的問題 3. 介紹廣泛使用 PIM 會遇到的問題 ## 1. Introduction * 使用 PIM 的兩個挑戰 1. 使用的時機: * 考量 area 或 energy * application 的特性 (work-intensive or memory access-intensive) * 不同 function 共同使用的 data 3. 簡易的 interface: * 沒有 cache coherence 跟 address translation，PIM 該如何有效率的使用 * 針對第 1 點，此論文提出一個 toolflows 來幫助 programmer 系統性的決定哪段 work 要給 PIM 做 * 針對第 2 點，此論文建立一系列的 interface 跟機制來允許 programmer 使用 PIM ## 2. Overview of PIM ### 2.1 Initial push for PIM * 講古，略 ### 2.2 New opportunities in modern memory systems * 兩個新的 memory 技術可以使用在 PIM 上 1. 3D-stacked 技術 * HBM * Wide I/O * Wide I/O 2 * HMC 3. NVM * PCM * MRAM * RRAM * 3D-stacked DRAM 架構中會有一個 logic layer 在最底層，可以考慮放個 processor 進去 * NVM 因為是新興產業，不像 DRAM 已經商業化，有機會可以將 PIM 結合在 NVM 上 ### 2.3 Two approaches: Processing-near-memory vs. processing-using-memory ![](https://i.imgur.com/0CMIXoO.jpg) ## 3. Identifying opportunities for PIM in applications ### 3.1 Design constraints for PIM * 就 AREA 因素來講，以 HMC 使用 22nm 製成為例： * 若要在 DRAM logic layer 加 PIM 約有 20-60 mm^2 的空間，但跨越不同 vault 會造成 access 變慢 * 若每個 vaults 都加 PIM 約有 3.5-4.4 mm^2 的空間(32-vault)，但這大小會限制 PIM 能 access 的空間 * 價格、耗電量、熱度也都是加入 PIM 要考慮的點 ![](https://i.imgur.com/LByv5MG.png) -------------------------------------------[ref](https://arxiv.org/pdf/1706.02725v3.pdf)----------------------------------------- ### 3.2 Choosing what to execute in memory * 直覺上會把 work-intensive 的 application 直接交給 PIM 做，但是既使 PIM core 是 ISA-compatible，因為 area、energy、thermal 等因素，加上沒有 multi-level cache 就沒法做 out-of-order or superscalar execution，無法完成 instruction parallelism，所以 PIM 沒辦法運作的像 CPU 那麼好 * 結論上來說， compute-intensive or cache-friendly 應該要放在 CPU 做 * 論文提出 toolflow (hardware performance counters and our energy model) 來確認 application 哪些部分適合放進 PIM 中 (PIM target)，如滿足以下條件，則適合： 1. 最多的能量耗損是在 workload 上 2. data movement 佔到 20％以上 3. memory-intensive(last-level cache misses per kilo instruction, or MPKI 超過10) 4. data movement 是 function 中最大的能量耗損來源 * 對於 candidate function 再以兩個標準檢測，不滿足則不使用 1. 在 simple PIM logic 上時效能會變差的 2. 需要的 area 超過 logic layer 所能 access 的範圍 ### 3.3 Case Study: PIM opportunities in TensorFlow * 透過觀察 TensorFlow Lite，可以發現有幾項特別消耗能量跟執行時間的操作(Fig.2/Fig.3) * Conv2D + MatMul * Packing * Quantization * ![](https://i.imgur.com/HMjv6wR.png) * ![](https://i.imgur.com/GFoStW5.png) * Conv2D + MatMul 不適合由 PIM 來做的原因: 1. 大部分的能量花在 computation (67.5%) 2. 需要較複雜的 PIM logic 才能完成 ## 4. Programming PIM architectures: Key issues * 在此章節會探討哪四個 key issues 會影響 programmability of PIM architectures 1. offload 到 PIM kernel 的不同 granularities 2. 如何達到 PIM kernels and CPU threads 間的 data sharing 3. 如何有效率提供 PIM kernels 有效率的 virtual memory address translation 4. 如何自動判定及 offload PIM targets ### 4.1 Offloading Granularity * 測試四種 granularities: * a single instruction * a bulk operation * an entire function * an entire application #### 4.1.1 a single instruction * PEI( PIM-enabled instruction )，可以加入現存的 ISA，而每個 PIM operation 被當作單一 instruction * Fig.6 顯示一種 architecture 可以支援 PEIs，在這架構中 PEI 在 PCU (PEI Computation Unit)中執行 * PCU 會加入每一個 host CPU 跟 vault 中 * ![](https://i.imgur.com/KDUJNtD.png) * CPU 只需要執行一道 PEI 指令，就會傳送到 PMU (central PEI Management Unit)中，PMU 會在一個 PCU 中執行合適的 PEI 指令 * 執行 PEI 並小幅度的修改系統有以下三個 key issues: 1. PIM instruction 跟 ISA 中的 PEI instruction 一一對應，免除了 address translation，因為 CPU 傳送 PEI 進 memory 前就已經轉換完成 2. PIM operation 只能在 single cache line 上執行，這樣只需處理單一 cache line 的 cache coherence 問題 3. 一個 PEI 相對於其他 PEIs 是 atomic 的，且會使用 memory fence 來達成 PEI 跟 CPU instruction 間的 atomic * 每個 PIM kernel 跟 CPU threads 間可以簡單的合作，但彼此能夠處理的 data 都不能太多，且 computation 的複雜度也不能太高，不然當有多個 PIM operations 時會產生很大的 overhead #### 4.1.2 a bulk operation * offload bulk operations to memory，並由多個 PCU 完成，可以執行如: * bulk copy and data initialization * bulk bitwise operations * simple arithmetic operations * [Ambit]()，是一個使用 DRAM cells 特性完成 bulk bitwise operations 的 processing-using-memory architecture，可以在 database queries 上比一般 CPU-only 的速度快 3-12x * 但有兩個 Tradeoffs: 1. 單一 bulk function 有限制，如在 Ambit 中，一次不能執行小於一個 row 的 data 2. processing-using-memory architecture 受限於 memory array 特性，能做的事小於 general-purpose core #### 4.1.3 an entire function * 有一些方法能夠標記要放到 PIM 執行的 code 區段，有一個方法就是在要執行的 code 區段前後加上 compiler directive，如 #PIM_begin and #PIM_end directives，這樣 compiler 就會產生一個 thread 用來在 PIM 上執行 * 此方法需要 compiler or library 配合，來分配一個 PIM kernel to memory，且 programmer 需要一些方法知道程式中那些區段要給 PIM 做 * 此方法也需要考慮到 CPU 跟 PIM logic 間如何合作，以此篇論文的方法，會將要 offload 到 PIM 的 instruction 做標記，CPU 跟 PIM logic 是有可能同步執行的 #### 4.1.4 an entire application * 如果把整個 application 都 offload，就不需要考慮 cache coherence 的問題，但這類型的 application 會有限制，整個系統都要做改變，以下兩個例子介紹: * Tesseract * GRIM-Filter * Tesseract * ### 4.2 Sharing data between PIM logic and CPUs ### 4.3 Virtual memory ### 4.4 Enabling programmers and compilers to find PIM targets ## 5. Related work ## 6. Future challenges