Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation
0. Abstract
- 將 pointer chasing 的部份移到 memory 做,因為原先由 CPU 做會有兩個缺點:
- irregular access 造成頻繁的 CPU miss & TLB miss
- 每次 data 都要從 memory 端 send back,CPU 才能決定下一個 access 的 pointer
- 提出 IMPICA,在 memory 做 pointer chasing,有以下困難:
- 如何達到高平行執行
- 如何不透過 CPU's MMU,在 memory side 完成 virtual-to-physical address translation
1. Introduction
- The Parallelism Challenge
- 因為 dependent sequential access,即使在 memory 也很難做到 parallelism
- 當前一個 pointer chasing 在等待 memory access 的時間時,會先做下一個 address generation,稱為 address-access decoupling,以此來減少等待時間
- The Address Translation Challenge
- 因為 linked list 指向的位置也是 virtual address,若是 in-memory 來做,每次做完都要由 CPU 傳 physical address,那就有很大的 overhead
- 單純跟 CPU 維持同一份 TLB 也是有很大的 overhead,要解決 coherence、duplication 等問題
- 直接將 IMPICA 會使用到的 data structure map 到 virtual memory space 中連續的空間,並設計新的 translation mechanism,region-based page table
2. Motivation
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
3. Design Challenges
3.1. Challenge 1: Parallelism in the Accelerator
- simple accelerator 會比 multi-cores 來得慢,因為他要 serial 執行
- IMPICA 將 address generation 跟 memory access 使用兩個引擎分開,因此可以達到接近同步執行的效果
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
3.2. Challenge 2: Virtual Address Translation
- IMPICA 所使用的 page table 完全分離於 CPUs
- 只會在連續的 virtual address map IMPICA 需要 access 的空間,稱為 IMPICA region
- 若只要 map IMPICA region,那可以修改目前傳統的 page table,改為較少 overhead 的 IMPICA page table
4. IMPICA Architecture
4.1 IMPICA Core Architecture
- 將 core 分為兩部分:
- address engine,產生 pointer 的 address,包含所有 IMPICA's functional units
- access engine,使用 address engine 產生的 address 來 access memory,會透過 IMPICA page table 把 virtual address 轉成 physical
- 步驟:
- host CPU 將 pointer chasing operation 放入 main memory,並 enqueue request
- address engine load pointer chasing code into instruction RAM
- 執行 instruction RAM 的 code,並使用 data RAM 為 stack,所有不需要 memory access 的部份都由 address engine 來做,如 ALU operations、control flow,data RAM 中的 stack 決定了有幾個 pointer chasing operations 能夠同時執行
- 當 address engine 遇到 memory instruction,他 enqueue address 跟 data RAM stack pointer into access queue,access engine 將 virtual address 透過 IMPICA page table 轉成 physical 後送入 memory controller
- 當 address engine 從 memory controller 得到 data 後,將它存到 IMPICA cache
- 跟 return data 對應的 access queue entry 會從 access queue 移到 response queue
- address engine 會監視 response queue,當 response queue entry 好了, address engine 會去讀 return data
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
4.2 IMPICA Page Table
- region-based page table (RPT) 分為三個層級:
- first-level region table,7-bits,在 region 中找到對應的 flat table
- second-level flat table,每個 region 中較大的 page (2MB),20-bits,在 flat page table 中找到對應的 small page table
- third-level small page table
- 轉換成 physical address 步驟:
- [47:41],7-bits 用來 index region table,並以此找到對應的 flat page table
- [40:21],20-bits 用來 index flat page table,並提供 small page table 的位置
- [20:12],9-bits 用來 index small page table
- [11:0],11-bits 用來確定 physical page 中的 offset
- 不同大小的 pages 提供了放置上的彈性,OS 可以選擇要用 2MB 還是 4KB 作為 last level
- 此方法只會有兩個 misses,flat page table 跟 last-level page table
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
5. Interface and Design Considerations
5.1. CPU Interface and Programming Model
- 執行步驟:
- CPU 將包含 function call and parameters 的 packet 送入 memory
- packet 寫入特定的位置(data RAM 中的 predefined location),並以此 trigger execution
- IMPICA 讀取 predefined location,並 load 到 inst RAM
- IMPICA 完成 execution 後,將 Value 寫入 data RAM,CPU 會 periodically polls 這些 location
5.2. Page Table Management
- 使用特殊 API 來 allocate pointer-based data structures
- OS 會為了 IMPICA 保留一塊連續的空間
- OS 會保持 IMPICA RPT 跟 CPU page table 間的一致性,若有更新到 IMPICA region,會 shoot down IMPICA RPT
5.3. Cache Coherence
- 所有牽涉到 IMPICA region 的指令都在 IMPICA 做,因此 IMPICA 一定會看到最新 update 的資料
6. Methodology
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
7. Evaluation
9. Conclusion