Chapter_6 Thread Level Parallelism (TLP)

# Thread Level Parallelism (TLP) ## 1. Introduction ### 1.1 Definition - perform executions of multiple threads that are parallel > thread: 一個 application下的 sub-task，包含一連串 instructions - vs. DLP (data level parallelism): DLP perform identical operations on data, ex: image processing > ⭣ (Chatgpt) 1. TLP 是指在 program 中同時執行多個 threads 的能力。每個 thread 可以在不同的處理器核心上運行，或在同一處理器核心的不同時刻運行。 property： * 獨立性：每個線程通常是獨立的，具有自己的執行路徑和狀態。 * 多核處理器： TLP 主要利用多核處理器來並行執行不同的線程。 * thread parallelism: 主要用於將程序中的不同任務或工作分配給不同的線程，以便它們可以同時進行 2. DLP 是指在處理數據時對每個數據元素進行相同的操作的能力。這通常是在單一指令下對多個數據元素進行操作。 property： * data parallelism：操作是對數據集中的多個數據元素進行的，這些數據元素通常在內存中是連續的或結構化的。 * SIMD（單指令多數據）： DLP 通常利用 SIMD 指令集來實現，即單一指令同時作用於多個數據點。 * 高效性：對於可以進行相同操作的大量數據，DLP 可以大幅度提高運算效率。 ### 1.2 Multi-threading use TLP to improve uni-processor throughput → multiple threads share the functional unit → question: 怎麼分配 solution #### a. Fine-grain multi-threading also called interleaved multi-threading (IMT) - hardware must be able to switch threads every clock - advantage hide both short and long stalls, since instructions from other threads executed when one thread stalls - disadvantage: 每個thread完成的時間變久 - a thread ready to execute without stalls will be delayed by instructions from other threads. - 處理器需要擁有大量的 register，以便每個線程都有足夠的register 空間來儲存其狀態。 #### b. Coarse-grain multi-threading also called block multi-threading (BMT) - **hardware switches threads only on costly stalls, when pipeline refill << stall time** - advantage - relieve hardware cost (不用好幾套register) - 每個thread完成的時間不會像IMT那麼久 - disadvantage - 會有pipeline start-up cost - priority 問題 (要讓哪個thread做) #### c. Simultaneous multi-threading (SMT) **fine-grain multi-threading + multiple-issue + dynamically scheduled processor (fine-grain multi-threading + superscalar)** - **ILP + TLP** 1. virtual registers are used to hold the independent threads 2. register renaming 來排除不同 thread 在 datapath 的干擾 3. the thread is executed out of order → thread renaming table + separate PC + separate reorder buffer for each thread ![Untitled (1)](https://hackmd.io/_uploads/S1vFWc-a0.png) ## 2. TLP on multi-processor introduction ![Untitled (2)](https://hackmd.io/_uploads/HJQydBrTR.png) ### 2.1 loosely-coupled multiprocessor - processors that use distributed memory and can work on independent tasks in parallel - 通常是 message-passing multi-processor processor 之間用 package 做資料交換 ex: NoC (network on chip) ![Untitled-Diagram-621](https://hackmd.io/_uploads/SJQ4jBrp0.png) ### 2.2 tightly-coupled multiprocessor - **shared-memory**, 多個處理器或處理核心共享同一記憶體資源共享記憶體允許所有處理器直接訪問同一記憶體空間，這使得數據共享和通訊更為直接和高效 - 分類 1. UMA(uniform memory access time)/ SMP(symmetric multi-processors)/ centralized shared-memory 2. NUMA(non uniform memory access time)/ DSM(distributed shared memory) ![4FF080CE-E80D-4891-AC5F-06D87849C3BB](https://hackmd.io/_uploads/HksWqUSTR.jpg) - challenge: cache coherence problem 當多個處理器擁有對相同記憶體位置的 cache copies 時，如何確保所有處理器看到的數據是一致的。如果多個處理器同時或者分別對同一記憶體位置進行讀寫操作，而這些記憶體位置又被 cache 到了不同處理器的private cache中，就會出現數據不一致的問題。 ## 3. Cache coherence problem - define coherent memory system 1. preserve program order 當一個處理器 P 對某個記憶體位置 X 進行寫操作後，其後的讀操作（由相同或不同處理器執行）應該返回 P 寫入的值 2. coherent view of memory 如果處理器A 對記憶體位置X 的 write 後，另一處理器B對 X 的 read 應返回最後處理器A寫入的值，前提是這兩個操作之間有足夠的時間間隔，且期間沒有其他處理器寫位置 X 3. write serialization 兩次對同一記憶體位置的 write（由任意兩個處理器執行）在所有處理器上應按相同順序觀察到 (對同個 block 裡的位置，前個 write 還沒完成下個 write 不會開始) ![Untitled (2)](https://hackmd.io/_uploads/B1mW38B6C.png) > read 的順序不重要，只要能在下個write前讀到正確的值就好 - 2 classes of cache coherence protocols (hardware) 1. Snooping 2. Directory-based ## 4. Snooping - 每個 private cache 都有 cache controller，用來監控 snoop bus 上的資訊 - distributed control protocol - every cache with a copy of data also has a copy of sharing status of block: state, address, data ### 4.1 Write-through snooping 每次對 cache 的 write 都會同時更新主記憶體 ![5BD364A8-71D5-487C-BD4C-DD0891235D2B](https://hackmd.io/_uploads/rks4CLSaR.jpg) > step3: 因為是write through，會直接修改share memory，其他有該block的processor也會被invalidate - simpler, but higher write latency ### 4.2 Write-back snooping - FSM of Cache Block State: 描述了 cache 裡的 block 在不同狀態之間的轉變，包括 invalid, valid, dirty (dirty代表該block是最新的資料) - Broadcast Medium Transactions (e.g. bus) - 用來連接各個 private cache 到 share memory - let every cache controller observes every transaction - write: 需要知道其他處理器的 private cache 有沒有該block - 沒有: no need to place write on bus - 有: place invalidate on bus - write serialization: 當一個處理器獲得 broadcast medium transactions 控制權並更新某 block 時，它會使其他處理器的相同 block 失效 (invalid)，以確保所有的處理器在讀取或寫入該block前都有最新的資料 - 如果某個處理器擁有已被修改的最新數據，其他處理器發出read miss時，要回應該miss - FSM - CPU read/write的操作下，cache block的狀態 ![Untitled (4)](https://hackmd.io/_uploads/rJ-igvSpR.png) - bus 操作下，cache block的狀態 ![Untitled (5)](https://hackmd.io/_uploads/rkfTlDBpA.png) - combination ![Untitled (6)](https://hackmd.io/_uploads/S1fbWDSTC.png) - snooping protocol optimization 1. MESI(modified, exclusive, shared, invalid) 2. MOESI(modified, owned, exclusive, shared, invalid) - limitation - processor 數越多，coherent miss 越高 - review: Cache miss type - compulsory miss - capacity miss - conflict miss - multi-processor的情況下再加一個miss: coherence miss - coherence miss - True sharing misses - Invalidates due to first write to shared block - Reads by another CPU of exclusive block in different cache 當一個處理器修改了它的快取中的一個數據塊後，這個數據塊的新狀態為 exclusive，並invalidate其他人，所以當另一個處理器試圖read同一個數據塊時就會read miss - would still occur if block size were 1 word - False sharing misses - 因在同一個 block 但不同 word 而互相影響 - would not occur if block size were 1 word <img src="https://hackmd.io/_uploads/H1Zi-wBpR.png" width="40%"> ![Untitled (9)](https://hackmd.io/_uploads/Sk8eGvBTA.png) ## 5. Directory-based #### Introduction - dedicated message known source and destination vs. snoopy (known source and unknown destination) - explicit response 系統會要求接收者給出明確的響應 - Every memory block has associated directory information to keep track of copies blocks and their states ![Untitled (10)](https://hackmd.io/_uploads/rJglmwBTR.png) ![57F91830-8E02-46C7-9118-6748EC50BEDF](https://hackmd.io/_uploads/BJWr4vSTC.jpg) #### Directory protocol - three state 1. shared: ≥1 processors have data, 擁有最新的資料, dirty bit=0, presence bit 記錄哪些 private cache 有該筆資料 2. uncached: presence bit and dirty bit are all 0, means that no processor has it 3. exclusive: dirty bit=1, 只有一個presence bit=1(owner) - FSM ![Untitled (12)](https://hackmd.io/_uploads/SJh1IwH6R.png) ![Untitled (13)](https://hackmd.io/_uploads/SyvzUwBpR.png) ![Untitled (14)](https://hackmd.io/_uploads/SyZSLDraR.png) - Example1 ![Untitled (15)](https://hackmd.io/_uploads/B1-PUvHTC.png) Example2 1. first read (p1要讀) ![Untitled (16)](https://hackmd.io/_uploads/HkuqUwSaA.png) > p1: invalid → shared > p1 presence bit on 2. read share (p2也要讀那個block) ![Untitled (17)](https://hackmd.io/_uploads/SJORIwHp0.png) > p2: invalid → shared > p2 presence bit on 3. p1 write 該 block ![Untitled (18)](https://hackmd.io/_uploads/r1zgvPHaC.png) > p1: shared → exclusive > p2: shared → invalid > 該 block 的 dirty bit on 4. p2也 write 該 block (新owner誕生) ![Untitled (19)](https://hackmd.io/_uploads/rkpWDvS6R.png) > p1: exclusive → invalid > p2: invalid → exclusive > dirty bit still on ## 6. multiprocessor two-level hierarchy系統 (兩級結構) - node: processor + private cache + cache controller - 常見方法 across node 的 communication 通過 directory-based 的方法來實現 node 內部使用snoopy/ directory-based 來實現