# Thread Level Parallelism (TLP) ## 1. Introduction ### 1.1 Definition - perform executions of multiple threads that are parallel > thread: 一個 application下的 sub-task,包含一連串 instructions - vs. DLP (data level parallelism): DLP perform identical operations on data, ex: image processing > ⭣ (Chatgpt) 1. TLP 是指在 program 中同時執行多個 threads 的能力。每個 thread 可以在不同的處理器核心上運行,或在同一處理器核心的不同時刻運行。 property: * 獨立性: 每個線程通常是獨立的,具有自己的執行路徑和狀態。 * 多核處理器: TLP 主要利用多核處理器來並行執行不同的線程。 * thread parallelism: 主要用於將程序中的不同任務或工作分配給不同的線程,以便它們可以同時進行 2. DLP 是指在處理數據時對每個數據元素進行相同的操作的能力。這通常是在單一指令下對多個數據元素進行操作。 property: * data parallelism: 操作是對數據集中的多個數據元素進行的,這些數據元素通常在內存中是連續的或結構化的。 * SIMD(單指令多數據): DLP 通常利用 SIMD 指令集來實現,即單一指令同時作用於多個數據點。 * 高效性: 對於可以進行相同操作的大量數據,DLP 可以大幅度提高運算效率。 ### 1.2 Multi-threading use TLP to improve uni-processor throughput → multiple threads share the functional unit → question: 怎麼分配 solution #### a. Fine-grain multi-threading also called interleaved multi-threading (IMT) - hardware must be able to switch threads every clock - advantage hide both short and long stalls, since instructions from other threads executed when one thread stalls - disadvantage: 每個thread完成的時間變久 - a thread ready to execute without stalls will be delayed by instructions from other threads. - 處理器需要擁有大量的 register,以便每個線程都有足夠的register 空間來儲存其狀態。 #### b. Coarse-grain multi-threading also called block multi-threading (BMT) - **hardware switches threads only on costly stalls, when pipeline refill << stall time** - advantage - relieve hardware cost (不用好幾套register) - 每個thread完成的時間不會像IMT那麼久 - disadvantage - 會有pipeline start-up cost - priority 問題 (要讓哪個thread做) #### c. Simultaneous multi-threading (SMT) **fine-grain multi-threading + multiple-issue + dynamically scheduled processor (fine-grain multi-threading + superscalar)** - **ILP + TLP** 1. virtual registers are used to hold the independent threads 2. register renaming 來排除不同 thread 在 datapath 的干擾 3. the thread is executed out of order → thread renaming table + separate PC + separate reorder buffer for each thread ![Untitled (1)](https://hackmd.io/_uploads/S1vFWc-a0.png) ## 2. TLP on multi-processor introduction ![Untitled (2)](https://hackmd.io/_uploads/HJQydBrTR.png) ### 2.1 loosely-coupled multiprocessor - processors that use distributed memory and can work on independent tasks in parallel - 通常是 message-passing multi-processor processor 之間用 package 做資料交換 ex: NoC (network on chip) ![Untitled-Diagram-621](https://hackmd.io/_uploads/SJQ4jBrp0.png) ### 2.2 tightly-coupled multiprocessor - **shared-memory**, 多個處理器或處理核心共享同一記憶體資源 共享記憶體允許所有處理器直接訪問同一記憶體空間,這使得數據共享和通訊更為直接和高效 - 分類 1. UMA(uniform memory access time)/ SMP(symmetric multi-processors)/ centralized shared-memory 2. NUMA(non uniform memory access time)/ DSM(distributed shared memory) ![4FF080CE-E80D-4891-AC5F-06D87849C3BB](https://hackmd.io/_uploads/HksWqUSTR.jpg) - challenge: cache coherence problem 當多個處理器擁有對相同記憶體位置的 cache copies 時,如何確保所有處理器看到的數據是一致的。 如果多個處理器同時或者分別對同一記憶體位置進行讀寫操作,而這些記憶體位置又被 cache 到了不同處理器的private cache中,就會出現數據不一致的問題。 ## 3. Cache coherence problem - define coherent memory system 1. preserve program order 當一個處理器 P 對某個記憶體位置 X 進行寫操作後,其後的讀操作(由相同或不同處理器執行)應該返回 P 寫入的值 2. coherent view of memory 如果處理器A 對記憶體位置X 的 write 後,另一處理器B對 X 的 read 應返回最後處理器A寫入的值,前提是這兩個操作之間有足夠的時間間隔,且期間沒有其他處理器寫位置 X 3. write serialization 兩次對同一記憶體位置的 write(由任意兩個處理器執行)在所有處理器上應按相同順序觀察到 (對同個 block 裡的位置,前個 write 還沒完成下個 write 不會開始) ![Untitled (2)](https://hackmd.io/_uploads/B1mW38B6C.png) > read 的順序不重要,只要能在下個write前讀到正確的值就好 - 2 classes of cache coherence protocols (hardware) 1. Snooping 2. Directory-based ## 4. Snooping - 每個 private cache 都有 cache controller,用來監控 snoop bus 上的資訊 - distributed control protocol - every cache with a copy of data also has a copy of sharing status of block: state, address, data ### 4.1 Write-through snooping 每次對 cache 的 write 都會同時更新主記憶體 ![5BD364A8-71D5-487C-BD4C-DD0891235D2B](https://hackmd.io/_uploads/rks4CLSaR.jpg) > step3: 因為是write through,會直接修改share memory,其他有該block的processor也會被invalidate - simpler, but higher write latency ### 4.2 Write-back snooping - FSM of Cache Block State: 描述了 cache 裡的 block 在不同狀態之間的轉變,包括 invalid, valid, dirty (dirty代表該block是最新的資料) - Broadcast Medium Transactions (e.g. bus) - 用來連接各個 private cache 到 share memory - let every cache controller observes every transaction - write: 需要知道其他處理器的 private cache 有沒有該block - 沒有: no need to place write on bus - 有: place invalidate on bus - write serialization: 當一個處理器獲得 broadcast medium transactions 控制權並更新某 block 時,它會使其他處理器的相同 block 失效 (invalid),以確保所有的處理器在讀取或寫入該block前都有最新的資料 - 如果某個處理器擁有已被修改的最新數據,其他處理器發出read miss時,要回應該miss - FSM - CPU read/write的操作下,cache block的狀態 ![Untitled (4)](https://hackmd.io/_uploads/rJ-igvSpR.png) - bus 操作下,cache block的狀態 ![Untitled (5)](https://hackmd.io/_uploads/rkfTlDBpA.png) - combination ![Untitled (6)](https://hackmd.io/_uploads/S1fbWDSTC.png) - snooping protocol optimization 1. MESI(modified, exclusive, shared, invalid) 2. MOESI(modified, owned, exclusive, shared, invalid) - limitation - processor 數越多,coherent miss 越高 - review: Cache miss type - compulsory miss - capacity miss - conflict miss - multi-processor的情況下再加一個miss: coherence miss - coherence miss - True sharing misses - Invalidates due to first write to shared block - Reads by another CPU of exclusive block in different cache 當一個處理器修改了它的快取中的一個數據塊後,這個數據塊的新狀態為 exclusive,並invalidate其他人,所以當另一個處理器試圖read同一個數據塊時就會read miss - would still occur if block size were 1 word - False sharing misses - 因在同一個 block 但不同 word 而互相影響 - would not occur if block size were 1 word <img src="https://hackmd.io/_uploads/H1Zi-wBpR.png" width="40%"> ![Untitled (9)](https://hackmd.io/_uploads/Sk8eGvBTA.png) ## 5. Directory-based #### Introduction - dedicated message known source and destination vs. snoopy (known source and unknown destination) - explicit response 系統會要求接收者給出明確的響應 - Every memory block has associated directory information to keep track of copies blocks and their states ![Untitled (10)](https://hackmd.io/_uploads/rJglmwBTR.png) ![57F91830-8E02-46C7-9118-6748EC50BEDF](https://hackmd.io/_uploads/BJWr4vSTC.jpg) #### Directory protocol - three state 1. shared: ≥1 processors have data, 擁有最新的資料, dirty bit=0, presence bit 記錄哪些 private cache 有該筆資料 2. uncached: presence bit and dirty bit are all 0, means that no processor has it 3. exclusive: dirty bit=1, 只有一個presence bit=1(owner) - FSM ![Untitled (12)](https://hackmd.io/_uploads/SJh1IwH6R.png) ![Untitled (13)](https://hackmd.io/_uploads/SyvzUwBpR.png) ![Untitled (14)](https://hackmd.io/_uploads/SyZSLDraR.png) - Example1 ![Untitled (15)](https://hackmd.io/_uploads/B1-PUvHTC.png) Example2 1. first read (p1要讀) ![Untitled (16)](https://hackmd.io/_uploads/HkuqUwSaA.png) > p1: invalid → shared > p1 presence bit on 2. read share (p2也要讀那個block) ![Untitled (17)](https://hackmd.io/_uploads/SJORIwHp0.png) > p2: invalid → shared > p2 presence bit on 3. p1 write 該 block ![Untitled (18)](https://hackmd.io/_uploads/r1zgvPHaC.png) > p1: shared → exclusive > p2: shared → invalid > 該 block 的 dirty bit on 4. p2也 write 該 block (新owner誕生) ![Untitled (19)](https://hackmd.io/_uploads/rkpWDvS6R.png) > p1: exclusive → invalid > p2: invalid → exclusive > dirty bit still on ## 6. multiprocessor two-level hierarchy系統 (兩級結構) - node: processor + private cache + cache controller - 常見方法 across node 的 communication 通過 directory-based 的方法來實現 node 內部使用snoopy/ directory-based 來實現