Problem C 文獻筆記

# Problem formulation ![image](https://hackmd.io/_uploads/rkXZ0Elzll.png) Ref: DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement  # 開發計畫 ## 可以發揮的地方 1. 改 Hyper param 動態調整的策略 2. 把 buffering 整合進 Global placement 的方式 3. 也許可以把 Placement site 的 legalization 也整合進GP?看能不能減少 legalization 造成的損失 4. 寫個很屌的 legalizer 5. 梯度 scaling 的策略 6. 改良 WL 、 Density 、 gate sizing objective ## 待解決的問題 1. Buffering 如何處理? (也許可參考陳煥沅的timing近似算法，在所有flow最一開始時算出criticality來決定要在哪裡放buf，但這樣還要決定要用兩個inv還是buffer cell) 2. Displacement penalty 和 PPA 如何兼顧? 加penalty term 的話權重怎麼決定? 3. 怎麼處理 macro? ## Code tracing (DreamPlace) 1. `Placer.py`: 程式執行的入口點，會先讀取檔案以及建立 OpenTimer: ``` placedb = PlaceDB.PlaceDB() placedb(params) timer(params, placedb) timer.update_timing() ``` 接下來會執行 `NonLinearPlace` 這個程式: ``` placer = NonLinearPlace.NonLinearPlace(params, placedb, timer) placer(params) #執行 NonLinearPlace.py::NonLinearPlace.__call__ ``` 2. `NonLinearPlace.py`：主要 placement 的演算法，包涵 GP & legalization。最重要的函式為 `one_descent_step`，分為以下四個步驟： - Forward 計算，包涵 HPWL, electric potential：`EvalMetrics::evaluate` - Backward 計算，計算加權後的狀態、計算合併後的 gradients: `PlaceObj::obj_and_grad_fn` - 更新參數：`optimizer.step()` - timing-driven net weighting：`timing_op.timer.update_timing()`。首先執行 `ops/timing/src/timing_cpp.cpp::timingCppLauncher` 更新 steiner tree 以及更新 OpenTimer 狀態。接著再做 net weighting: `updateNetWeightCppLauncher`。 3. `PlaceObj.py`：用於計算梯度的class。會使用到 `ops/` 下面的多個計算，例如 `logsumexp_wirelength`。裡面的每個 op 都是使用 C++/CUDA 撰寫的，可以先查看目錄下的 `.py` 檔案如何 import 的。例如： `logsumexp_wirelength.py` 使用了 `dreamplace.ops.logsumexp_wirelength.logsumexp_wirelength_cpp_merged`，可以至對應到的 `src/logsumexp_wirelength_merged.cpp` 查看 `PYBIND11_MODULE`的程式碼及對應的 C++ 函式。DreamPlace 使用 [PYBIND](https://pybind11.readthedocs.io/en/stable/reference.html) 將 C++ 函式跟 Python 綁定。Pytorch 在計算 gradients 的時候會執行每個 object 的 `backward` 函式，所以可以在多個 ops 裡面看到`backward`的定義，這邊就是 `PlaceObj::obj_and_grad_fn` 的 `obj.backward()` 會觸發到的。 ## 開發環境須先安裝 NVIDIA container toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html ### 最一開始的安裝方式如下： :::spoiler #### 執行 Docker container ```bash docker pull nvcr.io/nvidia/cuda:11.8.0-runtime-ubuntu20.04 docker run --gpus 1 -it -v $(pwd):/DREAMPlace nvcr.io/nvidia/cuda:11.8.0-runtime-ubuntu20.04 ``` #### 安裝 Docker container 環境 ```bash apt update && DEBIAN_FRONTEND=noninteractive apt install -y wget git cmake vim bison libboost-all-dev python3 python3-pip flex tcl libcairo2-dev liblapack-dev libblas-dev # Install NVIDIA toolkit umask 000 && cd /DREAMPlace wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run bash cuda_11.8.0_520.61.05_linux.run --silent --toolkit # check nvidia-smi nvidia-smi # NVML error: https://forums.developer.nvidia.com/t/nvida-container-toolkit-failed-to-initialize-nvml-unknown-error/286219 # check CUDA toolkit nvcc --version ``` #### 編譯 DREAMPlace 套件 ```bash umask 000 # 避免之後在 docker 建立的文件無法在 docker 外修改（檔案擁有者是 root） cd /DREAMPlace git clone --recursive git@github.com:I-am-zeroZ/2025-ICCAD-Problem-C.git DREAMPlace cd DREAMPlace pip3 install 'networkx<3' pip3 install torch==2.0.0+cu118 --index-url https://download.pytorch.org/whl/cu118 pip3 install 'numpy<2' pip3 install -r requirements.txt # Fix compile error: https://github.com/doctest/doctest/issues/473 sed -i.bak "s/SIGSTKSZ/16384/g" thirdparty/OpenTimer/unittest/doctest.h mkdir build cd build && rm -rf * cmake .. -DCMAKE_INSTALL_PREFIX=../install -DPython_EXECUTABLE=$(which python3) make -j$(nproc) make install chmod a+w -R ../install ``` ::: ### 經過改良的環境 #### 開發、測試用途關於測資要怎麼放請參考 https://github.com/samuel21119/Xplace_ICCAD2025PC?tab=readme-ov-file#21-data-preparation 的 readme ``` docker run --gpus 1 -it -v $(pwd):/Xplace samuel21119/iccad2025:cuda umask 000 cd /Xplace # 進行開發 git branch gpu_timer_opt pip3 install torchvision==0.15.1+cu118 --index-url https://download.pytorch.org/whl/cu118 pip3 install -r requirements.txt # 編譯 mkdir build cd build && rm -rf * cmake .. -DCMAKE_INSTALL_PREFIX=.. -DPython_EXECUTABLE=$(which python3) make -j$(nproc) make install # 執行 cd .. python3 main.py --dataset iccad2025 --design_name aes_cipher_top --timing_opt True python3 main.py --dataset iccad2015 --design_name superblue1 --timing_opt True ``` # Wirelength ## Weighted average model(WL objective 都用這一種) ![image](https://hackmd.io/_uploads/ry62n8lMlg.png) 提出 WA 的是這篇，從裡面的結果看起來 WA 比 LSE 準很多，應該可以直接沿用: [TSV-Aware Analytical Placement for 3-D IC Designs Based on a Novel Weighted-Average Wirelength Model](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6480847) ![image](https://hackmd.io/_uploads/rkD-vvlfll.png)  # Diffrentiable Timing Objective >ref: Differentiable-Timing-Driven Global Placement ## Timing model from TAU Ref :[TAU 2015 Contest on Incremental Timing Analysis ](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7372664) ### Net delay evaluation(Elmore delay model) 應該大家都知道 Elmore delay，先空著不寫  ### Net output slew evaluation (主要使用 TAU 的 notation) > [!Note]Slew 的用途是用來查表求 cell delay > Cell delay 與 input 訊號的 slew 和 output load 有關:\ > load 越小，iput slew 越短 -> cell delay 越小，cell output slew 越小。\ > 用 slew 和 output load 去查 LUT ，做內插或外插後可得 cell delay。給定一個輸入腳位(Port)為 Z，輸出腳位(Taps)為 $T_{1}$,$T_{2}$...。考慮其中一個 port $T。$已知 net input slew $s_{i}$，求 net output slew $s_{oT}$。$s_{oT}$ 會成為此 net 輸出端所 drive 的 cell/net 的 input slew。 1. Evaluate $\beta_{T}$\ 下一步 Impusle 計算要用到的某個項，意義感覺不重要。\ ![image](https://hackmd.io/_uploads/S1MQUTWmex.png)\ $N$ : 所有 node 的集合\ $R_{k\to T}$ : 如果 node k 在 $Z\to T$ 路徑上，則為 $k\to T$ 路徑上的電阻總和，否則為0\ $C_{k}$ : node k 的 downstream capacitance 總和\ $d_{k}$ : node k 的 Elmore delay >[!Note] TAU 和 Diffrentiable timing driven GP 3.4.2 的 notation 差異 > TAU -> Diffrentiable timing driven GP:\ >$C_{k}d_{k}$ -> $LDelay(k)$\ >$\beta_{T}$ -> $Beta(T)$  2. Evaluate Slew of **impulse** response $\hat{s_{oT}}$\ ![image](https://hackmd.io/_uploads/H1HNVpWXxe.png)\ $d_{T}$ : tap node T 的 Elmore delay 3. Evaluate **net output slew $s_{oT}$**\ ![image](https://hackmd.io/_uploads/By34H6-7xl.png)\ $s_{i}$ : tap node T 的 Elmore delay ### cell delay & cell output slew evaluation (LUT 內插) 圖中 $x$ 和 $y$ 是 LUT 的兩個欄位，應該就是對應 cell input slew 和 output load。\ 假設用 $(x,y)$ 去查表，若這種情況落在 $x_{2}<x<x_{3}$ ， $y_{2}<y<y_{1}$。其中 $x_{2}, x_{3}, y_{1}, y_{2}$ 是 LUT 中有紀錄的欄位，LUT 中對應的四個值分別是 $v_{21}, v_{22}, v_{31}, v_{32}$。可以先對 y 方向內插得到 $v_{2y}$、$v_{3y}$，再對 x 方向用 $v_{2y}$、$v_{3y}$ 內插得到 $v_{xy}$。$v_{xy}$ 可以被偏微取 LUT 梯度。 >[!Warning] 也許需要考慮外插的情況? >> ![image](https://hackmd.io/_uploads/HJi-FuM7xx.png) ## Forward & backward Propagation ![image](https://hackmd.io/_uploads/rk2QTLgMle.png) ### Forward steps 1. 用 FLUTE 生成各 net 的 RSMT 2. 用 RSMT 求出 wire rc 3. wire rc 代入 elmore model 算出 net delay/impulse(算slew要用)、net load 4. 用 3 算出的 net load 和 driving cell 輸出的 slew 查表，得到輸出端所接下一級 cell 的 delay 和 slew。 5. 對電路的每一級重複1234，得到各級的arrival time(AT) 和 slew ，直到算出 output port 的 AT 6. 用 5 算出的 arrival time 和 required arrival time(RAT) 相減算 slack 7. 用 6 算出的 slack 算 TNS、WNS ### Backward steps 先寫 pseudo code 記著 #### TNS/WNS gradient w.r.t. $AT(v)$ and $Slew(v)$ This is the initial step of backpropagation. First, we approximate the timing objectives by LSE: $WNS_{\gamma}=\max\limits_{v}(AT(v)-RAT(v))\ \cong\gamma\log(\sum_{v}\exp(\frac{AT(v)-RAT(v)}{\gamma}))$\ $TNS_{\gamma}=\sum_{v}\max(0,AT(v)-RAT(v))\cong\gamma\sum_{v}\log(1+\exp(\frac{AT(v)-RAT(v)}{\gamma}))$ $\frac{\partial WNS_{\gamma}}{\partial AT(v))}=\frac{\exp(\frac{AT(v)-RAT(v)}{\gamma}}{\sum_{v}\exp(\frac{AT(v)-RAT(v)}{\gamma})}$\ $\frac{\partial TNS_{\gamma}}{\partial AT(v))}=\frac{1}{1+\exp(-\frac{AT(v)-RAT(v)}{\gamma})}$\ $\frac{\partial WNS_{\gamma}}{\partial Slew(v)}=0$\ $\frac{\partial TNS_{\gamma}}{\partial Slew(v))}=0$ #### Net propagation For a net with a fanin port $u$ and fanout ports $v_{1}, v_{2}...$ , we can obtain the gradients on $u$ from the fanout gradients $\nabla AT(v), \nabla Slew(v)$: 1. Propagation to the local elmore delay model\ $\nabla Delay(v)=\nabla AT(v)$\ $\nabla Impulse^{2}(v)=\frac{1}{2Slew(v)} \nabla Slew(v)$ 2. Propagation to the fanin port u:\ Summing over all fanout ports in (10a) and (10c) to get full propagation:\ $\nabla AT(u) = \sum_{v}\nabla AT(v)$\ $\nabla Slew(u) = \sum_{v}\frac{Slew(u)}{Slew(v)} \nabla Slew(v)$\ $\nabla AT(u)$ and $\nabla Slew(u)$ will then be passed into cell propagation of the driver cell of u. #### Cell propagation For a cell with input pins $u$ and a output pin $v$ **(multiple outputs?)**, compute the gradient on u $\nabla AT(v)$ and $\nabla Slew(v)$: Propageted gradients:\ $\nabla Load(v)$: to the driven net's elmore delay model\ $\nabla Slew(u)$, $\nabla AT(u)$: to the fanin nets u's net propagation #### Elmore delay propagation Initial values: \ $\nabla Load(u)$ from the driving cell propagation\ $\nabla Delay(v)$, $\nabla Impulse^2(v)$ from the sink ports(net propagation) # Gate sizing >ref: Fusion of Global Placement and Gate Sizing with Differentiable Optimization # Buffering 還沒看 # References 1. Differentiable-Timing-Driven Global Placement, DAC 2022 2. Fusion of Global Placement and Gate Sizing with Differentiable Optimization, ICCAD 2024 3. DREAMPlace 4.0: Timing-Driven Placement With Momentum-Based Net Weighting and Lagrangian-Based Refinement, DATE 2022 (open sourced) 4. INSTA: An Ultra-Fast, Differentiable, Statistical Static Timing Analysis Engine for Industrial Physical Design Applications, DAC 2025 5. TAU 2015 contest on incremental timing analysis, IEEE 2015 6. DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement 7. GPU-Accelerated Static Timing Analysis, ICCAD 2020 ## 期末考後閱讀進度 1、2、5、6 ## TNS/WNS-oriented 這邊的論文均是針對單一指標進行最佳化，例如 TNS。 1. Timing-Driven Global Placement by Efficient Critical Path Extraction, DATE 2025 (open sourced) 2. DREAMPlace 4.0: Timing-Driven Placement With Momentum-Based Net Weighting and Lagrangian-Based Refinement, DATE 2022 (open sourced) 3. AutoDMP: Automated DREAMPlace-based Macro Placement, ISPD 2023, (considering macros, open sourced) ## Multi-metrics 使用綜合指標，例如 TNS, power，較符合題目的方向。 1. Differentiable-Timing-Driven Global Placement, DAC 2022 2. Fusion of Global Placement and Gate Sizing with Differentiable Optimization, ICCAD 2024 3. INSTA: An Ultra-Fast, Differentiable, Statistical Static Timing Analysis Engine for Industrial Physical Design Applications, DAC 2025