# 討論 ![](https://i.imgur.com/HEEJml2.png) ![](https://i.imgur.com/XZ5iPRu.png) ## 01/16 - 如何告知電路 image data 存在哪些地方? - 嘗試把 register mapping 到 DRAM 中進行溝通 --- ## 12/19 ### Channel-wise - 先拿 output 的 1/4 (16x1024x1024 的其中 4 channels) 然後往後做 - 重複拿資料 ![](https://i.imgur.com/JSTa6TZ.jpg) ### Output-division ![](https://i.imgur.com/kN4Vqiu.jpg) ## 12/12 ### 8Ch8W 問題 #### 硬體部份 - 官方目前有開源的 hardware 部份有限: - 8ch8w 的 bitstream 檔(prebuild, 單一檔案) - 2ch8w 的 predefined 專案(可直接 synthesis 與 implement 產生 bitstream) - 不知道是幾 ch 幾 way 的 verilog source code(需要自行建立專案後才能開始 synthesis 與 implement) - GitHub 上非官方的 8Ch8W 的專案 - https://github.com/OpenSSD-CN/OpenSSD_8CH8WAY #### 韌體部份 - predefined 的 2Ch8W 能順利執行 ![](https://i.imgur.com/oE3EAFZ.png) - prebuild 的 8Ch8W 的 FTL 初始化失敗 - 除第 0, 1 channel 以外都沒有 free block (全部 bad blocks) - 可限制 fw 的 channel 數避開有問題的 channel - controller 問題?板子問題?flash module 問題? - 修改 map address 測試? ![](https://i.imgur.com/B0zSswA.png) ```= [WARNING] There is no free block on Ch 2 Way 0! [WARNING] There is no free block on Ch 3 Way 0! [WARNING] There is no free block on Ch 5 Way 0! [WARNING] There is no free block on Ch 6 Way 0! [WARNING] There is no free block on Ch 7 Way 0! [WARNING] There is no free block on Ch 2 Way 1! [WARNING] There is no free block on Ch 3 Way 1! [WARNING] There is no free block on Ch 5 Way 1! [WARNING] There is no free block on Ch 6 Way 1! [WARNING] There is no free block on Ch 7 Way 1! [WARNING] There is no free block on Ch 2 Way 2! [WARNING] There is no free block on Ch 3 Way 2! [WARNING] There is no free block on Ch 6 Way 2! [WARNING] There is no free block on Ch 7 Way 2! [WARNING] There is no free block on Ch 2 Way 3! [WARNING] There is no free block on Ch 3 Way 3! [WARNING] There is no free block on Ch 5 Way 3! [WARNING] There is no free block on Ch 6 Way 3! [WARNING] There is no free block on Ch 7 Way 3! [WARNING] There is no free block on Ch 2 Way 4! [WARNING] There is no free block on Ch 3 Way 4! [WARNING] There is no free block on Ch 5 Way 4! [WARNING] There is no free block on Ch 6 Way 4! [WARNING] There is no free block on Ch 7 Way 4! [WARNING] There is no free block on Ch 2 Way 5! [WARNING] There is no free block on Ch 3 Way 5! [WARNING] There is no free block on Ch 5 Way 5! [WARNING] There is no free block on Ch 6 Way 5! [WARNING] There is no free block on Ch 7 Way 5! [WARNING] There is no free block on Ch 2 Way 6! [WARNING] There is no free block on Ch 3 Way 6! [WARNING] There is no free block on Ch 5 Way 6! [WARNING] There is no free block on Ch 6 Way 6! [WARNING] There is no free block on Ch 7 Way 6! [WARNING] There is no free block on Ch 2 Way 7! [WARNING] There is no free block on Ch 3 Way 7! [WARNING] There is no free block on Ch 5 Way 7! [WARNING] There is no free block on Ch 6 Way 7! [WARNING] There is no free block on Ch 7 Way 7! ``` --- ## 12/5 ### 每 cycle 能做的乘加數修正 每個 cycle 的 16 個 pixel 能做的乘加數有三種可能: ![](https://i.imgur.com/s7L2PTM.png) - case 1: 左上角 - case 2: 上、左邊界 - case 3: 其他 一個 1024x1024 的圖片中,大多都屬於 case 3 (能提供每個 3x3 kernel 16 個乘加的資料)。 :::danger 但這個前提是之前 cycle 的資料能夠保留 => 最多只要保留 4x1024 Bytes => 應該夠 ::: ### 每 patch 所需時間計算修正 - patch size: 1024x1024x3 (RGB) - kernel size: 3x3 - stride: 1 x 方向的乘加數 = y 方向的乘加數 = 1022 個乘加 => 每個 patch 的三個 channel 共 3 * 1022^2 次乘加 => 一個 patch 的每個 channel 都會跟 output 的 16 個 channel 的 kernel 做乘加 => 一個 patch 共 16 * 3 * 1022^2 個乘加 = **50135232** 個乘加 一個 cycle 能做 ==**16**== x 16 個乘加 = **256** 個乘加 => 一個 patch 需要 50135232 / 256 = 195840.75 個 cycle => AXI 頻率 100 MHz => 每秒 100,000,000 個 cycle => 一個 patch 需要 195840.75 / 100,000,000 = **0.0019584075** 秒 :::success 每個 patch 時間 x 總 patch 數 => 0.0019584075 x 30171 => 59.0871126825 秒 ::: :::warning 原圖片 = 87424x205952x3 pixels => 50.3058 GiB => 31.4411 秒 共有 30171 個 patch => 88.3916 GiB => 55.24475 秒 ::: :::warning Unet 需要知道下個 cycle 是在圖片角落、邊緣還是中間 => 會影響能做的乘加數 ::: ### 利用 LUTs 與 Registers 避免讀取與重算 patch 間重疊部份 ![](https://i.imgur.com/oleA6qE.png) ![](https://i.imgur.com/bF4LLhu.jpg) ![](https://i.imgur.com/8cdiGU9.png) > ![](https://i.imgur.com/sMLSdKO.png) > https://www.mouser.com/datasheet/2/903/ds190-Zynq-7000-Overview-1595492.pdf 每個 patch 間重疊部份:256 x 1024 pixels = 262144 Bytes => 只保留原始資料還是要重算 -> 浪費時間 => 分成「乘加結果」與「原始資料」兩部份儲存: ![](https://i.imgur.com/fxycpRR.png) - 乘加結果: 共 259588 個乘加 = 259588 個結果要存 => 乘加結果:259588 Bytes - 原始資料: stride 為 1 因此必須保留 2x1024 跟未重疊部份做乘加 => 原始資料:2048 Bytes --- ## 1212 ### LUT test - 64K(data:8-bit, depth:65536) 用了 11034 個 LUT (8C8W 可用 117600 個, 約十倍) - 512K(data:64-bit, depth:65536) : 94683 個 - 8 個 64K : 88272 個 - 1024K(data:128-bit, depth:65536) : 189275個 - 16 個 64K : 176544 個 - patch 重疊要 261636 / 65536 = 3.99224853515625 個 64 K - patch conv output 能不能放 LUT? - 16M (16x1024x1024) -> 256 個 64K -> 不夠用 - 嘗試把 65536x8 -> 65536x64 (or more) 看 LUT 的使用量 ### 8bit * 8bit = 16bit (乘), 16bit x 9 = 20bit (乘加) - 乘加電路會用多少資源? - FF: 0 - LUT: 0 - DSP : 256 - 20bit to 8bit???? ### Source Code -> 專案 ### --- ## 第一個 conv 的 output 要存哪 - ping-pong buffer - register 不夠存 32MB (16 x 1024 x 1024 x 8bits) - 要把每層的 feature map 存到 DRAM - 怎麼跟 FW 溝通? - status checker 怎麼看 controller 狀態? - 直接 compile 檢查 address - `V2FMCRegisters` ? - 丟 control signal 給 flash controller? - 開始把 register output 存到 DRAM - 存完了,UNet 可以繼續做 - register 1024K(8\*8\*16) ## Unet 後半部需要前半部資訊 - 同時發給 DRAM 跟 flash? - DRAM to Unet_inference time - 如果要從 DRAM 拿資料,經過 FW 的時間? ## 預估每個 patch 做完第一層 conv 的時間 ```shell= module name input shape output shape params memory(MB) MAdd Flops MemRead(B) MemWrite(B) duration[%] MemR+W(B) 0 conv1.0 3 1024 1024 16 1024 1024 448.0 64.00 905,969,664.0 469,762,048.0 12584704.0 67108864.0 0.88% 7.969357e+07 1 conv1.1 16 1024 1024 16 1024 1024 32.0 64.00 67,108,864.0 33,554,432.0 67108992.0 67108864.0 0.12% 1.342179e+08 2 conv1.2 16 1024 1024 16 1024 1024 0.0 64.00 16,777,216.0 16,777,216.0 67108864.0 67108864.0 0.20% 1.342177e+08 3 conv1.3 16 1024 1024 16 1024 1024 2320.0 64.00 4,831,838,208.0 2,432,696,320.0 67118144.0 67108864.0 4.01% 1.342270e+08 4 conv1.4 16 1024 1024 16 1024 1024 32.0 64.00 67,108,864.0 33,554,432.0 67108992.0 67108864.0 0.08% 1.342179e+08 5 conv1.5 16 1024 1024 16 1024 1024 0.0 64.00 16,777,216.0 16,777,216.0 67108864.0 67108864.0 0.13% 1.342177e+08 6 maxpool1 16 1024 1024 16 512 512 0.0 16.00 33,554,432.0 16,777,216.0 67108864.0 16777216.0 0.28% 8.388608e+07 7 conv2.0 16 512 512 32 512 512 4640.0 32.00 2,415,919,104.0 1,216,348,160.0 16795776.0 33554432.0 0.77% 5.035021e+07 8 conv2.1 32 512 512 32 512 512 64.0 32.00 33,554,432.0 16,777,216.0 33554688.0 33554432.0 0.05% 6.710912e+07 9 conv2.2 32 512 512 32 512 512 0.0 32.00 8,388,608.0 8,388,608.0 33554432.0 33554432.0 0.07% 6.710886e+07 10 conv2.3 32 512 512 32 512 512 9248.0 32.00 4,831,838,208.0 2,424,307,712.0 33591424.0 33554432.0 1.96% 6.714586e+07 11 conv2.4 32 512 512 32 512 512 64.0 32.00 33,554,432.0 16,777,216.0 33554688.0 33554432.0 0.07% 6.710912e+07 12 conv2.5 32 512 512 32 512 512 0.0 32.00 8,388,608.0 8,388,608.0 33554432.0 33554432.0 0.08% 6.710886e+07 13 maxpool2 32 512 512 32 256 256 0.0 8.00 16,777,216.0 8,388,608.0 33554432.0 8388608.0 0.20% 4.194304e+07 14 conv3.0 32 256 256 64 256 256 18496.0 16.00 2,415,919,104.0 1,212,153,856.0 8462592.0 16777216.0 0.45% 2.523981e+07 15 conv3.1 64 256 256 64 256 256 128.0 16.00 16,777,216.0 8,388,608.0 16777728.0 16777216.0 0.02% 3.355494e+07 16 conv3.2 64 256 256 64 256 256 0.0 16.00 4,194,304.0 4,194,304.0 16777216.0 16777216.0 0.02% 3.355443e+07 17 conv3.3 64 256 256 64 256 256 36928.0 16.00 4,831,838,208.0 2,420,113,408.0 16924928.0 16777216.0 0.81% 3.370214e+07 18 conv3.4 64 256 256 64 256 256 128.0 16.00 16,777,216.0 8,388,608.0 16777728.0 16777216.0 0.02% 3.355494e+07 19 conv3.5 64 256 256 64 256 256 0.0 16.00 4,194,304.0 4,194,304.0 16777216.0 16777216.0 0.02% 3.355443e+07 20 maxpool3 64 256 256 64 128 128 0.0 4.00 8,388,608.0 4,194,304.0 16777216.0 4194304.0 0.07% 2.097152e+07 21 conv4.0 64 128 128 128 128 128 73856.0 8.00 2,415,919,104.0 1,210,056,704.0 4489728.0 8388608.0 0.25% 1.287834e+07 22 conv4.1 128 128 128 128 128 128 256.0 8.00 8,388,608.0 4,194,304.0 8389632.0 8388608.0 0.01% 1.677824e+07 23 conv4.2 128 128 128 128 128 128 0.0 8.00 2,097,152.0 2,097,152.0 8388608.0 8388608.0 0.01% 1.677722e+07 24 conv4.3 128 128 128 128 128 128 147584.0 8.00 4,831,838,208.0 2,418,016,256.0 8978944.0 8388608.0 0.47% 1.736755e+07 25 conv4.4 128 128 128 128 128 128 256.0 8.00 8,388,608.0 4,194,304.0 8389632.0 8388608.0 0.01% 1.677824e+07 26 conv4.5 128 128 128 128 128 128 0.0 8.00 2,097,152.0 2,097,152.0 8388608.0 8388608.0 0.00% 1.677722e+07 27 maxpool4 128 128 128 128 64 64 0.0 2.00 4,194,304.0 2,097,152.0 8388608.0 2097152.0 0.03% 1.048576e+07 28 upconv4.0 128 64 64 256 64 64 295168.0 4.00 2,415,919,104.0 1,209,008,128.0 3277824.0 4194304.0 0.15% 7.472128e+06 29 upconv4.1 256 64 64 256 64 64 512.0 4.00 4,194,304.0 2,097,152.0 4196352.0 4194304.0 0.01% 8.390656e+06 30 upconv4.2 256 64 64 256 64 64 0.0 4.00 1,048,576.0 1,048,576.0 4194304.0 4194304.0 0.00% 8.388608e+06 31 upconv4.3 256 64 64 256 64 64 590080.0 4.00 4,831,838,208.0 2,416,967,680.0 6554624.0 4194304.0 0.27% 1.074893e+07 32 upconv4.4 256 64 64 256 64 64 512.0 4.00 4,194,304.0 2,097,152.0 4196352.0 4194304.0 0.01% 8.390656e+06 33 upconv4.5 256 64 64 256 64 64 0.0 4.00 1,048,576.0 1,048,576.0 4194304.0 4194304.0 0.01% 8.388608e+06 34 ConvT4 256 64 64 128 128 128 295040.0 8.00 2,415,919,104.0 0.0 0.0 0.0 0.23% 0.000000e+00 35 upconv3.0 256 128 128 128 128 128 295040.0 8.00 9,663,676,416.0 4,833,935,360.0 17957376.0 8388608.0 1.42% 2.634598e+07 36 upconv3.1 128 128 128 128 128 128 256.0 8.00 8,388,608.0 4,194,304.0 8389632.0 8388608.0 0.01% 1.677824e+07 37 upconv3.2 128 128 128 128 128 128 0.0 8.00 2,097,152.0 2,097,152.0 8388608.0 8388608.0 0.01% 1.677722e+07 38 upconv3.3 128 128 128 128 128 128 147584.0 8.00 4,831,838,208.0 2,418,016,256.0 8978944.0 8388608.0 0.42% 1.736755e+07 39 upconv3.4 128 128 128 128 128 128 256.0 8.00 8,388,608.0 4,194,304.0 8389632.0 8388608.0 0.01% 1.677824e+07 40 upconv3.5 128 128 128 128 128 128 0.0 8.00 2,097,152.0 2,097,152.0 8388608.0 8388608.0 0.01% 1.677722e+07 41 ConvT3 128 128 128 64 256 256 73792.0 16.00 2,415,919,104.0 0.0 0.0 0.0 0.26% 0.000000e+00 42 upconv2.0 128 256 256 64 256 256 73792.0 16.00 9,663,676,416.0 4,836,032,512.0 33849600.0 16777216.0 1.85% 5.062682e+07 43 upconv2.1 64 256 256 64 256 256 128.0 16.00 16,777,216.0 8,388,608.0 16777728.0 16777216.0 0.02% 3.355494e+07 44 upconv2.2 64 256 256 64 256 256 0.0 16.00 4,194,304.0 4,194,304.0 16777216.0 16777216.0 0.02% 3.355443e+07 45 upconv2.3 64 256 256 64 256 256 36928.0 16.00 4,831,838,208.0 2,420,113,408.0 16924928.0 16777216.0 1.03% 3.370214e+07 46 upconv2.4 64 256 256 64 256 256 128.0 16.00 16,777,216.0 8,388,608.0 16777728.0 16777216.0 0.03% 3.355494e+07 47 upconv2.5 64 256 256 64 256 256 0.0 16.00 4,194,304.0 4,194,304.0 16777216.0 16777216.0 0.04% 3.355443e+07 48 ConvT2 64 256 256 32 512 512 18464.0 32.00 2,415,919,104.0 0.0 0.0 0.0 0.67% 0.000000e+00 49 upconv1.0 64 512 512 32 512 512 18464.0 32.00 9,663,676,416.0 4,840,226,816.0 67182720.0 33554432.0 4.98% 1.007372e+08 50 upconv1.1 32 512 512 32 512 512 64.0 32.00 33,554,432.0 16,777,216.0 33554688.0 33554432.0 0.05% 6.710912e+07 51 upconv1.2 32 512 512 32 512 512 0.0 32.00 8,388,608.0 8,388,608.0 33554432.0 33554432.0 0.05% 6.710886e+07 52 upconv1.3 32 512 512 32 512 512 9248.0 32.00 4,831,838,208.0 2,424,307,712.0 33591424.0 33554432.0 6.66% 6.714586e+07 53 upconv1.4 32 512 512 32 512 512 64.0 32.00 33,554,432.0 16,777,216.0 33554688.0 33554432.0 0.04% 6.710912e+07 54 upconv1.5 32 512 512 32 512 512 0.0 32.00 8,388,608.0 8,388,608.0 33554432.0 33554432.0 0.04% 6.710886e+07 55 ConvT1 32 512 512 16 1024 1024 4624.0 64.00 2,415,919,104.0 0.0 0.0 0.0 1.30% 0.000000e+00 56 upconv0.0 32 1024 1024 16 1024 1024 4624.0 64.00 9,663,676,416.0 4,848,615,424.0 134236224.0 67108864.0 37.24% 2.013451e+08 57 upconv0.1 16 1024 1024 16 1024 1024 32.0 64.00 67,108,864.0 33,554,432.0 67108992.0 67108864.0 0.10% 1.342179e+08 58 upconv0.2 16 1024 1024 16 1024 1024 0.0 64.00 16,777,216.0 16,777,216.0 67108864.0 67108864.0 0.15% 1.342177e+08 59 upconv0.3 16 1024 1024 16 1024 1024 2320.0 64.00 4,831,838,208.0 2,432,696,320.0 67118144.0 67108864.0 7.02% 1.342270e+08 60 upconv0.4 16 1024 1024 16 1024 1024 32.0 64.00 67,108,864.0 33,554,432.0 67108992.0 67108864.0 0.09% 1.342179e+08 61 upconv0.5 16 1024 1024 16 1024 1024 0.0 64.00 16,777,216.0 16,777,216.0 67108864.0 67108864.0 0.21% 1.342177e+08 62 output_1.0 16 1024 1024 1 1024 1024 145.0 4.00 301,989,888.0 152,043,520.0 67109444.0 4194304.0 4.42% 7.130375e+07 63 output_2.0 16 1024 1024 1 1024 1024 145.0 4.00 301,989,888.0 152,043,520.0 67109444.0 4194304.0 20.11% 7.130375e+07 total 2161922.0 1622.00 103,681,097,728.0 47,202,697,216.0 67109444.0 4194304.0 100.00% 3.417049e+09 ============================================================================================================================================================== Total params: 2,161,922 -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total memory: 1622.00MB Total MAdd: 103.68GMAdd Total Flops: 47.2GFlops Total MemR+W: 3.18GB ``` ## TODO - quantized weight => 哪位學生負責?柏碩 - firmware hardware communication => mmap register - Monitor 怎麼做?