# 討論


## 01/16
- 如何告知電路 image data 存在哪些地方?
- 嘗試把 register mapping 到 DRAM 中進行溝通
---
## 12/19
### Channel-wise
- 先拿 output 的 1/4 (16x1024x1024 的其中 4 channels) 然後往後做
- 重複拿資料

### Output-division

## 12/12
### 8Ch8W 問題
#### 硬體部份
- 官方目前有開源的 hardware 部份有限:
- 8ch8w 的 bitstream 檔(prebuild, 單一檔案)
- 2ch8w 的 predefined 專案(可直接 synthesis 與 implement 產生 bitstream)
- 不知道是幾 ch 幾 way 的 verilog source code(需要自行建立專案後才能開始 synthesis 與 implement)
- GitHub 上非官方的 8Ch8W 的專案
- https://github.com/OpenSSD-CN/OpenSSD_8CH8WAY
#### 韌體部份
- predefined 的 2Ch8W 能順利執行

- prebuild 的 8Ch8W 的 FTL 初始化失敗
- 除第 0, 1 channel 以外都沒有 free block (全部 bad blocks)
- 可限制 fw 的 channel 數避開有問題的 channel
- controller 問題?板子問題?flash module 問題?
- 修改 map address 測試?

```=
[WARNING] There is no free block on Ch 2 Way 0!
[WARNING] There is no free block on Ch 3 Way 0!
[WARNING] There is no free block on Ch 5 Way 0!
[WARNING] There is no free block on Ch 6 Way 0!
[WARNING] There is no free block on Ch 7 Way 0!
[WARNING] There is no free block on Ch 2 Way 1!
[WARNING] There is no free block on Ch 3 Way 1!
[WARNING] There is no free block on Ch 5 Way 1!
[WARNING] There is no free block on Ch 6 Way 1!
[WARNING] There is no free block on Ch 7 Way 1!
[WARNING] There is no free block on Ch 2 Way 2!
[WARNING] There is no free block on Ch 3 Way 2!
[WARNING] There is no free block on Ch 6 Way 2!
[WARNING] There is no free block on Ch 7 Way 2!
[WARNING] There is no free block on Ch 2 Way 3!
[WARNING] There is no free block on Ch 3 Way 3!
[WARNING] There is no free block on Ch 5 Way 3!
[WARNING] There is no free block on Ch 6 Way 3!
[WARNING] There is no free block on Ch 7 Way 3!
[WARNING] There is no free block on Ch 2 Way 4!
[WARNING] There is no free block on Ch 3 Way 4!
[WARNING] There is no free block on Ch 5 Way 4!
[WARNING] There is no free block on Ch 6 Way 4!
[WARNING] There is no free block on Ch 7 Way 4!
[WARNING] There is no free block on Ch 2 Way 5!
[WARNING] There is no free block on Ch 3 Way 5!
[WARNING] There is no free block on Ch 5 Way 5!
[WARNING] There is no free block on Ch 6 Way 5!
[WARNING] There is no free block on Ch 7 Way 5!
[WARNING] There is no free block on Ch 2 Way 6!
[WARNING] There is no free block on Ch 3 Way 6!
[WARNING] There is no free block on Ch 5 Way 6!
[WARNING] There is no free block on Ch 6 Way 6!
[WARNING] There is no free block on Ch 7 Way 6!
[WARNING] There is no free block on Ch 2 Way 7!
[WARNING] There is no free block on Ch 3 Way 7!
[WARNING] There is no free block on Ch 5 Way 7!
[WARNING] There is no free block on Ch 6 Way 7!
[WARNING] There is no free block on Ch 7 Way 7!
```
---
## 12/5
### 每 cycle 能做的乘加數修正
每個 cycle 的 16 個 pixel 能做的乘加數有三種可能:

- case 1: 左上角
- case 2: 上、左邊界
- case 3: 其他
一個 1024x1024 的圖片中,大多都屬於 case 3 (能提供每個 3x3 kernel 16 個乘加的資料)。
:::danger
但這個前提是之前 cycle 的資料能夠保留 => 最多只要保留 4x1024 Bytes => 應該夠
:::
### 每 patch 所需時間計算修正
- patch size: 1024x1024x3 (RGB)
- kernel size: 3x3
- stride: 1
x 方向的乘加數 = y 方向的乘加數 = 1022 個乘加
=> 每個 patch 的三個 channel 共 3 * 1022^2 次乘加
=> 一個 patch 的每個 channel 都會跟 output 的 16 個 channel 的 kernel 做乘加
=> 一個 patch 共 16 * 3 * 1022^2 個乘加 = **50135232** 個乘加
一個 cycle 能做 ==**16**== x 16 個乘加 = **256** 個乘加
=> 一個 patch 需要 50135232 / 256 = 195840.75 個 cycle
=> AXI 頻率 100 MHz => 每秒 100,000,000 個 cycle
=> 一個 patch 需要 195840.75 / 100,000,000 = **0.0019584075** 秒
:::success
每個 patch 時間 x 總 patch 數 => 0.0019584075 x 30171 => 59.0871126825 秒
:::
:::warning
原圖片 = 87424x205952x3 pixels => 50.3058 GiB => 31.4411 秒
共有 30171 個 patch => 88.3916 GiB => 55.24475 秒
:::
:::warning
Unet 需要知道下個 cycle 是在圖片角落、邊緣還是中間 => 會影響能做的乘加數
:::
### 利用 LUTs 與 Registers 避免讀取與重算 patch 間重疊部份



> 
> https://www.mouser.com/datasheet/2/903/ds190-Zynq-7000-Overview-1595492.pdf
每個 patch 間重疊部份:256 x 1024 pixels = 262144 Bytes
=> 只保留原始資料還是要重算 -> 浪費時間
=> 分成「乘加結果」與「原始資料」兩部份儲存:

- 乘加結果: 共 259588 個乘加 = 259588 個結果要存
=> 乘加結果:259588 Bytes
- 原始資料: stride 為 1 因此必須保留 2x1024 跟未重疊部份做乘加
=> 原始資料:2048 Bytes
---
## 1212
### LUT test
- 64K(data:8-bit, depth:65536) 用了 11034 個 LUT (8C8W 可用 117600 個, 約十倍)
- 512K(data:64-bit, depth:65536) : 94683 個
- 8 個 64K : 88272 個
- 1024K(data:128-bit, depth:65536) : 189275個
- 16 個 64K : 176544 個
- patch 重疊要 261636 / 65536 = 3.99224853515625 個 64 K
- patch conv output 能不能放 LUT?
- 16M (16x1024x1024) -> 256 個 64K -> 不夠用
- 嘗試把 65536x8 -> 65536x64 (or more) 看 LUT 的使用量
### 8bit * 8bit = 16bit (乘), 16bit x 9 = 20bit (乘加)
- 乘加電路會用多少資源?
- FF: 0
- LUT: 0
- DSP : 256
- 20bit to 8bit????
### Source Code -> 專案
###
---
## 第一個 conv 的 output 要存哪
- ping-pong buffer
- register 不夠存 32MB (16 x 1024 x 1024 x 8bits)
- 要把每層的 feature map 存到 DRAM
- 怎麼跟 FW 溝通?
- status checker 怎麼看 controller 狀態?
- 直接 compile 檢查 address
- `V2FMCRegisters` ?
- 丟 control signal 給 flash controller?
- 開始把 register output 存到 DRAM
- 存完了,UNet 可以繼續做
- register 1024K(8\*8\*16)
## Unet 後半部需要前半部資訊
- 同時發給 DRAM 跟 flash?
- DRAM to Unet_inference time
- 如果要從 DRAM 拿資料,經過 FW 的時間?
## 預估每個 patch 做完第一層 conv 的時間
```shell=
module name input shape output shape params memory(MB) MAdd Flops MemRead(B) MemWrite(B) duration[%] MemR+W(B)
0 conv1.0 3 1024 1024 16 1024 1024 448.0 64.00 905,969,664.0 469,762,048.0 12584704.0 67108864.0 0.88% 7.969357e+07
1 conv1.1 16 1024 1024 16 1024 1024 32.0 64.00 67,108,864.0 33,554,432.0 67108992.0 67108864.0 0.12% 1.342179e+08
2 conv1.2 16 1024 1024 16 1024 1024 0.0 64.00 16,777,216.0 16,777,216.0 67108864.0 67108864.0 0.20% 1.342177e+08
3 conv1.3 16 1024 1024 16 1024 1024 2320.0 64.00 4,831,838,208.0 2,432,696,320.0 67118144.0 67108864.0 4.01% 1.342270e+08
4 conv1.4 16 1024 1024 16 1024 1024 32.0 64.00 67,108,864.0 33,554,432.0 67108992.0 67108864.0 0.08% 1.342179e+08
5 conv1.5 16 1024 1024 16 1024 1024 0.0 64.00 16,777,216.0 16,777,216.0 67108864.0 67108864.0 0.13% 1.342177e+08
6 maxpool1 16 1024 1024 16 512 512 0.0 16.00 33,554,432.0 16,777,216.0 67108864.0 16777216.0 0.28% 8.388608e+07
7 conv2.0 16 512 512 32 512 512 4640.0 32.00 2,415,919,104.0 1,216,348,160.0 16795776.0 33554432.0 0.77% 5.035021e+07
8 conv2.1 32 512 512 32 512 512 64.0 32.00 33,554,432.0 16,777,216.0 33554688.0 33554432.0 0.05% 6.710912e+07
9 conv2.2 32 512 512 32 512 512 0.0 32.00 8,388,608.0 8,388,608.0 33554432.0 33554432.0 0.07% 6.710886e+07
10 conv2.3 32 512 512 32 512 512 9248.0 32.00 4,831,838,208.0 2,424,307,712.0 33591424.0 33554432.0 1.96% 6.714586e+07
11 conv2.4 32 512 512 32 512 512 64.0 32.00 33,554,432.0 16,777,216.0 33554688.0 33554432.0 0.07% 6.710912e+07
12 conv2.5 32 512 512 32 512 512 0.0 32.00 8,388,608.0 8,388,608.0 33554432.0 33554432.0 0.08% 6.710886e+07
13 maxpool2 32 512 512 32 256 256 0.0 8.00 16,777,216.0 8,388,608.0 33554432.0 8388608.0 0.20% 4.194304e+07
14 conv3.0 32 256 256 64 256 256 18496.0 16.00 2,415,919,104.0 1,212,153,856.0 8462592.0 16777216.0 0.45% 2.523981e+07
15 conv3.1 64 256 256 64 256 256 128.0 16.00 16,777,216.0 8,388,608.0 16777728.0 16777216.0 0.02% 3.355494e+07
16 conv3.2 64 256 256 64 256 256 0.0 16.00 4,194,304.0 4,194,304.0 16777216.0 16777216.0 0.02% 3.355443e+07
17 conv3.3 64 256 256 64 256 256 36928.0 16.00 4,831,838,208.0 2,420,113,408.0 16924928.0 16777216.0 0.81% 3.370214e+07
18 conv3.4 64 256 256 64 256 256 128.0 16.00 16,777,216.0 8,388,608.0 16777728.0 16777216.0 0.02% 3.355494e+07
19 conv3.5 64 256 256 64 256 256 0.0 16.00 4,194,304.0 4,194,304.0 16777216.0 16777216.0 0.02% 3.355443e+07
20 maxpool3 64 256 256 64 128 128 0.0 4.00 8,388,608.0 4,194,304.0 16777216.0 4194304.0 0.07% 2.097152e+07
21 conv4.0 64 128 128 128 128 128 73856.0 8.00 2,415,919,104.0 1,210,056,704.0 4489728.0 8388608.0 0.25% 1.287834e+07
22 conv4.1 128 128 128 128 128 128 256.0 8.00 8,388,608.0 4,194,304.0 8389632.0 8388608.0 0.01% 1.677824e+07
23 conv4.2 128 128 128 128 128 128 0.0 8.00 2,097,152.0 2,097,152.0 8388608.0 8388608.0 0.01% 1.677722e+07
24 conv4.3 128 128 128 128 128 128 147584.0 8.00 4,831,838,208.0 2,418,016,256.0 8978944.0 8388608.0 0.47% 1.736755e+07
25 conv4.4 128 128 128 128 128 128 256.0 8.00 8,388,608.0 4,194,304.0 8389632.0 8388608.0 0.01% 1.677824e+07
26 conv4.5 128 128 128 128 128 128 0.0 8.00 2,097,152.0 2,097,152.0 8388608.0 8388608.0 0.00% 1.677722e+07
27 maxpool4 128 128 128 128 64 64 0.0 2.00 4,194,304.0 2,097,152.0 8388608.0 2097152.0 0.03% 1.048576e+07
28 upconv4.0 128 64 64 256 64 64 295168.0 4.00 2,415,919,104.0 1,209,008,128.0 3277824.0 4194304.0 0.15% 7.472128e+06
29 upconv4.1 256 64 64 256 64 64 512.0 4.00 4,194,304.0 2,097,152.0 4196352.0 4194304.0 0.01% 8.390656e+06
30 upconv4.2 256 64 64 256 64 64 0.0 4.00 1,048,576.0 1,048,576.0 4194304.0 4194304.0 0.00% 8.388608e+06
31 upconv4.3 256 64 64 256 64 64 590080.0 4.00 4,831,838,208.0 2,416,967,680.0 6554624.0 4194304.0 0.27% 1.074893e+07
32 upconv4.4 256 64 64 256 64 64 512.0 4.00 4,194,304.0 2,097,152.0 4196352.0 4194304.0 0.01% 8.390656e+06
33 upconv4.5 256 64 64 256 64 64 0.0 4.00 1,048,576.0 1,048,576.0 4194304.0 4194304.0 0.01% 8.388608e+06
34 ConvT4 256 64 64 128 128 128 295040.0 8.00 2,415,919,104.0 0.0 0.0 0.0 0.23% 0.000000e+00
35 upconv3.0 256 128 128 128 128 128 295040.0 8.00 9,663,676,416.0 4,833,935,360.0 17957376.0 8388608.0 1.42% 2.634598e+07
36 upconv3.1 128 128 128 128 128 128 256.0 8.00 8,388,608.0 4,194,304.0 8389632.0 8388608.0 0.01% 1.677824e+07
37 upconv3.2 128 128 128 128 128 128 0.0 8.00 2,097,152.0 2,097,152.0 8388608.0 8388608.0 0.01% 1.677722e+07
38 upconv3.3 128 128 128 128 128 128 147584.0 8.00 4,831,838,208.0 2,418,016,256.0 8978944.0 8388608.0 0.42% 1.736755e+07
39 upconv3.4 128 128 128 128 128 128 256.0 8.00 8,388,608.0 4,194,304.0 8389632.0 8388608.0 0.01% 1.677824e+07
40 upconv3.5 128 128 128 128 128 128 0.0 8.00 2,097,152.0 2,097,152.0 8388608.0 8388608.0 0.01% 1.677722e+07
41 ConvT3 128 128 128 64 256 256 73792.0 16.00 2,415,919,104.0 0.0 0.0 0.0 0.26% 0.000000e+00
42 upconv2.0 128 256 256 64 256 256 73792.0 16.00 9,663,676,416.0 4,836,032,512.0 33849600.0 16777216.0 1.85% 5.062682e+07
43 upconv2.1 64 256 256 64 256 256 128.0 16.00 16,777,216.0 8,388,608.0 16777728.0 16777216.0 0.02% 3.355494e+07
44 upconv2.2 64 256 256 64 256 256 0.0 16.00 4,194,304.0 4,194,304.0 16777216.0 16777216.0 0.02% 3.355443e+07
45 upconv2.3 64 256 256 64 256 256 36928.0 16.00 4,831,838,208.0 2,420,113,408.0 16924928.0 16777216.0 1.03% 3.370214e+07
46 upconv2.4 64 256 256 64 256 256 128.0 16.00 16,777,216.0 8,388,608.0 16777728.0 16777216.0 0.03% 3.355494e+07
47 upconv2.5 64 256 256 64 256 256 0.0 16.00 4,194,304.0 4,194,304.0 16777216.0 16777216.0 0.04% 3.355443e+07
48 ConvT2 64 256 256 32 512 512 18464.0 32.00 2,415,919,104.0 0.0 0.0 0.0 0.67% 0.000000e+00
49 upconv1.0 64 512 512 32 512 512 18464.0 32.00 9,663,676,416.0 4,840,226,816.0 67182720.0 33554432.0 4.98% 1.007372e+08
50 upconv1.1 32 512 512 32 512 512 64.0 32.00 33,554,432.0 16,777,216.0 33554688.0 33554432.0 0.05% 6.710912e+07
51 upconv1.2 32 512 512 32 512 512 0.0 32.00 8,388,608.0 8,388,608.0 33554432.0 33554432.0 0.05% 6.710886e+07
52 upconv1.3 32 512 512 32 512 512 9248.0 32.00 4,831,838,208.0 2,424,307,712.0 33591424.0 33554432.0 6.66% 6.714586e+07
53 upconv1.4 32 512 512 32 512 512 64.0 32.00 33,554,432.0 16,777,216.0 33554688.0 33554432.0 0.04% 6.710912e+07
54 upconv1.5 32 512 512 32 512 512 0.0 32.00 8,388,608.0 8,388,608.0 33554432.0 33554432.0 0.04% 6.710886e+07
55 ConvT1 32 512 512 16 1024 1024 4624.0 64.00 2,415,919,104.0 0.0 0.0 0.0 1.30% 0.000000e+00
56 upconv0.0 32 1024 1024 16 1024 1024 4624.0 64.00 9,663,676,416.0 4,848,615,424.0 134236224.0 67108864.0 37.24% 2.013451e+08
57 upconv0.1 16 1024 1024 16 1024 1024 32.0 64.00 67,108,864.0 33,554,432.0 67108992.0 67108864.0 0.10% 1.342179e+08
58 upconv0.2 16 1024 1024 16 1024 1024 0.0 64.00 16,777,216.0 16,777,216.0 67108864.0 67108864.0 0.15% 1.342177e+08
59 upconv0.3 16 1024 1024 16 1024 1024 2320.0 64.00 4,831,838,208.0 2,432,696,320.0 67118144.0 67108864.0 7.02% 1.342270e+08
60 upconv0.4 16 1024 1024 16 1024 1024 32.0 64.00 67,108,864.0 33,554,432.0 67108992.0 67108864.0 0.09% 1.342179e+08
61 upconv0.5 16 1024 1024 16 1024 1024 0.0 64.00 16,777,216.0 16,777,216.0 67108864.0 67108864.0 0.21% 1.342177e+08
62 output_1.0 16 1024 1024 1 1024 1024 145.0 4.00 301,989,888.0 152,043,520.0 67109444.0 4194304.0 4.42% 7.130375e+07
63 output_2.0 16 1024 1024 1 1024 1024 145.0 4.00 301,989,888.0 152,043,520.0 67109444.0 4194304.0 20.11% 7.130375e+07
total 2161922.0 1622.00 103,681,097,728.0 47,202,697,216.0 67109444.0 4194304.0 100.00% 3.417049e+09
==============================================================================================================================================================
Total params: 2,161,922
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Total memory: 1622.00MB
Total MAdd: 103.68GMAdd
Total Flops: 47.2GFlops
Total MemR+W: 3.18GB
```
## TODO
- quantized weight => 哪位學生負責?柏碩
- firmware hardware communication => mmap register
- Monitor 怎麼做?