可微邏輯門網絡（Differentiable Logic Gate Networks, DLGNs）

# 可微邏輯門網絡（Differentiable Logic Gate Networks, DLGNs） - paper link https://arxiv.org/abs/2210.08277 https://arxiv.org/html/2411.04732v1 - Learning with Differentiable Algorithms (ch6) https://arxiv.org/abs/2209.00616 - paper video https://www.youtube.com/watch?v=VulM891DhI4 - [x] paper - [x] 用 diff logic git做實驗 - [x] 整理問題 - [x] read source code # 跟老師討論 ## input 的資料型別 > difflogic 會先做 normalize 把 input 壓到 [-1, 1], type 是 float32, 不過論文有提到用 bf16. ![image](https://hackmd.io/_uploads/ByTStFmv1g.png) ## forward - input 打包的地方有點問題... 看作者應該是想把 9216 個 bit 打包到 long long 的 array, d 應該要是 d < 9216 / 64.. - logic_gate_net 是作者用 Python 用 script 產生的 function, 其中可以看到 `inp` 跟 `out`, inp 就是 image 攤平後的 array, out 是算完的結果. 其中 e.g., inp[161] | inp[160], 中間的op 就是作者用 fuzzy logic train 出來的. - 從 logic_gate_net 拿到 out 會再把內容做group sum, 從 paper 看起來是把 out 裡面的value 去做 bit_count 的運算 - 看起來作者有把 logic gate 的邏輯用 c 語言實現, 有看數位邏輯跟計算機組織的課程, 感覺是實現一個 half adder, 織的課程, 感覺是實現一個 half adder, 電路跟 c 的邏輯差不多都是用 and 跟 xor 去實現, > 這邊還不知道實作 half adder 的用途... ![image](https://hackmd.io/_uploads/Hyf9tELvJl.png) ![image](https://hackmd.io/_uploads/S1v5ON8PJl.png) ```c= void logic_gate_net(long long const *inp, long long *out) { const long long v0 = ~inp[161] | inp[160]; const long long v1 = ~inp[5595]; const long long v2 = inp[4072] | inp[4073]; const long long v3 = ~(inp[2259] & inp[2261]); const long long v4 = ~(inp[859] & inp[860]); const long long v5 = inp[149]; ...... .... out[11991] = ~v34183; out[11992] = v30818; out[11993] = ~v33111; out[11994] = ~v27064; out[11995] = v32836; out[11996] = v24432 & ~v24433; out[11997] = ~0LL; out[11998] = ~v33359; out[11999] = v31983 ^ v31984; } void apply_logic_gate_net (bool const *inp, int *out, size_t len) { long long *inp_temp = malloc(9216*sizeof(long long)); long long *out_temp = malloc(12000*sizeof(long long)); long long *out_temp_o = malloc(11*sizeof(long long)); for(size_t i = 0; i < len; ++i) { // input 是 3*3*32*32 個 bit, 會把64 個 bit 打包到 long long 裡面. 可以節省空間, 可以看到 input 是 bool* inp. // Converting the bool array into a bitpacked array for(size_t d = 0; d < 9216; ++d) { long long res = 0LL; for(size_t b = 0; b < 64; ++b) { res <<= 1; res += !!(inp[i * 9216 * 64 + (64 - b - 1) * 9216 + d]); } inp_temp[d] = res; } // Applying the logic gate net logic_gate_net(inp_temp, out_temp); // GroupSum of the results via logic gate networks for(size_t c = 0; c < 10; ++c) { // for each class // Initialize the output bits for(size_t d = 0; d < 11; ++d) { out_temp_o[d] = 0LL; } // Apply the adder logic gate network for(size_t a = 0; a < 1200; ++a) { long long carry = out_temp[c * 1200 + a]; long long out_temp_o_d; for(int d = 11 - 1; d >= 0; --d) { out_temp_o_d = out_temp_o[d]; out_temp_o[d] = carry ^ out_temp_o_d; carry = carry & out_temp_o_d; } } // Unpack the result bits for(size_t b = 0; b < 64; ++b) { const long long bit_mask = 1LL << b; int res = 0; for(size_t d = 0; d < 11; ++d) { res <<= 1; res += !!(out_temp_o[d] & bit_mask); } out[(i * 64 + b) * 10 + c] = res; } } } free(inp_temp); free(out_temp); free(out_temp_o); } ``` ### backward 作者是用 torch 去做訓練的, 從下面的code, 作者有另外實作 model 的 class, 只要在 model 裡面有實作 forward , pytorch 就會根據 forward 的結果去計算梯度. 可以參考下面的第二行. ```python= def train(model, x, y, loss_fn, optimizer): x = model(x) loss = loss_fn(x, y) optimizer.zero_grad() loss.backward() optimizer.step() return loss.item() - forward 可以參考下面的 code, ``` forward 有分 python 跟 cuda 版本的, 目前先看 python 版本的實作, 可以看到一開始會先從 image 的 arr, 各別隨機選出 a 跟 b 根據 op 的 table 去算出 ```python= def forward_python(self, x): assert x.shape[-1] == self.in_dim, (x[0].shape[-1], self.in_dim) if self.indices[0].dtype == torch.int64 or self.indices[1].dtype == torch.int64: print(self.indices[0].dtype, self.indices[1].dtype) self.indices = self.indices[0].long(), self.indices[1].long() print(self.indices[0].dtype, self.indices[1].dtype) a, b = x[..., self.indices[0]], x[..., self.indices[1]] if self.training: x = bin_op_s(a, b, torch.nn.functional.softmax(self.weights, dim=-1)) else: weights = torch.nn.functional.one_hot(self.weights.argmax(-1), 16).to(torch.float32) x = bin_op_s(a, b, weights) return x ``` 下面是簡單的流程圖. ![image](https://hackmd.io/_uploads/rJf0zBLwJe.png) - loss function. > 看 source code 是用 cross entropy. ## 可以用在 bitnet? > 目前我的想法應該是不可行, 因為作者的做法是用 logic gate 的op 去取代 conv, max pool. 而bitnet 已經把資料壓縮成 bit並且運算已經用 accumulate sum 去取代矩陣乘法. 理論上用硬體做應該已經最快了, 優化的空間可能也有限. ## 背景介紹`傳統邏輯門（如 AND、OR、XOR 等）在`神經網絡中的應用。 - 挑戰：`傳統的邏輯門無法直接進行反向傳播和梯度計算`，這使得它們無法直接用於深度學習模型中。 - 貢獻：`提出可微邏輯門網絡（DLGNs）`，該網絡可以將邏輯門運算符轉化為可微分形式，並能夠進行有效的梯度下降訓練。 - 主要成果：展示了在邏輯運算任務（如 XOR 和 AND）中，DLGN 可以學習邏輯關係，並有效地進行訓練。 - cnn vs LGN - cnn: input, receptive field, weight ![image](https://hackmd.io/_uploads/SyYt63BByl.png) - lgn ![image](https://hackmd.io/_uploads/SJLJChrB1x.png) ![image](https://hackmd.io/_uploads/rJX-J6Hr1l.png) - logical or pooling ![image](https://hackmd.io/_uploads/ByLIyaHS1x.png) --- ![image](https://hackmd.io/_uploads/r1kOakE4yl.png) --- ## 邏輯門的可微化 - 引入實值邏輯運算符（例如 T-norms 和 T-conorms）fuzzy logic, 來代替傳統的布爾邏輯。 - T-norms, T-conorms ![image](https://hackmd.io/_uploads/HJq4O0mV1g.png) - 16 logic operator table ![image](https://hackmd.io/_uploads/By4oQA7Nkg.png) - 模型結構：DLGN 模型的結構，包括邏輯門層、加權邏輯運算符、輸入和輸出層等。使用 softmax 來選擇每層的邏輯運算符，使得每個邏輯操作的選擇成為一個概率分佈。 - 加權平均讓logic net 可以微分 ![image](https://hackmd.io/_uploads/HymBi3HHJx.png) - aggregate of output neutron ![image](https://hackmd.io/_uploads/Hk__i3BH1e.png) ## example - 條件設置 ![image](https://hackmd.io/_uploads/ByNZF3BBye.png) --- - 梯度計算步驟 ![image](https://hackmd.io/_uploads/ry9NK3SBke.png) --- - 更新參數 ![image](https://hackmd.io/_uploads/SyPd5nHB1l.png) --- - 可微性保證：詳細介紹如何保證每個邏輯操作在訓練過程中是可微的，並且支持基於梯度的優化方法（如 Adam）。 ```bash= +------------------+ | Input Layer | | (a, b) | +------------------+ | | +-------------------------------------+ | Softmax Probabilities | | (選擇每個邏輯運算符的概率) | +-------------------------------------+ | v +-----------------------------------------------+ | Logic Gate Layer (16 possible operations) | | Operators: AND, OR, XOR, etc. | +-----------------------------------------------+ | v +------------------+ | Output Layer | | (final result) | +------------------+ ``` ``` ┌─────────────────────────────┐ │ Dataset │ │ (e.g., CIFAR-10, MNIST) │ └──────────────┬──────────────┘ │ ▼ ┌─────────────────────────────┐ │ Data Loader (PyTorch) │ │ - 加載並批次化資料 │ │ - 應用轉換 (Transform) │ └──────────────┬──────────────┘ │ ▼ ┌─────────────────────────────┐ │ Model │ │ (由多層 LogicLayer 和 │ │ 一個 GroupSum 組成) │ └──────────────┬──────────────┘ │ Forward Pass ▼ ┌─────────────────────────────┐ │ LogicLayer (CUDA) │ │ - 前向傳播使用 CUDA kernel │ │ - 計算邏輯運算的加權和 │ └──────────────┬──────────────┘ │ ▼ ┌─────────────────────────────┐ │ GroupSum │ │ - 將邏輯層輸出分組求和 │ │ - 產生最終的 logits │ └──────────────┬──────────────┘ │ ▼ ┌─────────────────────────────┐ │ Loss Function │ │ - 計算預測與真實標籤之間的損失│ └──────────────┬──────────────┘ │ Backward Pass ▼ ┌─────────────────────────────┐ │ Backward Pass │ │ - 反向傳播通過 CUDA kernel │ │ - 計算梯度 (對輸入和權重) │ └──────────────┬──────────────┘ │ ▼ ┌─────────────────────────────┐ │ Optimizer (e.g., Adam) │ │ - 更新模型參數 (權重) │ └──────────────┴──────────────┘ ``` - 邏輯運算符選擇和計算流程在每一層中，我們使用 softmax 來選擇每個邏輯運算符的選擇概率。每個邏輯運算符的概率會根據梯度下降優化來自動調整，選擇最適合的邏輯運算符。 - 邏輯運算符的選擇和加權計算例如，如果選擇了 AND，則對應的運算是 a * b；如果選擇了 OR，則對應的運算是 a + b - a * b。選擇的運算符會根據 softmax 輸出的概率進行加權。可以將其看作是從 16 個邏輯運算符中選擇一個來進行計算。 ``` Input: a = 0.7, b = 0.8 | AND | OR | XOR | ... | |-------|-------|-------|--------| | 0.4 | 0.3 | 0.2 | 0.1 | <- Softmax probabilities Selected logic operator: AND (because it has the highest probability) Calculation: AND: a * b = 0.7 * 0.8 = 0.56 ``` ``` Input a = 0.7, b = 0.8 | v +-----------------------+ | softmax(p) | <-- 概率分佈，選擇每個邏輯運算符的權重 +-----------------------+ | v +----------------------------+ | AND (p=0.4) OR (p=0.3) | <-- 選擇 AND 操作，p=0.4 最大 | XOR (p=0.2) ... | +----------------------------+ | v +------------------------+ | Logic Operation | | AND: a * b = 0.7 * 0.8 | +------------------------+ | v +-------------------------+ | Output: 0.56 | +-------------------------+ ``` ## limitation ![image](https://hackmd.io/_uploads/SksB8AX4ke.png) 邏輯門數量的增長特性表格右側的標註指出，隨著數據集的複雜度增加，邏輯門數量以指數速度增長。例如：在 CIFAR-10 的模型中，每層包含 1,024,000 個邏輯門，這顯著高於小模型的 12,000。參數總數（Total num. of p.）也隨之顯著增加，從幾萬增長到數百萬。 ## 研究成果 - result (binaried mnist) ![image](https://hackmd.io/_uploads/HJg0wJEN1x.png) - mnist ![image](https://hackmd.io/_uploads/Hk_CdJNVyl.png) https://drive.google.com/file/d/1Q3EfJ_Ld5ic_43HKd6-ojjaNEkC6lybm/view?usp=drive_link - mnist ![image](https://hackmd.io/_uploads/H1T_I8LS1e.png) ![image](https://hackmd.io/_uploads/SJRt88IHJg.png) - cifar 10 ![image](https://hackmd.io/_uploads/BJ5CUIUSkl.png) ![image](https://hackmd.io/_uploads/BydJPU8rkx.png) ![image](https://hackmd.io/_uploads/HkG-w8UHye.png) ![image](https://hackmd.io/_uploads/BkRbP8UHJl.png) ## source code - forward Flowchart - flatten pic - 從 picture 的 tensor 各選出兩個 indice - a: [0...51200] - b: [0...51200] - a 跟 b 會先根據 table 算 16 次, 再透過 random 的 weight 算出加權和 ```mermaid graph TD A["輸入圖片 Batch (B, 3, 32, 32)"] --> B["Flatten 成 (B, 3072)"] B --> C["檢查 x.shape[-1] == in_dim"] C --> D{"self.indices[0] 或 self.indices[1] 為 int64?"} D -->|是| E["轉換 self.indices 為 long()"] D -->|否| F["保持 self.indices 原樣"] E --> G["取出 a 和 b"] F --> G G["取出 a 和 b (B, 51200)"] --> H{"訓練模式?"} H -->|是| I["計算 softmax(weights)"] H -->|否| J["計算 one-hot(weights.argmax)"] I --> K["使用 bin_op_s(a, b, softmax(weights))"] J --> K["使用 bin_op_s(a, b, one-hot(weights.argmax))"] K --> L["輸出結果 pred (B, 51200)"] ``` - 實驗 - 先train model. difflogic 的 model 是 *.so - `python experiments/main.py -bs 100 -t 10 --dataset mnist20x20 -ni 200_000 -ef 1_000 -k 8_000 -l 6 --compile_model` - 如果成功產生 so 檔案 , 可以用 `apply_compiled_net.py` 去跑 inference - so 檔案只有 input 跟 op. input 是dataset 決定, op 是透過 softmax 決定機率最高的 - table ```cpp= // | id | Operator | AB=00 | AB=01 | AB=10 | AB=11 | // |----|----------------------|-------|-------|-------|-------| // | 0 | 0 | 0 | 0 | 0 | 0 | // | 1 | A and B | 0 | 0 | 0 | 1 | // | 2 | not(A implies B) | 0 | 0 | 1 | 0 | // | 3 | A | 0 | 0 | 1 | 1 | // | 4 | not(B implies A) | 0 | 1 | 0 | 0 | // | 5 | B | 0 | 1 | 0 | 1 | // | 6 | A xor B | 0 | 1 | 1 | 0 | // | 7 | A or B | 0 | 1 | 1 | 1 | // | 8 | not(A or B) | 1 | 0 | 0 | 0 | // | 9 | not(A xor B) | 1 | 0 | 0 | 1 | // | 10 | not(B) | 1 | 0 | 1 | 0 | // | 11 | B implies A | 1 | 0 | 1 | 1 | // | 12 | not(A) | 1 | 1 | 0 | 0 | // | 13 | A implies B | 1 | 1 | 0 | 1 | // | 14 | not(A and B) | 1 | 1 | 1 | 0 | // | 15 | 1 | 1 | 1 | 1 | 1 | template <typename T> __device__ __forceinline__ T bin_op_eval(const T a_, const T b_, const int op_idx) { switch (op_idx) { case 0: return static_cast<T>(0); case 1: return a_ & b_; case 2: return a_ & ~b_; case 3: return a_; case 4: return b_ & ~a_; case 5: return b_; case 6: return a_ ^ b_; case 7: return a_ | b_; case 8: return ~(a_ | b_); case 9: return ~(a_ ^ b_); case 10: return ~b_; case 11: return ~b_ | a_; case 12: return ~a_; case 13: return ~a_ | b_; case 14: return ~(a_ & b_); default: return ~static_cast<T>(0); } } ``` - logic_layer_cuda_forward_kernel ```cpp= template <typename scalar_t> __global__ void logic_layer_cuda_forward_kernel( torch::PackedTensorAccessor64<scalar_t, 2, torch::RestrictPtrTraits> x, torch::PackedTensorAccessor64<int64_t, 1, torch::RestrictPtrTraits> a, torch::PackedTensorAccessor64<int64_t, 1, torch::RestrictPtrTraits> b, torch::PackedTensorAccessor64<scalar_t, 2, torch::RestrictPtrTraits> w, torch::PackedTensorAccessor64<scalar_t, 2, torch::RestrictPtrTraits> y ) { for ( // batch dim auto row = blockIdx.x * blockDim.x + threadIdx.x; row < y.size(1); row += blockDim.x * gridDim.x ) { for ( // neuron dim auto col = blockIdx.y * blockDim.y + threadIdx.y; col < y.size(0); col += blockDim.y * gridDim.y ) { const auto idx_a = a[col]; const auto idx_b = b[col]; const auto a_ = x[idx_a][row]; const auto b_ = x[idx_b][row]; const auto w_ = w[col]; y[col][row] = ( ((w_[1] * (a_ * b_) + w_[2] * (a_ - a_ * b_)) + (w_[3] * a_ + w_[4] * (b_ - a_ * b_))) + ((w_[5] * b_ + w_[6] * (a_ + b_ - static_cast<scalar_t>(2) * a_ * b_)) + (w_[7] * (a_ + b_ - a_ * b_) + w_[8] * (static_cast<scalar_t>(1) - (a_ + b_ - a_ * b_))))) + (((w_[9] * (static_cast<scalar_t>(1) - (a_ + b_ - static_cast<scalar_t>(2) * a_ * b_)) + w_[10] * (static_cast<scalar_t>(1) - b_)) + (w_[11] * (static_cast<scalar_t>(1) - b_ + a_ * b_) + w_[12] * (static_cast<scalar_t>(1) - a_))) + (w_[13] * (static_cast<scalar_t>(1) - a_ + a_ * b_) + w_[14] * (static_cast<scalar_t>(1) - a_ * b_) + w_[15]) ); }} } ``` - logic_layer_cuda_backward_w_kernel - logic_layer_cuda_backeard_x_kernel ```cpp= template <typename scalar_t> __global__ void logic_layer_cuda_backward_w_kernel( torch::PackedTensorAccessor64<scalar_t, 2, torch::RestrictPtrTraits> x, torch::PackedTensorAccessor64<int64_t, 1, torch::RestrictPtrTraits> a, torch::PackedTensorAccessor64<int64_t, 1, torch::RestrictPtrTraits> b, torch::PackedTensorAccessor64<scalar_t, 2, torch::RestrictPtrTraits> grad_y, torch::PackedTensorAccessor64<scalar_t, 3, torch::RestrictPtrTraits> grad_w_ ) { const auto row_ = blockIdx.x * blockDim.x + threadIdx.x; for ( // neuron dim auto col = blockIdx.y * blockDim.y + threadIdx.y; col < grad_y.size(0); col += blockDim.y * gridDim.y ) { const auto idx_a = a[col]; const auto idx_b = b[col]; scalar_t grad_w_local_1 = 0; scalar_t grad_w_local_3 = 0; scalar_t grad_w_local_5 = 0; scalar_t grad_w_local_15 = 0; for (int row = row_; row < grad_y.size(1); row += BACKWARD_W_BATCH_THREADS) { // batch dim const auto a_ = x[idx_a][row]; const auto b_ = x[idx_b][row]; const auto grad_y_ = grad_y[col][row]; // compute grad_w grad_w_local_1 += (a_ * b_) * grad_y_; grad_w_local_3 += a_ * grad_y_; grad_w_local_5 += b_ * grad_y_; grad_w_local_15 += grad_y_; } grad_w_[col][row_][0] = grad_w_local_1; grad_w_[col][row_][1] = grad_w_local_3; grad_w_[col][row_][2] = grad_w_local_5; grad_w_[col][row_][3] = grad_w_local_15; } } ``` - logic_layer_cuda_eval_kernel - tensor_packbits_cuda_kernel - groupbitsum_kernel -

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.