可微邏輯門網絡（Differentiable Logic Gate Networks, DLGNs）

paper link
https://arxiv.org/abs/2210.08277
https://arxiv.org/html/2411.04732v1
Learning with Differentiable Algorithms (ch6)
https://arxiv.org/abs/2209.00616
paper video
https://www.youtube.com/watch?v=VulM891DhI4
paper
用 diff logic git做實驗
整理問題
read source code

跟老師討論

input 的資料型別

difflogic 會先做 normalize 把 input 壓到 [-1, 1], type 是 float32, 不過論文有提到用 bf16.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

forward

input 打包的地方有點問題… 看作者應該是想把 9216 個 bit 打包到 long long 的 array, d 應該要是 d < 9216 / 64..
logic_gate_net 是作者用 Python 用 script 產生的 function, 其中可以看到 inp 跟 out, inp 就是 image 攤平後的 array, out 是算完的結果. 其中 e.g., inp[161] | inp[160], 中間的op 就是作者用 fuzzy logic train 出來的.
從 logic_gate_net 拿到 out 會再把內容做group sum, 從 paper 看起來是把 out 裡面的value 去做 bit_count 的運算
看起來作者有把 logic gate 的邏輯用 c 語言實現, 有看數位邏輯跟計算機組織的課程, 感覺是實現一個 half adder, 織的課程, 感覺是實現一個 half adder, 電路跟 c 的邏輯差不多都是用 and 跟 xor 去實現,

這邊還不知道實作 half adder 的用途…











































































void logic_gate_net(long long const *inp, long long *out) {
        const long long v0 = ~inp[161] | inp[160];
        const long long v1 = ~inp[5595];
        const long long v2 = inp[4072] | inp[4073];
        const long long v3 = ~(inp[2259] & inp[2261]);
        const long long v4 = ~(inp[859] & inp[860]);
        const long long v5 = inp[149];
        ......
        ....
        out[11991] = ~v34183;
        out[11992] = v30818;
        out[11993] = ~v33111;
        out[11994] = ~v27064;
        out[11995] = v32836;
        out[11996] = v24432 & ~v24433;
        out[11997] = ~0LL;
        out[11998] = ~v33359;
        out[11999] = v31983 ^ v31984;
}
   
void apply_logic_gate_net (bool const *inp, int *out, size_t len) {
    long long *inp_temp = malloc(9216*sizeof(long long));
    long long *out_temp = malloc(12000*sizeof(long long));
    long long *out_temp_o = malloc(11*sizeof(long long));

    for(size_t i = 0; i < len; ++i) {
        // input 是 3*3*32*32 個 bit, 會把64 個 bit 打包到 long long 裡面. 可以節省空間, 可以看到 input 是 bool* inp.
        // Converting the bool array into a bitpacked array
        for(size_t d = 0; d < 9216; ++d) {
            long long res = 0LL;
            for(size_t b = 0; b < 64; ++b) {
                res <<= 1;
                res += !!(inp[i * 9216 * 64 + (64 - b - 1) * 9216 + d]);
            }
            inp_temp[d] = res;
        }

        // Applying the logic gate net
        logic_gate_net(inp_temp, out_temp);

        // GroupSum of the results via logic gate networks
        for(size_t c = 0; c < 10; ++c) {  // for each class
            // Initialize the output bits
            for(size_t d = 0; d < 11; ++d) {
                out_temp_o[d] = 0LL;
            }

            // Apply the adder logic gate network
            for(size_t a = 0; a < 1200; ++a) {
                long long carry = out_temp[c * 1200 + a];
                long long out_temp_o_d;
                for(int d = 11 - 1; d >= 0; --d) {
                    out_temp_o_d  = out_temp_o[d];
                    out_temp_o[d] = carry ^ out_temp_o_d;
                    carry         = carry & out_temp_o_d;
                }
            }

            // Unpack the result bits
            for(size_t b = 0; b < 64; ++b) {
                const long long bit_mask = 1LL << b;
                int res = 0;
                for(size_t d = 0; d < 11; ++d) {
                    res <<= 1;
                    res += !!(out_temp_o[d] & bit_mask);
                }
                out[(i * 64 + b) * 10 + c] = res;
            }
        }
    }
    free(inp_temp);
    free(out_temp);
    free(out_temp_o);
}

backward

作者是用 torch 去做訓練的, 從下面的code, 作者有另外實作 model 的 class, 只要在 model 裡面有實作 forward , pytorch 就會根據 forward 的結果去計算梯度. 可以參考下面的第二行.









def train(model, x, y, loss_fn, optimizer):
    x = model(x)
    loss = loss_fn(x, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss.item()
    - forward 可以參考下面的 code,

forward 有分 python 跟 cuda 版本的, 目前先看 python 版本的實作, 可以看到一開始會先從 image 的 arr, 各別隨機選出 a 跟 b 根據 op 的 table 去算出
















 def forward_python(self, x):
        assert x.shape[-1] == self.in_dim, (x[0].shape[-1], self.in_dim)

        if self.indices[0].dtype == torch.int64 or self.indices[1].dtype == torch.int64:
            print(self.indices[0].dtype, self.indices[1].dtype)
            self.indices = self.indices[0].long(), self.indices[1].long()
            print(self.indices[0].dtype, self.indices[1].dtype)

        a, b = x[..., self.indices[0]], x[..., self.indices[1]]
        if self.training:
            x = bin_op_s(a, b, torch.nn.functional.softmax(self.weights, dim=-1))
        else:
            weights = torch.nn.functional.one_hot(self.weights.argmax(-1), 16).to(torch.float32)
            x = bin_op_s(a, b, weights)
        return x

下面是簡單的流程圖.

loss function.

看 source code 是用 cross entropy.

可以用在 bitnet?

目前我的想法應該是不可行, 因為作者的做法是用 logic gate 的op 去取代 conv, max pool. 而bitnet 已經把資料壓縮成 bit並且運算已經用 accumulate sum 去取代矩陣乘法. 理論上用硬體做應該已經最快了, 優化的空間可能也有限.

背景

介紹傳統邏輯門（如 AND、OR、XOR 等）在神經網絡中的應用。

挑戰：傳統的邏輯門無法直接進行反向傳播和梯度計算，這使得它們無法直接用於深度學習模型中。
貢獻：提出可微邏輯門網絡（DLGNs），該網絡可以將邏輯門運算符轉化為可微分形式，並能夠進行有效的梯度下降訓練。
主要成果：展示了在邏輯運算任務（如 XOR 和 AND）中，DLGN 可以學習邏輯關係，並有效地進行訓練。
cnn vs LGN
- cnn: input, receptive field, weight
- lgn
logical or pooling

邏輯門的可微化

引入實值邏輯運算符（例如 T-norms 和 T-conorms）fuzzy logic, 來代替傳統的布爾邏輯。
T-norms, T-conorms
16 logic operator table
模型結構：DLGN 模型的結構，包括邏輯門層、加權邏輯運算符、輸入和輸出層等。使用 softmax 來選擇每層的邏輯運算符，使得每個邏輯操作的選擇成為一個概率分佈。
加權平均讓logic net 可以微分
aggregate of output neutron

example

條件設置

梯度計算步驟

更新參數

可微性保證：詳細介紹如何保證每個邏輯操作在訓練過程中是可微的，並且支持基於梯度的優化方法（如 Adam）。
























               +------------------+
               |  Input Layer     |
               |   (a, b)         |
               +------------------+
                        |
                        |
         +-------------------------------------+
         |    Softmax Probabilities           |
         |   (選擇每個邏輯運算符的概率)        |
         +-------------------------------------+
                        |
                        v
      +-----------------------------------------------+
      |    Logic Gate Layer (16 possible operations)  |
      |  Operators: AND, OR, XOR, etc.                |
      +-----------------------------------------------+
                        |
                        v
               +------------------+
               | Output Layer     |
               | (final result)   |
               +------------------+

┌─────────────────────────────┐
│          Dataset            │
│  (e.g., CIFAR-10, MNIST)    │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│      Data Loader (PyTorch)  │
│ - 加載並批次化資料           │
│ - 應用轉換 (Transform)      │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│          Model              │
│  (由多層 LogicLayer 和      │
│    一個 GroupSum 組成)      │
└──────────────┬──────────────┘
               │ Forward Pass
               ▼
┌─────────────────────────────┐
│       LogicLayer (CUDA)     │
│ - 前向傳播使用 CUDA kernel   │
│ - 計算邏輯運算的加權和       │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│         GroupSum            │
│ - 將邏輯層輸出分組求和       │
│ - 產生最終的 logits         │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│          Loss Function      │
│ - 計算預測與真實標籤之間的損失│
└──────────────┬──────────────┘
               │ Backward Pass
               ▼
┌─────────────────────────────┐
│       Backward Pass         │
│ - 反向傳播通過 CUDA kernel   │
│ - 計算梯度 (對輸入和權重)    │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│       Optimizer (e.g., Adam) │
│ - 更新模型參數 (權重)        │
└──────────────┴──────────────┘

邏輯運算符選擇和計算流程
在每一層中，我們使用 softmax 來選擇每個邏輯運算符的選擇概率。每個邏輯運算符的概率會根據梯度下降優化來自動調整，選擇最適合的邏輯運算符。
邏輯運算符的選擇和加權計算
例如，如果選擇了 AND，則對應的運算是 a * b；如果選擇了 OR，則對應的運算是 a + b - a * b。選擇的運算符會根據 softmax 輸出的概率進行加權。可以將其看作是從 16 個邏輯運算符中選擇一個來進行計算。

 Input:   a = 0.7, b = 0.8

    |  AND  |  OR   |  XOR  |  ...   |
    |-------|-------|-------|--------|
    | 0.4   | 0.3   | 0.2   | 0.1    |  <- Softmax probabilities

  Selected logic operator: AND (because it has the highest probability)

    Calculation:
      AND: a * b = 0.7 * 0.8 = 0.56

        Input a = 0.7, b = 0.8
               |
               v
    +-----------------------+
    | softmax(p)            |  <-- 概率分佈，選擇每個邏輯運算符的權重
    +-----------------------+
               |
               v
    +----------------------------+
    |  AND (p=0.4)  OR (p=0.3)   |  <-- 選擇 AND 操作，p=0.4 最大
    |  XOR (p=0.2)  ...           |
    +----------------------------+
               |
               v
    +------------------------+
    | Logic Operation        |
    | AND: a * b = 0.7 * 0.8 |
    +------------------------+
               |
               v
    +-------------------------+
    | Output: 0.56            |
    +-------------------------+

limitation

邏輯門數量的增長特性
表格右側的標註指出，隨著數據集的複雜度增加，邏輯門數量以指數速度增長。例如：
在 CIFAR-10 的模型中，每層包含 1,024,000 個邏輯門，這顯著高於小模型的 12,000。
參數總數（Total num. of p.）也隨之顯著增加，從幾萬增長到數百萬。

研究成果

result (binaried mnist)
mnist

https://drive.google.com/file/d/1Q3EfJ_Ld5ic_43HKd6-ojjaNEkC6lybm/view?usp=drive_link

mnist
cifar 10

source code

forward Flowchart
- flatten pic
- 從 picture 的 tensor 各選出兩個 indice
  - a: [0…51200]
  - b: [0…51200]
  - a 跟 b 會先根據 table 算 16 次, 再透過 random 的 weight 算出加權和

實驗
先train model. difflogic 的 model 是 *.so
python experiments/main.py -bs 100 -t 10 --dataset mnist20x20 -ni 200_000 -ef 1_000 -k 8_000 -l 6 --compile_model
如果成功產生 so 檔案 , 可以用 apply_compiled_net.py 去跑 inference
so 檔案只有 input 跟 op. input 是dataset 決定, op 是透過 softmax 決定機率最高的
table























































// | id | Operator             | AB=00 | AB=01 | AB=10 | AB=11 |
// |----|----------------------|-------|-------|-------|-------|
// | 0  | 0                    | 0     | 0     | 0     | 0     |
// | 1  | A and B              | 0     | 0     | 0     | 1     |
// | 2  | not(A implies B)     | 0     | 0     | 1     | 0     |
// | 3  | A                    | 0     | 0     | 1     | 1     |
// | 4  | not(B implies A)     | 0     | 1     | 0     | 0     |
// | 5  | B                    | 0     | 1     | 0     | 1     |
// | 6  | A xor B              | 0     | 1     | 1     | 0     |
// | 7  | A or B               | 0     | 1     | 1     | 1     |
// | 8  | not(A or B)          | 1     | 0     | 0     | 0     |
// | 9  | not(A xor B)         | 1     | 0     | 0     | 1     |
// | 10 | not(B)               | 1     | 0     | 1     | 0     |
// | 11 | B implies A          | 1     | 0     | 1     | 1     |
// | 12 | not(A)               | 1     | 1     | 0     | 0     |
// | 13 | A implies B          | 1     | 1     | 0     | 1     |
// | 14 | not(A and B)         | 1     | 1     | 1     | 0     |
// | 15 | 1                    | 1     | 1     | 1     | 1     |

template <typename T> __device__ __forceinline__ T bin_op_eval(const T a_, const T b_, const int op_idx) {
    switch (op_idx) {
    case 0:
        return static_cast<T>(0);
    case 1:
        return a_ & b_;
    case 2:
        return a_ & ~b_;
    case 3:
        return a_;
    case 4:
        return b_ & ~a_;
    case 5:
        return b_;
    case 6:
        return a_ ^ b_;
    case 7:
        return a_ | b_;
    case 8:
        return ~(a_ | b_);
    case 9:
        return ~(a_ ^ b_);
    case 10:
        return ~b_;
    case 11:
        return ~b_ | a_;
    case 12:
        return ~a_;
    case 13:
        return ~a_ | b_;
    case 14:
        return ~(a_ & b_);
    default:
        return ~static_cast<T>(0);
    }
}

logic_layer_cuda_forward_kernel

















































template <typename scalar_t>
__global__ void logic_layer_cuda_forward_kernel(
    torch::PackedTensorAccessor64<scalar_t, 2, torch::RestrictPtrTraits> x,
    torch::PackedTensorAccessor64<int64_t, 1, torch::RestrictPtrTraits> a,
    torch::PackedTensorAccessor64<int64_t, 1, torch::RestrictPtrTraits> b,
    torch::PackedTensorAccessor64<scalar_t, 2, torch::RestrictPtrTraits> w,
    torch::PackedTensorAccessor64<scalar_t, 2, torch::RestrictPtrTraits> y
) {

    for (  // batch dim
        auto row = blockIdx.x * blockDim.x + threadIdx.x;
        row < y.size(1);
        row += blockDim.x * gridDim.x
    ) {
        for (  // neuron dim
            auto col = blockIdx.y * blockDim.y + threadIdx.y;
            col < y.size(0);
            col += blockDim.y * gridDim.y
        ) {

            const auto idx_a = a[col];
            const auto idx_b = b[col];
            const auto a_ = x[idx_a][row];
            const auto b_ = x[idx_b][row];

            const auto w_ = w[col];

            y[col][row] = (
                 ((w_[1] * (a_ * b_)
                 + w_[2] * (a_ - a_ * b_))
                + (w_[3] * a_
                 + w_[4] * (b_ - a_ * b_)))
               + ((w_[5] * b_
                 + w_[6] * (a_ + b_ - static_cast<scalar_t>(2) * a_ * b_))
                + (w_[7] * (a_ + b_ - a_ * b_)
                 + w_[8] * (static_cast<scalar_t>(1) - (a_ + b_ - a_ * b_)))))
              + (((w_[9] * (static_cast<scalar_t>(1) - (a_ + b_ - static_cast<scalar_t>(2) * a_ * b_))
                 + w_[10] * (static_cast<scalar_t>(1) - b_)) +
                  (w_[11] * (static_cast<scalar_t>(1) - b_ + a_ * b_)
                 + w_[12] * (static_cast<scalar_t>(1) - a_))) +
                  (w_[13] * (static_cast<scalar_t>(1) - a_ + a_ * b_)
                 + w_[14] * (static_cast<scalar_t>(1) - a_ * b_)
                 + w_[15])
            );
    }}
}

logic_layer_cuda_backward_w_kernel
logic_layer_cuda_backeard_x_kernel










































template <typename scalar_t>
__global__ void
logic_layer_cuda_backward_w_kernel(
    torch::PackedTensorAccessor64<scalar_t, 2, torch::RestrictPtrTraits> x,
    torch::PackedTensorAccessor64<int64_t, 1, torch::RestrictPtrTraits> a,
    torch::PackedTensorAccessor64<int64_t, 1, torch::RestrictPtrTraits> b,
    torch::PackedTensorAccessor64<scalar_t, 2, torch::RestrictPtrTraits> grad_y,
    torch::PackedTensorAccessor64<scalar_t, 3, torch::RestrictPtrTraits> grad_w_
) {

    const auto row_ = blockIdx.x * blockDim.x + threadIdx.x;

    for (  // neuron dim
        auto col = blockIdx.y * blockDim.y + threadIdx.y;
        col < grad_y.size(0);
        col += blockDim.y * gridDim.y
    ) {
        const auto idx_a = a[col];
        const auto idx_b = b[col];
        scalar_t grad_w_local_1 = 0;
        scalar_t grad_w_local_3 = 0;
        scalar_t grad_w_local_5 = 0;
        scalar_t grad_w_local_15 = 0;
        for (int row = row_; row < grad_y.size(1); row += BACKWARD_W_BATCH_THREADS) {  // batch dim
            const auto a_ = x[idx_a][row];
            const auto b_ = x[idx_b][row];
            const auto grad_y_ = grad_y[col][row];

            // compute grad_w
            grad_w_local_1 += (a_ * b_) * grad_y_;
            grad_w_local_3 += a_ * grad_y_;
            grad_w_local_5 += b_ * grad_y_;
            grad_w_local_15 += grad_y_;
        }

        grad_w_[col][row_][0] = grad_w_local_1;
        grad_w_[col][row_][1] = grad_w_local_3;
        grad_w_[col][row_][2] = grad_w_local_5;
        grad_w_[col][row_][3] = grad_w_local_15;
    }
}

logic_layer_cuda_eval_kernel
tensor_packbits_cuda_kernel
groupbitsum_kernel

可微邏輯門網絡（Differentiable Logic Gate Networks, DLGNs）

跟老師討論

input 的資料型別

forward

backward

可以用在 bitnet?

背景

邏輯門的可微化

example

limitation

研究成果

source code

Read more

linux2023: https://github.com/Nigelwz

TPO1

Untitled

OSEK