---
# System prepended metadata

title: ClipQ

---

# ClipQ
- A flexible and efficient design and implementation of CNN accelerator with 8-bit CLIP-Q quantization

[TOC]

## Progress 
:::spoiler Milestone
- [x] Setup the env on server (140.116.245.115)
- [x] Full precision training of NIN model on CIFAR-10/100
- [x] Fine-tuning with N-bit Clip-Q
- [x] Inference and check precision 
- [x] Save weight to run on FPGA
:::

![](https://i.imgur.com/aR8wQG3.png)

:::spoiler By week
- Week 15 (2020/12/14-12/18)
    - Identify the problem by understanding PyTorch
        - [PyTorch: Defining new autograd functions](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-defining-new-autograd-functions)
        - [Extending torch.autograd](https://pytorch.org/docs/stable/notes/extending.html#extending-torch-autograd)
        - [Solution found: Difference between apply an call for an autograd function](https://discuss.pytorch.org/t/difference-between-apply-an-call-for-an-autograd-function/13845)
            - API must be changed as follow
            ```python=
            class F_new(torch.autograd.Function):
                @staticmethod
                def forward(ctx, args, gamma):
                    ctx.gamma = gamma
                    pass

                @staticmethod
                def backward(ctx, args):
                    pass

            # Using your old style Function from your code sample:
            F(gamma)(inp)
            # Using the new style Function:
            F_new.apply(inp, gamma)
            ```
            - Fixed in this [commit](https://github.com/WeiCheng14159/caid_clipQ/commit/05e0accf17aaebefc32f54baccd9d739ecf6f4c9) 
    - Fine-tuning with N-bit Clip-Q
        - Best Accuracy: 64.61%
- Week 14 (2020/12/7-12/11)
    - Stuck with the error `RuntimeError: Legacy autograd function with non-static forward method is deprecated. Please use new-style autograd function with static forward method.`
- Week 13 (2020/11/30-12/4)
    - Setup the SW env on server
        - Problem: `pip install -r requirements.txt` fail due to mismatch python environment
            - Solution: Remove version number in `requirements.txt` and install the latest version
        - Stuck on command `Building wheels for collected packages: opencv-python, PyYAML, scandir, visdom, wrapt` 
            - Solution: Restart command
        - `numpy.core.multiarray failed to import` 
            - Solution: Upgrade pip and reinstall numpy
        - `The NVIDIA driver on your system is too old (found version 10010).`
            - Reason: The torch version and nvidia-driver version is different
            - Solution:  Completely uninstall torch & reinstall torch by for cuda 10.1 `# CUDA 10.1
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1 -c pytorch` 
            - Ref: [Torch official](https://pytorch.org/get-started/previous-versions/)
    - Full precision training of NIN model on CIFAR-100
        - ![](https://i.imgur.com/LWNgPiL.png)
        - Command: `python3 main.py --lr 0.1 --opt SGD --full 1 --cifar 100 --epoch 200`
        - Result: `Best Accuracy: 67.46%` 
        - Command: `python3 main.py --lr 0.1 --opt SGD --full 1 --cifar 10 --epoch 200`
        - Result: `Best Accuracy: 89.30%` 
    - Fine-tuning with N-bit Clip-Q
        - Problem: `RuntimeError: Legacy autograd function with non-static forward method is deprecated. Please use new-style autograd function with static forward method.`
            - Solution: TBD
- Week 12 (2020/11/23-27)
    - Read project doc
    - Read research paper
    - Read core Verilog code (Draw FSM diagram)
    - Identify core python code (NIN training & ClipQ)
:::

## Workflow overview
- Full precision training of NIN model on CIFAR-10/100
- Fine-tuning with N-bit Clip-Q
- Inference and check precision 
- Save weight to run on FPGA
## Software 
### NIN mode
- Structure
    - Three 3x3 conv layer, each 3x3 conv layer is followed by two 1x1 conv layer 
    - ![](https://i.imgur.com/Co5tCbx.png)
- Reason
    - Less parameter, high precision
    - ![](https://i.imgur.com/vVtuAuJ.png)
- Quantized conv layer x9
    - Child class of [torch.nn.quantized](https://pytorch.org/docs/1.7.0/torch.nn.quantized.html?highlight=torch%20nn%20quantized#module-torch.nn.quantized)
- Why 1x1 convolution ? 
    - From [What does 1x1 convolution mean in a neural network?](https://stats.stackexchange.com/questions/194142/what-does-1x1-convolution-mean-in-a-neural-network) 
    - In terms of Google Inception model
        > Suppose this output is fed into a conv layer with $F_1$ 1x1 filters, zero padding and stride 1 ... So 1x1 conv filters can be used to change the dimensionality in the filter space. If $F_1$ > 𝐹 then we are increasing dimensionality, if $F_1$ < 𝐹 we are decreasing dimensionality, in the filter dimension.
    - In terms of channel extension/compression
        > A 1x1 convolution simply maps an input pixel with all it's channels to an output pixel, **not looking at anything around itself**. It is often used to **reduce the number of depth channels**, since it is often very slow to multiply volumes with extremely large depths. 

:::spoiler pytorch code
```python=
class Net(nn.Module):
    def __init__(self,f,cifar,write):
        super(Net, self).__init__()
        self.QCNN = nn.Sequential(
            QConv2d(  3,  96, kernel_size=3, stride=1, padding=1,layer = 1,full=f,w=write),
            QConv2d( 96, 160, kernel_size=1, stride=1, padding=0,layer = 2,full=f),
            QConv2d(160, 192, kernel_size=1, stride=1, padding=0,layer = 3,full=f),
            nn.MaxPool2d(kernel_size=2, stride=2, padding=0),

            QConv2d(192, 96 , kernel_size=3, stride=1, padding=1,layer = 4,full=f),
            QConv2d(96 , 192, kernel_size=1, stride=1, padding=0,layer = 5,full=f),
            QConv2d(192, 192, kernel_size=1, stride=1, padding=0,layer = 6,full=f),
            nn.AvgPool2d(kernel_size=2, stride=2, padding=0),

            QConv2d(192, 384, kernel_size=3, stride=1, padding=1,layer = 7,full=f),
            QConv2d(384, 192, kernel_size=1, stride=1, padding=0,layer = 8,full=f),
            QConv2d(192, int(cifar), kernel_size=1, stride=1, padding=0,layer = 9,full=f),
            nn.AvgPool2d(kernel_size=8, stride=1, padding=0),
            )

    def forward(self, x ):
        for m in self.modules():
            if isinstance(m, nn.BatchNorm2d) or isinstance(m, nn.BatchNorm1d):
                if hasattr(m.weight, 'data'):
                    m.weight.data.clamp_(min=0.01)
        x = self.QCNN(x)
        x = x.view(x.size(0), -1)
        return x
```
:::

### Clip-Q
- Idea:
    - 1) combines **network pruning** and **weight quantization** in a single learning framework that solves for both weight pruning and quantization jointly 
    - 2) makes flexible pruning and quantization decisions that adapt over time as the network structure changes
    - 3) performs pruning and quantization in parallel with **fine-tuning** the **full-precision weights**.
- Algorithm:
    - ![](https://i.imgur.com/5Lna4OC.png)
    - ![](https://i.imgur.com/HJewJAP.png)

- Implementation: `util_write.py`
:::spoiler ClipQ python Code
```python=
def ClipQ(self):
    for index in range(self.num_of_params):
        start = time.time()
        x = self.target_modules[index].data.cpu()
        p=0.4
        b=2
        x1=x.view(-1).numpy()
        x1s=np.sort(x1, axis=None)
        x1arg = np.argsort(x1, axis=None)
        pos = x1s[np.where(x1s>0)]
        pos_arg = x1arg[np.where(x1s>0)]
        neg = x1s[np.where(x1s<0)]
        neg_arg = x1arg[np.where(x1s<0)]

        P_Znum = m.ceil(len(pos)*p)
        N_Znum = m.ceil(len(neg)*p)

        P_max = max(pos[:P_Znum])
        N_min = min(neg[-N_Znum:])

        x1[pos_arg[:P_Znum]]=0
        x1[neg_arg[-N_Znum:]]=0


        x1s=np.sort(x1, axis=None)
        partb0 = m.pow(2,b-1)-1
        partb1 = m.pow(2,b-1)

        pos = x1s[np.where(x1s>0)]
        neg = x1s[np.where(x1s<0)]

        pos_s = (pos[len(pos)-1] - P_max)/partb0
        neg_s = (neg[0] - N_min)/partb1

        pos = pos - P_max
        neg = neg - N_min

        pos_d = {}
        neg_d = {}

        sum_pos = np.zeros(int(partb0))
        sum_neg = np.zeros(int(partb1))

        num_pos = np.zeros(int(partb0))
        num_neg = np.zeros(int(partb1))

        pos_max = np.zeros(int(partb0))
        neg_min = np.zeros(int(partb1))

        for i in range(int(partb0)):
          pos_d[i] = pos[np.where(np.floor(pos/pos_s) == i)] + P_max

        try:
          pos_d[partb0-1] = np.append(pos_d[partb0-1],[pos[len(pos)-1] + P_max])
        except:
          pos_d[partb0-1] = [pos[len(pos)-1] + P_max]

        for i in range(int(partb1)):
          neg_d[i] = neg[np.where(np.floor(neg/neg_s) == i)] + N_min


        try:
          neg_d[partb1-1] = np.append(neg_d[partb1-1],[neg[0] + N_min])
        except:
          neg_d[partb1-1] = [neg[0] + N_min] 

        pos_avg = {}
        neg_avg = {}

        for i in pos_d.items():
          pos_avg[i[0]] = sum(i[1])/(len(i[1])+0.0000001)
          try:
            pos_max[i[0]] = max(i[1])
          except:
            pass

        for i in neg_d.items():
          neg_avg[i[0]] = sum(i[1])/(len(i[1])+0.0000001)
          try:
            neg_min[i[0]] = min(i[1])
          except:
            pass

        xx1 = x1.copy()
        we = x1.copy()
        realW = []

        for i in range(int(partb0)):
          realW.append(pos_avg[i])
          if i==0:
            x1[np.logical_and(xx1>0, xx1<=pos_max[i])] = pos_avg[i]
            we[np.logical_and(xx1>0, xx1<=pos_max[i])] = 1
          else:
            x1[np.logical_and(xx1>pos_max[i-1], xx1<=pos_max[i])] = pos_avg[i]
            we[np.logical_and(xx1>pos_max[i-1], xx1<=pos_max[i])] = i+1


        for i in range(int(partb1)):
          realW.append(neg_avg[i])
          if i==0:
            x1[np.logical_and(xx1<0, xx1>=neg_min[i])] = neg_avg[i]
            we[np.logical_and(xx1<0, xx1>=neg_min[i])] = -1
          else:
            x1[np.logical_and(xx1<neg_min[i-1], xx1>=neg_min[i])] = neg_avg[i]
            we[np.logical_and(xx1<neg_min[i-1], xx1>=neg_min[i])] = -i-1

        x2 = torch.from_numpy(x1) 
        pa = torch.Tensor(realW)
        x=x2.view(x.size())
        num_bits = 8
        num_int = 3
        qmin = -(2.**(num_int - 1))
        qmax = qmin + 2.**num_int - 1./(2.**(num_bits - num_int))
        scale = 1/(2.**(num_bits - num_int))
        xx = x - torch.fmod(x,scale)
        pa = pa - torch.fmod(pa,scale)
        xx[xx.le(qmin)] = qmin
        xx[xx.ge(qmax)] = qmax
        pa[pa.le(qmin)] = qmin
        pa[pa.ge(qmax)] = qmax
        ww = torch.from_numpy(we)
        ww = ww.view(x.size())+2

        if index == 1:  
          if not os.path.exists('./H_data/W2.hex'):
            ch_fileW2(ww,'./H_data/W2.hex')
          if not os.path.exists('./H_data/W8.hex'):
            fileW8(pa,'./H_data/W8.hex',5)

        self.target_modules[index].data = xx.cuda()
```
:::

## Hardware Architecture
### Background
#### Line buffer for conv2d
- Line buffer method first introduced in [Reconfigurable Pipelined 2-D Convolvers for Fast Digital Signal Processing](https://ieeexplore.ieee.org/document/784091)
- ![](https://i.imgur.com/cUNdtUv.png)
- Used in [Going Deeper with Embedded FPGA Platform for Convolutional Neural Network](https://dl.acm.org/doi/10.1145/2847263.2847265)
- ![](https://i.imgur.com/6XByAIM.png)
#### PE method
- PE method [A high performance FPGA-based accelerator for large-scale convolutional neural networks](https://ieeexplore.ieee.org/document/7577308)
- ![](https://i.imgur.com/nFNvEZV.png)
- ![](https://i.imgur.com/zmY0HTe.png)

#### Comparison
|                | Advantage                      | Disadvantage                       |
| -------------- | ------------------------------ | ---------------------------------- |
| Line buffer    | Data reuse for W, In           | Not scalable, Lots of HW resources |
| PE method      | Scalable                       | No data reuse for W                |
### Details ![](https://i.imgur.com/rknRWuu.jpg)
#### Controller `ctrl.v`
- FSM diagram
```graphviz
digraph controller_fsm{
    graph [fontname=Arial];
    node [shape=record,style=filled, fillcolor=aquamarine];
    edge [fontcolor=red];
    // nodes
    s0 [label="S_IDLE"];
    s1 [label="S_READ_PARAM"];
    s2 [label="S_READ_W8"];
    s3 [label="S_READ_W2"];
    s4 [label="S_READ_INPUT"];
    s5 [label="S_EMPTY"];
    s6 [label="S_INPUT_CH"];
    s7 [label="S_FINISH"];

    // edges
    s0->s1 [label="start=1"];
    s1->s2 [label="s1_fin=1"];
    s2->s3 [label="s2_fin=1"];
    s3->s4 [label="s3_fin=1"];
    s4->s5 [label="s4_fin=1"];
    s5->s6 [label="s5_fin=1"];
    s6->s7 [label="s6_fin=1"];
    s6->s3 [label="s6_back=1"];
}
```
#### Conv Unit `c_unit.v`
- Diagram
```graphviz
digraph accu_module{
    rankdir=TB;
    // splines=false;
    graph [fontname=Arial];
    node [shape=record,style=filled, 
        fillcolor=aquamarine,fontsize=20.0];
    edge [fontcolor=red, fontsize=20.0];
    
    subgraph cluster_0{
        label="c_unit_0";
        // node
        mult_ [label="*"];
        w_ [label="weight"];
        add_ [label="+"];
        out_prev_ [label="accu_prev"];
        out_ [label="accu"];
        in_ [label="input"];
        // edge
        mult_->add_[label="mul_r[15:0]"];
        w_->mult_[label="w_in[7:0]"];
        in_->mult_[label="in[7:0]"];    
        out_prev_->add_[label="accu_prev[31:0]"];
        add_->out_[label="accu[31:0]"];
        out_->out_prev[label=""];
    };
    
    subgraph cluster_1{
        label="c_unit_1";
        // node
        mult [label="*"];
        w [label="weight"];
        add [label="+"];
        out_prev [label="accu_prev"];
        out [label="accu"];
        in [label="input"];
        out_next [label="accu_next"];
        // edge
        mult->add[label="mul_r[15:0]"];
        w->mult[label="w_in[7:0]"];
        in->mult[label="in[7:0]"];    
        out_prev->add[label="accu_prev[31:0]"];
        add->out[label="accu[31:0]"];
        out->out_next[label=""];
    };
}
```
:::spoiler Verilog code
```c=
module c_unit(
  clk,rst,
  l_in,
  d_in,
  en_pu,
  en_in,
  en,
  zero_en,
  w_en,
  w_in,
  d_out,
  w_out
);

input  clk,rst;
input signed [7:0]  d_in;
input signed  [31:0] l_in;

input en_pu;
input en_in;
output reg en;
input zero_en;

input  [7:0]  w_in;
input         w_en;

output reg signed [31:0] d_out;
output reg signed [7:0]  w_out;

wire signed [ 7:0]in;
wire signed [31:0]la;

wire signed [15:0]mul_r;

reg signed [31:0]out_reg;


//w_out
always@(posedge clk or negedge rst)begin
  if(!rst)begin
    w_out <= 0;
  end
  else begin
    if(w_en)
      w_out <= w_in;
  end
end

//in
//assign in = en? d_in :  7'b0;
//assign la = en? l_in : 31'b0;

//mul_r & d_out
assign mul_r = w_out * d_in;

//out_reg
always@(posedge clk or negedge rst)begin
  if(!rst)begin
    out_reg <= 0;
  end
  else begin
    out_reg <= mul_r + l_in;
  end
end

//d_out
always@(*)begin
  d_out = out_reg;
  if(!en)
    d_out = 0;
end

//en
always@(posedge clk or negedge rst)begin
  if(!rst)begin
    en <= 0;
  end
  else begin
    if(zero_en)
      en <= 0;
    else if(en_pu)
      en <= en_in;
  end
end

endmodule
```
:::

:::spoiler 5x5 design diagram
![](https://i.imgur.com/VKJlbkW.png)
:::

- Average Weight Selection TBD