# ClipQ
- A flexible and efficient design and implementation of CNN accelerator with 8-bit CLIP-Q quantization
[TOC]
## Progress
:::spoiler Milestone
- [x] Setup the env on server (140.116.245.115)
- [x] Full precision training of NIN model on CIFAR-10/100
- [x] Fine-tuning with N-bit Clip-Q
- [x] Inference and check precision
- [x] Save weight to run on FPGA
:::
![](https://i.imgur.com/aR8wQG3.png)
:::spoiler By week
- Week 15 (2020/12/14-12/18)
- Identify the problem by understanding PyTorch
- [PyTorch: Defining new autograd functions](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-defining-new-autograd-functions)
- [Extending torch.autograd](https://pytorch.org/docs/stable/notes/extending.html#extending-torch-autograd)
- [Solution found: Difference between apply an call for an autograd function](https://discuss.pytorch.org/t/difference-between-apply-an-call-for-an-autograd-function/13845)
- API must be changed as follow
```python=
class F_new(torch.autograd.Function):
@staticmethod
def forward(ctx, args, gamma):
ctx.gamma = gamma
pass
@staticmethod
def backward(ctx, args):
pass
# Using your old style Function from your code sample:
F(gamma)(inp)
# Using the new style Function:
F_new.apply(inp, gamma)
```
- Fixed in this [commit](https://github.com/WeiCheng14159/caid_clipQ/commit/05e0accf17aaebefc32f54baccd9d739ecf6f4c9)
- Fine-tuning with N-bit Clip-Q
- Best Accuracy: 64.61%
- Week 14 (2020/12/7-12/11)
- Stuck with the error `RuntimeError: Legacy autograd function with non-static forward method is deprecated. Please use new-style autograd function with static forward method.`
- Week 13 (2020/11/30-12/4)
- Setup the SW env on server
- Problem: `pip install -r requirements.txt` fail due to mismatch python environment
- Solution: Remove version number in `requirements.txt` and install the latest version
- Stuck on command `Building wheels for collected packages: opencv-python, PyYAML, scandir, visdom, wrapt`
- Solution: Restart command
- `numpy.core.multiarray failed to import`
- Solution: Upgrade pip and reinstall numpy
- `The NVIDIA driver on your system is too old (found version 10010).`
- Reason: The torch version and nvidia-driver version is different
- Solution: Completely uninstall torch & reinstall torch by for cuda 10.1 `# CUDA 10.1
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1 -c pytorch`
- Ref: [Torch official](https://pytorch.org/get-started/previous-versions/)
- Full precision training of NIN model on CIFAR-100
- ![](https://i.imgur.com/LWNgPiL.png)
- Command: `python3 main.py --lr 0.1 --opt SGD --full 1 --cifar 100 --epoch 200`
- Result: `Best Accuracy: 67.46%`
- Command: `python3 main.py --lr 0.1 --opt SGD --full 1 --cifar 10 --epoch 200`
- Result: `Best Accuracy: 89.30%`
- Fine-tuning with N-bit Clip-Q
- Problem: `RuntimeError: Legacy autograd function with non-static forward method is deprecated. Please use new-style autograd function with static forward method.`
- Solution: TBD
- Week 12 (2020/11/23-27)
- Read project doc
- Read research paper
- Read core Verilog code (Draw FSM diagram)
- Identify core python code (NIN training & ClipQ)
:::
## Workflow overview
- Full precision training of NIN model on CIFAR-10/100
- Fine-tuning with N-bit Clip-Q
- Inference and check precision
- Save weight to run on FPGA
## Software
### NIN mode
- Structure
- Three 3x3 conv layer, each 3x3 conv layer is followed by two 1x1 conv layer
- ![](https://i.imgur.com/Co5tCbx.png)
- Reason
- Less parameter, high precision
- ![](https://i.imgur.com/vVtuAuJ.png)
- Quantized conv layer x9
- Child class of [torch.nn.quantized](https://pytorch.org/docs/1.7.0/torch.nn.quantized.html?highlight=torch%20nn%20quantized#module-torch.nn.quantized)
- Why 1x1 convolution ?
- From [What does 1x1 convolution mean in a neural network?](https://stats.stackexchange.com/questions/194142/what-does-1x1-convolution-mean-in-a-neural-network)
- In terms of Google Inception model
> Suppose this output is fed into a conv layer with $F_1$ 1x1 filters, zero padding and stride 1 ... So 1x1 conv filters can be used to change the dimensionality in the filter space. If $F_1$ > 𝐹 then we are increasing dimensionality, if $F_1$ < 𝐹 we are decreasing dimensionality, in the filter dimension.
- In terms of channel extension/compression
> A 1x1 convolution simply maps an input pixel with all it's channels to an output pixel, **not looking at anything around itself**. It is often used to **reduce the number of depth channels**, since it is often very slow to multiply volumes with extremely large depths.
:::spoiler pytorch code
```python=
class Net(nn.Module):
def __init__(self,f,cifar,write):
super(Net, self).__init__()
self.QCNN = nn.Sequential(
QConv2d( 3, 96, kernel_size=3, stride=1, padding=1,layer = 1,full=f,w=write),
QConv2d( 96, 160, kernel_size=1, stride=1, padding=0,layer = 2,full=f),
QConv2d(160, 192, kernel_size=1, stride=1, padding=0,layer = 3,full=f),
nn.MaxPool2d(kernel_size=2, stride=2, padding=0),
QConv2d(192, 96 , kernel_size=3, stride=1, padding=1,layer = 4,full=f),
QConv2d(96 , 192, kernel_size=1, stride=1, padding=0,layer = 5,full=f),
QConv2d(192, 192, kernel_size=1, stride=1, padding=0,layer = 6,full=f),
nn.AvgPool2d(kernel_size=2, stride=2, padding=0),
QConv2d(192, 384, kernel_size=3, stride=1, padding=1,layer = 7,full=f),
QConv2d(384, 192, kernel_size=1, stride=1, padding=0,layer = 8,full=f),
QConv2d(192, int(cifar), kernel_size=1, stride=1, padding=0,layer = 9,full=f),
nn.AvgPool2d(kernel_size=8, stride=1, padding=0),
)
def forward(self, x ):
for m in self.modules():
if isinstance(m, nn.BatchNorm2d) or isinstance(m, nn.BatchNorm1d):
if hasattr(m.weight, 'data'):
m.weight.data.clamp_(min=0.01)
x = self.QCNN(x)
x = x.view(x.size(0), -1)
return x
```
:::
### Clip-Q
- Idea:
- 1) combines **network pruning** and **weight quantization** in a single learning framework that solves for both weight pruning and quantization jointly
- 2) makes flexible pruning and quantization decisions that adapt over time as the network structure changes
- 3) performs pruning and quantization in parallel with **fine-tuning** the **full-precision weights**.
- Algorithm:
- ![](https://i.imgur.com/5Lna4OC.png)
- ![](https://i.imgur.com/HJewJAP.png)
- Implementation: `util_write.py`
:::spoiler ClipQ python Code
```python=
def ClipQ(self):
for index in range(self.num_of_params):
start = time.time()
x = self.target_modules[index].data.cpu()
p=0.4
b=2
x1=x.view(-1).numpy()
x1s=np.sort(x1, axis=None)
x1arg = np.argsort(x1, axis=None)
pos = x1s[np.where(x1s>0)]
pos_arg = x1arg[np.where(x1s>0)]
neg = x1s[np.where(x1s<0)]
neg_arg = x1arg[np.where(x1s<0)]
P_Znum = m.ceil(len(pos)*p)
N_Znum = m.ceil(len(neg)*p)
P_max = max(pos[:P_Znum])
N_min = min(neg[-N_Znum:])
x1[pos_arg[:P_Znum]]=0
x1[neg_arg[-N_Znum:]]=0
x1s=np.sort(x1, axis=None)
partb0 = m.pow(2,b-1)-1
partb1 = m.pow(2,b-1)
pos = x1s[np.where(x1s>0)]
neg = x1s[np.where(x1s<0)]
pos_s = (pos[len(pos)-1] - P_max)/partb0
neg_s = (neg[0] - N_min)/partb1
pos = pos - P_max
neg = neg - N_min
pos_d = {}
neg_d = {}
sum_pos = np.zeros(int(partb0))
sum_neg = np.zeros(int(partb1))
num_pos = np.zeros(int(partb0))
num_neg = np.zeros(int(partb1))
pos_max = np.zeros(int(partb0))
neg_min = np.zeros(int(partb1))
for i in range(int(partb0)):
pos_d[i] = pos[np.where(np.floor(pos/pos_s) == i)] + P_max
try:
pos_d[partb0-1] = np.append(pos_d[partb0-1],[pos[len(pos)-1] + P_max])
except:
pos_d[partb0-1] = [pos[len(pos)-1] + P_max]
for i in range(int(partb1)):
neg_d[i] = neg[np.where(np.floor(neg/neg_s) == i)] + N_min
try:
neg_d[partb1-1] = np.append(neg_d[partb1-1],[neg[0] + N_min])
except:
neg_d[partb1-1] = [neg[0] + N_min]
pos_avg = {}
neg_avg = {}
for i in pos_d.items():
pos_avg[i[0]] = sum(i[1])/(len(i[1])+0.0000001)
try:
pos_max[i[0]] = max(i[1])
except:
pass
for i in neg_d.items():
neg_avg[i[0]] = sum(i[1])/(len(i[1])+0.0000001)
try:
neg_min[i[0]] = min(i[1])
except:
pass
xx1 = x1.copy()
we = x1.copy()
realW = []
for i in range(int(partb0)):
realW.append(pos_avg[i])
if i==0:
x1[np.logical_and(xx1>0, xx1<=pos_max[i])] = pos_avg[i]
we[np.logical_and(xx1>0, xx1<=pos_max[i])] = 1
else:
x1[np.logical_and(xx1>pos_max[i-1], xx1<=pos_max[i])] = pos_avg[i]
we[np.logical_and(xx1>pos_max[i-1], xx1<=pos_max[i])] = i+1
for i in range(int(partb1)):
realW.append(neg_avg[i])
if i==0:
x1[np.logical_and(xx1<0, xx1>=neg_min[i])] = neg_avg[i]
we[np.logical_and(xx1<0, xx1>=neg_min[i])] = -1
else:
x1[np.logical_and(xx1<neg_min[i-1], xx1>=neg_min[i])] = neg_avg[i]
we[np.logical_and(xx1<neg_min[i-1], xx1>=neg_min[i])] = -i-1
x2 = torch.from_numpy(x1)
pa = torch.Tensor(realW)
x=x2.view(x.size())
num_bits = 8
num_int = 3
qmin = -(2.**(num_int - 1))
qmax = qmin + 2.**num_int - 1./(2.**(num_bits - num_int))
scale = 1/(2.**(num_bits - num_int))
xx = x - torch.fmod(x,scale)
pa = pa - torch.fmod(pa,scale)
xx[xx.le(qmin)] = qmin
xx[xx.ge(qmax)] = qmax
pa[pa.le(qmin)] = qmin
pa[pa.ge(qmax)] = qmax
ww = torch.from_numpy(we)
ww = ww.view(x.size())+2
if index == 1:
if not os.path.exists('./H_data/W2.hex'):
ch_fileW2(ww,'./H_data/W2.hex')
if not os.path.exists('./H_data/W8.hex'):
fileW8(pa,'./H_data/W8.hex',5)
self.target_modules[index].data = xx.cuda()
```
:::
## Hardware Architecture
### Background
#### Line buffer for conv2d
- Line buffer method first introduced in [Reconfigurable Pipelined 2-D Convolvers for Fast Digital Signal Processing](https://ieeexplore.ieee.org/document/784091)
- ![](https://i.imgur.com/cUNdtUv.png)
- Used in [Going Deeper with Embedded FPGA Platform for Convolutional Neural Network](https://dl.acm.org/doi/10.1145/2847263.2847265)
- ![](https://i.imgur.com/6XByAIM.png)
#### PE method
- PE method [A high performance FPGA-based accelerator for large-scale convolutional neural networks](https://ieeexplore.ieee.org/document/7577308)
- ![](https://i.imgur.com/nFNvEZV.png)
- ![](https://i.imgur.com/zmY0HTe.png)
#### Comparison
| | Advantage | Disadvantage |
| -------------- | ------------------------------ | ---------------------------------- |
| Line buffer | Data reuse for W, In | Not scalable, Lots of HW resources |
| PE method | Scalable | No data reuse for W |
### Details ![](https://i.imgur.com/rknRWuu.jpg)
#### Controller `ctrl.v`
- FSM diagram
```graphviz
digraph controller_fsm{
graph [fontname=Arial];
node [shape=record,style=filled, fillcolor=aquamarine];
edge [fontcolor=red];
// nodes
s0 [label="S_IDLE"];
s1 [label="S_READ_PARAM"];
s2 [label="S_READ_W8"];
s3 [label="S_READ_W2"];
s4 [label="S_READ_INPUT"];
s5 [label="S_EMPTY"];
s6 [label="S_INPUT_CH"];
s7 [label="S_FINISH"];
// edges
s0->s1 [label="start=1"];
s1->s2 [label="s1_fin=1"];
s2->s3 [label="s2_fin=1"];
s3->s4 [label="s3_fin=1"];
s4->s5 [label="s4_fin=1"];
s5->s6 [label="s5_fin=1"];
s6->s7 [label="s6_fin=1"];
s6->s3 [label="s6_back=1"];
}
```
#### Conv Unit `c_unit.v`
- Diagram
```graphviz
digraph accu_module{
rankdir=TB;
// splines=false;
graph [fontname=Arial];
node [shape=record,style=filled,
fillcolor=aquamarine,fontsize=20.0];
edge [fontcolor=red, fontsize=20.0];
subgraph cluster_0{
label="c_unit_0";
// node
mult_ [label="*"];
w_ [label="weight"];
add_ [label="+"];
out_prev_ [label="accu_prev"];
out_ [label="accu"];
in_ [label="input"];
// edge
mult_->add_[label="mul_r[15:0]"];
w_->mult_[label="w_in[7:0]"];
in_->mult_[label="in[7:0]"];
out_prev_->add_[label="accu_prev[31:0]"];
add_->out_[label="accu[31:0]"];
out_->out_prev[label=""];
};
subgraph cluster_1{
label="c_unit_1";
// node
mult [label="*"];
w [label="weight"];
add [label="+"];
out_prev [label="accu_prev"];
out [label="accu"];
in [label="input"];
out_next [label="accu_next"];
// edge
mult->add[label="mul_r[15:0]"];
w->mult[label="w_in[7:0]"];
in->mult[label="in[7:0]"];
out_prev->add[label="accu_prev[31:0]"];
add->out[label="accu[31:0]"];
out->out_next[label=""];
};
}
```
:::spoiler Verilog code
```c=
module c_unit(
clk,rst,
l_in,
d_in,
en_pu,
en_in,
en,
zero_en,
w_en,
w_in,
d_out,
w_out
);
input clk,rst;
input signed [7:0] d_in;
input signed [31:0] l_in;
input en_pu;
input en_in;
output reg en;
input zero_en;
input [7:0] w_in;
input w_en;
output reg signed [31:0] d_out;
output reg signed [7:0] w_out;
wire signed [ 7:0]in;
wire signed [31:0]la;
wire signed [15:0]mul_r;
reg signed [31:0]out_reg;
//w_out
always@(posedge clk or negedge rst)begin
if(!rst)begin
w_out <= 0;
end
else begin
if(w_en)
w_out <= w_in;
end
end
//in
//assign in = en? d_in : 7'b0;
//assign la = en? l_in : 31'b0;
//mul_r & d_out
assign mul_r = w_out * d_in;
//out_reg
always@(posedge clk or negedge rst)begin
if(!rst)begin
out_reg <= 0;
end
else begin
out_reg <= mul_r + l_in;
end
end
//d_out
always@(*)begin
d_out = out_reg;
if(!en)
d_out = 0;
end
//en
always@(posedge clk or negedge rst)begin
if(!rst)begin
en <= 0;
end
else begin
if(zero_en)
en <= 0;
else if(en_pu)
en <= en_in;
end
end
endmodule
```
:::
:::spoiler 5x5 design diagram
![](https://i.imgur.com/VKJlbkW.png)
:::
- Average Weight Selection TBD