ClipQ - HackMD

# ClipQ - A flexible and efficient design and implementation of CNN accelerator with 8-bit CLIP-Q quantization [TOC] ## Progress :::spoiler Milestone - [x] Setup the env on server (140.116.245.115) - [x] Full precision training of NIN model on CIFAR-10/100 - [x] Fine-tuning with N-bit Clip-Q - [x] Inference and check precision - [x] Save weight to run on FPGA ::: ![](https://i.imgur.com/aR8wQG3.png) :::spoiler By week - Week 15 (2020/12/14-12/18) - Identify the problem by understanding PyTorch - [PyTorch: Defining new autograd functions](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-defining-new-autograd-functions) - [Extending torch.autograd](https://pytorch.org/docs/stable/notes/extending.html#extending-torch-autograd) - [Solution found: Difference between apply an call for an autograd function](https://discuss.pytorch.org/t/difference-between-apply-an-call-for-an-autograd-function/13845) - API must be changed as follow ```python= class F_new(torch.autograd.Function): @staticmethod def forward(ctx, args, gamma): ctx.gamma = gamma pass @staticmethod def backward(ctx, args): pass # Using your old style Function from your code sample: F(gamma)(inp) # Using the new style Function: F_new.apply(inp, gamma) ``` - Fixed in this [commit](https://github.com/WeiCheng14159/caid_clipQ/commit/05e0accf17aaebefc32f54baccd9d739ecf6f4c9) - Fine-tuning with N-bit Clip-Q - Best Accuracy: 64.61% - Week 14 (2020/12/7-12/11) - Stuck with the error `RuntimeError: Legacy autograd function with non-static forward method is deprecated. Please use new-style autograd function with static forward method.` - Week 13 (2020/11/30-12/4) - Setup the SW env on server - Problem: `pip install -r requirements.txt` fail due to mismatch python environment - Solution: Remove version number in `requirements.txt` and install the latest version - Stuck on command `Building wheels for collected packages: opencv-python, PyYAML, scandir, visdom, wrapt` - Solution: Restart command - `numpy.core.multiarray failed to import` - Solution: Upgrade pip and reinstall numpy - `The NVIDIA driver on your system is too old (found version 10010).` - Reason: The torch version and nvidia-driver version is different - Solution: Completely uninstall torch & reinstall torch by for cuda 10.1 `# CUDA 10.1 conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1 -c pytorch` - Ref: [Torch official](https://pytorch.org/get-started/previous-versions/) - Full precision training of NIN model on CIFAR-100 - ![](https://i.imgur.com/LWNgPiL.png) - Command: `python3 main.py --lr 0.1 --opt SGD --full 1 --cifar 100 --epoch 200` - Result: `Best Accuracy: 67.46%` - Command: `python3 main.py --lr 0.1 --opt SGD --full 1 --cifar 10 --epoch 200` - Result: `Best Accuracy: 89.30%` - Fine-tuning with N-bit Clip-Q - Problem: `RuntimeError: Legacy autograd function with non-static forward method is deprecated. Please use new-style autograd function with static forward method.` - Solution: TBD - Week 12 (2020/11/23-27) - Read project doc - Read research paper - Read core Verilog code (Draw FSM diagram) - Identify core python code (NIN training & ClipQ) ::: ## Workflow overview - Full precision training of NIN model on CIFAR-10/100 - Fine-tuning with N-bit Clip-Q - Inference and check precision - Save weight to run on FPGA ## Software ### NIN mode - Structure - Three 3x3 conv layer, each 3x3 conv layer is followed by two 1x1 conv layer - ![](https://i.imgur.com/Co5tCbx.png) - Reason - Less parameter, high precision - ![](https://i.imgur.com/vVtuAuJ.png) - Quantized conv layer x9 - Child class of [torch.nn.quantized](https://pytorch.org/docs/1.7.0/torch.nn.quantized.html?highlight=torch%20nn%20quantized#module-torch.nn.quantized) - Why 1x1 convolution ? - From [What does 1x1 convolution mean in a neural network?](https://stats.stackexchange.com/questions/194142/what-does-1x1-convolution-mean-in-a-neural-network) - In terms of Google Inception model > Suppose this output is fed into a conv layer with $F_1$ 1x1 filters, zero padding and stride 1 ... So 1x1 conv filters can be used to change the dimensionality in the filter space. If $F_1$ > 𝐹 then we are increasing dimensionality, if $F_1$ < 𝐹 we are decreasing dimensionality, in the filter dimension. - In terms of channel extension/compression > A 1x1 convolution simply maps an input pixel with all it's channels to an output pixel, **not looking at anything around itself**. It is often used to **reduce the number of depth channels**, since it is often very slow to multiply volumes with extremely large depths. :::spoiler pytorch code ```python= class Net(nn.Module): def __init__(self,f,cifar,write): super(Net, self).__init__() self.QCNN = nn.Sequential( QConv2d( 3, 96, kernel_size=3, stride=1, padding=1,layer = 1,full=f,w=write), QConv2d( 96, 160, kernel_size=1, stride=1, padding=0,layer = 2,full=f), QConv2d(160, 192, kernel_size=1, stride=1, padding=0,layer = 3,full=f), nn.MaxPool2d(kernel_size=2, stride=2, padding=0), QConv2d(192, 96 , kernel_size=3, stride=1, padding=1,layer = 4,full=f), QConv2d(96 , 192, kernel_size=1, stride=1, padding=0,layer = 5,full=f), QConv2d(192, 192, kernel_size=1, stride=1, padding=0,layer = 6,full=f), nn.AvgPool2d(kernel_size=2, stride=2, padding=0), QConv2d(192, 384, kernel_size=3, stride=1, padding=1,layer = 7,full=f), QConv2d(384, 192, kernel_size=1, stride=1, padding=0,layer = 8,full=f), QConv2d(192, int(cifar), kernel_size=1, stride=1, padding=0,layer = 9,full=f), nn.AvgPool2d(kernel_size=8, stride=1, padding=0), ) def forward(self, x ): for m in self.modules(): if isinstance(m, nn.BatchNorm2d) or isinstance(m, nn.BatchNorm1d): if hasattr(m.weight, 'data'): m.weight.data.clamp_(min=0.01) x = self.QCNN(x) x = x.view(x.size(0), -1) return x ``` ::: ### Clip-Q - Idea: - 1) combines **network pruning** and **weight quantization** in a single learning framework that solves for both weight pruning and quantization jointly - 2) makes flexible pruning and quantization decisions that adapt over time as the network structure changes - 3) performs pruning and quantization in parallel with **fine-tuning** the **full-precision weights**. - Algorithm: - ![](https://i.imgur.com/5Lna4OC.png) - ![](https://i.imgur.com/HJewJAP.png) - Implementation: `util_write.py` :::spoiler ClipQ python Code ```python= def ClipQ(self): for index in range(self.num_of_params): start = time.time() x = self.target_modules[index].data.cpu() p=0.4 b=2 x1=x.view(-1).numpy() x1s=np.sort(x1, axis=None) x1arg = np.argsort(x1, axis=None) pos = x1s[np.where(x1s>0)] pos_arg = x1arg[np.where(x1s>0)] neg = x1s[np.where(x1s<0)] neg_arg = x1arg[np.where(x1s<0)] P_Znum = m.ceil(len(pos)*p) N_Znum = m.ceil(len(neg)*p) P_max = max(pos[:P_Znum]) N_min = min(neg[-N_Znum:]) x1[pos_arg[:P_Znum]]=0 x1[neg_arg[-N_Znum:]]=0 x1s=np.sort(x1, axis=None) partb0 = m.pow(2,b-1)-1 partb1 = m.pow(2,b-1) pos = x1s[np.where(x1s>0)] neg = x1s[np.where(x1s<0)] pos_s = (pos[len(pos)-1] - P_max)/partb0 neg_s = (neg[0] - N_min)/partb1 pos = pos - P_max neg = neg - N_min pos_d = {} neg_d = {} sum_pos = np.zeros(int(partb0)) sum_neg = np.zeros(int(partb1)) num_pos = np.zeros(int(partb0)) num_neg = np.zeros(int(partb1)) pos_max = np.zeros(int(partb0)) neg_min = np.zeros(int(partb1)) for i in range(int(partb0)): pos_d[i] = pos[np.where(np.floor(pos/pos_s) == i)] + P_max try: pos_d[partb0-1] = np.append(pos_d[partb0-1],[pos[len(pos)-1] + P_max]) except: pos_d[partb0-1] = [pos[len(pos)-1] + P_max] for i in range(int(partb1)): neg_d[i] = neg[np.where(np.floor(neg/neg_s) == i)] + N_min try: neg_d[partb1-1] = np.append(neg_d[partb1-1],[neg[0] + N_min]) except: neg_d[partb1-1] = [neg[0] + N_min] pos_avg = {} neg_avg = {} for i in pos_d.items(): pos_avg[i[0]] = sum(i[1])/(len(i[1])+0.0000001) try: pos_max[i[0]] = max(i[1]) except: pass for i in neg_d.items(): neg_avg[i[0]] = sum(i[1])/(len(i[1])+0.0000001) try: neg_min[i[0]] = min(i[1]) except: pass xx1 = x1.copy() we = x1.copy() realW = [] for i in range(int(partb0)): realW.append(pos_avg[i]) if i==0: x1[np.logical_and(xx1>0, xx1<=pos_max[i])] = pos_avg[i] we[np.logical_and(xx1>0, xx1<=pos_max[i])] = 1 else: x1[np.logical_and(xx1>pos_max[i-1], xx1<=pos_max[i])] = pos_avg[i] we[np.logical_and(xx1>pos_max[i-1], xx1<=pos_max[i])] = i+1 for i in range(int(partb1)): realW.append(neg_avg[i]) if i==0: x1[np.logical_and(xx1<0, xx1>=neg_min[i])] = neg_avg[i] we[np.logical_and(xx1<0, xx1>=neg_min[i])] = -1 else: x1[np.logical_and(xx1<neg_min[i-1], xx1>=neg_min[i])] = neg_avg[i] we[np.logical_and(xx1<neg_min[i-1], xx1>=neg_min[i])] = -i-1 x2 = torch.from_numpy(x1) pa = torch.Tensor(realW) x=x2.view(x.size()) num_bits = 8 num_int = 3 qmin = -(2.**(num_int - 1)) qmax = qmin + 2.**num_int - 1./(2.**(num_bits - num_int)) scale = 1/(2.**(num_bits - num_int)) xx = x - torch.fmod(x,scale) pa = pa - torch.fmod(pa,scale) xx[xx.le(qmin)] = qmin xx[xx.ge(qmax)] = qmax pa[pa.le(qmin)] = qmin pa[pa.ge(qmax)] = qmax ww = torch.from_numpy(we) ww = ww.view(x.size())+2 if index == 1: if not os.path.exists('./H_data/W2.hex'): ch_fileW2(ww,'./H_data/W2.hex') if not os.path.exists('./H_data/W8.hex'): fileW8(pa,'./H_data/W8.hex',5) self.target_modules[index].data = xx.cuda() ``` ::: ## Hardware Architecture ### Background #### Line buffer for conv2d - Line buffer method first introduced in [Reconfigurable Pipelined 2-D Convolvers for Fast Digital Signal Processing](https://ieeexplore.ieee.org/document/784091) - ![](https://i.imgur.com/cUNdtUv.png) - Used in [Going Deeper with Embedded FPGA Platform for Convolutional Neural Network](https://dl.acm.org/doi/10.1145/2847263.2847265) - ![](https://i.imgur.com/6XByAIM.png) #### PE method - PE method [A high performance FPGA-based accelerator for large-scale convolutional neural networks](https://ieeexplore.ieee.org/document/7577308) - ![](https://i.imgur.com/nFNvEZV.png) - ![](https://i.imgur.com/zmY0HTe.png) #### Comparison | | Advantage | Disadvantage | | -------------- | ------------------------------ | ---------------------------------- | | Line buffer | Data reuse for W, In | Not scalable, Lots of HW resources | | PE method | Scalable | No data reuse for W | ### Details ![](https://i.imgur.com/rknRWuu.jpg) #### Controller `ctrl.v` - FSM diagram ```graphviz digraph controller_fsm{ graph [fontname=Arial]; node [shape=record,style=filled, fillcolor=aquamarine]; edge [fontcolor=red]; // nodes s0 [label="S_IDLE"]; s1 [label="S_READ_PARAM"]; s2 [label="S_READ_W8"]; s3 [label="S_READ_W2"]; s4 [label="S_READ_INPUT"]; s5 [label="S_EMPTY"]; s6 [label="S_INPUT_CH"]; s7 [label="S_FINISH"]; // edges s0->s1 [label="start=1"]; s1->s2 [label="s1_fin=1"]; s2->s3 [label="s2_fin=1"]; s3->s4 [label="s3_fin=1"]; s4->s5 [label="s4_fin=1"]; s5->s6 [label="s5_fin=1"]; s6->s7 [label="s6_fin=1"]; s6->s3 [label="s6_back=1"]; } ``` #### Conv Unit `c_unit.v` - Diagram ```graphviz digraph accu_module{ rankdir=TB; // splines=false; graph [fontname=Arial]; node [shape=record,style=filled, fillcolor=aquamarine,fontsize=20.0]; edge [fontcolor=red, fontsize=20.0]; subgraph cluster_0{ label="c_unit_0"; // node mult_ [label="*"]; w_ [label="weight"]; add_ [label="+"]; out_prev_ [label="accu_prev"]; out_ [label="accu"]; in_ [label="input"]; // edge mult_->add_[label="mul_r[15:0]"]; w_->mult_[label="w_in[7:0]"]; in_->mult_[label="in[7:0]"]; out_prev_->add_[label="accu_prev[31:0]"]; add_->out_[label="accu[31:0]"]; out_->out_prev[label=""]; }; subgraph cluster_1{ label="c_unit_1"; // node mult [label="*"]; w [label="weight"]; add [label="+"]; out_prev [label="accu_prev"]; out [label="accu"]; in [label="input"]; out_next [label="accu_next"]; // edge mult->add[label="mul_r[15:0]"]; w->mult[label="w_in[7:0]"]; in->mult[label="in[7:0]"]; out_prev->add[label="accu_prev[31:0]"]; add->out[label="accu[31:0]"]; out->out_next[label=""]; }; } ``` :::spoiler Verilog code ```c= module c_unit( clk,rst, l_in, d_in, en_pu, en_in, en, zero_en, w_en, w_in, d_out, w_out ); input clk,rst; input signed [7:0] d_in; input signed [31:0] l_in; input en_pu; input en_in; output reg en; input zero_en; input [7:0] w_in; input w_en; output reg signed [31:0] d_out; output reg signed [7:0] w_out; wire signed [ 7:0]in; wire signed [31:0]la; wire signed [15:0]mul_r; reg signed [31:0]out_reg; //w_out always@(posedge clk or negedge rst)begin if(!rst)begin w_out <= 0; end else begin if(w_en) w_out <= w_in; end end //in //assign in = en? d_in : 7'b0; //assign la = en? l_in : 31'b0; //mul_r & d_out assign mul_r = w_out * d_in; //out_reg always@(posedge clk or negedge rst)begin if(!rst)begin out_reg <= 0; end else begin out_reg <= mul_r + l_in; end end //d_out always@(*)begin d_out = out_reg; if(!en) d_out = 0; end //en always@(posedge clk or negedge rst)begin if(!rst)begin en <= 0; end else begin if(zero_en) en <= 0; else if(en_pu) en <= en_in; end end endmodule ``` ::: :::spoiler 5x5 design diagram ![](https://i.imgur.com/VKJlbkW.png) ::: - Average Weight Selection TBD