CNN Circuit - HackMD

# Introduction - This project aims to construct a two-layer atrous convolution circuit. - The tasks in **Layer 0** include: replicate padding, atrous convolution, and ReLU activation. - ![layer0](https://hackmd.io/_uploads/rJaVoSGAke.png) - The tasks in **Layer 1** include: max pooling and rounding up. - ![layer1](https://hackmd.io/_uploads/rk3BoBG01x.png) - This synthesis area cannot exceed 1000 logic elements according to requirement, and implemented with only 895 logic elements which cost 2% area. - The full requirements and additional infomation are available at [here](https://github.com/PoChuan994/Atrous_Convolution_Circuit/blob/main/2023_hw4.pdf) and the implementation can be found in this [GitHub Repository](https://github.com/PoChuan994/Atrous_Convolution_Circuit). # Implementation ## Finite State Machine (FSM) - The states are defined as follows: ```verilog= localparam INIT = 0; localparam ATCONV_9PIXELS = 1; localparam LAYER0_WRITERELU = 2; localparam MAXPOOL_4PIXELS = 3; localparam LAYER1_WRITECEILING = 4; localparam FINISH = 5; ``` - The state transition logic is shown below ```verilog= always @(*) begin case (state) INIT: nextState = (ready)? ATCONV_9PIXELS : INIT; ATCONV_9PIXELS: nextState = (counter == 4'd9)? LAYER0_WRITERELU : ATCONV_9PIXELS; LAYER0_WRITERELU: nextState = (center == 12'd4095)? MAXPOOL_4PIXELS : ATCONV_9PIXELS; MAXPOOL_4PIXELS: nextState = (counter == 4'd4)? LAYER1_WRITECEILING : MAXPOOL_4PIXELS; LAYER1_WRITECEILING: nextState = (caddr_wr == 12'd1023)? FINISH : MAXPOOL_4PIXELS; FINISH: nextState = FINISH; default: nextState = INIT; endcase end ``` - The state transition diagram is shown below: ```mermaid graph TD; A((INIT))--->| ready == 1'd0 | A A--->|ready == 1'd1| B((ATCONV_9PIXELS)) B--->| counter < 4'd9| B B--->| counter == 4'd9 | C((LAYER0_WRITERELU)) C--->| center < 12'd 4095 | C C--->|center == 12'd 4095 | D((MAXPOOL_4PIXELS)) D---> | counter < 4'd4| D D---> | counter = = 4d'd | E((LAYER1_WRITECEILING)) E---> |caddr_wr < 12'd1023 | E E---> |caddr_wr == 12'd1023 | F((FINISH)) F---> F ``` ## Registers & Wires declartion ### Kernel and bias setting ```verilog= wire signed [12:0] kernel [1:9]; ``` - Declares the Atrous convolution kernel with 9 weights. Each weight is a 13-bit signed fixed-point number, where the most significant 9 bits represent the integer part, and the least significant 4 bits represent the fractional part. ```verilog= assign kernel[1] = 13'h1FFF; assign kernel[2] = 13'h1FFE; assign kernel[3] = 13'h1FFF; assign kernel[4] = 13'h1FFC; assign kernel[5] = 13'h0010; assign kernel[6] = 13'h1FFC; assign kernel[7] = 13'h1FFF; assign kernel[8] = 13'h1FFE; assign kernel[9] = 13'h1FFF; ``` - Assigns specific values to each kernel weight. - ![image](https://hackmd.io/_uploads/BykFyzfR1e.png) ```verilog= wire signed [12:0] bias; assign bias = 13'h1FF4; ``` - Declares the bias and assigns its specified value. - ![image](https://hackmd.io/_uploads/By1aJGfA1l.png) ### Register ```verilog= //regs reg [2:0] state, nextState; reg [11:0] center; // Coordinate (row, column) = (center[11:6], center[5:0]) reg [3:0] counter; reg signed [25:0] convSum; // {mul_integer(18bits), mul_fraction(8bits)} ``` - `state` and `nextState`: Represent the current and the next state of the finite state machine (FSM). - `center`: Indicates the current index where the atrous convolution is being applied. - `center[11:6]`: The y-coordinate of the current index. - `center[5:0]`: The x-coordinate of the current index. - `counter`: A 4-bit register (maximum value = 15). It is used to count iterations for: - 9 cycles in atrous convolution. - 4 cycles in max pooling. - `convSum`: Stores the accumulated convolution result. - It is a 26-bit signed register. - Uses 18 bits for the integer part and 8 bits for the fractional part to prevent overflow during fixed-point accumulation. ### Constant parameter ```verilog= localparam LENGTH = 6'd63; localparam ZERO = 6'd0; ``` - `LENGTH`: A constant with the value 63, used for detecting or validating if the current coordinate is at the image boundary (i.e., the last row or column of a 64×64 image). - `ZERO`: A constant set to 0, used to reset the x-coordinate or y-coordinate when moving to the next row or column. ### Coordinate operation ```verilog= wire [5:0] cx_add2,cx_minus2,cy_add2,cy_minus2; assign cy_add2 = center[11:6] + 6'd2; assign cy_minus2 = center[11:6] - 6'd2; assign cx_add2 = center[5:0] + 6'd2 ; assign cx_minus2 = center[5:0] - 6'd2; ``` - These four wires are used to calculate the neighboring pixel coordinates offset by ±2 in the x and y directions. They are mainly used for replicate padding and for determining the sampling positions during atrous convolution (with dilation = 2). ## INIT ```verilog= INIT:begin if (ready) begin busy <= 1'd1; end end ``` - When the `ready` signal is high, indicating that the image has been successfully loaded into IMAGE_MEM, the `busy` signal is asserted (set to high) to indicate that the convolution process is starting. ## ATCONV_9PIXELS ### Signal Setting ```verilog= csel <= 1'd0; crd <= 1'd1; cwr <= 1'd0; ``` - Select Layer 0 by setting `csel <= 1'd0`. - Enable memory read by setting `crd <= 1'd1`. - Disable memory write by setting `cwr <= 1'd0` to prevent unintended memory writes. ### Update the next pixel (counter) to be read ```verilog= if(counter > 4'd0) begin convSum <= convSum + idata*kernel[counter]; end counter <= counter + 4'd1; ``` - Accumulate the convolution sum once the pixel value `idata` becomes available (`counter > 4'd0`). - Note: The requested `idata` will be available on the next clock cycle. - Increment the `counter` to proceed to the next kernel position. ### Update coordinate according to position (counter) ```verilog= case (counter) // -> for y axis (row) 0,1,2: iaddr[11:6] <= ((center[11:6] == 6'd0) || (center[11:6] == 6'd1))? ZERO : cy_minus2; 3,4,5: iaddr[11:6] <= center[11:6]; 6,7,8: iaddr[11:6] <= ((center[11:6] == LENGTH - 6'd1) || (center[11:6] == LENGTH))? LENGTH : cy_add2; endcase case (counter) // -> for x axis (column) 0,3,6: iaddr[5:0] <= ((center[5:0] == 6'd0) || (center[5:0] == 6'd1))? ZERO : cx_minus2; 1,4,7: iaddr[5:0] <= center[5:0]; 2,5,8: iaddr[5:0] <= ((center[5:0] == LENGTH - 6'd1) || (center[5:0] == LENGTH))? LENGTH : cx_add2; endcase ``` - Set both the x- and y-coordinates of the next requested pixel address (`iaddr`) for convolution. - The y-coordinate selection logic for the next pixel is shown in the table below: - ![CNN circuit](https://hackmd.io/_uploads/S1V8SQf0yl.png) - In y-direction, the y-coordinate setting of the next pixel is shown below. | `counter` | Y-coordinate (normal case) | Y-coordinate (edge case) | | :--------: | :------------------------: | :------------------------: | | 0 ~ 2 | `center[11:6]` - 2 | $0^{\text{th}}$ row (if `center` is at $0^{\text{th}}$ or $1^{\text{st}}$ row) | | 3 ~ 5 | `center[11:6]` | N/A | | 6 ~ 8 | `center[11:6]` + 2 | $63^{\text{rd}}$ row (if `center` is at $62^{\text{nd}}$ or $63^{\text{rd}}$ row) | - The x-coordinate selection logic for the next pixel is shown in the table below: | `counter` | X-coordinate (normal case) | X-coordinate (edge case) | | :--------: | :------------------------: | :------------------------: | | 0, 3, 6 | `center[5:0]` - 2 | $0^{\text{th}}$ column (if `center` is at $0^{\text{th}}$ or $1^{\text{st}}$ column) | | 1, 4, 7 | `center[5:0]` | N/A | | 2, 5, 8 | `center[5:0]` + 2 | $63^{\text{rd}}$ column (if `center` is at $62^{\text{nd}}$ or $63^{\text{rd}}$ column) | ## LAYER0_WRITERELU ### Signal Setting ```verilog= csel <= 1'd0; crd <= 1'd0; cwr <= 1'd1; caddr_wr <= center; ``` - Set `csel <= 1'd0` and `cwr <= 1'd1` to enable writing to *Layer 0 memory*, and disable memory read by setting `crd <= 1'd0`. - Assign the write address using `caddr_wr <= center`. ### ReLU function ```verilog= cdata_wr <= (convSum[25])? 13'd0 : convSum[16:4]; // ReLU ``` - The ReLU (Rectified Linear Unit) function is defined as: $$ \text{ReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases} $$ - According to the [specification](https://github.com/PoChuan994/Atrous_Convolution_Circuit/blob/main/2023_hw4.pdf), only 13-bit data should be stored in memory. - The output value convSum is a 26-bit signed fixed-point number (18-bit integer + 8-bit fraction). - This line applies the ReLU function and truncates the result to 13 bits: ### Update and Reinitialize ```verilog= convSum <= {{9{1'b1}}, bias, 4'd0}; center <= center + 12'd1; counter <= 4'd0; ``` - Reinitialize `counter` and `convSum` for the next convolution operation. - `convSum` is preloaded with the bias value, sign-extended to 26 bits, and padded with 4 fractional bits (for fixed-point representation). - Increment `center` to move to the next pixel position for atrous convolution. ## MAXPOOL_4PIXELS - This step performs the *max pooling* operation, which extracts the maximum value from a 2×2 region of pixels in Layer 0 and writes it into Layer 1, as illustrated below: - ![maxpooling](https://hackmd.io/_uploads/S1-BNSf01x.png) ### Signal Setting ```verilog= csel <= 1'd0; crd <= 1'd1; cwr <= 1'd0; ``` - Select Layer 0 as the memory source by setting `csel <= 1'd0`, enable memory read with `crd <= 1'd1`, and disable memory write with `cwr <= 1'd0` to prevent unintended writes. ### Update temporarily maximum value and counter (next pixel to be read) ```verilog= if (counter==0) begin cdata_wr <= 13'd0; end else if (cdata_rd > cdata_wr) begin cdata_wr <= cdata_rd; end counter <= counter + 4'd1; ``` - Initialize and update the `counter`, which records the number of pixels processed and determines the address of the next pixel to read. - Update the temporary maximum value `cdata_wr` if the current data read from memory (`cdata_rd`) is greater than the current stored maximum. ### Set the Next Requested Pixel Memory Address ```verilog= case(counter) // -> for y axis (row) 0,1: caddr_rd[11:6] <= {center[9:5], 1'd0}; 2,3: caddr_rd[11:6] <= {center[9:5], 1'd1}; endcase case(counter) // -> for x axis (column) 0,2: caddr_rd[5:0] <= {center[4:0], 1'd0}; 1,3: caddr_rd[5:0] <= {center[4:0], 1'd1}; endcase ``` - In the y-direction, the y-coordinate of the next pixel to be read from Layer 0 (for max pooling) is determined by the current `counter` value: | `counter` | y-coordinate in Layer 0 | | :--------: | :----------------------: | | 0, 1 | row index of Layer 1 × 2 | | 2, 3 | row index of Layer 1 × 2 + 1 | - In the x-direction, the x-coordinate of the next pixel in Layer 0 is determined similarly: | `counter` | x-coordinate in Layer 0 | | :--------: | :----------------------: | | 0, 2 | column index of Layer 1 × 2 | | 1, 3 | column index of Layer 1 × 2 + 1 | - These coordinates are computed by bit-shifting the Layer 1 `center` address. For example, `{center[9:5], 1'b0}` is equivalent to `Layer1_row × 2`. ## LAYER0_WRITECELIING - This step performs a *round-up* operation and writes the final value into the specified address in Layer 1 memory. - The memory address to be written is updated using the `center` signal. ### Signal Setting ```verilog= csel <= 1'd1; crd <= 1'd0; cwr <= 1'd1; caddr_wr <= center; ``` - Select Layer 1 by setting `csel <= 1'd1` and enable write operation with `cwr <= 1'd1`. - Disable memory read by setting `crd <= 1'd0`. ### Round up value ```verilog= cdata_wr <= { cdata_wr[12:4] + {8'd0,|cdata_wr[3:0]} , 4'd0 }; ``` - This expression performs rounding up from a fixed-point number: - `cdata_wr[12:4]`: the integer part of a 13-bit fixed-point number (9 integer bits + 4 fractional bits). - `|cdata_wr[3:0]`: logic-OR of the fractional part; adds 1 to the integer if any fractional bit is set. - The result is padded with `4'd0` to preserve the fixed-point format (aligning with 13-bit width). ### Update and Reinitialize ```verilog= center <= center + 12'd1; counter <= 4'd0; ``` - Move to the next coordinate in Layer 1 by incrementing `center`. - Reset `counter` to zero for the next max-pooling operation. ## FINISH ```verilog= busy <= 1'd0; ``` - Setting the `busy` signal to `1'd0` indicates that the processing task has been completed.