# Introduction
- This project aims to construct a two-layer atrous convolution circuit.
- The tasks in **Layer 0** include: replicate padding, atrous convolution, and ReLU activation.
- 
- The tasks in **Layer 1** include: max pooling and rounding up.
- 
- This synthesis area cannot exceed 1000 logic elements according to requirement, and implemented with only 895 logic elements which cost 2% area.
- The full requirements and additional infomation are available at [here](https://github.com/PoChuan994/Atrous_Convolution_Circuit/blob/main/2023_hw4.pdf) and the implementation can be found in this [GitHub Repository](https://github.com/PoChuan994/Atrous_Convolution_Circuit).
# Implementation
## Finite State Machine (FSM)
- The states are defined as follows:
```verilog=
localparam INIT = 0;
localparam ATCONV_9PIXELS = 1;
localparam LAYER0_WRITERELU = 2;
localparam MAXPOOL_4PIXELS = 3;
localparam LAYER1_WRITECEILING = 4;
localparam FINISH = 5;
```
- The state transition logic is shown below
```verilog=
always @(*) begin
case (state)
INIT: nextState = (ready)? ATCONV_9PIXELS : INIT;
ATCONV_9PIXELS: nextState = (counter == 4'd9)? LAYER0_WRITERELU : ATCONV_9PIXELS;
LAYER0_WRITERELU: nextState = (center == 12'd4095)? MAXPOOL_4PIXELS : ATCONV_9PIXELS;
MAXPOOL_4PIXELS: nextState = (counter == 4'd4)? LAYER1_WRITECEILING : MAXPOOL_4PIXELS;
LAYER1_WRITECEILING: nextState = (caddr_wr == 12'd1023)? FINISH : MAXPOOL_4PIXELS;
FINISH: nextState = FINISH;
default: nextState = INIT;
endcase
end
```
- The state transition diagram is shown below:
```mermaid
graph TD;
A((INIT))--->| ready == 1'd0 | A
A--->|ready == 1'd1| B((ATCONV_9PIXELS))
B--->| counter < 4'd9| B
B--->| counter == 4'd9 | C((LAYER0_WRITERELU))
C--->| center < 12'd 4095 | C
C--->|center == 12'd 4095 | D((MAXPOOL_4PIXELS))
D---> | counter < 4'd4| D
D---> | counter = = 4d'd | E((LAYER1_WRITECEILING))
E---> |caddr_wr < 12'd1023 | E
E---> |caddr_wr == 12'd1023 | F((FINISH))
F---> F
```
## Registers & Wires declartion
### Kernel and bias setting
```verilog=
wire signed [12:0] kernel [1:9];
```
- Declares the Atrous convolution kernel with 9 weights. Each weight is a 13-bit signed fixed-point number, where the most significant 9 bits represent the integer part, and the least significant 4 bits represent the fractional part.
```verilog=
assign kernel[1] = 13'h1FFF; assign kernel[2] = 13'h1FFE; assign kernel[3] = 13'h1FFF;
assign kernel[4] = 13'h1FFC; assign kernel[5] = 13'h0010; assign kernel[6] = 13'h1FFC;
assign kernel[7] = 13'h1FFF; assign kernel[8] = 13'h1FFE; assign kernel[9] = 13'h1FFF;
```
- Assigns specific values to each kernel weight.
- 
```verilog=
wire signed [12:0] bias;
assign bias = 13'h1FF4;
```
- Declares the bias and assigns its specified value.
- 
### Register
```verilog=
//regs
reg [2:0] state, nextState;
reg [11:0] center; // Coordinate (row, column) = (center[11:6], center[5:0])
reg [3:0] counter;
reg signed [25:0] convSum; // {mul_integer(18bits), mul_fraction(8bits)}
```
- `state` and `nextState`: Represent the current and the next state of the finite state machine (FSM).
- `center`: Indicates the current index where the atrous convolution is being applied.
- `center[11:6]`: The y-coordinate of the current index.
- `center[5:0]`: The x-coordinate of the current index.
- `counter`: A 4-bit register (maximum value = 15). It is used to count iterations for:
- 9 cycles in atrous convolution.
- 4 cycles in max pooling.
- `convSum`: Stores the accumulated convolution result.
- It is a 26-bit signed register.
- Uses 18 bits for the integer part and 8 bits for the fractional part to prevent overflow during fixed-point accumulation.
### Constant parameter
```verilog=
localparam LENGTH = 6'd63;
localparam ZERO = 6'd0;
```
- `LENGTH`: A constant with the value 63, used for detecting or validating if the current coordinate is at the image boundary (i.e., the last row or column of a 64×64 image).
- `ZERO`: A constant set to 0, used to reset the x-coordinate or y-coordinate when moving to the next row or column.
### Coordinate operation
```verilog=
wire [5:0] cx_add2,cx_minus2,cy_add2,cy_minus2;
assign cy_add2 = center[11:6] + 6'd2;
assign cy_minus2 = center[11:6] - 6'd2;
assign cx_add2 = center[5:0] + 6'd2 ;
assign cx_minus2 = center[5:0] - 6'd2;
```
- These four wires are used to calculate the neighboring pixel coordinates offset by ±2 in the x and y directions. They are mainly used for replicate padding and for determining the sampling positions during atrous convolution (with dilation = 2).
## INIT
```verilog=
INIT:begin
if (ready) begin
busy <= 1'd1;
end
end
```
- When the `ready` signal is high, indicating that the image has been successfully loaded into IMAGE_MEM, the `busy` signal is asserted (set to high) to indicate that the convolution process is starting.
## ATCONV_9PIXELS
### Signal Setting
```verilog=
csel <= 1'd0;
crd <= 1'd1;
cwr <= 1'd0;
```
- Select Layer 0 by setting `csel <= 1'd0`.
- Enable memory read by setting `crd <= 1'd1`.
- Disable memory write by setting `cwr <= 1'd0` to prevent unintended memory writes.
### Update the next pixel (counter) to be read
```verilog=
if(counter > 4'd0) begin
convSum <= convSum + idata*kernel[counter];
end
counter <= counter + 4'd1;
```
- Accumulate the convolution sum once the pixel value `idata` becomes available (`counter > 4'd0`).
- Note: The requested `idata` will be available on the next clock cycle.
- Increment the `counter` to proceed to the next kernel position.
### Update coordinate according to position (counter)
```verilog=
case (counter) // -> for y axis (row)
0,1,2: iaddr[11:6] <= ((center[11:6] == 6'd0) || (center[11:6] == 6'd1))? ZERO : cy_minus2;
3,4,5: iaddr[11:6] <= center[11:6];
6,7,8: iaddr[11:6] <= ((center[11:6] == LENGTH - 6'd1) || (center[11:6] == LENGTH))? LENGTH : cy_add2;
endcase
case (counter) // -> for x axis (column)
0,3,6: iaddr[5:0] <= ((center[5:0] == 6'd0) || (center[5:0] == 6'd1))? ZERO : cx_minus2;
1,4,7: iaddr[5:0] <= center[5:0];
2,5,8: iaddr[5:0] <= ((center[5:0] == LENGTH - 6'd1) || (center[5:0] == LENGTH))? LENGTH : cx_add2;
endcase
```
- Set both the x- and y-coordinates of the next requested pixel address (`iaddr`) for convolution.
- The y-coordinate selection logic for the next pixel is shown in the table below:
- 
- In y-direction, the y-coordinate setting of the next pixel is shown below.
| `counter` | Y-coordinate (normal case) | Y-coordinate (edge case) |
| :--------: | :------------------------: | :------------------------: |
| 0 ~ 2 | `center[11:6]` - 2 | $0^{\text{th}}$ row (if `center` is at $0^{\text{th}}$ or $1^{\text{st}}$ row) |
| 3 ~ 5 | `center[11:6]` | N/A |
| 6 ~ 8 | `center[11:6]` + 2 | $63^{\text{rd}}$ row (if `center` is at $62^{\text{nd}}$ or $63^{\text{rd}}$ row) |
- The x-coordinate selection logic for the next pixel is shown in the table below:
| `counter` | X-coordinate (normal case) | X-coordinate (edge case) |
| :--------: | :------------------------: | :------------------------: |
| 0, 3, 6 | `center[5:0]` - 2 | $0^{\text{th}}$ column (if `center` is at $0^{\text{th}}$ or $1^{\text{st}}$ column) |
| 1, 4, 7 | `center[5:0]` | N/A |
| 2, 5, 8 | `center[5:0]` + 2 | $63^{\text{rd}}$ column (if `center` is at $62^{\text{nd}}$ or $63^{\text{rd}}$ column) |
## LAYER0_WRITERELU
### Signal Setting
```verilog=
csel <= 1'd0;
crd <= 1'd0;
cwr <= 1'd1;
caddr_wr <= center;
```
- Set `csel <= 1'd0` and `cwr <= 1'd1` to enable writing to *Layer 0 memory*, and disable memory read by setting `crd <= 1'd0`.
- Assign the write address using `caddr_wr <= center`.
### ReLU function
```verilog=
cdata_wr <= (convSum[25])? 13'd0 : convSum[16:4]; // ReLU
```
- The ReLU (Rectified Linear Unit) function is defined as:
$$
\text{ReLU}(x) =
\begin{cases}
x & \text{if } x > 0 \\
0 & \text{otherwise}
\end{cases}
$$
- According to the [specification](https://github.com/PoChuan994/Atrous_Convolution_Circuit/blob/main/2023_hw4.pdf), only 13-bit data should be stored in memory.
- The output value convSum is a 26-bit signed fixed-point number (18-bit integer + 8-bit fraction).
- This line applies the ReLU function and truncates the result to 13 bits:
### Update and Reinitialize
```verilog=
convSum <= {{9{1'b1}}, bias, 4'd0};
center <= center + 12'd1;
counter <= 4'd0;
```
- Reinitialize `counter` and `convSum` for the next convolution operation.
- `convSum` is preloaded with the bias value, sign-extended to 26 bits, and padded with 4 fractional bits (for fixed-point representation).
- Increment `center` to move to the next pixel position for atrous convolution.
## MAXPOOL_4PIXELS
- This step performs the *max pooling* operation, which extracts the maximum value from a 2×2 region of pixels in Layer 0 and writes it into Layer 1, as illustrated below:
- 
### Signal Setting
```verilog=
csel <= 1'd0;
crd <= 1'd1;
cwr <= 1'd0;
```
- Select Layer 0 as the memory source by setting `csel <= 1'd0`, enable memory read with `crd <= 1'd1`, and disable memory write with `cwr <= 1'd0` to prevent unintended writes.
### Update temporarily maximum value and counter (next pixel to be read)
```verilog=
if (counter==0) begin
cdata_wr <= 13'd0;
end
else if (cdata_rd > cdata_wr) begin
cdata_wr <= cdata_rd;
end
counter <= counter + 4'd1;
```
- Initialize and update the `counter`, which records the number of pixels processed and determines the address of the next pixel to read.
- Update the temporary maximum value `cdata_wr` if the current data read from memory (`cdata_rd`) is greater than the current stored maximum.
### Set the Next Requested Pixel Memory Address
```verilog=
case(counter) // -> for y axis (row)
0,1: caddr_rd[11:6] <= {center[9:5], 1'd0};
2,3: caddr_rd[11:6] <= {center[9:5], 1'd1};
endcase
case(counter) // -> for x axis (column)
0,2: caddr_rd[5:0] <= {center[4:0], 1'd0};
1,3: caddr_rd[5:0] <= {center[4:0], 1'd1};
endcase
```
- In the y-direction, the y-coordinate of the next pixel to be read from Layer 0 (for max pooling) is determined by the current `counter` value:
| `counter` | y-coordinate in Layer 0 |
| :--------: | :----------------------: |
| 0, 1 | row index of Layer 1 × 2 |
| 2, 3 | row index of Layer 1 × 2 + 1 |
- In the x-direction, the x-coordinate of the next pixel in Layer 0 is determined similarly:
| `counter` | x-coordinate in Layer 0 |
| :--------: | :----------------------: |
| 0, 2 | column index of Layer 1 × 2 |
| 1, 3 | column index of Layer 1 × 2 + 1 |
- These coordinates are computed by bit-shifting the Layer 1 `center` address. For example, `{center[9:5], 1'b0}` is equivalent to `Layer1_row × 2`.
## LAYER0_WRITECELIING
- This step performs a *round-up* operation and writes the final value into the specified address in Layer 1 memory.
- The memory address to be written is updated using the `center` signal.
### Signal Setting
```verilog=
csel <= 1'd1;
crd <= 1'd0;
cwr <= 1'd1;
caddr_wr <= center;
```
- Select Layer 1 by setting `csel <= 1'd1` and enable write operation with `cwr <= 1'd1`.
- Disable memory read by setting `crd <= 1'd0`.
### Round up value
```verilog=
cdata_wr <= { cdata_wr[12:4] + {8'd0,|cdata_wr[3:0]} , 4'd0 };
```
- This expression performs rounding up from a fixed-point number:
- `cdata_wr[12:4]`: the integer part of a 13-bit fixed-point number (9 integer bits + 4 fractional bits).
- `|cdata_wr[3:0]`: logic-OR of the fractional part; adds 1 to the integer if any fractional bit is set.
- The result is padded with `4'd0` to preserve the fixed-point format (aligning with 13-bit width).
### Update and Reinitialize
```verilog=
center <= center + 12'd1;
counter <= 4'd0;
```
- Move to the next coordinate in Layer 1 by incrementing `center`.
- Reset `counter` to zero for the next max-pooling operation.
## FINISH
```verilog=
busy <= 1'd0;
```
- Setting the `busy` signal to `1'd0` indicates that the processing task has been completed.