# Lab3_FIR updated
[Reference: lab3_ver2 design](https://github.com/Raywang908/lab3_ver2)
## Design SPEC
* data_width = max 32 bit
* tap_width = max 32 bit
* addr_width = max 12 bit
* tap_number = max 32 unit
* RAM_addr_width = max 12 bit
* RAM_capacity = data_width * 32 space
* addr 0x00: for [2] ap_idle, [1] ap_done, [0] ap_start
* addr 0x10: data_length
* addr 0x14: tap_length
* addr 0x80 ~ 0xFF: tap, data
## Design Approach
### Axi_Lite waveform

This time diagram is correct and some further details can reference the 1st version of this part.
### Axi_Lite block diagram

This part is refine to some simple logic and I don't use FSM in this version.
### Axi_Stream waveform 1st

The concept of this waveform is similar with the version 1, I did not decouple the input x and the core engine. So, if I design it based on this idea, it will result in poor performance in Lab 4, because the CPU will have to wait for the core engine to finish the task before it can start another one.
### Axi_Stream waveform 2nd

This version adds x_buffer and y_buffer to decouple the input/output Bus and the engine, so this two system can do its own task.
:::warning
This waveform is not totally correct, but it can be used as an insight for the design.
:::
:::success
Although drawing waveforms before coding may take a lot of time and effort, it can give you a clear idea of the design details, so you won't need to revise your code frequently. However, the downside is that when the design becomes complicated, redrawing the waveforms can be very time-consuming.
:::
### Axi_Stream block diagram

### Overview block diagram

## 1st version Code
```verilog
module fir
#( parameter pADDR_WIDTH = 12,
parameter pDATA_WIDTH = 32,
parameter Tape_Num = 12
)
(
output wire awready,
output wire wready,
input wire awvalid,
input wire [(pADDR_WIDTH-1):0] awaddr,
input wire wvalid,
input wire [(pDATA_WIDTH-1):0] wdata,
output wire arready,
input wire rready,
input wire arvalid,
input wire [(pADDR_WIDTH-1):0] araddr,
output wire rvalid,
output wire [(pDATA_WIDTH-1):0] rdata,
input wire ss_tvalid,
input wire [(pDATA_WIDTH-1):0] ss_tdata,
input wire ss_tlast,
output wire ss_tready,
input wire sm_tready,
output wire sm_tvalid,
output wire [(pDATA_WIDTH-1):0] sm_tdata,
output wire sm_tlast,
// bram for tap RAM
output wire [3:0] tap_WE,
output wire tap_EN,
output wire [(pDATA_WIDTH-1):0] tap_Di,
output wire [(pADDR_WIDTH-1):0] tap_A,
input wire [(pDATA_WIDTH-1):0] tap_Do,
// bram for data RAM
output wire [3:0] data_WE,
output wire data_EN,
output wire [(pDATA_WIDTH-1):0] data_Di,
output wire [(pADDR_WIDTH-1):0] data_A,
input wire [(pDATA_WIDTH-1):0] data_Do,
input wire axis_clk,
input wire axis_rst_n
);
//========================== Declaration ==========================
//-------------------------- Axi-Lite -------------------------------
reg [(pADDR_WIDTH-1):0] araddr_tmp;
reg [(pADDR_WIDTH-1):0] araddr_next;
reg wready_tmp;
reg wready_next;
reg awready_tmp;
reg awready_next;
reg arready_tmp;
reg arready_next;
reg rvalid_tmp;
reg rvalid_next;
reg [(pDATA_WIDTH-1):0] rdata_tmp;
wire [4:0] condition; // used as the condition of the mux for rdata
reg [1:0] addr_define; // define address of the three register below into 2 bits
localparam [(pADDR_WIDTH-1):0] AP_ADDR = {{(pADDR_WIDTH-8){1'b0}}, 8'h00}; // address of ap_crtl register
localparam [(pADDR_WIDTH-1):0] DATA_ADDR = {{(pADDR_WIDTH-8){1'b0}}, 8'h10}; // address of data_length register
localparam [(pADDR_WIDTH-1):0] TAP_ADDR = {{(pADDR_WIDTH-8){1'b0}}, 8'h14}; // address of tap_length register
localparam [(pADDR_WIDTH-1):0] INVALID_ADDR = {{(pADDR_WIDTH-1){1'b0}}, 1'b1}; //return invalid address ex. 12'h01
localparam [(pDATA_WIDTH-1):0] INVALID_DATA = {(pDATA_WIDTH){1'b1}}; // return invalid number ex. 32'hffffffff
reg [(pDATA_WIDTH-1):0] data_length;
reg [(pDATA_WIDTH-1):0] data_length_next;
reg [(pDATA_WIDTH-1):0] tap_length;
reg [(pDATA_WIDTH-1):0] tap_length_next;
//-------------------------- ap_idle & ap_done & ap_start -------------------------------
reg ap_idle;
reg ap_idle_next;
reg ap_done;
reg ap_done_next;
wire ap_start;
wire [2:0] ap_crtl;
localparam VALID = 1'b1;
localparam PULLD = 1'b0 ; // pull down = 1'b0
localparam PULLU = 1'b1 ; // pull up = 1'b1
//-------------------------- Axi-Stream SS (input X) -------------------------------
reg data_filled;
wire data_filled_next;
reg [(pDATA_WIDTH-1):0] data_cnt_now;
reg [(pDATA_WIDTH-1):0] data_cnt_now_next; // we use this in the time diagram
reg [(pDATA_WIDTH-1):0] x_buffer;
reg [(pDATA_WIDTH-1):0] x_buffer_next; // we use this in the time diagram
reg ss_tdone;
wire ss_tdone_next;
reg ss_tready_tmp;
reg ss_tready_next;
//-------------------------- FIR engine -------------------------------
reg [(pDATA_WIDTH-1):0] data_cnt_state;
reg [(pDATA_WIDTH-1):0] data_cnt_state_next; // we use this in the time diagram
reg [(pDATA_WIDTH-1):0] data_cnt_norm;
reg [(pDATA_WIDTH-1):0] data_cnt_norm_next; // we use this in the time diagram
reg [(pADDR_WIDTH-1):0] tapA_cnter;
reg [(pADDR_WIDTH-1):0] tapA_cnter_next; // we use this in the time diagram
reg [(pADDR_WIDTH-1):0] dataA_cnter;
reg [(pADDR_WIDTH-1):0] dataA_cnter_next; // we use this in the time diagram
wire [(pDATA_WIDTH-1):0] data_A_tmp;
localparam [(pADDR_WIDTH-1):0] CNTERA_INVALID = 32; //max tap num
reg data_write;
reg write_sr; // data_write is shift right one cycle
reg write_sr_next;
reg [(pADDR_WIDTH-1):0] op_cnter1;
wire [(pADDR_WIDTH-1):0] op_cnter1_next;
reg [(pADDR_WIDTH-1):0] op_cnter2;
wire [(pADDR_WIDTH-1):0] op_cnter2_next;
reg [(pADDR_WIDTH-1):0] op_cnter3; // we use this in the time diagram
wire [(pADDR_WIDTH-1):0] op_cnter3_next;
reg [(pDATA_WIDTH-1):0] muli;
reg [(pDATA_WIDTH-1):0] muli_next;
reg mul_ctrl1;
wire mul_ctrl1_next;
reg mul_ctrl2;
wire mul_ctrl2_next;
reg sum_ctrl;
wire sum_ctrl_next;
reg [(pDATA_WIDTH-1):0] xi;
wire [(pDATA_WIDTH-1):0] xi_next;
reg [(pDATA_WIDTH-1):0] tapi;
wire [(pDATA_WIDTH-1):0] tapi_next;
reg [(pDATA_WIDTH-1):0] sumi;
reg [(pDATA_WIDTH-1):0] sumi_next;
reg [(pDATA_WIDTH-1):0] y_storage;
reg [(pDATA_WIDTH-1):0] y_storage_next;
reg [(pDATA_WIDTH-1):0] y_buffer;
reg [(pDATA_WIDTH-1):0] y_buffer_next;
reg y_locked;
reg y_locked_next;
reg y_change_valid;
wire y_change_valid_next;
reg sm_tdone;
wire sm_tdone_next;
//-------------------------- Axi-Stream SM (Output Y) -------------------------------
reg [(pDATA_WIDTH-1):0] y_cnter;
reg [(pDATA_WIDTH-1):0] y_cnter_next;
reg sm_tvalid_tmp;
reg sm_tvalid_next;
reg sm_tlast_tmp;
reg sm_tlast_next;
//========================== Function ==========================
//-------------------------- Axi-Lite -------------------------------
// assign response signals
assign wready = wready_tmp;
assign awready = awready_tmp;
assign arready = arready_tmp;
assign rvalid = rvalid_tmp;
// assign tap_RAM ouput signals
assign tap_WE = (ap_idle && awready && wready) ? 4'b1111 : 4'b0000;
assign tap_EN = (ap_idle) ? ((wready && awaddr[7]) || (rvalid && araddr_tmp[7])) : ((|data_cnt_state_next) || ss_tdone); // (|data_cnt_state) == !(data_cnt_state == 0)
assign tap_Di = wdata;
assign tap_A = (ap_idle) ? ((awvalid) ? awaddr[6:0] : araddr_tmp[6:0]) : (tapA_cnter_next << 2);
always @(posedge axis_clk or negedge axis_rst_n) begin
if (!axis_rst_n) begin
data_length <= 0;
tap_length <= 0;
wready_tmp <= 0;
awready_tmp <= 0;
arready_tmp <= 0;
rvalid_tmp <= 0;
araddr_tmp <= 0;
end else begin
data_length <= data_length_next;
tap_length <= tap_length_next;
wready_tmp <= wready_next;
awready_tmp <= awready_next;
arready_tmp <= arready_next;
rvalid_tmp <= rvalid_next;
araddr_tmp <= araddr_next;
end
end
always @(*) begin
// managing wready and awready
if (awvalid && wvalid && !awready) begin
wready_next = PULLU;
awready_next = PULLU;
end else begin
wready_next = PULLD;
awready_next = PULLD;
end
// managing arready
if (arvalid && !arready) begin
arready_next = PULLU;
end else begin
arready_next = PULLD;
end
// managing rvalid
if (arready) begin
rvalid_next = PULLU;
end else if (rready) begin
rvalid_next = PULLD;
end else begin
rvalid_next = rvalid_tmp;
end
// managing araddr_tmp (store the value of araddr until shakehand)
if (arvalid) begin
araddr_next = araddr;
end else if (rvalid && rready) begin
araddr_next = INVALID_ADDR;
end else begin
araddr_next = araddr_tmp;
end
end
// define address of the three register (ap_crtl, data_length, tap_length) into 2 bits
// which is used in "condition"
always @(*) begin
if (araddr_tmp == AP_ADDR) begin
addr_define = 2'b11;
end else if (araddr_tmp == DATA_ADDR) begin
addr_define = 2'b01;
end else begin
addr_define = 2'b10;
end
end
assign condition = {araddr_tmp[7], ap_idle, rready && rvalid, addr_define};
always @(*) begin
casez (condition)
5'b111??: begin
rdata_tmp = tap_Do;
end
5'b101??: begin
rdata_tmp = INVALID_DATA;
end
5'b0?101: begin
rdata_tmp = data_length;
end
5'b0?110: begin
rdata_tmp = tap_length;
end
5'b0?111: begin
rdata_tmp = ap_crtl;
end
default: begin
rdata_tmp = {(pDATA_WIDTH){1'b0}};
end
endcase
end
assign rdata = rdata_tmp;
always @(*) begin
// define the flipflop of data_length
if ((|tap_WE) && (awaddr == DATA_ADDR)) begin
data_length_next = wdata;
end else begin
data_length_next = data_length;
end
// define the flipflop of tap_length
if ((|tap_WE) && (awaddr == TAP_ADDR)) begin
tap_length_next = wdata;
end else begin
tap_length_next = tap_length;
end
end
//-------------------------- ap_idle & ap_done & ap_start -------------------------------
assign ap_crtl = {ap_idle, ap_done, ap_start};
// define the write in of ap_start
assign ap_start = ((|tap_WE) && (awaddr == AP_ADDR)) ? wdata[0] : PULLD;
always @(posedge axis_clk or negedge axis_rst_n) begin
if (!axis_rst_n) begin
ap_idle <= 1;
ap_done <= 0;
end else begin
ap_idle <= ap_idle_next;
ap_done <= ap_done_next;
end
end
always @(*) begin
// define the flipflop of ap_idle
if (ap_start) begin
ap_idle_next = PULLD;
end else if (sm_tlast && sm_tready) begin
ap_idle_next = PULLU;
end else begin
ap_idle_next = ap_idle;
end
// define the flipflop of ap_done
if (sm_tlast && sm_tready) begin
ap_done_next = PULLU;
end else if (ap_idle && ap_done && (araddr_tmp == AP_ADDR) && rready && rvalid) begin
ap_done_next = PULLD;
end else begin
ap_done_next = ap_done;
end
end
//-------------------------- Axi-Stream SS (input X) -------------------------------
always @(posedge axis_clk or negedge axis_rst_n) begin
if (!axis_rst_n) begin
data_cnt_now <= 0;
x_buffer <= INVALID_DATA;
data_filled <= 0;
ss_tdone <= 0;
ss_tready_tmp <= 0;
end else begin
data_cnt_now <= data_cnt_now_next;
x_buffer <= x_buffer_next;
data_filled <= data_filled_next;
ss_tdone <= ss_tdone_next;
ss_tready_tmp <= ss_tready_next;
end
end
assign ss_tready = ss_tready_tmp;
// used as a switch to turn off data_EN and tap_EN
assign ss_tdone_next = (ss_tready && ss_tlast) ? PULLU :
(sm_tready && sm_tlast) ? PULLD : ss_tdone;
// data_filled = 1 if data_RAM is filled
assign data_filled_next = (sm_tdone) ? PULLD :
(data_cnt_now_next == tap_length) && !ap_idle ? PULLU : data_filled;
// define data_cnt_now_next and x_buffer_next
always @(*) begin
if (ap_idle) begin
data_cnt_now_next = {(pDATA_WIDTH-1){1'b0}};
x_buffer_next = INVALID_DATA;
end else if (ss_tvalid && ss_tready) begin
data_cnt_now_next = data_cnt_now + 1;
x_buffer_next = ss_tdata;
end else if (sm_tdone) begin // ss_tdone
data_cnt_now_next = {(pDATA_WIDTH-1){1'b0}};
x_buffer_next = INVALID_DATA;
end else begin
data_cnt_now_next = data_cnt_now;
x_buffer_next = x_buffer;
end
end
// define ss_tready_next
always @(*) begin
if (!data_filled) begin
if (ss_tvalid && !ss_tready && !ap_idle) begin
ss_tready_next = PULLU;
end else begin
ss_tready_next = PULLD;
end
end else if ((data_cnt_now_next == data_cnt_state_next) && ss_tvalid && !ss_tready && !ap_idle && !ss_tdone) begin
ss_tready_next = PULLU;
end else begin
ss_tready_next = PULLD;
end
end
//-------------------------- FIR engine -------------------------------
always @(posedge axis_clk or negedge axis_rst_n) begin
if (!axis_rst_n) begin
data_cnt_state <= 0;
data_cnt_norm <= 0;
tapA_cnter <= CNTERA_INVALID;
dataA_cnter <= CNTERA_INVALID;
write_sr <= 0;
op_cnter1 <= CNTERA_INVALID;
op_cnter2 <= CNTERA_INVALID;
op_cnter3 <= CNTERA_INVALID;
mul_ctrl1 <= 0;
mul_ctrl2 <= 0;
xi <= INVALID_DATA;
tapi <= INVALID_DATA;
sum_ctrl <= 0;
sumi <= 0;
muli <= 0;
y_locked <= 0;
y_storage <= INVALID_DATA;
y_change_valid <= 0;
y_buffer <= INVALID_DATA;
sm_tdone <= 0;
end else begin
data_cnt_state <= data_cnt_state_next;
data_cnt_norm <= data_cnt_norm_next;
tapA_cnter <= tapA_cnter_next;
dataA_cnter <= dataA_cnter_next;
write_sr <= write_sr_next;
op_cnter1 <= op_cnter1_next;
op_cnter2 <= op_cnter2_next;
op_cnter3 <= op_cnter3_next;
mul_ctrl1 <= mul_ctrl1_next;
mul_ctrl2 <= mul_ctrl2_next;
xi <= xi_next;
tapi <= tapi_next;
sum_ctrl <= sum_ctrl_next;
sumi <= sumi_next;
muli <= muli_next;
y_locked <= y_locked_next;
y_storage <= y_storage_next;
y_change_valid <= y_change_valid_next;
y_buffer <= y_buffer_next;
sm_tdone <= sm_tdone_next;
end
end
// define data_cnt_state_next
always @(*) begin
if (ap_idle) begin // oridinally no ap_idle
data_cnt_state_next = 0;
end else if ((data_cnt_now_next == 1) && (data_cnt_now == 0)) begin
data_cnt_state_next = 1;
end else if (tapA_cnter == 1 && !(y_locked || y_locked_next) && (data_cnt_state < data_cnt_now_next)) begin
data_cnt_state_next = data_cnt_state + 1;
end else begin
data_cnt_state_next = data_cnt_state;
end
end
// define data_cnt_norm_next, which is the loop of 1 to tap_lentgh - 1
always @(*) begin
if (ap_idle) begin // sm_tdone
data_cnt_norm_next = 0;
end else if (!(data_cnt_state == data_cnt_state_next)) begin
if (data_cnt_norm == tap_length) begin
data_cnt_norm_next = 1;
end else begin
data_cnt_norm_next = data_cnt_norm + 1;
end
end else begin
data_cnt_norm_next = data_cnt_norm;
end
end
// define tapA_cnter_next
always @(*) begin
if (ap_idle) begin // sm_tdone
tapA_cnter_next = CNTERA_INVALID;
end else if (data_cnt_state_next == 0) begin
tapA_cnter_next = tapA_cnter;
end else if (!(data_cnt_state == data_cnt_state_next)) begin
tapA_cnter_next = 0;
end else if (!data_write && (tapA_cnter == 0)) begin
if (data_cnt_state_next < tap_length) begin
tapA_cnter_next = data_cnt_state_next;
end else begin
tapA_cnter_next = tap_length - 1;
end
end else if (tapA_cnter > 1 && !data_write) begin //only tapA_cnter > 1
tapA_cnter_next = tapA_cnter - 1;
end else begin
tapA_cnter_next = tapA_cnter;
end
end
// define dataA_cnter_next
always @(*) begin
if (ap_idle) begin // ss_tdone
dataA_cnter_next = CNTERA_INVALID;
end else if (data_cnt_state_next == 0) begin
dataA_cnter_next = dataA_cnter;
end else if (!(data_cnt_state == data_cnt_state_next)) begin
dataA_cnter_next = data_cnt_norm_next - 1;
end else if (!(tapA_cnter_next == tapA_cnter) && !data_write) begin
if (data_cnt_state_next < tap_length) begin
if (tapA_cnter == 0) begin
dataA_cnter_next = 0;
end else begin
dataA_cnter_next = dataA_cnter + 1;
end
end else begin
if (tapA_cnter == 0) begin
if (data_cnt_state_next % tap_length + 1 == tap_length) begin
dataA_cnter_next = 0;
end else begin
dataA_cnter_next = data_cnt_state_next % tap_length + 1;
end
end else begin
if (dataA_cnter + 1 == tap_length) begin
dataA_cnter_next = 0;
end else begin
dataA_cnter_next = dataA_cnter + 1;
end
end
end
end else begin
dataA_cnter_next = dataA_cnter;
end
end
// define write_sr_next
always @(*) begin
if (!(data_cnt_state == data_cnt_state_next) && !(data_cnt_state_next == data_cnt_now_next) && ss_tready) begin
write_sr_next = PULLU;
end else begin
write_sr_next = PULLD;
end
end
// define data_write
always @(*) begin
if (!data_filled) begin
if (ss_tready && (write_sr_next == PULLU)) begin
data_write = PULLD;
end else if (write_sr) begin
data_write = PULLU;
end else begin
data_write = ss_tready;
end
end else if ((data_cnt_state_next > tap_length) && !(data_cnt_state == data_cnt_state_next)) begin
data_write = PULLU;
end else begin
data_write = PULLD;
end
end
// assign data_RAM ouput signals
assign data_WE = {(4){data_write}};
assign data_EN = (|data_cnt_state_next) || ss_tdone;
assign data_Di = x_buffer_next;
assign data_A_tmp = (data_write && !data_filled) ? (data_cnt_now_next - 1) << 2 : dataA_cnter_next << 2;
assign data_A = data_A_tmp[(pADDR_WIDTH-1):0]; // bcs data_cnt_now_next is [(pDATA_WIDTH-1):0]
// assign xi and tapi for pipelining
assign xi_next = data_Do;
assign tapi_next = tap_Do;
// assign op_cnter_next
assign op_cnter1_next = tapA_cnter;
assign op_cnter2_next = op_cnter1;
assign op_cnter3_next = op_cnter2;
// define mul_ctrl
assign mul_ctrl1_next = !(tapA_cnter_next == tapA_cnter) ? PULLU : PULLD;
assign mul_ctrl2_next = mul_ctrl1;
// define sum_ctrl
assign sum_ctrl_next = mul_ctrl2;
// manage muli: do multiply operation
always @(*) begin
if (mul_ctrl2) begin
muli_next = xi * tapi;
end else begin
muli_next = muli;
end
end
// manage sumi: do sum operation
always @(*) begin
if (sum_ctrl && (op_cnter3 == CNTERA_INVALID)) begin
sumi_next = muli;
end else if (sum_ctrl && (op_cnter3 == 0)) begin
sumi_next = muli;
end else if (sum_ctrl) begin
sumi_next = muli + sumi;
end else begin
sumi_next = sumi;
end
end
// define y_locked
always @(*) begin
if (op_cnter2 == 1 && op_cnter1 == 0 && sm_tvalid && !sm_tready) begin
y_locked_next = PULLU;
end else if (sm_tready) begin
y_locked_next = PULLD;
end else begin
y_locked_next = y_locked;
end
end
// assign sm_tdone
assign sm_tdone_next = sm_tlast && sm_tready && ~sm_tdone;
// define y_storage
always @(*) begin
if (op_cnter3 == 1 && op_cnter2 == 0 && y_locked) begin
y_storage_next = sumi_next;
end else if (sm_tdone) begin
y_storage_next = INVALID_DATA;
end else begin
y_storage_next = y_storage;
end
end
// define y_change_valid
assign y_change_valid_next = ((y_locked_next == 0) && (y_locked == 1)) ? PULLU : PULLD;
// define y_buffer
always @(*) begin
if (sm_tdone || ap_idle) begin
y_buffer_next = INVALID_DATA;
y_cnter_next = 0;
end else if (op_cnter3 == CNTERA_INVALID && op_cnter2 == 0) begin
y_buffer_next = sumi_next;
y_cnter_next = y_cnter + 1;
end else if (op_cnter3 == 1 && op_cnter2 == 0 && !y_locked) begin
y_buffer_next = sumi_next;
y_cnter_next = y_cnter + 1;
end else if (y_change_valid) begin
y_buffer_next = y_storage;
y_cnter_next = y_cnter + 1;
end else begin
y_buffer_next = y_buffer;
y_cnter_next = y_cnter;
end
end
//-------------------------- Axi-Stream SM (Output Y) -------------------------------
always @(posedge axis_clk or negedge axis_rst_n) begin
if (!axis_rst_n) begin
sm_tvalid_tmp <= 0;
sm_tlast_tmp <= 0;
y_cnter <= 0;
end else begin
sm_tvalid_tmp <= sm_tvalid_next;
sm_tlast_tmp <= sm_tlast_next;
y_cnter <= y_cnter_next;
end
end
// assign to sm_tvalid
assign sm_tvalid = sm_tvalid_tmp;
// define sm_tvalid
always @(*) begin
if (op_cnter3 == CNTERA_INVALID && op_cnter2 == 0) begin
sm_tvalid_next = PULLU;
end else if (op_cnter3 == 1 && op_cnter2 == 0 && !y_locked) begin
sm_tvalid_next = PULLU;
end else if (y_change_valid && !ap_idle) begin
sm_tvalid_next = PULLU;
end else if (sm_tready) begin
sm_tvalid_next = PULLD;
end else begin
sm_tvalid_next = sm_tvalid_tmp;
end
end
// define sm_tlast
always @(*) begin
if (!ap_idle && y_cnter_next == data_length && sm_tvalid_next == PULLU) begin
sm_tlast_next = PULLU;
end else if (sm_tvalid_next == PULLD) begin
sm_tlast_next = PULLD;
end else begin
sm_tlast_next = sm_tlast_tmp;
end
end
// assign sm_tlast
assign sm_tlast = sm_tlast_tmp;
// assign to sm_tdata
assign sm_tdata = (sm_tready) ? y_buffer_next : INVALID_DATA;
endmodule
```
:::danger
Timing violation


The reason may be too many if else statement, causing mutliple stage MUX, which result in bad timing path. As follow:


:::
## Final version Code Analyse
### Axi_Lite

```verilog
//-------------------------- Axi-Lite -------------------------------
// assign response signals
assign wready = wready_tmp;
assign awready = awready_tmp;
assign arready = arready_tmp;
assign rvalid = rvalid_tmp;
// assign tap_RAM ouput signals
assign tap_WE = (ap_idle && awready && wready) ? 4'b1111 : 4'b0000;
assign tap_EN = (ap_idle) ? ((wready && awaddr[7]) || (rvalid && araddr_tmp[7])) :
((|data_cnt_state_next) || ss_tdone); // (|data_cnt_state) == !(data_cnt_state == 0)
assign tap_Di = wdata;
assign tap_A = (ap_idle) ? ((awvalid) ? awaddr[6:0] : araddr_tmp[6:0]) : (tapA_cnter_next << 2);
always @(posedge axis_clk or negedge axis_rst_n) begin
if (!axis_rst_n) begin
data_length <= 0;
tap_length <= 0;
wready_tmp <= 0;
awready_tmp <= 0;
arready_tmp <= 0;
rvalid_tmp <= 0;
araddr_tmp <= 0;
end else begin
data_length <= data_length_next;
tap_length <= tap_length_next;
wready_tmp <= wready_next;
awready_tmp <= awready_next;
arready_tmp <= arready_next;
rvalid_tmp <= rvalid_next;
araddr_tmp <= araddr_next;
end
end
always @(*) begin
// managing wready and awready
if (awvalid && wvalid && !awready) begin
wready_next = PULLU;
awready_next = PULLU;
end else begin
wready_next = PULLD;
awready_next = PULLD;
end
// managing arready
if (arvalid && !arready) begin
arready_next = PULLU;
end else begin
arready_next = PULLD;
end
// managing rvalid
if (arready) begin
rvalid_next = PULLU;
end else if (rready) begin
rvalid_next = PULLD;
end else begin
rvalid_next = rvalid_tmp;
end
// managing araddr_tmp (store the value of araddr until shakehand)
if (arvalid) begin
araddr_next = araddr;
end else if (rvalid && rready) begin
araddr_next = INVALID_ADDR;
end else begin
araddr_next = araddr_tmp;
end
end
// define address of the three register (ap_crtl, data_length, tap_length) into 2 bits
// which is used in "condition"
always @(*) begin
if (araddr_tmp == AP_ADDR) begin
addr_define = 2'b11;
end else if (araddr_tmp == DATA_ADDR) begin
addr_define = 2'b01;
end else begin
addr_define = 2'b10;
end
end
assign condition = {araddr_tmp[7], ap_idle, rready && rvalid, addr_define};
always @(*) begin
casez (condition)
5'b111??: begin
rdata_tmp = tap_Do;
end
5'b101??: begin
rdata_tmp = INVALID_DATA;
end
5'b0?101: begin
rdata_tmp = data_length;
end
5'b0?110: begin
rdata_tmp = tap_length;
end
5'b0?111: begin
rdata_tmp = ap_crtl;
end
default: begin
rdata_tmp = {(pDATA_WIDTH){1'b0}};
end
endcase
end
assign rdata = rdata_tmp;
always @(*) begin
// define the flipflop of data_length
if ((|tap_WE) && (awaddr == DATA_ADDR)) begin
data_length_next = wdata;
end else begin
data_length_next = data_length;
end
// define the flipflop of tap_length
if ((|tap_WE) && (awaddr == TAP_ADDR)) begin
tap_length_next = wdata;
end else begin
tap_length_next = tap_length;
end
end
```
### Ap_ctrl

```verilog
//-------------------------- ap_idle & ap_done & ap_start -------------------------------
assign ap_crtl = {ap_idle, ap_done, ap_start};
// define the write in of ap_start
assign ap_start = ((|tap_WE) && (awaddr == AP_ADDR)) ? wdata[0] : PULLD;
always @(posedge axis_clk or negedge axis_rst_n) begin
if (!axis_rst_n) begin
ap_idle <= 1;
ap_done <= 0;
end else begin
ap_idle <= ap_idle_next;
ap_done <= ap_done_next;
end
end
always @(*) begin
// define the flipflop of ap_idle
if (ap_start) begin
ap_idle_next = PULLD;
end else if (sm_tlast && sm_tready) begin
ap_idle_next = PULLU;
end else begin
ap_idle_next = ap_idle;
end
// define the flipflop of ap_done
if (sm_tlast && sm_tready) begin
ap_done_next = PULLU;
end else if (ap_idle && ap_done && (araddr_tmp == AP_ADDR) && rready && rvalid) begin
ap_done_next = PULLD;
end else begin
ap_done_next = ap_done;
end
end
```
### Axi_Stream SS (input)


We can see that input x has already input the 31th x, but our engine only done 6 stage of calculation.
```verilog
//-------------------------- Axi-Stream SS (input X) -------------------------------
always @(posedge axis_clk or negedge axis_rst_n) begin
if (!axis_rst_n) begin
data_cnt_now <= 0;
x_buffer <= INVALID_DATA;
data_filled <= 0;
ss_tdone <= 0;
ss_tready_tmp <= 0;
end else begin
data_cnt_now <= data_cnt_now_next;
x_buffer <= x_buffer_next;
data_filled <= data_filled_next;
ss_tdone <= ss_tdone_next;
ss_tready_tmp <= ss_tready_next;
end
end
assign ss_tready = ss_tready_tmp;
// used as a switch to turn off data_EN and tap_EN
assign ss_tdone_next = (ss_tready && ss_tlast) ? PULLU :
(sm_tready && sm_tlast) ? PULLD : ss_tdone;
// data_filled = 1 if data_RAM is filled
assign data_filled_next = (sm_tdone) ? PULLD :
(data_cnt_now_next == tap_length) && !ap_idle ? PULLU : data_filled;
// define data_cnt_now_next and x_buffer_next
always @(*) begin
if (ap_idle) begin
data_cnt_now_next = {(pDATA_WIDTH-1){1'b0}};
x_buffer_next = INVALID_DATA;
end else if (ss_tvalid && ss_tready) begin
data_cnt_now_next = data_cnt_now + 1;
x_buffer_next = ss_tdata;
end else if (sm_tdone) begin // ss_tdone
data_cnt_now_next = {(pDATA_WIDTH-1){1'b0}};
x_buffer_next = INVALID_DATA;
end else begin
data_cnt_now_next = data_cnt_now;
x_buffer_next = x_buffer;
end
end
// define ss_tready_next
always @(*) begin
if ((!data_filled || ((data_cnt_now_next == data_cnt_state_next) && !ss_tdone)) && ss_tvalid && !ss_tready && !ap_idle) begin
ss_tready_next = PULLU;
end else begin
ss_tready_next = PULLD;
end
end
```
### Fir engine

```verilog
//-------------------------- FIR engine -------------------------------
always @(posedge axis_clk or negedge axis_rst_n) begin
if (!axis_rst_n) begin
data_cnt_state <= 0;
data_cnt_norm <= 0;
tapA_cnter <= CNTERA_INVALID;
dataA_cnter <= CNTERA_INVALID;
cnter_indata <= 0;
write_sr <= 0;
op_cnter1 <= CNTERA_INVALID;
op_cnter2 <= CNTERA_INVALID;
op_cnter3 <= CNTERA_INVALID;
mul_ctrl1 <= 0;
mul_ctrl2 <= 0;
xi <= INVALID_DATA;
tapi <= INVALID_DATA;
sum_ctrl <= 0;
sumi <= 0;
muli <= 0;
y_locked <= 0;
y_storage <= INVALID_DATA;
y_change_valid <= 0;
y_buffer <= INVALID_DATA;
sm_tdone <= 0;
end else begin
data_cnt_state <= data_cnt_state_next;
data_cnt_norm <= data_cnt_norm_next;
tapA_cnter <= tapA_cnter_next;
dataA_cnter <= dataA_cnter_next;
cnter_indata <= cnter_indata_next;
write_sr <= write_sr_next;
op_cnter1 <= op_cnter1_next;
op_cnter2 <= op_cnter2_next;
op_cnter3 <= op_cnter3_next;
mul_ctrl1 <= mul_ctrl1_next;
mul_ctrl2 <= mul_ctrl2_next;
xi <= xi_next;
tapi <= tapi_next;
sum_ctrl <= sum_ctrl_next;
sumi <= sumi_next;
muli <= muli_next;
y_locked <= y_locked_next;
y_storage <= y_storage_next;
y_change_valid <= y_change_valid_next;
y_buffer <= y_buffer_next;
sm_tdone <= sm_tdone_next;
end
end
// define data_cnt_state_next
always @(*) begin
if (ap_idle) begin // oridinally no ap_idle
data_cnt_state_next = 0;
end else if ((data_cnt_now_next == 1) && (data_cnt_now == 0)) begin
data_cnt_state_next = 1;
end else if (tapA_cnter == 1 && !(y_locked || y_locked_next) && (data_cnt_state < data_cnt_now_next)) begin
data_cnt_state_next = data_cnt_state + 1;
end else begin
data_cnt_state_next = data_cnt_state;
end
end
// define data_cnt_norm_next, which is the loop of 1 to tap_lentgh - 1
always @(*) begin
if (ap_idle) begin // sm_tdone
data_cnt_norm_next = 0;
end else if (!(data_cnt_state == data_cnt_state_next) && (data_cnt_norm == tap_length)) begin
data_cnt_norm_next = 1;
end else if (!(data_cnt_state == data_cnt_state_next)) begin
data_cnt_norm_next = data_cnt_norm + 1;
end else begin
data_cnt_norm_next = data_cnt_norm;
end
end
// define tapA_cnter_next
always @(*) begin
if (ap_idle) begin // sm_tdone
tapA_cnter_next = CNTERA_INVALID;
end else if (!(data_cnt_state == data_cnt_state_next)) begin
tapA_cnter_next = 0;
end else if (!data_write && (tapA_cnter == 0) && (data_cnt_state_next < tap_length)) begin
tapA_cnter_next = data_cnt_state_next;
end else if (!data_write && (tapA_cnter == 0)) begin
tapA_cnter_next = tap_length - 1;
end else if (tapA_cnter > 1 && !data_write && !(tapA_cnter == CNTERA_INVALID)) begin //only tapA_cnter > 1
tapA_cnter_next = tapA_cnter - 1;
end else begin
tapA_cnter_next = tapA_cnter;
end
end
// define cnter_indata
always @(*) begin
if (ap_idle || (!(data_cnt_state == data_cnt_state_next) && (cnter_indata == tap_length - 1))) begin
cnter_indata_next = 0;
end else if (!(data_cnt_state == data_cnt_state_next) && data_cnt_state_next >= tap_length) begin
cnter_indata_next = cnter_indata + 1;
end else begin
cnter_indata_next = cnter_indata;
end
end
// define dataA_cnter_next
always @(*) begin
if (ap_idle) begin // ss_tdone
dataA_cnter_next = CNTERA_INVALID;
end else if (!(data_cnt_state == data_cnt_state_next)) begin
dataA_cnter_next = data_cnt_norm_next - 1;
end else if (!data_write && ((tapA_cnter == 0 && cnter_indata == 0) || (!(tapA_cnter == 0) && !(tapA_cnter_next == tapA_cnter) && (dataA_cnter + 1 == tap_length)))) begin
dataA_cnter_next = 0;
end else if (tapA_cnter == 0 && !data_write) begin
dataA_cnter_next = cnter_indata_next;
end else if (!(tapA_cnter_next == tapA_cnter) && !data_write) begin
dataA_cnter_next = dataA_cnter + 1;
end else begin
dataA_cnter_next = dataA_cnter;
end
end
// define write_sr_next
always @(*) begin
if (!(data_cnt_state == data_cnt_state_next) && !(data_cnt_state_next == data_cnt_now_next) && ss_tready) begin
write_sr_next = PULLU;
end else begin
write_sr_next = PULLD;
end
end
// define data_write
always @(*) begin
if ((!data_filled && write_sr) || ((data_cnt_state_next > tap_length) && !(data_cnt_state == data_cnt_state_next))) begin
data_write = PULLU;
end else if (!data_filled && !(write_sr_next == PULLU)) begin
data_write = ss_tready;
end else begin
data_write = PULLD;
end
end
```
* `data_cnt_now_next`: represent the nth input.
* `data_cnt_state_next`: represent the nth stage of calculation, the first tap and data cnter in the stage is the final thing to add for the n-1th stage. Meaning that our stage will pre-calculate the summation for the input x (nth) of the rest of the stage in (n-1th).
* `tapA_cnter_next` and `dataA_cnter_next`: the cnter for address input of `tap_A` and `data_A`
* `data_cnt_norm_next`: normalized state, which will be a loop of `1 ~ tap_length`, for the assignment of cnter address.
* `cnter_indata`: is the cnter in `dataA_cnter_next`, since `dataA_cnter_next` is not a simple cnter, it is composed of this `cnter_indata` and other condition.
* `write_sr_next`: Since the first cnter address of the stage is to do the final summation of the previous stage, and we use the change of the `data_cnt_state_next` to determine the change of the cnter. Therefore, we have to ensure that the `data_write` only write in the first cnter address if it is to write the input x of this stage.
* `data_write`: To write the values in `x_buffer` into SRAM.

```verilog
// assign data_RAM ouput signals
assign data_WE = {(4){data_write}};
assign data_EN = (|data_cnt_state_next) || ss_tdone;
assign data_Di = x_buffer_next;
assign data_A_tmp = (data_write && !data_filled) ? (data_cnt_now_next - 1) << 2 : dataA_cnter_next << 2;
assign data_A = data_A_tmp[(pADDR_WIDTH-1):0]; // bcs data_cnt_now_next is [(pDATA_WIDTH-1):0]
// assign xi and tapi for pipelining
assign xi_next = data_Do;
assign tapi_next = tap_Do;
// assign op_cnter_next
assign op_cnter1_next = tapA_cnter;
assign op_cnter2_next = op_cnter1;
assign op_cnter3_next = op_cnter2;
// define mul_ctrl
assign mul_ctrl1_next = !(tapA_cnter_next == tapA_cnter) ? PULLU : PULLD;
assign mul_ctrl2_next = mul_ctrl1;
// define sum_ctrl
assign sum_ctrl_next = mul_ctrl2;
// manage muli: do multiply operation
always @(*) begin
if (mul_ctrl2) begin
muli_next = xi * tapi;
end else begin
muli_next = muli;
end
end
// manage sumi: do sum operation
always @(*) begin
if ((sum_ctrl && (op_cnter3 == CNTERA_INVALID)) || (sum_ctrl && (op_cnter3 == 0))) begin
sumi_next = muli;
end else if (sum_ctrl) begin
sumi_next = muli + sumi;
end else begin
sumi_next = sumi;
end
end
// define y_locked
always @(*) begin
if (op_cnter2 == 1 && op_cnter1 == 0 && sm_tvalid && !sm_tready) begin
y_locked_next = PULLU;
end else if (sm_tready) begin
y_locked_next = PULLD;
end else begin
y_locked_next = y_locked;
end
end
// assign sm_tdone
assign sm_tdone_next = sm_tlast && sm_tready && ~sm_tdone;
// define y_storage
always @(*) begin
if (op_cnter3 == 1 && op_cnter2 == 0 && y_locked) begin
y_storage_next = sumi_next;
end else if (sm_tdone) begin
y_storage_next = INVALID_DATA;
end else begin
y_storage_next = y_storage;
end
end
// define y_change_valid
assign y_change_valid_next = ((y_locked_next == 0) && (y_locked == 1)) ? PULLU : PULLD;
// define y_buffer
always @(*) begin
if (sm_tdone || ap_idle) begin
y_buffer_next = INVALID_DATA;
y_cnter_next = 0;
end else if ((op_cnter3 == CNTERA_INVALID && op_cnter2 == 0) || (op_cnter3 == 1 && op_cnter2 == 0 && !y_locked)) begin
y_buffer_next = sumi_next;
y_cnter_next = y_cnter + 1;
end else if (y_change_valid) begin
y_buffer_next = y_storage;
y_cnter_next = y_cnter + 1;
end else begin
y_buffer_next = y_buffer;
y_cnter_next = y_cnter;
end
end
```
* `mul_ctrl2` and `sum_ctrl`: is when the mul and sum do operation, these two are determined by the change of `tapA_cnter`.
* `y_locked_next`: When y is ready to output but the CPU did not response, `y_locked` will be pull up, and stall the `data_cnt_state_next`.
* `y_storage_next`: Is to storage y that should be sent but CPU have not accept yet, and let the engine do the pre-calculation.
* `y_change_valid`: It is use to determine whether `y_buffer_next` should be the value of final summation or the value of y_storage.
### Axi-Stream SM (output)

```verilog
//-------------------------- Axi-Stream SM (Output Y) -------------------------------
always @(posedge axis_clk or negedge axis_rst_n) begin
if (!axis_rst_n) begin
sm_tvalid_tmp <= 0;
sm_tlast_tmp <= 0;
y_cnter <= 0;
end else begin
sm_tvalid_tmp <= sm_tvalid_next;
sm_tlast_tmp <= sm_tlast_next;
y_cnter <= y_cnter_next;
end
end
// assign to sm_tvalid
assign sm_tvalid = sm_tvalid_tmp;
// define sm_tvalid
always @(*) begin
if ((op_cnter3 == CNTERA_INVALID && op_cnter2 == 0) || (op_cnter3 == 1 && op_cnter2 == 0 && !y_locked) || (y_change_valid && !ap_idle)) begin
sm_tvalid_next = PULLU;
end else if (sm_tready) begin
sm_tvalid_next = PULLD;
end else begin
sm_tvalid_next = sm_tvalid_tmp;
end
end
// define sm_tlast
always @(*) begin
if (!ap_idle && y_cnter_next == data_length && sm_tvalid_next == PULLU) begin
sm_tlast_next = PULLU;
end else if (sm_tvalid_next == PULLD) begin
sm_tlast_next = PULLD;
end else begin
sm_tlast_next = sm_tlast_tmp;
end
end
// assign sm_tlast
assign sm_tlast = sm_tlast_tmp;
// assign to sm_tdata
assign sm_tdata = (sm_tready) ? y_buffer_next : INVALID_DATA;
```
## Waveform overview


## Syn Report
### Time

### Area
