SysArray - MatMul Device Part 1: The Systolic Array

# Systolic Array MatMul Device Part 1: The Systolic Array A systolic array is a general design pattern for computation that can speed up execution when compared to traditional sequential code with a form of pipelining - storing computation state in between clock cycles and passing this state between components to allow for parallel execution of these components. I designed a device that uses the systolic array pattern to implement simple matrix multiplication; i.e. executing $C = A * B + D$ - I'll use these notes to describe the design of device, building the device up from the core systolic array with additional features like threading, a controller, and ISA, and comms. The final idea is to have a separate device that can dynamically load memory (matrices) and instructions, apply those instructions (namely matrix multiplication on the loaded matrices), and be able to return back the memory to CPU - this video demonstrates the overall flow from a program running on a CPU interacting with the (virtual) device: [![Youtube link - Kademlia demo](https://img.youtube.com/vi/bAgslo_tx7I/0.jpg)](https://youtu.be/bAgslo_tx7I) ## High-Level Systolic Array Architecture A systolic array is a matrix of processing elements (PEs) - each PE stores data, accepts input from preceding PEs, executes computation, and writes output to successive PEs - the general idea is that data should "flow" through the systolic array from the top -> down and from left -> right: ```graphviz digraph sys_array { nodesep=1.0 node [color=Black,fontname=Arial,shape=box] edge [color=Black,fontname=Arial] labelloc = "t" # columns subgraph { "PE_-1_-1" [style=invis] "PE_0_-1" [style=invis] "PE_1_-1" [style=invis] "PE_2_-1" [style=invis] "PE_3_-1" [style=invis] "PE_4_-1" [style=invis] } subgraph { "PE_-1_0" [style=invis] "PE_0_0" "PE_1_0" "PE_2_0" "PE_3_0" "PE_4_0" [style=invis] } subgraph { "PE_-1_1" [style=invis] "PE_0_1" "PE_1_1" "PE_2_1" "PE_3_1" "PE_4_1" [style=invis] } subgraph { "PE_-1_2" [style=invis] "PE_0_2" "PE_1_2" "PE_2_2" "PE_3_2" "PE_4_2" [style=invis] } subgraph { "PE_-1_3" [style=invis] "PE_0_3" "PE_1_3" "PE_2_3" "PE_3_3" "PE_4_3" [style=invis] } # rows subgraph { rank=same "PE_-1_-1" [style=invis] "PE_-1_0" [style=invis] "PE_-1_1" [style=invis] "PE_-1_2" [style=invis] "PE_-1_3" [style=invis] "PE_-1_4" [style=invis] } subgraph { rank=same "PE_0_-1" [style=invis] "PE_0_0" "PE_0_1" "PE_0_2" "PE_0_3" "PE_0_4" [style=invis] } subgraph { rank=same "PE_1_-1" [style=invis] "PE_1_0" "PE_1_1" "PE_1_2" "PE_1_3" "PE_1_4" [style=invis] } subgraph { rank=same "PE_2_-1" [style=invis] "PE_2_0" "PE_2_1" "PE_2_2" "PE_2_3" "PE_2_4" [style=invis] } subgraph { rank=same "PE_3_-1" [style=invis] "PE_3_0" "PE_3_1" "PE_3_2" "PE_3_3" "PE_3_4" [style=invis] } "PE_0_-1" -> "PE_0_0" [label="input data"] "PE_0_0" -> "PE_0_1" -> "PE_0_2" -> "PE_0_3" "PE_0_3" -> "PE_0_4" [label="output data (unused)"] "PE_1_-1" -> "PE_1_0" [label="input data"] "PE_1_0" -> "PE_1_1" -> "PE_1_2" -> "PE_1_3" "PE_1_3" -> "PE_1_4" [label="output data (unused)"] "PE_2_-1" -> "PE_2_0" [label="input data"] "PE_2_0" -> "PE_2_1" -> "PE_2_2" -> "PE_2_3" "PE_2_3" -> "PE_2_4" [label="output data (unused)"] "PE_3_-1" -> "PE_3_0" [label="input data"] "PE_3_0" -> "PE_3_1" -> "PE_3_2" -> "PE_3_3" "PE_3_3" -> "PE_3_4" [label="output data (unused)"] "PE_-1_0" -> "PE_0_0" [label="input data"] "PE_0_0" -> "PE_1_0" -> "PE_2_0" -> "PE_3_0" "PE_3_0" -> "PE_4_0" [label="output data"] "PE_-1_1" -> "PE_0_1" [label="input data"] "PE_0_1" -> "PE_1_1" -> "PE_2_1" -> "PE_3_1" "PE_3_1" -> "PE_4_1" [label="output data"] "PE_-1_2" -> "PE_0_2" [label="input data"] "PE_0_2" -> "PE_1_2" -> "PE_2_2" -> "PE_3_2" "PE_3_2" -> "PE_4_2" [label="output data"] "PE_-1_3" -> "PE_0_3" [label="input data"] "PE_0_3" -> "PE_1_3" -> "PE_2_3" -> "PE_3_3" "PE_3_3" -> "PE_4_3" [label="output data"] } ``` Each PE is identically defined and for the purposes of matrix multiplication, will execute scalar multiplication + addition. **The goal will be to store $B$ directly in the PEs, pass $A$ into the array from the left, pass $D$ into the array from the top, and orchestrate all of this correctly so that $C$ comes out the bottom.** ## Rearranging Matrix Multiplication for the Systolic Array All future references to matrix multiplication ($C = A * B + D$) will assume that $A, B, C, D \in \mathbb{Z}^{n \times n}$. Matrix multiplication ($C = A * B + D$) can be defined by using the dot product operator on the rows/columns of $A$ and $B$ respectively: $$ C_{ij} = D_{ij} + \sum_{k=1}^{n} A_{ik} * B_{kj}$$ It will actually be more useful to express this using $A^T$: $$ C_{ij} = D_{ij} + \sum_{k=1}^{n} A^T_{ki} * B_{kj}$$ $$ C_{ij} = D_{ij} + \mathbf{A^T_{i}} \cdot \mathbf{B_{j}}$$ So each element of $C$ can be defined using the dot product of columns of $A^T$ and $B$ (where $\mathbf{A^T_{k}}$ is the k'th column of $A^T$ treated as a vector). Additionally, in order to calculate every $C_{ij}$ we will need to calculate the dot product of every possible pair of columns $\mathbf{A^T_{i}}$ and $\mathbf{B_{j}}$. From here we can construct a rudimentary use of the systolic array - for now we'll assume that **there is not pipelining** from top -> bottom but **there is pipelining** from left -> right: - store $B$ directly in a register in the PEs in the systolic array (e.g. store $B_{00}$ in a register at `PE_0_0` above) - feed each column of $A^T$ as input from the left and move each column from left to right on each clock cycle; i.e. - Clock cycle 0: input - $\mathbf{A^T_0}$ to column 0 of the systolic array - Clock cycle 1: input - $\mathbf{A^T_0}$ to column 1 of the systolic array - $\mathbf{A^T_1}$ to column 0 of the systolic array - Clock cycle 2: input - $\mathbf{A^T_0}$ to column 2 of the systolic array - $\mathbf{A^T_1}$ to column 1 of the systolic array - $\mathbf{A^T_2}$ to column 0 of the systolic array - etc. - when column $\mathbf{A^T_{i}}$ is input to column $j$ of the systolic array (i.e., over the column of PEs that is storing $\mathbf{B_{j}}$) - feed $D_{ij}$ as input from the top (of that column) and at each PE compute: $$ out_{bottom} = in_{top} + reg * in_{left} $$ This is sufficient to correctly execute matrix multiplication - following each column of the systolic array from top -> bottom, each PE is performing a multiplication and an "accumulation," adding the result of the multiplication to the previous sum and passing it down. Since there isn't any top -> bottom pipelining (yet) the bottom output is set on the same clock cycle - e.g. for Clock cycle 0 above: - `PE_0_0`: $out_{bottom} = (D_{00}) + A^T_{00} * B_{00}$ - `PE_1_0`: $out_{bottom} = (D_{00} + A^T_{00} * B_{00}) + A^T_{10} * B_{10}$ - `PE_2_0`: $out_{bottom} = (D_{00} + A^T_{00} * B_{00} + A^T_{10} * B_{10}) + A^T_{20} * B_{20}$ - `PE_3_0`: $out_{bottom} = (D_{00} + A^T_{00} * B_{00} + A^T_{10} * B_{10} + A^T_{20} * B_{20}) + A^T_{30} * B_{30}$ which is exactly what we want to compute: $D_{00} + \mathbf{A^T_{0}} \cdot \mathbf{B_{0}} = C_{00}$ This is what that looks like at Clock cycles 0, 1, and 2 overlaid onto the systolic array graph (note that I've added registers for left -> right pipelining - the `RE` elements are clocked registers that will store the entries in $A$ and pass them to the successive PE on the next clock cycle). ### CLOCK CYCLE 0 ```graphviz digraph sys_array { nodesep=1.0 node [color=Black,fontname=Arial,shape=box] edge [color=Black,fontname=Arial] labelloc = "t" # columns subgraph { "PE_-1_-1" [style=invis] "PE_0_-1" [style=invis] "PE_1_-1" [style=invis] "PE_2_-1" [style=invis] "PE_3_-1" [style=invis] "PE_4_-1" [style=invis] } subgraph { "PE_-1_0" [style=invis] "PE_0_0" [label="PE_0_0\n (B_00)"] "PE_1_0" [label="PE_1_0\n (B_10)"] "PE_2_0" [label="PE_2_0\n (B_20)"] "PE_3_0" [label="PE_3_0\n (B_30)"] "PE_4_0" [style=invis] } subgraph { "PE_-1_1" [style=invis] "PE_0_1" [label="PE_0_1\n (B_01)"] "PE_1_1" [label="PE_1_1\n (B_11)"] "PE_2_1" [label="PE_2_1\n (B_21)"] "PE_3_1" [label="PE_3_1\n (B_31)"] "PE_4_1" [style=invis] } subgraph { "PE_-1_2" [style=invis] "PE_0_2" [label="PE_0_2\n (B_02)"] "PE_1_2" [label="PE_1_2\n (B_12)"] "PE_2_2" [label="PE_2_2\n (B_22)"] "PE_3_2" [label="PE_3_2\n (B_32)"] "PE_4_2" [style=invis] } subgraph { "PE_-1_3" [style=invis] "PE_0_3" [label="PE_0_3\n (B_03)"] "PE_1_3" [label="PE_1_3\n (B_13)"] "PE_2_3" [label="PE_2_3\n (B_23)"] "PE_3_3" [label="PE_3_3\n (B_33)"] "PE_4_3" [style=invis] } # rows subgraph { rank=same "PE_-1_-1" [style=invis] "PE_-1_0" [style=invis] "PE_-1_1" [style=invis] "PE_-1_2" [style=invis] "PE_-1_3" [style=invis] "PE_-1_4" [style=invis] } subgraph { rank=same "PE_0_-1" [style=invis] "PE_0_0" "PE_0_1" "PE_0_2" "PE_0_3" "PE_0_4" [style=invis] "RE_0_0" [shape=note] "RE_0_1" [shape=note] "RE_0_2" [shape=note] } subgraph { rank=same "PE_1_-1" [style=invis] "PE_1_0" "PE_1_1" "PE_1_2" "PE_1_3" "PE_1_4" [style=invis] "RE_1_0" [shape=note] "RE_1_1" [shape=note] "RE_1_2" [shape=note] } subgraph { rank=same "PE_2_-1" [style=invis] "PE_2_0" "PE_2_1" "PE_2_2" "PE_2_3" "PE_2_4" [style=invis] "RE_2_0" [shape=note] "RE_2_1" [shape=note] "RE_2_2" [shape=note] } subgraph { rank=same "PE_3_-1" [style=invis] "PE_3_0" "PE_3_1" "PE_3_2" "PE_3_3" "PE_3_4" [style=invis] "RE_3_0" [shape=note] "RE_3_1" [shape=note] "RE_3_2" [shape=note] } # left -> right "PE_0_-1" -> "PE_0_0" [label="A^T_00"] "PE_0_0" -> "RE_0_0" [label="A^T_00"] "RE_0_0" -> "PE_0_1" "PE_0_1" -> "RE_0_1" "RE_0_1" -> "PE_0_2" "PE_0_2" -> "RE_0_2" "RE_0_2" -> "PE_0_3" "PE_0_3" -> "PE_0_4" "PE_1_-1" -> "PE_1_0" [label="A^T_10"] "PE_1_0" -> "RE_1_0" [label="A^T_10"] "RE_1_0" -> "PE_1_1" "PE_1_1" -> "RE_1_1" "RE_1_1" -> "PE_1_2" "PE_1_2" -> "RE_1_2" "RE_1_2"-> "PE_1_3" "PE_1_3" -> "PE_1_4" "PE_2_-1" -> "PE_2_0" [label="A^T_20"] "PE_2_0" -> "RE_2_0" [label="A^T_20"] "RE_2_0" -> "PE_2_1" "PE_2_1" -> "RE_2_1" "RE_2_1" -> "PE_2_2" "PE_2_2" -> "RE_2_2" "RE_2_2" -> "PE_2_3" "PE_2_3" -> "PE_2_4" "PE_3_-1" -> "PE_3_0" [label="A^T_30"] "PE_3_0" -> "RE_3_0" [label="A^T_30"] "RE_3_0" -> "PE_3_1" "PE_3_1" -> "RE_3_1" "RE_3_1" -> "PE_3_2" "PE_3_2" -> "RE_3_2" "RE_3_2" -> "PE_3_3" "PE_3_3" -> "PE_3_4" # top -> bottom "PE_-1_0" -> "PE_0_0" [label="D_00"] "PE_0_0" -> "PE_1_0" [label="D_00\n + A^T_00 * B_00"] "PE_1_0" -> "PE_2_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10"] "PE_2_0" -> "PE_3_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10\n + A^T_20 * B_20"] "PE_3_0" -> "PE_4_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10\n + A^T_20 * B_20\n + A^T_30 * B_30\n -->C_00"] "PE_-1_1" -> "PE_0_1" [label="---"] "PE_0_1" -> "PE_1_1" "PE_1_1" -> "PE_2_1" "PE_2_1" -> "PE_3_1" "PE_3_1" -> "PE_4_1" "PE_-1_2" -> "PE_0_2" [label="---"] "PE_0_2" -> "PE_1_2" "PE_1_2" -> "PE_2_2" "PE_2_2" -> "PE_3_2" "PE_3_2" -> "PE_4_2" "PE_-1_3" -> "PE_0_3" [label="---"] "PE_0_3" -> "PE_1_3" "PE_1_3" -> "PE_2_3" "PE_2_3" -> "PE_3_3" "PE_3_3" -> "PE_4_3" } ``` ### CLOCK CYCLE 1 ```graphviz digraph sys_array { nodesep=1.0 node [color=Black,fontname=Arial,shape=box] edge [color=Black,fontname=Arial] labelloc = "t" # columns subgraph { "PE_-1_-1" [style=invis] "PE_0_-1" [style=invis] "PE_1_-1" [style=invis] "PE_2_-1" [style=invis] "PE_3_-1" [style=invis] "PE_4_-1" [style=invis] } subgraph { "PE_-1_0" [style=invis] "PE_0_0" [label="PE_0_0\n (B_00)"] "PE_1_0" [label="PE_1_0\n (B_10)"] "PE_2_0" [label="PE_2_0\n (B_20)"] "PE_3_0" [label="PE_3_0\n (B_30)"] "PE_4_0" [style=invis] } subgraph { "PE_-1_1" [style=invis] "PE_0_1" [label="PE_0_1\n (B_01)"] "PE_1_1" [label="PE_1_1\n (B_11)"] "PE_2_1" [label="PE_2_1\n (B_21)"] "PE_3_1" [label="PE_3_1\n (B_31)"] "PE_4_1" [style=invis] } subgraph { "PE_-1_2" [style=invis] "PE_0_2" [label="PE_0_2\n (B_02)"] "PE_1_2" [label="PE_1_2\n (B_12)"] "PE_2_2" [label="PE_2_2\n (B_22)"] "PE_3_2" [label="PE_3_2\n (B_32)"] "PE_4_2" [style=invis] } subgraph { "PE_-1_3" [style=invis] "PE_0_3" [label="PE_0_3\n (B_03)"] "PE_1_3" [label="PE_1_3\n (B_13)"] "PE_2_3" [label="PE_2_3\n (B_23)"] "PE_3_3" [label="PE_3_3\n (B_33)"] "PE_4_3" [style=invis] } # rows subgraph { rank=same "PE_-1_-1" [style=invis] "PE_-1_0" [style=invis] "PE_-1_1" [style=invis] "PE_-1_2" [style=invis] "PE_-1_3" [style=invis] "PE_-1_4" [style=invis] } subgraph { rank=same "PE_0_-1" [style=invis] "PE_0_0" "PE_0_1" "PE_0_2" "PE_0_3" "PE_0_4" [style=invis] "RE_0_0" [shape=note] "RE_0_1" [shape=note] "RE_0_2" [shape=note] } subgraph { rank=same "PE_1_-1" [style=invis] "PE_1_0" "PE_1_1" "PE_1_2" "PE_1_3" "PE_1_4" [style=invis] "RE_1_0" [shape=note] "RE_1_1" [shape=note] "RE_1_2" [shape=note] } subgraph { rank=same "PE_2_-1" [style=invis] "PE_2_0" "PE_2_1" "PE_2_2" "PE_2_3" "PE_2_4" [style=invis] "RE_2_0" [shape=note] "RE_2_1" [shape=note] "RE_2_2" [shape=note] } subgraph { rank=same "PE_3_-1" [style=invis] "PE_3_0" "PE_3_1" "PE_3_2" "PE_3_3" "PE_3_4" [style=invis] "RE_3_0" [shape=note] "RE_3_1" [shape=note] "RE_3_2" [shape=note] } # left -> right "PE_0_-1" -> "PE_0_0" [label="A^T_01"] "PE_0_0" -> "RE_0_0" [label="A^T_01"] "RE_0_0" -> "PE_0_1" [label="A^T_00"] "PE_0_1" -> "RE_0_1" [label="A^T_00"] "RE_0_1" -> "PE_0_2" "PE_0_2" -> "RE_0_2" "RE_0_2" -> "PE_0_3" "PE_0_3" -> "PE_0_4" "PE_1_-1" -> "PE_1_0" [label="A^T_11"] "PE_1_0" -> "RE_1_0" [label="A^T_11"] "RE_1_0" -> "PE_1_1" [label="A^T_10"] "PE_1_1" -> "RE_1_1" [label="A^T_10"] "RE_1_1" -> "PE_1_2" "PE_1_2" -> "RE_1_2" "RE_1_2"-> "PE_1_3" "PE_1_3" -> "PE_1_4" "PE_2_-1" -> "PE_2_0" [label="A^T_21"] "PE_2_0" -> "RE_2_0" [label="A^T_21"] "RE_2_0" -> "PE_2_1" [label="A^T_20"] "PE_2_1" -> "RE_2_1" [label="A^T_20"] "RE_2_1" -> "PE_2_2" "PE_2_2" -> "RE_2_2" "RE_2_2" -> "PE_2_3" "PE_2_3" -> "PE_2_4" "PE_3_-1" -> "PE_3_0" [label="A^T_31"] "PE_3_0" -> "RE_3_0" [label="A^T_31"] "RE_3_0" -> "PE_3_1" [label="A^T_30"] "PE_3_1" -> "RE_3_1" [label="A^T_30"] "RE_3_1" -> "PE_3_2" "PE_3_2" -> "RE_3_2" "RE_3_2" -> "PE_3_3" "PE_3_3" -> "PE_3_4" # top -> bottom "PE_-1_0" -> "PE_0_0" [label="D_10"] "PE_0_0" -> "PE_1_0" [label="D_10\n + A^T_01 * B_00"] "PE_1_0" -> "PE_2_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10"] "PE_2_0" -> "PE_3_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10\n + A^T_21 * B_20"] "PE_3_0" -> "PE_4_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10\n + A^T_21 * B_20\n + A^T_31 * B_30\n -->C_10"] "PE_-1_1" -> "PE_0_1" [label="D_01"] "PE_0_1" -> "PE_1_1" [label="D_01\n + A^T_00 * B_01"] "PE_1_1" -> "PE_2_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11"] "PE_2_1" -> "PE_3_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11\n + A^T_20 * B_21"] "PE_3_1" -> "PE_4_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11\n + A^T_20 * B_21\n + A^T_30 * B_31\n -->C_01"] "PE_-1_2" -> "PE_0_2" [label="---"] "PE_0_2" -> "PE_1_2" "PE_1_2" -> "PE_2_2" "PE_2_2" -> "PE_3_2" "PE_3_2" -> "PE_4_2" "PE_-1_3" -> "PE_0_3" [label="---"] "PE_0_3" -> "PE_1_3" "PE_1_3" -> "PE_2_3" "PE_2_3" -> "PE_3_3" "PE_3_3" -> "PE_4_3" } ``` ### CLOCK CYCLE 2 ```graphviz digraph sys_array { nodesep=1.0 node [color=Black,fontname=Arial,shape=box] edge [color=Black,fontname=Arial] labelloc = "t" # columns subgraph { "PE_-1_-1" [style=invis] "PE_0_-1" [style=invis] "PE_1_-1" [style=invis] "PE_2_-1" [style=invis] "PE_3_-1" [style=invis] "PE_4_-1" [style=invis] } subgraph { "PE_-1_0" [style=invis] "PE_0_0" [label="PE_0_0\n (B_00)"] "PE_1_0" [label="PE_1_0\n (B_10)"] "PE_2_0" [label="PE_2_0\n (B_20)"] "PE_3_0" [label="PE_3_0\n (B_30)"] "PE_4_0" [style=invis] } subgraph { "PE_-1_1" [style=invis] "PE_0_1" [label="PE_0_1\n (B_01)"] "PE_1_1" [label="PE_1_1\n (B_11)"] "PE_2_1" [label="PE_2_1\n (B_21)"] "PE_3_1" [label="PE_3_1\n (B_31)"] "PE_4_1" [style=invis] } subgraph { "PE_-1_2" [style=invis] "PE_0_2" [label="PE_0_2\n (B_02)"] "PE_1_2" [label="PE_1_2\n (B_12)"] "PE_2_2" [label="PE_2_2\n (B_22)"] "PE_3_2" [label="PE_3_2\n (B_32)"] "PE_4_2" [style=invis] } subgraph { "PE_-1_3" [style=invis] "PE_0_3" [label="PE_0_3\n (B_03)"] "PE_1_3" [label="PE_1_3\n (B_13)"] "PE_2_3" [label="PE_2_3\n (B_23)"] "PE_3_3" [label="PE_3_3\n (B_33)"] "PE_4_3" [style=invis] } # rows subgraph { rank=same "PE_-1_-1" [style=invis] "PE_-1_0" [style=invis] "PE_-1_1" [style=invis] "PE_-1_2" [style=invis] "PE_-1_3" [style=invis] "PE_-1_4" [style=invis] } subgraph { rank=same "PE_0_-1" [style=invis] "PE_0_0" "PE_0_1" "PE_0_2" "PE_0_3" "PE_0_4" [style=invis] "RE_0_0" [shape=note] "RE_0_1" [shape=note] "RE_0_2" [shape=note] } subgraph { rank=same "PE_1_-1" [style=invis] "PE_1_0" "PE_1_1" "PE_1_2" "PE_1_3" "PE_1_4" [style=invis] "RE_1_0" [shape=note] "RE_1_1" [shape=note] "RE_1_2" [shape=note] } subgraph { rank=same "PE_2_-1" [style=invis] "PE_2_0" "PE_2_1" "PE_2_2" "PE_2_3" "PE_2_4" [style=invis] "RE_2_0" [shape=note] "RE_2_1" [shape=note] "RE_2_2" [shape=note] } subgraph { rank=same "PE_3_-1" [style=invis] "PE_3_0" "PE_3_1" "PE_3_2" "PE_3_3" "PE_3_4" [style=invis] "RE_3_0" [shape=note] "RE_3_1" [shape=note] "RE_3_2" [shape=note] } # left -> right "PE_0_-1" -> "PE_0_0" [label="A^T_02"] "PE_0_0" -> "RE_0_0" [label="A^T_02"] "RE_0_0" -> "PE_0_1" [label="A^T_01"] "PE_0_1" -> "RE_0_1" [label="A^T_01"] "RE_0_1" -> "PE_0_2" [label="A^T_00"] "PE_0_2" -> "RE_0_2" [label="A^T_00"] "RE_0_2" -> "PE_0_3" "PE_0_3" -> "PE_0_4" "PE_1_-1" -> "PE_1_0" [label="A^T_12"] "PE_1_0" -> "RE_1_0" [label="A^T_12"] "RE_1_0" -> "PE_1_1" [label="A^T_11"] "PE_1_1" -> "RE_1_1" [label="A^T_11"] "RE_1_1" -> "PE_1_2" [label="A^T_10"] "PE_1_2" -> "RE_1_2" [label="A^T_10"] "RE_1_2"-> "PE_1_3" "PE_1_3" -> "PE_1_4" "PE_2_-1" -> "PE_2_0" [label="A^T_22"] "PE_2_0" -> "RE_2_0" [label="A^T_22"] "RE_2_0" -> "PE_2_1" [label="A^T_21"] "PE_2_1" -> "RE_2_1" [label="A^T_21"] "RE_2_1" -> "PE_2_2" [label="A^T_20"] "PE_2_2" -> "RE_2_2" [label="A^T_20"] "RE_2_2" -> "PE_2_3" "PE_2_3" -> "PE_2_4" "PE_3_-1" -> "PE_3_0" [label="A^T_32"] "PE_3_0" -> "RE_3_0" [label="A^T_32"] "RE_3_0" -> "PE_3_1" [label="A^T_31"] "PE_3_1" -> "RE_3_1" [label="A^T_31"] "RE_3_1" -> "PE_3_2" [label="A^T_30"] "PE_3_2" -> "RE_3_2" [label="A^T_30"] "RE_3_2" -> "PE_3_3" "PE_3_3" -> "PE_3_4" # top -> bottom "PE_-1_0" -> "PE_0_0" [label="D_20"] "PE_0_0" -> "PE_1_0" [label="D_20\n + A^T_02 * B_00"] "PE_1_0" -> "PE_2_0" [label="D_20\n + A^T_02 * B_00\n + A^T_12 * B_10"] "PE_2_0" -> "PE_3_0" [label="D_20\n + A^T_02 * B_00\n + A^T_12 * B_10\n + A^T_22 * B_20"] "PE_3_0" -> "PE_4_0" [label="D_20\n + A^T_02 * B_00\n + A^T_12 * B_10\n + A^T_22 * B_20\n + A^T_32 * B_30\n -->C_20"] "PE_-1_1" -> "PE_0_1" [label="D_11"] "PE_0_1" -> "PE_1_1" [label="D_11\n + A^T_01 * B_01"] "PE_1_1" -> "PE_2_1" [label="D_11\n + A^T_01 * B_01\n + A^T_11 * B_11"] "PE_2_1" -> "PE_3_1" [label="D_11\n + A^T_01 * B_01\n + A^T_11 * B_11\n + A^T_21 * B_21"] "PE_3_1" -> "PE_4_1" [label="D_11\n + A^T_01 * B_01\n + A^T_11 * B_11\n + A^T_21 * B_21\n + A^T_31 * B_31\n -->C_11"] "PE_-1_2" -> "PE_0_2" [label="D_02"] "PE_0_2" -> "PE_1_2" [label="D_02\n + A^T_00 * B_02"] "PE_1_2" -> "PE_2_2" [label="D_02\n + A^T_00 * B_02\n + A^T_10 * B_12"] "PE_2_2" -> "PE_3_2" [label="D_02\n + A^T_00 * B_02\n + A^T_10 * B_12\n + A^T_20 * B_22"] "PE_3_2" -> "PE_4_2" [label="D_02\n + A^T_00 * B_02\n + A^T_10 * B_12\n + A^T_20 * B_22\n + A^T_30 * B_32\n -->C_02"] "PE_-1_3" -> "PE_0_3" [label="---"] "PE_0_3" -> "PE_1_3" "PE_1_3" -> "PE_2_3" "PE_2_3" -> "PE_3_3" "PE_3_3" -> "PE_4_3" } ``` ## Top Down Pipelining A big problem with the rudimentary design above is that the entire dot product $C_{ij} = D_{ij} + \mathbf{A^T_i} \cdot \mathbf{B_j}$ is is computed in a single clock cycle - this means that the critical path of the circuit will include the multiplication + accumulation steps for every PE in an entire column (in the examples above - 4). This can be improved by adding top->bottom pipelining; i.e., adding registers in between each PE in a column of the systolic array. The tradeoff now is that a single dot product will take 4 clock cycles instead of 1. A balance between both competing factors is to add **"tiles"** - only adding registers between each k PEs for some k | n; e.g. for k = 2, n = 4(each tile contains 2 PEs) then each dot product will take 2 clock cycles and each clock cycle will have 2 multiplication + accumulation ops. The downside of tiling is added complexity - not to the circuit, but to the input "feeding", since each column of $A^T$ shouldn't be sent in its entirety on the same clock cycle. Instead the **left input has to be staggered** so that the left input to a tile (a chunk of a column $\mathbf{A^T_i}$) matches up with the column the was passed to the above tile on the **previous clock cycle**. This is what that looks like at Clock cycles 0, 1, and 2: ### CLOCK CYCLE 0 ```graphviz digraph sys_array { nodesep=1.0 node [color=Black,fontname=Arial,shape=box] edge [color=Black,fontname=Arial] labelloc = "t" # columns subgraph { "PE_-1_-1" [style=invis] "PE_0_-1" [style=invis] "PE_1_-1" [style=invis] "PE_2_-1" [style=invis] "PE_3_-1" [style=invis] "PE_4_-1" [style=invis] } subgraph { "PE_-1_0" [style=invis] "PE_0_0" [label="PE_0_0\n (B_00)"] "PE_1_0" [label="PE_1_0\n (B_10)"] "PE_2_0" [label="PE_2_0\n (B_20)"] "PE_3_0" [label="PE_3_0\n (B_30)"] "PE_4_0" [style=invis] "TE_1_0" [shape=note] } subgraph { "PE_-1_1" [style=invis] "PE_0_1" [label="PE_0_1\n (B_01)"] "PE_1_1" [label="PE_1_1\n (B_11)"] "PE_2_1" [label="PE_2_1\n (B_21)"] "PE_3_1" [label="PE_3_1\n (B_31)"] "PE_4_1" [style=invis] "TE_1_1" [shape=note] } subgraph { "PE_-1_2" [style=invis] "PE_0_2" [label="PE_0_2\n (B_02)"] "PE_1_2" [label="PE_1_2\n (B_12)"] "PE_2_2" [label="PE_2_2\n (B_22)"] "PE_3_2" [label="PE_3_2\n (B_32)"] "PE_4_2" [style=invis] "TE_1_2" [shape=note] } subgraph { "PE_-1_3" [style=invis] "PE_0_3" [label="PE_0_3\n (B_03)"] "PE_1_3" [label="PE_1_3\n (B_13)"] "PE_2_3" [label="PE_2_3\n (B_23)"] "PE_3_3" [label="PE_3_3\n (B_33)"] "PE_4_3" [style=invis] "TE_1_3" [shape=note] } # rows subgraph { rank=same "PE_-1_-1" [style=invis] "PE_-1_0" [style=invis] "PE_-1_1" [style=invis] "PE_-1_2" [style=invis] "PE_-1_3" [style=invis] "PE_-1_4" [style=invis] } subgraph { rank=same "PE_0_-1" [style=invis] "PE_0_0" "PE_0_1" "PE_0_2" "PE_0_3" "PE_0_4" [style=invis] "RE_0_0" [shape=note] "RE_0_1" [shape=note] "RE_0_2" [shape=note] } subgraph { rank=same "PE_1_-1" [style=invis] "PE_1_0" "PE_1_1" "PE_1_2" "PE_1_3" "PE_1_4" [style=invis] "RE_1_0" [shape=note] "RE_1_1" [shape=note] "RE_1_2" [shape=note] } subgraph { rank=same "PE_2_-1" [style=invis] "PE_2_0" "PE_2_1" "PE_2_2" "PE_2_3" "PE_2_4" [style=invis] "RE_2_0" [shape=note] "RE_2_1" [shape=note] "RE_2_2" [shape=note] } subgraph { rank=same "PE_3_-1" [style=invis] "PE_3_0" "PE_3_1" "PE_3_2" "PE_3_3" "PE_3_4" [style=invis] "RE_3_0" [shape=note] "RE_3_1" [shape=note] "RE_3_2" [shape=note] } # left -> right "PE_0_-1" -> "PE_0_0" [label="A^T_00"] "PE_0_0" -> "RE_0_0" [label="A^T_00"] "RE_0_0" -> "PE_0_1" "PE_0_1" -> "RE_0_1" "RE_0_1" -> "PE_0_2" "PE_0_2" -> "RE_0_2" "RE_0_2" -> "PE_0_3" "PE_0_3" -> "PE_0_4" "PE_1_-1" -> "PE_1_0" [label="A^T_10"] "PE_1_0" -> "RE_1_0" [label="A^T_10"] "RE_1_0" -> "PE_1_1" "PE_1_1" -> "RE_1_1" "RE_1_1" -> "PE_1_2" "PE_1_2" -> "RE_1_2" "RE_1_2"-> "PE_1_3" "PE_1_3" -> "PE_1_4" "PE_2_-1" -> "PE_2_0" "PE_2_0" -> "RE_2_0" "RE_2_0" -> "PE_2_1" "PE_2_1" -> "RE_2_1" "RE_2_1" -> "PE_2_2" "PE_2_2" -> "RE_2_2" "RE_2_2" -> "PE_2_3" "PE_2_3" -> "PE_2_4" "PE_3_-1" -> "PE_3_0" "PE_3_0" -> "RE_3_0" "RE_3_0" -> "PE_3_1" "PE_3_1" -> "RE_3_1" "RE_3_1" -> "PE_3_2" "PE_3_2" -> "RE_3_2" "RE_3_2" -> "PE_3_3" "PE_3_3" -> "PE_3_4" # top -> bottom "PE_-1_0" -> "PE_0_0" [label="D_00"] "PE_0_0" -> "PE_1_0" [label="D_00\n + A^T_00 * B_00"] "PE_1_0" -> "TE_1_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10"] "TE_1_0" -> "PE_2_0" "PE_2_0" -> "PE_3_0" "PE_3_0" -> "PE_4_0" "PE_-1_1" -> "PE_0_1" [label="---"] "PE_0_1" -> "PE_1_1" "PE_1_1" -> "TE_1_1" "TE_1_1" -> "PE_2_1" "PE_2_1" -> "PE_3_1" "PE_3_1" -> "PE_4_1" "PE_-1_2" -> "PE_0_2" [label="---"] "PE_0_2" -> "PE_1_2" "PE_1_2" -> "TE_1_2" "TE_1_2" -> "PE_2_2" "PE_2_2" -> "PE_3_2" "PE_3_2" -> "PE_4_2" "PE_-1_3" -> "PE_0_3" [label="---"] "PE_0_3" -> "PE_1_3" "PE_1_3" -> "TE_1_3" "TE_1_3" -> "PE_2_3" "PE_2_3" -> "PE_3_3" "PE_3_3" -> "PE_4_3" } ``` ### CLOCK CYCLE 1 ```graphviz digraph sys_array { nodesep=1.0 node [color=Black,fontname=Arial,shape=box] edge [color=Black,fontname=Arial] labelloc = "t" # columns subgraph { "PE_-1_-1" [style=invis] "PE_0_-1" [style=invis] "PE_1_-1" [style=invis] "PE_2_-1" [style=invis] "PE_3_-1" [style=invis] "PE_4_-1" [style=invis] } subgraph { "PE_-1_0" [style=invis] "PE_0_0" [label="PE_0_0\n (B_00)"] "PE_1_0" [label="PE_1_0\n (B_10)"] "PE_2_0" [label="PE_2_0\n (B_20)"] "PE_3_0" [label="PE_3_0\n (B_30)"] "PE_4_0" [style=invis] "TE_1_0" [shape=note] } subgraph { "PE_-1_1" [style=invis] "PE_0_1" [label="PE_0_1\n (B_01)"] "PE_1_1" [label="PE_1_1\n (B_11)"] "PE_2_1" [label="PE_2_1\n (B_21)"] "PE_3_1" [label="PE_3_1\n (B_31)"] "PE_4_1" [style=invis] "TE_1_1" [shape=note] } subgraph { "PE_-1_2" [style=invis] "PE_0_2" [label="PE_0_2\n (B_02)"] "PE_1_2" [label="PE_1_2\n (B_12)"] "PE_2_2" [label="PE_2_2\n (B_22)"] "PE_3_2" [label="PE_3_2\n (B_32)"] "PE_4_2" [style=invis] "TE_1_2" [shape=note] } subgraph { "PE_-1_3" [style=invis] "PE_0_3" [label="PE_0_3\n (B_03)"] "PE_1_3" [label="PE_1_3\n (B_13)"] "PE_2_3" [label="PE_2_3\n (B_23)"] "PE_3_3" [label="PE_3_3\n (B_33)"] "PE_4_3" [style=invis] "TE_1_3" [shape=note] } # rows subgraph { rank=same "PE_-1_-1" [style=invis] "PE_-1_0" [style=invis] "PE_-1_1" [style=invis] "PE_-1_2" [style=invis] "PE_-1_3" [style=invis] "PE_-1_4" [style=invis] } subgraph { rank=same "PE_0_-1" [style=invis] "PE_0_0" "PE_0_1" "PE_0_2" "PE_0_3" "PE_0_4" [style=invis] "RE_0_0" [shape=note] "RE_0_1" [shape=note] "RE_0_2" [shape=note] } subgraph { rank=same "PE_1_-1" [style=invis] "PE_1_0" "PE_1_1" "PE_1_2" "PE_1_3" "PE_1_4" [style=invis] "RE_1_0" [shape=note] "RE_1_1" [shape=note] "RE_1_2" [shape=note] } subgraph { rank=same "PE_2_-1" [style=invis] "PE_2_0" "PE_2_1" "PE_2_2" "PE_2_3" "PE_2_4" [style=invis] "RE_2_0" [shape=note] "RE_2_1" [shape=note] "RE_2_2" [shape=note] } subgraph { rank=same "PE_3_-1" [style=invis] "PE_3_0" "PE_3_1" "PE_3_2" "PE_3_3" "PE_3_4" [style=invis] "RE_3_0" [shape=note] "RE_3_1" [shape=note] "RE_3_2" [shape=note] } # left -> right "PE_0_-1" -> "PE_0_0" [label="A^T_01"] "PE_0_0" -> "RE_0_0" [label="A^T_01"] "RE_0_0" -> "PE_0_1" [label="A^T_00"] "PE_0_1" -> "RE_0_1" [label="A^T_00"] "RE_0_1" -> "PE_0_2" "PE_0_2" -> "RE_0_2" "RE_0_2" -> "PE_0_3" "PE_0_3" -> "PE_0_4" "PE_1_-1" -> "PE_1_0" [label="A^T_11"] "PE_1_0" -> "RE_1_0" [label="A^T_11"] "RE_1_0" -> "PE_1_1" [label="A^T_10"] "PE_1_1" -> "RE_1_1" [label="A^T_10"] "RE_1_1" -> "PE_1_2" "PE_1_2" -> "RE_1_2" "RE_1_2"-> "PE_1_3" "PE_1_3" -> "PE_1_4" "PE_2_-1" -> "PE_2_0" [label="A^T_20"] "PE_2_0" -> "RE_2_0" [label="A^T_20"] "RE_2_0" -> "PE_2_1" "PE_2_1" -> "RE_2_1" "RE_2_1" -> "PE_2_2" "PE_2_2" -> "RE_2_2" "RE_2_2" -> "PE_2_3" "PE_2_3" -> "PE_2_4" "PE_3_-1" -> "PE_3_0" [label="A^T_30"] "PE_3_0" -> "RE_3_0" [label="A^T_30"] "RE_3_0" -> "PE_3_1" "PE_3_1" -> "RE_3_1" "RE_3_1" -> "PE_3_2" "PE_3_2" -> "RE_3_2" "RE_3_2" -> "PE_3_3" "PE_3_3" -> "PE_3_4" # top -> bottom "PE_-1_0" -> "PE_0_0" [label="D_10"] "PE_0_0" -> "PE_1_0" [label="D_10\n + A^T_01 * B_00"] "PE_1_0" -> "TE_1_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10"] "TE_1_0" -> "PE_2_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10"] "PE_2_0" -> "PE_3_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10\n + A^T_20 * B_20"] "PE_3_0" -> "PE_4_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10\n + A^T_20 * B_20\n + A^T_30 * B_30\n-->C_00"] "PE_-1_1" -> "PE_0_1" [label="D_01"] "PE_0_1" -> "PE_1_1" [label="D_01\n + A^T_00 * B_01"] "PE_1_1" -> "TE_1_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11"] "TE_1_1" -> "PE_2_1" "PE_2_1" -> "PE_3_1" "PE_3_1" -> "PE_4_1" "PE_-1_2" -> "PE_0_2" [label="---"] "PE_0_2" -> "PE_1_2" "PE_1_2" -> "TE_1_2" "TE_1_2" -> "PE_2_2" "PE_2_2" -> "PE_3_2" "PE_3_2" -> "PE_4_2" "PE_-1_3" -> "PE_0_3" [label="---"] "PE_0_3" -> "PE_1_3" "PE_1_3" -> "TE_1_3" "TE_1_3" -> "PE_2_3" "PE_2_3" -> "PE_3_3" "PE_3_3" -> "PE_4_3" } ``` ### CLOCK CYCLE 2 ```graphviz digraph sys_array { nodesep=1.0 node [color=Black,fontname=Arial,shape=box] edge [color=Black,fontname=Arial] labelloc = "t" # columns subgraph { "PE_-1_-1" [style=invis] "PE_0_-1" [style=invis] "PE_1_-1" [style=invis] "PE_2_-1" [style=invis] "PE_3_-1" [style=invis] "PE_4_-1" [style=invis] } subgraph { "PE_-1_0" [style=invis] "PE_0_0" [label="PE_0_0\n (B_00)"] "PE_1_0" [label="PE_1_0\n (B_10)"] "PE_2_0" [label="PE_2_0\n (B_20)"] "PE_3_0" [label="PE_3_0\n (B_30)"] "PE_4_0" [style=invis] "TE_1_0" [shape=note] } subgraph { "PE_-1_1" [style=invis] "PE_0_1" [label="PE_0_1\n (B_01)"] "PE_1_1" [label="PE_1_1\n (B_11)"] "PE_2_1" [label="PE_2_1\n (B_21)"] "PE_3_1" [label="PE_3_1\n (B_31)"] "PE_4_1" [style=invis] "TE_1_1" [shape=note] } subgraph { "PE_-1_2" [style=invis] "PE_0_2" [label="PE_0_2\n (B_02)"] "PE_1_2" [label="PE_1_2\n (B_12)"] "PE_2_2" [label="PE_2_2\n (B_22)"] "PE_3_2" [label="PE_3_2\n (B_32)"] "PE_4_2" [style=invis] "TE_1_2" [shape=note] } subgraph { "PE_-1_3" [style=invis] "PE_0_3" [label="PE_0_3\n (B_03)"] "PE_1_3" [label="PE_1_3\n (B_13)"] "PE_2_3" [label="PE_2_3\n (B_23)"] "PE_3_3" [label="PE_3_3\n (B_33)"] "PE_4_3" [style=invis] "TE_1_3" [shape=note] } # rows subgraph { rank=same "PE_-1_-1" [style=invis] "PE_-1_0" [style=invis] "PE_-1_1" [style=invis] "PE_-1_2" [style=invis] "PE_-1_3" [style=invis] "PE_-1_4" [style=invis] } subgraph { rank=same "PE_0_-1" [style=invis] "PE_0_0" "PE_0_1" "PE_0_2" "PE_0_3" "PE_0_4" [style=invis] "RE_0_0" [shape=note] "RE_0_1" [shape=note] "RE_0_2" [shape=note] } subgraph { rank=same "PE_1_-1" [style=invis] "PE_1_0" "PE_1_1" "PE_1_2" "PE_1_3" "PE_1_4" [style=invis] "RE_1_0" [shape=note] "RE_1_1" [shape=note] "RE_1_2" [shape=note] } subgraph { rank=same "PE_2_-1" [style=invis] "PE_2_0" "PE_2_1" "PE_2_2" "PE_2_3" "PE_2_4" [style=invis] "RE_2_0" [shape=note] "RE_2_1" [shape=note] "RE_2_2" [shape=note] } subgraph { rank=same "PE_3_-1" [style=invis] "PE_3_0" "PE_3_1" "PE_3_2" "PE_3_3" "PE_3_4" [style=invis] "RE_3_0" [shape=note] "RE_3_1" [shape=note] "RE_3_2" [shape=note] } # left -> right "PE_0_-1" -> "PE_0_0" [label="A^T_02"] "PE_0_0" -> "RE_0_0" [label="A^T_02"] "RE_0_0" -> "PE_0_1" [label="A^T_01"] "PE_0_1" -> "RE_0_1" [label="A^T_01"] "RE_0_1" -> "PE_0_2" [label="A^T_00"] "PE_0_2" -> "RE_0_2" [label="A^T_00"] "RE_0_2" -> "PE_0_3" "PE_0_3" -> "PE_0_4" "PE_1_-1" -> "PE_1_0" [label="A^T_12"] "PE_1_0" -> "RE_1_0" [label="A^T_12"] "RE_1_0" -> "PE_1_1" [label="A^T_11"] "PE_1_1" -> "RE_1_1" [label="A^T_11"] "RE_1_1" -> "PE_1_2" [label="A^T_10"] "PE_1_2" -> "RE_1_2" [label="A^T_10"] "RE_1_2"-> "PE_1_3" "PE_1_3" -> "PE_1_4" "PE_2_-1" -> "PE_2_0" [label="A^T_21"] "PE_2_0" -> "RE_2_0" [label="A^T_21"] "RE_2_0" -> "PE_2_1" [label="A^T_20"] "PE_2_1" -> "RE_2_1" [label="A^T_20"] "RE_2_1" -> "PE_2_2" "PE_2_2" -> "RE_2_2" "RE_2_2" -> "PE_2_3" "PE_2_3" -> "PE_2_4" "PE_3_-1" -> "PE_3_0" [label="A^T_31"] "PE_3_0" -> "RE_3_0" [label="A^T_31"] "RE_3_0" -> "PE_3_1" [label="A^T_30"] "PE_3_1" -> "RE_3_1" [label="A^T_30"] "RE_3_1" -> "PE_3_2" "PE_3_2" -> "RE_3_2" "RE_3_2" -> "PE_3_3" "PE_3_3" -> "PE_3_4" # top -> bottom "PE_-1_0" -> "PE_0_0" [label="D_20"] "PE_0_0" -> "PE_1_0" [label="D_20\n + A^T_02 * B_00"] "PE_1_0" -> "TE_1_0" [label="D_20\n + A^T_02 * B_00\n + A^T_12 * B_10"] "TE_1_0" -> "PE_2_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10"] "PE_2_0" -> "PE_3_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10\n + A^T_21 * B_20"] "PE_3_0" -> "PE_4_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10\n + A^T_21 * B_20\n + A^T_31 * B_30\n-->C_10"] "PE_-1_1" -> "PE_0_1" [label="D_11"] "PE_0_1" -> "PE_1_1" [label="D_11\n + A^T_01 * B_01"] "PE_1_1" -> "TE_1_1" [label="D_11\n + A^T_01 * B_01\n + A^T_11 * B_11"] "TE_1_1" -> "PE_2_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11"] "PE_2_1" -> "PE_3_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11\n + A^T_20 * B_21"] "PE_3_1" -> "PE_4_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11\n + A^T_20 * B_21\n + A^T_30 * B_31\n-->C_01"] "PE_-1_2" -> "PE_0_2" [label="D_02"] "PE_0_2" -> "PE_1_2" [label="D_02\n + A^T_00 * B_02"] "PE_1_2" -> "TE_1_2" [label="D_02\n + A^T_00 * B_02\n + A^T_10 * B_12"] "TE_1_2" -> "PE_2_2" "PE_2_2" -> "PE_3_2" "PE_3_2" -> "PE_4_2" "PE_-1_3" -> "PE_0_3" [label="---"] "PE_0_3" -> "PE_1_3" "PE_1_3" -> "TE_1_3" "TE_1_3" -> "PE_2_3" "PE_2_3" -> "PE_3_3" "PE_3_3" -> "PE_4_3" } ``` A key difference here is that it takes k - 1 clock cycles for the systolic array to "warm up" before it starts outputting results. This helps to quantify the tradeoff mentioned above - doubling the number of tiles would double the number of warmup clock cycles but halve the critical path of the systolic array. ## Loading $B$ + Wrapping up the PE The last consideration for the core systolic array is how to set it up - specifically how to load $B$ directly into the systolic array. $B$ will be loaded from the top of the systolic array - passing each row of $B$ (starting from the **bottom** row, not the top row) down through the systolic array until it reaches its corresponding row in the array and storing it in a register in the PE. It wastes clock cycles to completely load $B$ into the array before feeding $A$ and $D$ to the array but this is required since $B$ is required for the computation. A neat trick to forego this synchronization problem (and which will allow for threading later) is to **store multiple $B$ matrices in the systolic array** - in particular each PE will store 2 registers for 2 separate matrices $B_0$ and $B_1$. We still need to completely load a matrix $B_i$ before using it for computation - however with 2 $B$ matrices available it will be possible to use one matrix for computation while loading the other matrix - e.g. if $B_0$ is already loaded we can feed $A$ and $D$ into the systolic array and **at the same time** feed $B_1$ into the systolic array. Once the computation using $B_0$ is done we can then immediately use $B_1$ for computation without having to wait - this allows us to always have the computation processing on the systolic array without blocking on loading each $B_i$. This will be the last modification to the PE - we'll just need to add 3 new signals: - `B` - an input signal (from the top PE) to load the B data into the PE - `counter[3:0]` - an unsigned integer that indicates how many further rows the `B` signal should propagate before being stored - if `counter == 0` then the PE should store the `B` signal in its local register - `propagate` - indicates whether the top $B$ input should be written to local register 0 (for $B_0$) or 1 (for $B_1$) - additionally **the opposite** matrix will be used for the multiplication + accumulation ops with the top $D$ input and left $A$ input