# Systolic Array MatMul Device Part 1: The Systolic Array
A systolic array is a general design pattern for computation that can speed up execution when compared to traditional sequential code with a form of pipelining - storing computation state in between clock cycles and passing this state between components to allow for parallel execution of these components. I designed a device that uses the systolic array pattern to implement simple matrix multiplication; i.e. executing $C = A * B + D$ - I'll use these notes to describe the design of device, building the device up from the core systolic array with additional features like threading, a controller, and ISA, and comms.
The final idea is to have a separate device that can dynamically load memory (matrices) and instructions, apply those instructions (namely matrix multiplication on the loaded matrices), and be able to return back the memory to CPU - this video demonstrates the overall flow from a program running on a CPU interacting with the (virtual) device:
[](https://youtu.be/bAgslo_tx7I)
## High-Level Systolic Array Architecture
A systolic array is a matrix of processing elements (PEs) - each PE stores data, accepts input from preceding PEs, executes computation, and writes output to successive PEs - the general idea is that data should "flow" through the systolic array from the top -> down and from left -> right:
```graphviz
digraph sys_array {
nodesep=1.0
node [color=Black,fontname=Arial,shape=box]
edge [color=Black,fontname=Arial]
labelloc = "t"
# columns
subgraph {
"PE_-1_-1" [style=invis]
"PE_0_-1" [style=invis]
"PE_1_-1" [style=invis]
"PE_2_-1" [style=invis]
"PE_3_-1" [style=invis]
"PE_4_-1" [style=invis]
}
subgraph {
"PE_-1_0" [style=invis]
"PE_0_0"
"PE_1_0"
"PE_2_0"
"PE_3_0"
"PE_4_0" [style=invis]
}
subgraph {
"PE_-1_1" [style=invis]
"PE_0_1"
"PE_1_1"
"PE_2_1"
"PE_3_1"
"PE_4_1" [style=invis]
}
subgraph {
"PE_-1_2" [style=invis]
"PE_0_2"
"PE_1_2"
"PE_2_2"
"PE_3_2"
"PE_4_2" [style=invis]
}
subgraph {
"PE_-1_3" [style=invis]
"PE_0_3"
"PE_1_3"
"PE_2_3"
"PE_3_3"
"PE_4_3" [style=invis]
}
# rows
subgraph {
rank=same
"PE_-1_-1" [style=invis]
"PE_-1_0" [style=invis]
"PE_-1_1" [style=invis]
"PE_-1_2" [style=invis]
"PE_-1_3" [style=invis]
"PE_-1_4" [style=invis]
}
subgraph {
rank=same
"PE_0_-1" [style=invis]
"PE_0_0"
"PE_0_1"
"PE_0_2"
"PE_0_3"
"PE_0_4" [style=invis]
}
subgraph {
rank=same
"PE_1_-1" [style=invis]
"PE_1_0"
"PE_1_1"
"PE_1_2"
"PE_1_3"
"PE_1_4" [style=invis]
}
subgraph {
rank=same
"PE_2_-1" [style=invis]
"PE_2_0"
"PE_2_1"
"PE_2_2"
"PE_2_3"
"PE_2_4" [style=invis]
}
subgraph {
rank=same
"PE_3_-1" [style=invis]
"PE_3_0"
"PE_3_1"
"PE_3_2"
"PE_3_3"
"PE_3_4" [style=invis]
}
"PE_0_-1" -> "PE_0_0" [label="input data"]
"PE_0_0" -> "PE_0_1" -> "PE_0_2" -> "PE_0_3"
"PE_0_3" -> "PE_0_4" [label="output data (unused)"]
"PE_1_-1" -> "PE_1_0" [label="input data"]
"PE_1_0" -> "PE_1_1" -> "PE_1_2" -> "PE_1_3"
"PE_1_3" -> "PE_1_4" [label="output data (unused)"]
"PE_2_-1" -> "PE_2_0" [label="input data"]
"PE_2_0" -> "PE_2_1" -> "PE_2_2" -> "PE_2_3"
"PE_2_3" -> "PE_2_4" [label="output data (unused)"]
"PE_3_-1" -> "PE_3_0" [label="input data"]
"PE_3_0" -> "PE_3_1" -> "PE_3_2" -> "PE_3_3"
"PE_3_3" -> "PE_3_4" [label="output data (unused)"]
"PE_-1_0" -> "PE_0_0" [label="input data"]
"PE_0_0" -> "PE_1_0" -> "PE_2_0" -> "PE_3_0"
"PE_3_0" -> "PE_4_0" [label="output data"]
"PE_-1_1" -> "PE_0_1" [label="input data"]
"PE_0_1" -> "PE_1_1" -> "PE_2_1" -> "PE_3_1"
"PE_3_1" -> "PE_4_1" [label="output data"]
"PE_-1_2" -> "PE_0_2" [label="input data"]
"PE_0_2" -> "PE_1_2" -> "PE_2_2" -> "PE_3_2"
"PE_3_2" -> "PE_4_2" [label="output data"]
"PE_-1_3" -> "PE_0_3" [label="input data"]
"PE_0_3" -> "PE_1_3" -> "PE_2_3" -> "PE_3_3"
"PE_3_3" -> "PE_4_3" [label="output data"]
}
```
Each PE is identically defined and for the purposes of matrix multiplication, will execute scalar multiplication + addition. **The goal will be to store $B$ directly in the PEs, pass $A$ into the array from the left, pass $D$ into the array from the top, and orchestrate all of this correctly so that $C$ comes out the bottom.**
## Rearranging Matrix Multiplication for the Systolic Array
All future references to matrix multiplication ($C = A * B + D$) will assume that $A, B, C, D \in \mathbb{Z}^{n \times n}$.
Matrix multiplication ($C = A * B + D$) can be defined by using the dot product operator on the rows/columns of $A$ and $B$ respectively:
$$ C_{ij} = D_{ij} + \sum_{k=1}^{n} A_{ik} * B_{kj}$$
It will actually be more useful to express this using $A^T$:
$$ C_{ij} = D_{ij} + \sum_{k=1}^{n} A^T_{ki} * B_{kj}$$
$$ C_{ij} = D_{ij} + \mathbf{A^T_{i}} \cdot \mathbf{B_{j}}$$
So each element of $C$ can be defined using the dot product of columns of $A^T$ and $B$ (where $\mathbf{A^T_{k}}$ is the k'th column of $A^T$ treated as a vector). Additionally, in order to calculate every $C_{ij}$ we will need to calculate the dot product of every possible pair of columns $\mathbf{A^T_{i}}$ and $\mathbf{B_{j}}$.
From here we can construct a rudimentary use of the systolic array - for now we'll assume that **there is not pipelining** from top -> bottom but **there is pipelining** from left -> right:
- store $B$ directly in a register in the PEs in the systolic array (e.g. store $B_{00}$ in a register at `PE_0_0` above)
- feed each column of $A^T$ as input from the left and move each column from left to right on each clock cycle; i.e.
- Clock cycle 0: input
- $\mathbf{A^T_0}$ to column 0 of the systolic array
- Clock cycle 1: input
- $\mathbf{A^T_0}$ to column 1 of the systolic array
- $\mathbf{A^T_1}$ to column 0 of the systolic array
- Clock cycle 2: input
- $\mathbf{A^T_0}$ to column 2 of the systolic array
- $\mathbf{A^T_1}$ to column 1 of the systolic array
- $\mathbf{A^T_2}$ to column 0 of the systolic array
- etc.
- when column $\mathbf{A^T_{i}}$ is input to column $j$ of the systolic array (i.e., over the column of PEs that is storing $\mathbf{B_{j}}$) - feed $D_{ij}$ as input from the top (of that column) and at each PE compute:
$$ out_{bottom} = in_{top} + reg * in_{left} $$
This is sufficient to correctly execute matrix multiplication - following each column of the systolic array from top -> bottom, each PE is performing a multiplication and an "accumulation," adding the result of the multiplication to the previous sum and passing it down. Since there isn't any top -> bottom pipelining (yet) the bottom output is set on the same clock cycle - e.g. for Clock cycle 0 above:
- `PE_0_0`: $out_{bottom} = (D_{00}) + A^T_{00} * B_{00}$
- `PE_1_0`: $out_{bottom} = (D_{00} + A^T_{00} * B_{00}) + A^T_{10} * B_{10}$
- `PE_2_0`: $out_{bottom} = (D_{00} + A^T_{00} * B_{00} + A^T_{10} * B_{10}) + A^T_{20} * B_{20}$
- `PE_3_0`: $out_{bottom} = (D_{00} + A^T_{00} * B_{00} + A^T_{10} * B_{10} + A^T_{20} * B_{20}) + A^T_{30} * B_{30}$
which is exactly what we want to compute: $D_{00} + \mathbf{A^T_{0}} \cdot \mathbf{B_{0}} = C_{00}$
This is what that looks like at Clock cycles 0, 1, and 2 overlaid onto the systolic array graph (note that I've added registers for left -> right pipelining - the `RE` elements are clocked registers that will store the entries in $A$ and pass them to the successive PE on the next clock cycle).
### CLOCK CYCLE 0
```graphviz
digraph sys_array {
nodesep=1.0
node [color=Black,fontname=Arial,shape=box]
edge [color=Black,fontname=Arial]
labelloc = "t"
# columns
subgraph {
"PE_-1_-1" [style=invis]
"PE_0_-1" [style=invis]
"PE_1_-1" [style=invis]
"PE_2_-1" [style=invis]
"PE_3_-1" [style=invis]
"PE_4_-1" [style=invis]
}
subgraph {
"PE_-1_0" [style=invis]
"PE_0_0" [label="PE_0_0\n (B_00)"]
"PE_1_0" [label="PE_1_0\n (B_10)"]
"PE_2_0" [label="PE_2_0\n (B_20)"]
"PE_3_0" [label="PE_3_0\n (B_30)"]
"PE_4_0" [style=invis]
}
subgraph {
"PE_-1_1" [style=invis]
"PE_0_1" [label="PE_0_1\n (B_01)"]
"PE_1_1" [label="PE_1_1\n (B_11)"]
"PE_2_1" [label="PE_2_1\n (B_21)"]
"PE_3_1" [label="PE_3_1\n (B_31)"]
"PE_4_1" [style=invis]
}
subgraph {
"PE_-1_2" [style=invis]
"PE_0_2" [label="PE_0_2\n (B_02)"]
"PE_1_2" [label="PE_1_2\n (B_12)"]
"PE_2_2" [label="PE_2_2\n (B_22)"]
"PE_3_2" [label="PE_3_2\n (B_32)"]
"PE_4_2" [style=invis]
}
subgraph {
"PE_-1_3" [style=invis]
"PE_0_3" [label="PE_0_3\n (B_03)"]
"PE_1_3" [label="PE_1_3\n (B_13)"]
"PE_2_3" [label="PE_2_3\n (B_23)"]
"PE_3_3" [label="PE_3_3\n (B_33)"]
"PE_4_3" [style=invis]
}
# rows
subgraph {
rank=same
"PE_-1_-1" [style=invis]
"PE_-1_0" [style=invis]
"PE_-1_1" [style=invis]
"PE_-1_2" [style=invis]
"PE_-1_3" [style=invis]
"PE_-1_4" [style=invis]
}
subgraph {
rank=same
"PE_0_-1" [style=invis]
"PE_0_0"
"PE_0_1"
"PE_0_2"
"PE_0_3"
"PE_0_4" [style=invis]
"RE_0_0" [shape=note]
"RE_0_1" [shape=note]
"RE_0_2" [shape=note]
}
subgraph {
rank=same
"PE_1_-1" [style=invis]
"PE_1_0"
"PE_1_1"
"PE_1_2"
"PE_1_3"
"PE_1_4" [style=invis]
"RE_1_0" [shape=note]
"RE_1_1" [shape=note]
"RE_1_2" [shape=note]
}
subgraph {
rank=same
"PE_2_-1" [style=invis]
"PE_2_0"
"PE_2_1"
"PE_2_2"
"PE_2_3"
"PE_2_4" [style=invis]
"RE_2_0" [shape=note]
"RE_2_1" [shape=note]
"RE_2_2" [shape=note]
}
subgraph {
rank=same
"PE_3_-1" [style=invis]
"PE_3_0"
"PE_3_1"
"PE_3_2"
"PE_3_3"
"PE_3_4" [style=invis]
"RE_3_0" [shape=note]
"RE_3_1" [shape=note]
"RE_3_2" [shape=note]
}
# left -> right
"PE_0_-1" -> "PE_0_0" [label="A^T_00"]
"PE_0_0" -> "RE_0_0" [label="A^T_00"]
"RE_0_0" -> "PE_0_1"
"PE_0_1" -> "RE_0_1"
"RE_0_1" -> "PE_0_2"
"PE_0_2" -> "RE_0_2"
"RE_0_2" -> "PE_0_3"
"PE_0_3" -> "PE_0_4"
"PE_1_-1" -> "PE_1_0" [label="A^T_10"]
"PE_1_0" -> "RE_1_0" [label="A^T_10"]
"RE_1_0" -> "PE_1_1"
"PE_1_1" -> "RE_1_1"
"RE_1_1" -> "PE_1_2"
"PE_1_2" -> "RE_1_2"
"RE_1_2"-> "PE_1_3"
"PE_1_3" -> "PE_1_4"
"PE_2_-1" -> "PE_2_0" [label="A^T_20"]
"PE_2_0" -> "RE_2_0" [label="A^T_20"]
"RE_2_0" -> "PE_2_1"
"PE_2_1" -> "RE_2_1"
"RE_2_1" -> "PE_2_2"
"PE_2_2" -> "RE_2_2"
"RE_2_2" -> "PE_2_3"
"PE_2_3" -> "PE_2_4"
"PE_3_-1" -> "PE_3_0" [label="A^T_30"]
"PE_3_0" -> "RE_3_0" [label="A^T_30"]
"RE_3_0" -> "PE_3_1"
"PE_3_1" -> "RE_3_1"
"RE_3_1" -> "PE_3_2"
"PE_3_2" -> "RE_3_2"
"RE_3_2" -> "PE_3_3"
"PE_3_3" -> "PE_3_4"
# top -> bottom
"PE_-1_0" -> "PE_0_0" [label="D_00"]
"PE_0_0" -> "PE_1_0" [label="D_00\n + A^T_00 * B_00"]
"PE_1_0" -> "PE_2_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10"]
"PE_2_0" -> "PE_3_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10\n + A^T_20 * B_20"]
"PE_3_0" -> "PE_4_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10\n + A^T_20 * B_20\n + A^T_30 * B_30\n -->C_00"]
"PE_-1_1" -> "PE_0_1" [label="---"]
"PE_0_1" -> "PE_1_1"
"PE_1_1" -> "PE_2_1"
"PE_2_1" -> "PE_3_1"
"PE_3_1" -> "PE_4_1"
"PE_-1_2" -> "PE_0_2" [label="---"]
"PE_0_2" -> "PE_1_2"
"PE_1_2" -> "PE_2_2"
"PE_2_2" -> "PE_3_2"
"PE_3_2" -> "PE_4_2"
"PE_-1_3" -> "PE_0_3" [label="---"]
"PE_0_3" -> "PE_1_3"
"PE_1_3" -> "PE_2_3"
"PE_2_3" -> "PE_3_3"
"PE_3_3" -> "PE_4_3"
}
```
### CLOCK CYCLE 1
```graphviz
digraph sys_array {
nodesep=1.0
node [color=Black,fontname=Arial,shape=box]
edge [color=Black,fontname=Arial]
labelloc = "t"
# columns
subgraph {
"PE_-1_-1" [style=invis]
"PE_0_-1" [style=invis]
"PE_1_-1" [style=invis]
"PE_2_-1" [style=invis]
"PE_3_-1" [style=invis]
"PE_4_-1" [style=invis]
}
subgraph {
"PE_-1_0" [style=invis]
"PE_0_0" [label="PE_0_0\n (B_00)"]
"PE_1_0" [label="PE_1_0\n (B_10)"]
"PE_2_0" [label="PE_2_0\n (B_20)"]
"PE_3_0" [label="PE_3_0\n (B_30)"]
"PE_4_0" [style=invis]
}
subgraph {
"PE_-1_1" [style=invis]
"PE_0_1" [label="PE_0_1\n (B_01)"]
"PE_1_1" [label="PE_1_1\n (B_11)"]
"PE_2_1" [label="PE_2_1\n (B_21)"]
"PE_3_1" [label="PE_3_1\n (B_31)"]
"PE_4_1" [style=invis]
}
subgraph {
"PE_-1_2" [style=invis]
"PE_0_2" [label="PE_0_2\n (B_02)"]
"PE_1_2" [label="PE_1_2\n (B_12)"]
"PE_2_2" [label="PE_2_2\n (B_22)"]
"PE_3_2" [label="PE_3_2\n (B_32)"]
"PE_4_2" [style=invis]
}
subgraph {
"PE_-1_3" [style=invis]
"PE_0_3" [label="PE_0_3\n (B_03)"]
"PE_1_3" [label="PE_1_3\n (B_13)"]
"PE_2_3" [label="PE_2_3\n (B_23)"]
"PE_3_3" [label="PE_3_3\n (B_33)"]
"PE_4_3" [style=invis]
}
# rows
subgraph {
rank=same
"PE_-1_-1" [style=invis]
"PE_-1_0" [style=invis]
"PE_-1_1" [style=invis]
"PE_-1_2" [style=invis]
"PE_-1_3" [style=invis]
"PE_-1_4" [style=invis]
}
subgraph {
rank=same
"PE_0_-1" [style=invis]
"PE_0_0"
"PE_0_1"
"PE_0_2"
"PE_0_3"
"PE_0_4" [style=invis]
"RE_0_0" [shape=note]
"RE_0_1" [shape=note]
"RE_0_2" [shape=note]
}
subgraph {
rank=same
"PE_1_-1" [style=invis]
"PE_1_0"
"PE_1_1"
"PE_1_2"
"PE_1_3"
"PE_1_4" [style=invis]
"RE_1_0" [shape=note]
"RE_1_1" [shape=note]
"RE_1_2" [shape=note]
}
subgraph {
rank=same
"PE_2_-1" [style=invis]
"PE_2_0"
"PE_2_1"
"PE_2_2"
"PE_2_3"
"PE_2_4" [style=invis]
"RE_2_0" [shape=note]
"RE_2_1" [shape=note]
"RE_2_2" [shape=note]
}
subgraph {
rank=same
"PE_3_-1" [style=invis]
"PE_3_0"
"PE_3_1"
"PE_3_2"
"PE_3_3"
"PE_3_4" [style=invis]
"RE_3_0" [shape=note]
"RE_3_1" [shape=note]
"RE_3_2" [shape=note]
}
# left -> right
"PE_0_-1" -> "PE_0_0" [label="A^T_01"]
"PE_0_0" -> "RE_0_0" [label="A^T_01"]
"RE_0_0" -> "PE_0_1" [label="A^T_00"]
"PE_0_1" -> "RE_0_1" [label="A^T_00"]
"RE_0_1" -> "PE_0_2"
"PE_0_2" -> "RE_0_2"
"RE_0_2" -> "PE_0_3"
"PE_0_3" -> "PE_0_4"
"PE_1_-1" -> "PE_1_0" [label="A^T_11"]
"PE_1_0" -> "RE_1_0" [label="A^T_11"]
"RE_1_0" -> "PE_1_1" [label="A^T_10"]
"PE_1_1" -> "RE_1_1" [label="A^T_10"]
"RE_1_1" -> "PE_1_2"
"PE_1_2" -> "RE_1_2"
"RE_1_2"-> "PE_1_3"
"PE_1_3" -> "PE_1_4"
"PE_2_-1" -> "PE_2_0" [label="A^T_21"]
"PE_2_0" -> "RE_2_0" [label="A^T_21"]
"RE_2_0" -> "PE_2_1" [label="A^T_20"]
"PE_2_1" -> "RE_2_1" [label="A^T_20"]
"RE_2_1" -> "PE_2_2"
"PE_2_2" -> "RE_2_2"
"RE_2_2" -> "PE_2_3"
"PE_2_3" -> "PE_2_4"
"PE_3_-1" -> "PE_3_0" [label="A^T_31"]
"PE_3_0" -> "RE_3_0" [label="A^T_31"]
"RE_3_0" -> "PE_3_1" [label="A^T_30"]
"PE_3_1" -> "RE_3_1" [label="A^T_30"]
"RE_3_1" -> "PE_3_2"
"PE_3_2" -> "RE_3_2"
"RE_3_2" -> "PE_3_3"
"PE_3_3" -> "PE_3_4"
# top -> bottom
"PE_-1_0" -> "PE_0_0" [label="D_10"]
"PE_0_0" -> "PE_1_0" [label="D_10\n + A^T_01 * B_00"]
"PE_1_0" -> "PE_2_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10"]
"PE_2_0" -> "PE_3_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10\n + A^T_21 * B_20"]
"PE_3_0" -> "PE_4_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10\n + A^T_21 * B_20\n + A^T_31 * B_30\n -->C_10"]
"PE_-1_1" -> "PE_0_1" [label="D_01"]
"PE_0_1" -> "PE_1_1" [label="D_01\n + A^T_00 * B_01"]
"PE_1_1" -> "PE_2_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11"]
"PE_2_1" -> "PE_3_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11\n + A^T_20 * B_21"]
"PE_3_1" -> "PE_4_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11\n + A^T_20 * B_21\n + A^T_30 * B_31\n -->C_01"]
"PE_-1_2" -> "PE_0_2" [label="---"]
"PE_0_2" -> "PE_1_2"
"PE_1_2" -> "PE_2_2"
"PE_2_2" -> "PE_3_2"
"PE_3_2" -> "PE_4_2"
"PE_-1_3" -> "PE_0_3" [label="---"]
"PE_0_3" -> "PE_1_3"
"PE_1_3" -> "PE_2_3"
"PE_2_3" -> "PE_3_3"
"PE_3_3" -> "PE_4_3"
}
```
### CLOCK CYCLE 2
```graphviz
digraph sys_array {
nodesep=1.0
node [color=Black,fontname=Arial,shape=box]
edge [color=Black,fontname=Arial]
labelloc = "t"
# columns
subgraph {
"PE_-1_-1" [style=invis]
"PE_0_-1" [style=invis]
"PE_1_-1" [style=invis]
"PE_2_-1" [style=invis]
"PE_3_-1" [style=invis]
"PE_4_-1" [style=invis]
}
subgraph {
"PE_-1_0" [style=invis]
"PE_0_0" [label="PE_0_0\n (B_00)"]
"PE_1_0" [label="PE_1_0\n (B_10)"]
"PE_2_0" [label="PE_2_0\n (B_20)"]
"PE_3_0" [label="PE_3_0\n (B_30)"]
"PE_4_0" [style=invis]
}
subgraph {
"PE_-1_1" [style=invis]
"PE_0_1" [label="PE_0_1\n (B_01)"]
"PE_1_1" [label="PE_1_1\n (B_11)"]
"PE_2_1" [label="PE_2_1\n (B_21)"]
"PE_3_1" [label="PE_3_1\n (B_31)"]
"PE_4_1" [style=invis]
}
subgraph {
"PE_-1_2" [style=invis]
"PE_0_2" [label="PE_0_2\n (B_02)"]
"PE_1_2" [label="PE_1_2\n (B_12)"]
"PE_2_2" [label="PE_2_2\n (B_22)"]
"PE_3_2" [label="PE_3_2\n (B_32)"]
"PE_4_2" [style=invis]
}
subgraph {
"PE_-1_3" [style=invis]
"PE_0_3" [label="PE_0_3\n (B_03)"]
"PE_1_3" [label="PE_1_3\n (B_13)"]
"PE_2_3" [label="PE_2_3\n (B_23)"]
"PE_3_3" [label="PE_3_3\n (B_33)"]
"PE_4_3" [style=invis]
}
# rows
subgraph {
rank=same
"PE_-1_-1" [style=invis]
"PE_-1_0" [style=invis]
"PE_-1_1" [style=invis]
"PE_-1_2" [style=invis]
"PE_-1_3" [style=invis]
"PE_-1_4" [style=invis]
}
subgraph {
rank=same
"PE_0_-1" [style=invis]
"PE_0_0"
"PE_0_1"
"PE_0_2"
"PE_0_3"
"PE_0_4" [style=invis]
"RE_0_0" [shape=note]
"RE_0_1" [shape=note]
"RE_0_2" [shape=note]
}
subgraph {
rank=same
"PE_1_-1" [style=invis]
"PE_1_0"
"PE_1_1"
"PE_1_2"
"PE_1_3"
"PE_1_4" [style=invis]
"RE_1_0" [shape=note]
"RE_1_1" [shape=note]
"RE_1_2" [shape=note]
}
subgraph {
rank=same
"PE_2_-1" [style=invis]
"PE_2_0"
"PE_2_1"
"PE_2_2"
"PE_2_3"
"PE_2_4" [style=invis]
"RE_2_0" [shape=note]
"RE_2_1" [shape=note]
"RE_2_2" [shape=note]
}
subgraph {
rank=same
"PE_3_-1" [style=invis]
"PE_3_0"
"PE_3_1"
"PE_3_2"
"PE_3_3"
"PE_3_4" [style=invis]
"RE_3_0" [shape=note]
"RE_3_1" [shape=note]
"RE_3_2" [shape=note]
}
# left -> right
"PE_0_-1" -> "PE_0_0" [label="A^T_02"]
"PE_0_0" -> "RE_0_0" [label="A^T_02"]
"RE_0_0" -> "PE_0_1" [label="A^T_01"]
"PE_0_1" -> "RE_0_1" [label="A^T_01"]
"RE_0_1" -> "PE_0_2" [label="A^T_00"]
"PE_0_2" -> "RE_0_2" [label="A^T_00"]
"RE_0_2" -> "PE_0_3"
"PE_0_3" -> "PE_0_4"
"PE_1_-1" -> "PE_1_0" [label="A^T_12"]
"PE_1_0" -> "RE_1_0" [label="A^T_12"]
"RE_1_0" -> "PE_1_1" [label="A^T_11"]
"PE_1_1" -> "RE_1_1" [label="A^T_11"]
"RE_1_1" -> "PE_1_2" [label="A^T_10"]
"PE_1_2" -> "RE_1_2" [label="A^T_10"]
"RE_1_2"-> "PE_1_3"
"PE_1_3" -> "PE_1_4"
"PE_2_-1" -> "PE_2_0" [label="A^T_22"]
"PE_2_0" -> "RE_2_0" [label="A^T_22"]
"RE_2_0" -> "PE_2_1" [label="A^T_21"]
"PE_2_1" -> "RE_2_1" [label="A^T_21"]
"RE_2_1" -> "PE_2_2" [label="A^T_20"]
"PE_2_2" -> "RE_2_2" [label="A^T_20"]
"RE_2_2" -> "PE_2_3"
"PE_2_3" -> "PE_2_4"
"PE_3_-1" -> "PE_3_0" [label="A^T_32"]
"PE_3_0" -> "RE_3_0" [label="A^T_32"]
"RE_3_0" -> "PE_3_1" [label="A^T_31"]
"PE_3_1" -> "RE_3_1" [label="A^T_31"]
"RE_3_1" -> "PE_3_2" [label="A^T_30"]
"PE_3_2" -> "RE_3_2" [label="A^T_30"]
"RE_3_2" -> "PE_3_3"
"PE_3_3" -> "PE_3_4"
# top -> bottom
"PE_-1_0" -> "PE_0_0" [label="D_20"]
"PE_0_0" -> "PE_1_0" [label="D_20\n + A^T_02 * B_00"]
"PE_1_0" -> "PE_2_0" [label="D_20\n + A^T_02 * B_00\n + A^T_12 * B_10"]
"PE_2_0" -> "PE_3_0" [label="D_20\n + A^T_02 * B_00\n + A^T_12 * B_10\n + A^T_22 * B_20"]
"PE_3_0" -> "PE_4_0" [label="D_20\n + A^T_02 * B_00\n + A^T_12 * B_10\n + A^T_22 * B_20\n + A^T_32 * B_30\n -->C_20"]
"PE_-1_1" -> "PE_0_1" [label="D_11"]
"PE_0_1" -> "PE_1_1" [label="D_11\n + A^T_01 * B_01"]
"PE_1_1" -> "PE_2_1" [label="D_11\n + A^T_01 * B_01\n + A^T_11 * B_11"]
"PE_2_1" -> "PE_3_1" [label="D_11\n + A^T_01 * B_01\n + A^T_11 * B_11\n + A^T_21 * B_21"]
"PE_3_1" -> "PE_4_1" [label="D_11\n + A^T_01 * B_01\n + A^T_11 * B_11\n + A^T_21 * B_21\n + A^T_31 * B_31\n -->C_11"]
"PE_-1_2" -> "PE_0_2" [label="D_02"]
"PE_0_2" -> "PE_1_2" [label="D_02\n + A^T_00 * B_02"]
"PE_1_2" -> "PE_2_2" [label="D_02\n + A^T_00 * B_02\n + A^T_10 * B_12"]
"PE_2_2" -> "PE_3_2" [label="D_02\n + A^T_00 * B_02\n + A^T_10 * B_12\n + A^T_20 * B_22"]
"PE_3_2" -> "PE_4_2" [label="D_02\n + A^T_00 * B_02\n + A^T_10 * B_12\n + A^T_20 * B_22\n + A^T_30 * B_32\n -->C_02"]
"PE_-1_3" -> "PE_0_3" [label="---"]
"PE_0_3" -> "PE_1_3"
"PE_1_3" -> "PE_2_3"
"PE_2_3" -> "PE_3_3"
"PE_3_3" -> "PE_4_3"
}
```
## Top Down Pipelining
A big problem with the rudimentary design above is that the entire dot product $C_{ij} = D_{ij} + \mathbf{A^T_i} \cdot \mathbf{B_j}$ is is computed in a single clock cycle - this means that the critical path of the circuit will include the multiplication + accumulation steps for every PE in an entire column (in the examples above - 4).
This can be improved by adding top->bottom pipelining; i.e., adding registers in between each PE in a column of the systolic array. The tradeoff now is that a single dot product will take 4 clock cycles instead of 1. A balance between both competing factors is to add **"tiles"** - only adding registers between each k PEs for some k | n; e.g. for k = 2, n = 4(each tile contains 2 PEs) then each dot product will take 2 clock cycles and each clock cycle will have 2 multiplication + accumulation ops.
The downside of tiling is added complexity - not to the circuit, but to the input "feeding", since each column of $A^T$ shouldn't be sent in its entirety on the same clock cycle. Instead the **left input has to be staggered** so that the left input to a tile (a chunk of a column $\mathbf{A^T_i}$) matches up with the column the was passed to the above tile on the **previous clock cycle**.
This is what that looks like at Clock cycles 0, 1, and 2:
### CLOCK CYCLE 0
```graphviz
digraph sys_array {
nodesep=1.0
node [color=Black,fontname=Arial,shape=box]
edge [color=Black,fontname=Arial]
labelloc = "t"
# columns
subgraph {
"PE_-1_-1" [style=invis]
"PE_0_-1" [style=invis]
"PE_1_-1" [style=invis]
"PE_2_-1" [style=invis]
"PE_3_-1" [style=invis]
"PE_4_-1" [style=invis]
}
subgraph {
"PE_-1_0" [style=invis]
"PE_0_0" [label="PE_0_0\n (B_00)"]
"PE_1_0" [label="PE_1_0\n (B_10)"]
"PE_2_0" [label="PE_2_0\n (B_20)"]
"PE_3_0" [label="PE_3_0\n (B_30)"]
"PE_4_0" [style=invis]
"TE_1_0" [shape=note]
}
subgraph {
"PE_-1_1" [style=invis]
"PE_0_1" [label="PE_0_1\n (B_01)"]
"PE_1_1" [label="PE_1_1\n (B_11)"]
"PE_2_1" [label="PE_2_1\n (B_21)"]
"PE_3_1" [label="PE_3_1\n (B_31)"]
"PE_4_1" [style=invis]
"TE_1_1" [shape=note]
}
subgraph {
"PE_-1_2" [style=invis]
"PE_0_2" [label="PE_0_2\n (B_02)"]
"PE_1_2" [label="PE_1_2\n (B_12)"]
"PE_2_2" [label="PE_2_2\n (B_22)"]
"PE_3_2" [label="PE_3_2\n (B_32)"]
"PE_4_2" [style=invis]
"TE_1_2" [shape=note]
}
subgraph {
"PE_-1_3" [style=invis]
"PE_0_3" [label="PE_0_3\n (B_03)"]
"PE_1_3" [label="PE_1_3\n (B_13)"]
"PE_2_3" [label="PE_2_3\n (B_23)"]
"PE_3_3" [label="PE_3_3\n (B_33)"]
"PE_4_3" [style=invis]
"TE_1_3" [shape=note]
}
# rows
subgraph {
rank=same
"PE_-1_-1" [style=invis]
"PE_-1_0" [style=invis]
"PE_-1_1" [style=invis]
"PE_-1_2" [style=invis]
"PE_-1_3" [style=invis]
"PE_-1_4" [style=invis]
}
subgraph {
rank=same
"PE_0_-1" [style=invis]
"PE_0_0"
"PE_0_1"
"PE_0_2"
"PE_0_3"
"PE_0_4" [style=invis]
"RE_0_0" [shape=note]
"RE_0_1" [shape=note]
"RE_0_2" [shape=note]
}
subgraph {
rank=same
"PE_1_-1" [style=invis]
"PE_1_0"
"PE_1_1"
"PE_1_2"
"PE_1_3"
"PE_1_4" [style=invis]
"RE_1_0" [shape=note]
"RE_1_1" [shape=note]
"RE_1_2" [shape=note]
}
subgraph {
rank=same
"PE_2_-1" [style=invis]
"PE_2_0"
"PE_2_1"
"PE_2_2"
"PE_2_3"
"PE_2_4" [style=invis]
"RE_2_0" [shape=note]
"RE_2_1" [shape=note]
"RE_2_2" [shape=note]
}
subgraph {
rank=same
"PE_3_-1" [style=invis]
"PE_3_0"
"PE_3_1"
"PE_3_2"
"PE_3_3"
"PE_3_4" [style=invis]
"RE_3_0" [shape=note]
"RE_3_1" [shape=note]
"RE_3_2" [shape=note]
}
# left -> right
"PE_0_-1" -> "PE_0_0" [label="A^T_00"]
"PE_0_0" -> "RE_0_0" [label="A^T_00"]
"RE_0_0" -> "PE_0_1"
"PE_0_1" -> "RE_0_1"
"RE_0_1" -> "PE_0_2"
"PE_0_2" -> "RE_0_2"
"RE_0_2" -> "PE_0_3"
"PE_0_3" -> "PE_0_4"
"PE_1_-1" -> "PE_1_0" [label="A^T_10"]
"PE_1_0" -> "RE_1_0" [label="A^T_10"]
"RE_1_0" -> "PE_1_1"
"PE_1_1" -> "RE_1_1"
"RE_1_1" -> "PE_1_2"
"PE_1_2" -> "RE_1_2"
"RE_1_2"-> "PE_1_3"
"PE_1_3" -> "PE_1_4"
"PE_2_-1" -> "PE_2_0"
"PE_2_0" -> "RE_2_0"
"RE_2_0" -> "PE_2_1"
"PE_2_1" -> "RE_2_1"
"RE_2_1" -> "PE_2_2"
"PE_2_2" -> "RE_2_2"
"RE_2_2" -> "PE_2_3"
"PE_2_3" -> "PE_2_4"
"PE_3_-1" -> "PE_3_0"
"PE_3_0" -> "RE_3_0"
"RE_3_0" -> "PE_3_1"
"PE_3_1" -> "RE_3_1"
"RE_3_1" -> "PE_3_2"
"PE_3_2" -> "RE_3_2"
"RE_3_2" -> "PE_3_3"
"PE_3_3" -> "PE_3_4"
# top -> bottom
"PE_-1_0" -> "PE_0_0" [label="D_00"]
"PE_0_0" -> "PE_1_0" [label="D_00\n + A^T_00 * B_00"]
"PE_1_0" -> "TE_1_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10"]
"TE_1_0" -> "PE_2_0"
"PE_2_0" -> "PE_3_0"
"PE_3_0" -> "PE_4_0"
"PE_-1_1" -> "PE_0_1" [label="---"]
"PE_0_1" -> "PE_1_1"
"PE_1_1" -> "TE_1_1"
"TE_1_1" -> "PE_2_1"
"PE_2_1" -> "PE_3_1"
"PE_3_1" -> "PE_4_1"
"PE_-1_2" -> "PE_0_2" [label="---"]
"PE_0_2" -> "PE_1_2"
"PE_1_2" -> "TE_1_2"
"TE_1_2" -> "PE_2_2"
"PE_2_2" -> "PE_3_2"
"PE_3_2" -> "PE_4_2"
"PE_-1_3" -> "PE_0_3" [label="---"]
"PE_0_3" -> "PE_1_3"
"PE_1_3" -> "TE_1_3"
"TE_1_3" -> "PE_2_3"
"PE_2_3" -> "PE_3_3"
"PE_3_3" -> "PE_4_3"
}
```
### CLOCK CYCLE 1
```graphviz
digraph sys_array {
nodesep=1.0
node [color=Black,fontname=Arial,shape=box]
edge [color=Black,fontname=Arial]
labelloc = "t"
# columns
subgraph {
"PE_-1_-1" [style=invis]
"PE_0_-1" [style=invis]
"PE_1_-1" [style=invis]
"PE_2_-1" [style=invis]
"PE_3_-1" [style=invis]
"PE_4_-1" [style=invis]
}
subgraph {
"PE_-1_0" [style=invis]
"PE_0_0" [label="PE_0_0\n (B_00)"]
"PE_1_0" [label="PE_1_0\n (B_10)"]
"PE_2_0" [label="PE_2_0\n (B_20)"]
"PE_3_0" [label="PE_3_0\n (B_30)"]
"PE_4_0" [style=invis]
"TE_1_0" [shape=note]
}
subgraph {
"PE_-1_1" [style=invis]
"PE_0_1" [label="PE_0_1\n (B_01)"]
"PE_1_1" [label="PE_1_1\n (B_11)"]
"PE_2_1" [label="PE_2_1\n (B_21)"]
"PE_3_1" [label="PE_3_1\n (B_31)"]
"PE_4_1" [style=invis]
"TE_1_1" [shape=note]
}
subgraph {
"PE_-1_2" [style=invis]
"PE_0_2" [label="PE_0_2\n (B_02)"]
"PE_1_2" [label="PE_1_2\n (B_12)"]
"PE_2_2" [label="PE_2_2\n (B_22)"]
"PE_3_2" [label="PE_3_2\n (B_32)"]
"PE_4_2" [style=invis]
"TE_1_2" [shape=note]
}
subgraph {
"PE_-1_3" [style=invis]
"PE_0_3" [label="PE_0_3\n (B_03)"]
"PE_1_3" [label="PE_1_3\n (B_13)"]
"PE_2_3" [label="PE_2_3\n (B_23)"]
"PE_3_3" [label="PE_3_3\n (B_33)"]
"PE_4_3" [style=invis]
"TE_1_3" [shape=note]
}
# rows
subgraph {
rank=same
"PE_-1_-1" [style=invis]
"PE_-1_0" [style=invis]
"PE_-1_1" [style=invis]
"PE_-1_2" [style=invis]
"PE_-1_3" [style=invis]
"PE_-1_4" [style=invis]
}
subgraph {
rank=same
"PE_0_-1" [style=invis]
"PE_0_0"
"PE_0_1"
"PE_0_2"
"PE_0_3"
"PE_0_4" [style=invis]
"RE_0_0" [shape=note]
"RE_0_1" [shape=note]
"RE_0_2" [shape=note]
}
subgraph {
rank=same
"PE_1_-1" [style=invis]
"PE_1_0"
"PE_1_1"
"PE_1_2"
"PE_1_3"
"PE_1_4" [style=invis]
"RE_1_0" [shape=note]
"RE_1_1" [shape=note]
"RE_1_2" [shape=note]
}
subgraph {
rank=same
"PE_2_-1" [style=invis]
"PE_2_0"
"PE_2_1"
"PE_2_2"
"PE_2_3"
"PE_2_4" [style=invis]
"RE_2_0" [shape=note]
"RE_2_1" [shape=note]
"RE_2_2" [shape=note]
}
subgraph {
rank=same
"PE_3_-1" [style=invis]
"PE_3_0"
"PE_3_1"
"PE_3_2"
"PE_3_3"
"PE_3_4" [style=invis]
"RE_3_0" [shape=note]
"RE_3_1" [shape=note]
"RE_3_2" [shape=note]
}
# left -> right
"PE_0_-1" -> "PE_0_0" [label="A^T_01"]
"PE_0_0" -> "RE_0_0" [label="A^T_01"]
"RE_0_0" -> "PE_0_1" [label="A^T_00"]
"PE_0_1" -> "RE_0_1" [label="A^T_00"]
"RE_0_1" -> "PE_0_2"
"PE_0_2" -> "RE_0_2"
"RE_0_2" -> "PE_0_3"
"PE_0_3" -> "PE_0_4"
"PE_1_-1" -> "PE_1_0" [label="A^T_11"]
"PE_1_0" -> "RE_1_0" [label="A^T_11"]
"RE_1_0" -> "PE_1_1" [label="A^T_10"]
"PE_1_1" -> "RE_1_1" [label="A^T_10"]
"RE_1_1" -> "PE_1_2"
"PE_1_2" -> "RE_1_2"
"RE_1_2"-> "PE_1_3"
"PE_1_3" -> "PE_1_4"
"PE_2_-1" -> "PE_2_0" [label="A^T_20"]
"PE_2_0" -> "RE_2_0" [label="A^T_20"]
"RE_2_0" -> "PE_2_1"
"PE_2_1" -> "RE_2_1"
"RE_2_1" -> "PE_2_2"
"PE_2_2" -> "RE_2_2"
"RE_2_2" -> "PE_2_3"
"PE_2_3" -> "PE_2_4"
"PE_3_-1" -> "PE_3_0" [label="A^T_30"]
"PE_3_0" -> "RE_3_0" [label="A^T_30"]
"RE_3_0" -> "PE_3_1"
"PE_3_1" -> "RE_3_1"
"RE_3_1" -> "PE_3_2"
"PE_3_2" -> "RE_3_2"
"RE_3_2" -> "PE_3_3"
"PE_3_3" -> "PE_3_4"
# top -> bottom
"PE_-1_0" -> "PE_0_0" [label="D_10"]
"PE_0_0" -> "PE_1_0" [label="D_10\n + A^T_01 * B_00"]
"PE_1_0" -> "TE_1_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10"]
"TE_1_0" -> "PE_2_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10"]
"PE_2_0" -> "PE_3_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10\n + A^T_20 * B_20"]
"PE_3_0" -> "PE_4_0" [label="D_00\n + A^T_00 * B_00\n + A^T_10 * B_10\n + A^T_20 * B_20\n + A^T_30 * B_30\n-->C_00"]
"PE_-1_1" -> "PE_0_1" [label="D_01"]
"PE_0_1" -> "PE_1_1" [label="D_01\n + A^T_00 * B_01"]
"PE_1_1" -> "TE_1_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11"]
"TE_1_1" -> "PE_2_1"
"PE_2_1" -> "PE_3_1"
"PE_3_1" -> "PE_4_1"
"PE_-1_2" -> "PE_0_2" [label="---"]
"PE_0_2" -> "PE_1_2"
"PE_1_2" -> "TE_1_2"
"TE_1_2" -> "PE_2_2"
"PE_2_2" -> "PE_3_2"
"PE_3_2" -> "PE_4_2"
"PE_-1_3" -> "PE_0_3" [label="---"]
"PE_0_3" -> "PE_1_3"
"PE_1_3" -> "TE_1_3"
"TE_1_3" -> "PE_2_3"
"PE_2_3" -> "PE_3_3"
"PE_3_3" -> "PE_4_3"
}
```
### CLOCK CYCLE 2
```graphviz
digraph sys_array {
nodesep=1.0
node [color=Black,fontname=Arial,shape=box]
edge [color=Black,fontname=Arial]
labelloc = "t"
# columns
subgraph {
"PE_-1_-1" [style=invis]
"PE_0_-1" [style=invis]
"PE_1_-1" [style=invis]
"PE_2_-1" [style=invis]
"PE_3_-1" [style=invis]
"PE_4_-1" [style=invis]
}
subgraph {
"PE_-1_0" [style=invis]
"PE_0_0" [label="PE_0_0\n (B_00)"]
"PE_1_0" [label="PE_1_0\n (B_10)"]
"PE_2_0" [label="PE_2_0\n (B_20)"]
"PE_3_0" [label="PE_3_0\n (B_30)"]
"PE_4_0" [style=invis]
"TE_1_0" [shape=note]
}
subgraph {
"PE_-1_1" [style=invis]
"PE_0_1" [label="PE_0_1\n (B_01)"]
"PE_1_1" [label="PE_1_1\n (B_11)"]
"PE_2_1" [label="PE_2_1\n (B_21)"]
"PE_3_1" [label="PE_3_1\n (B_31)"]
"PE_4_1" [style=invis]
"TE_1_1" [shape=note]
}
subgraph {
"PE_-1_2" [style=invis]
"PE_0_2" [label="PE_0_2\n (B_02)"]
"PE_1_2" [label="PE_1_2\n (B_12)"]
"PE_2_2" [label="PE_2_2\n (B_22)"]
"PE_3_2" [label="PE_3_2\n (B_32)"]
"PE_4_2" [style=invis]
"TE_1_2" [shape=note]
}
subgraph {
"PE_-1_3" [style=invis]
"PE_0_3" [label="PE_0_3\n (B_03)"]
"PE_1_3" [label="PE_1_3\n (B_13)"]
"PE_2_3" [label="PE_2_3\n (B_23)"]
"PE_3_3" [label="PE_3_3\n (B_33)"]
"PE_4_3" [style=invis]
"TE_1_3" [shape=note]
}
# rows
subgraph {
rank=same
"PE_-1_-1" [style=invis]
"PE_-1_0" [style=invis]
"PE_-1_1" [style=invis]
"PE_-1_2" [style=invis]
"PE_-1_3" [style=invis]
"PE_-1_4" [style=invis]
}
subgraph {
rank=same
"PE_0_-1" [style=invis]
"PE_0_0"
"PE_0_1"
"PE_0_2"
"PE_0_3"
"PE_0_4" [style=invis]
"RE_0_0" [shape=note]
"RE_0_1" [shape=note]
"RE_0_2" [shape=note]
}
subgraph {
rank=same
"PE_1_-1" [style=invis]
"PE_1_0"
"PE_1_1"
"PE_1_2"
"PE_1_3"
"PE_1_4" [style=invis]
"RE_1_0" [shape=note]
"RE_1_1" [shape=note]
"RE_1_2" [shape=note]
}
subgraph {
rank=same
"PE_2_-1" [style=invis]
"PE_2_0"
"PE_2_1"
"PE_2_2"
"PE_2_3"
"PE_2_4" [style=invis]
"RE_2_0" [shape=note]
"RE_2_1" [shape=note]
"RE_2_2" [shape=note]
}
subgraph {
rank=same
"PE_3_-1" [style=invis]
"PE_3_0"
"PE_3_1"
"PE_3_2"
"PE_3_3"
"PE_3_4" [style=invis]
"RE_3_0" [shape=note]
"RE_3_1" [shape=note]
"RE_3_2" [shape=note]
}
# left -> right
"PE_0_-1" -> "PE_0_0" [label="A^T_02"]
"PE_0_0" -> "RE_0_0" [label="A^T_02"]
"RE_0_0" -> "PE_0_1" [label="A^T_01"]
"PE_0_1" -> "RE_0_1" [label="A^T_01"]
"RE_0_1" -> "PE_0_2" [label="A^T_00"]
"PE_0_2" -> "RE_0_2" [label="A^T_00"]
"RE_0_2" -> "PE_0_3"
"PE_0_3" -> "PE_0_4"
"PE_1_-1" -> "PE_1_0" [label="A^T_12"]
"PE_1_0" -> "RE_1_0" [label="A^T_12"]
"RE_1_0" -> "PE_1_1" [label="A^T_11"]
"PE_1_1" -> "RE_1_1" [label="A^T_11"]
"RE_1_1" -> "PE_1_2" [label="A^T_10"]
"PE_1_2" -> "RE_1_2" [label="A^T_10"]
"RE_1_2"-> "PE_1_3"
"PE_1_3" -> "PE_1_4"
"PE_2_-1" -> "PE_2_0" [label="A^T_21"]
"PE_2_0" -> "RE_2_0" [label="A^T_21"]
"RE_2_0" -> "PE_2_1" [label="A^T_20"]
"PE_2_1" -> "RE_2_1" [label="A^T_20"]
"RE_2_1" -> "PE_2_2"
"PE_2_2" -> "RE_2_2"
"RE_2_2" -> "PE_2_3"
"PE_2_3" -> "PE_2_4"
"PE_3_-1" -> "PE_3_0" [label="A^T_31"]
"PE_3_0" -> "RE_3_0" [label="A^T_31"]
"RE_3_0" -> "PE_3_1" [label="A^T_30"]
"PE_3_1" -> "RE_3_1" [label="A^T_30"]
"RE_3_1" -> "PE_3_2"
"PE_3_2" -> "RE_3_2"
"RE_3_2" -> "PE_3_3"
"PE_3_3" -> "PE_3_4"
# top -> bottom
"PE_-1_0" -> "PE_0_0" [label="D_20"]
"PE_0_0" -> "PE_1_0" [label="D_20\n + A^T_02 * B_00"]
"PE_1_0" -> "TE_1_0" [label="D_20\n + A^T_02 * B_00\n + A^T_12 * B_10"]
"TE_1_0" -> "PE_2_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10"]
"PE_2_0" -> "PE_3_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10\n + A^T_21 * B_20"]
"PE_3_0" -> "PE_4_0" [label="D_10\n + A^T_01 * B_00\n + A^T_11 * B_10\n + A^T_21 * B_20\n + A^T_31 * B_30\n-->C_10"]
"PE_-1_1" -> "PE_0_1" [label="D_11"]
"PE_0_1" -> "PE_1_1" [label="D_11\n + A^T_01 * B_01"]
"PE_1_1" -> "TE_1_1" [label="D_11\n + A^T_01 * B_01\n + A^T_11 * B_11"]
"TE_1_1" -> "PE_2_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11"]
"PE_2_1" -> "PE_3_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11\n + A^T_20 * B_21"]
"PE_3_1" -> "PE_4_1" [label="D_01\n + A^T_00 * B_01\n + A^T_10 * B_11\n + A^T_20 * B_21\n + A^T_30 * B_31\n-->C_01"]
"PE_-1_2" -> "PE_0_2" [label="D_02"]
"PE_0_2" -> "PE_1_2" [label="D_02\n + A^T_00 * B_02"]
"PE_1_2" -> "TE_1_2" [label="D_02\n + A^T_00 * B_02\n + A^T_10 * B_12"]
"TE_1_2" -> "PE_2_2"
"PE_2_2" -> "PE_3_2"
"PE_3_2" -> "PE_4_2"
"PE_-1_3" -> "PE_0_3" [label="---"]
"PE_0_3" -> "PE_1_3"
"PE_1_3" -> "TE_1_3"
"TE_1_3" -> "PE_2_3"
"PE_2_3" -> "PE_3_3"
"PE_3_3" -> "PE_4_3"
}
```
A key difference here is that it takes k - 1 clock cycles for the systolic array to "warm up" before it starts outputting results. This helps to quantify the tradeoff mentioned above - doubling the number of tiles would double the number of warmup clock cycles but halve the critical path of the systolic array.
## Loading $B$ + Wrapping up the PE
The last consideration for the core systolic array is how to set it up - specifically how to load $B$ directly into the systolic array. $B$ will be loaded from the top of the systolic array - passing each row of $B$ (starting from the **bottom** row, not the top row) down through the systolic array until it reaches its corresponding row in the array and storing it in a register in the PE.
It wastes clock cycles to completely load $B$ into the array before feeding $A$ and $D$ to the array but this is required since $B$ is required for the computation. A neat trick to forego this synchronization problem (and which will allow for threading later) is to **store multiple $B$ matrices in the systolic array** - in particular each PE will store 2 registers for 2 separate matrices $B_0$ and $B_1$.
We still need to completely load a matrix $B_i$ before using it for computation - however with 2 $B$ matrices available it will be possible to use one matrix for computation while loading the other matrix - e.g. if $B_0$ is already loaded we can feed $A$ and $D$ into the systolic array and **at the same time** feed $B_1$ into the systolic array. Once the computation using $B_0$ is done we can then immediately use $B_1$ for computation without having to wait - this allows us to always have the computation processing on the systolic array without blocking on loading each $B_i$.
This will be the last modification to the PE - we'll just need to add 3 new signals:
- `B` - an input signal (from the top PE) to load the B data into the PE
- `counter[3:0]` - an unsigned integer that indicates how many further rows the `B` signal should propagate before being stored - if `counter == 0` then the PE should store the `B` signal in its local register
- `propagate` - indicates whether the top $B$ input should be written to local register 0 (for $B_0$) or 1 (for $B_1$) - additionally **the opposite** matrix will be used for the multiplication + accumulation ops with the top $D$ input and left $A$ input