# Systolic Array MatMul Device Part 3: ISA
The ISA I made for the device was very much motivated from the design of the core systolic array - specifically to cleanly isolate (i) loading $B$, (ii) writing $A$ and $D$ to execute the matrix multiplication compuation, and (iii) reading the matrix $C$ as it's output from the systolic array.
## ISA
This is a 32-bit ISA (recall from Part 2 that each instruction takes up $4$ B in instruction memory) with 4 defined instructions
### load
**loads matrix $B$** from block memory into the systolic array (note - thread $i$ will load the matrix into $B_i$ in the PEs in the systolic array)
```
|unused |B_addr |code |
|(22 b) |(8 b) |(2 b) |
|31 -- 10|9 -- 2|1 -- 0|
```
- `code = 0b10`
- `B_addr`: bits `[9:2]` of the 32-bit block memory address from which $B$ is loaded
### comp (compute)
**initiates and waits** on the systolic array to compute $C = A * B_i + D$ by reading $A$ and $D$ from block memory and writing $C$ to block memory
```
|unused |C_addr |D_addr |A_addr |code |
|(6 b) |(8 b) |(8 b) |(8 b) |(2 b) |
|31 -- 26|25 -- 18|17 -- 10|9 -- 2|1 -- 0|
```
- `code = 0b11`
- `A_addr`: bits `[9:2]` of the 32-bit block memory address from which $A$ is loaded
- `D_addr`: bits `[9:2]` of the 32-bit block memory address from which $D$ is loaded
- `C_addr`: bits `[9:2]` of the 32-bit block memory address into which $C$ is written
### write
writes (i) a $1$ B header and (ii) the full contents of an arbitrary matrix in block memory ($2^4$ B) to UART
```
|unused |header |addr |code |
|(14 b) |(8 b) |(8 b) |(2 b) |
|31 -- 18|17 -- 10|9 -- 2|1 -- 0|
```
- `code = 0b01`
- `addr`: bits `[9:2]` of the 32-bit block memory address from which the matrix to be written to UART is loaded
- `header` the exact 8-bit header written to UART before the matrix - the idea is to allow the device to add a tag to the output in case it needs to be distinguished from the output of other threads writing to UART
### term (terminate)
stops the thread - note that the thread will automatically stop once it reaches the final instruction in the script so this instruction is purely decorative
```
|unused |code |
|(30 b) |(2 b) |
|31 -- 2|1 -- 0|
```
- `code = 0b00`
## Scripts
Scripts tell the device driver how to execute a thread on the device - at a high level they describe the **full life cycle of a thread**. A script contains 3 sections
- `META`: 2 integer parameters ($n / k$ and $k$ - recall from Part 1 that $n$ is the number of entries in a matrix row/column and $k$ is the number of tiles in the systolic array)
- `DATA`: a list of statically-defined matrices that the device driver running the script must load into the device block memory
- `TEXT`: a list of instructions that the device driver running the script must load into the device instruction memory
## The Central Idea Behind the Device
The general driver-device interaction flow is as follows:
- the device driver reads a script and writes the script to the device:
- the device loads static data into **shared block memory**
- the device loads instructions into **local (per-thread) instruction memory** starting at address `0x0000 0000`
- the device driver sends a control signal to the device indicating that it should start an indexed thread (i.e., the driver should specifically indicate if it wants to start thread 0, thread 1, or both)
- the device starts running the indexed thread readig from its local instruction memory starting from instruction `0x0000 0000` and **completely stopping** once it hits `TERM` instruction or it runs through every instruction in its instruction memory
The motivating idea is that the device should be a **continuously running script consumer** - the driver could be constantly feeding new scripts to the the device (taking advantage of the multiple running threads) + updating the thread instruction memory + running new threads.
The ISA is parsimonious - there are no branches or jumps which means the device will run straight through any given script. This severely restricts the utility of any single script - but the idea is for **the driver to orchestrate a diversity of short scripts with custom, tailored logic and send them rapid-fire to the device** (alternating between updating a thread's instructions and allowing the thread to run). This is a fun (read: useless lol) contrast to the typical design for these sorts of devices, which typically has all threads on the device running the same code but just with different data inputs.