Sys-Array MatMul Device Part 3: ISA

# Systolic Array MatMul Device Part 3: ISA The ISA I made for the device was very much motivated from the design of the core systolic array - specifically to cleanly isolate (i) loading $B$, (ii) writing $A$ and $D$ to execute the matrix multiplication compuation, and (iii) reading the matrix $C$ as it's output from the systolic array. ## ISA This is a 32-bit ISA (recall from Part 2 that each instruction takes up $4$ B in instruction memory) with 4 defined instructions ### load **loads matrix $B$** from block memory into the systolic array (note - thread $i$ will load the matrix into $B_i$ in the PEs in the systolic array) ``` |unused |B_addr |code | |(22 b) |(8 b) |(2 b) | |31 -- 10|9 -- 2|1 -- 0| ``` - `code = 0b10` - `B_addr`: bits `[9:2]` of the 32-bit block memory address from which $B$ is loaded ### comp (compute) **initiates and waits** on the systolic array to compute $C = A * B_i + D$ by reading $A$ and $D$ from block memory and writing $C$ to block memory ``` |unused |C_addr |D_addr |A_addr |code | |(6 b) |(8 b) |(8 b) |(8 b) |(2 b) | |31 -- 26|25 -- 18|17 -- 10|9 -- 2|1 -- 0| ``` - `code = 0b11` - `A_addr`: bits `[9:2]` of the 32-bit block memory address from which $A$ is loaded - `D_addr`: bits `[9:2]` of the 32-bit block memory address from which $D$ is loaded - `C_addr`: bits `[9:2]` of the 32-bit block memory address into which $C$ is written ### write writes (i) a $1$ B header and (ii) the full contents of an arbitrary matrix in block memory ($2^4$ B) to UART ``` |unused |header |addr |code | |(14 b) |(8 b) |(8 b) |(2 b) | |31 -- 18|17 -- 10|9 -- 2|1 -- 0| ``` - `code = 0b01` - `addr`: bits `[9:2]` of the 32-bit block memory address from which the matrix to be written to UART is loaded - `header` the exact 8-bit header written to UART before the matrix - the idea is to allow the device to add a tag to the output in case it needs to be distinguished from the output of other threads writing to UART ### term (terminate) stops the thread - note that the thread will automatically stop once it reaches the final instruction in the script so this instruction is purely decorative ``` |unused |code | |(30 b) |(2 b) | |31 -- 2|1 -- 0| ``` - `code = 0b00` ## Scripts Scripts tell the device driver how to execute a thread on the device - at a high level they describe the **full life cycle of a thread**. A script contains 3 sections - `META`: 2 integer parameters ($n / k$ and $k$ - recall from Part 1 that $n$ is the number of entries in a matrix row/column and $k$ is the number of tiles in the systolic array) - `DATA`: a list of statically-defined matrices that the device driver running the script must load into the device block memory - `TEXT`: a list of instructions that the device driver running the script must load into the device instruction memory ## The Central Idea Behind the Device The general driver-device interaction flow is as follows: - the device driver reads a script and writes the script to the device: - the device loads static data into **shared block memory** - the device loads instructions into **local (per-thread) instruction memory** starting at address `0x0000 0000` - the device driver sends a control signal to the device indicating that it should start an indexed thread (i.e., the driver should specifically indicate if it wants to start thread 0, thread 1, or both) - the device starts running the indexed thread readig from its local instruction memory starting from instruction `0x0000 0000` and **completely stopping** once it hits `TERM` instruction or it runs through every instruction in its instruction memory The motivating idea is that the device should be a **continuously running script consumer** - the driver could be constantly feeding new scripts to the the device (taking advantage of the multiple running threads) + updating the thread instruction memory + running new threads. The ISA is parsimonious - there are no branches or jumps which means the device will run straight through any given script. This severely restricts the utility of any single script - but the idea is for **the driver to orchestrate a diversity of short scripts with custom, tailored logic and send them rapid-fire to the device** (alternating between updating a thread's instructions and allowing the thread to run). This is a fun (read: useless lol) contrast to the typical design for these sorts of devices, which typically has all threads on the device running the same code but just with different data inputs.