SysArray - MatMul Device Part 2: UART Comms

# Systolic Array MatMul Device Part 2: UART Comms + Memory This part is a diversion from the last part on the core systolic array of the total device - however the systolic array, UART module, and memory modules are really the only subcomponents orchestrated by the top-level device controller (the controller and other trivial subcomponents will all come together in Part 4). ## UART Introduction To communicate with the device driver, I used the **UART (Universal Asynchronous Receiver/Transmitter)** protocol. The key word to pick out here is **asynchronous** - we're going to define a baud rate (i.e. how many bits we send per second) that is slower than the clock rate running on our device, but we will not restrict the device and device driver to start the baud rate at the same instant and remain synced (we won't even restrict the device and device driver to have the same clock rate). Also, recall that our device will have multiple running "threads" that all want to communicate with the device driver - however, since the UART protocol only permits one line of communication we will need to impose a more abstract interface over the communication lines that enforces access that is **synchronized** and **buffered**, to make sure multiple threads don't overlap writing to the device driver and sending a jumbled mess. One last thing to note before describing the actual protocol - UART is a protocol for sending data in one direction, not for sending AND receiving data synchronously (i.e., in a question-answer flow). So in the overall system design, **the device and the device driver program will each have its own UART module** each writing with a separate higher purpose: - the **device UART module** will strictly be used for writing output data (i.e. the result of the `WRITE` command described in the next section) - the **device driver UART module** will be used for writing instruction data, block memory data, and control signals to the device controller - the controller will have careful logic for decoding the byte-by-byte data sent over this UART connection into the correct logic to execute ## UART Protocol The UART protocol is defined by the **baud rate**: the number of bits the UART transmitter sends per second - there are no other parameters. For ease of use we can define ```SYMBOL_EDGE_TICKS = CLOCK_FREQUENCY / BAUD_RATE``` (i.e., `SYMBOL_EDGE_TICKS` is the number of clock ticks per bit sent) There are up to two lines used: - `serial` (required): data line written to by transmitter - `cts` (optional): control line written to by receiver (indicates that the receiver is able to receive data over the serial line; i.e. the transmitter is "clear to send") The protocol the transmitter uses to send a single byte `B` cycles through the following steps: - `WAITING`: send the high bit (`1`) until ready to write (and optionally until `cts = 1`) - `START`: send the low bit (`0`) - `READING`/`WRITING`: send each bit in `B` starting from the least significant bit - `FINISH`: send the low bit (`0`) So we send 10 total bits while the transmitter is not waiting to send (8 data bits bookended by 2 control bits). Additionally note that UART is a little-endian protocol - it writes the data bits starting from the least significant bit. The most important takeaway (reiterated) is that this protocol is **asynchronous**: both the transmitter and receiver are stuck in a waiting state (where the serial line is high) until the protocol is triggered on the exact tick where the serial line is set low. In particular, the transmitter/receiver do not periodically check the signal every BAUD_RATE seconds except when the protocol is in motion - this could lead each to go out of sync, especially if they use different clock frequencies. This is what the serial line (and the transmitter state) would look like if it tried sending the byte `0x2A` (`0b00101010`)- note that, if viewing each "block" as a chunk of time, then **each block in the diagram will last for `SYMBOL_EDGE_TICKS` clock ticks:** ||-2|-1|0|1|2|3|4|5|6|7|8|9|10| |--|--|--|--|--|--|--|--|--|--|--|--|--|--| |state|WAITING|WAITING|START|WRITING|WRITING|WRITING|WRITING|WRITING|WRITING|WRITING|WRITING|FINISH|WAITING| |`serial`|1|1|0|0|1|0|1|0|1|0|0|0|1| ... ## UART Device-Level Considerations Here we've been discussing transmitters/receivers, but each "device" (our local device and the device driver) will each have their own transmitter and receiver (since each will need to write to/ready from the other). This means that there are now **4** lines of communication in total: - `serial_in` (required): data line written to by the driver TX (and read from by the device RX) - `serial_out` (required): data line written to by the device TX (and read from by the driver RX) - `cts` (optional): control line written to by the driver RX (indicating the device TX is "clear to send") - `rts` (optional): control line written to by the device RX (indicated the device RX "requests [the driver] to send") Additionally, we will require that the device RX will not start running (i.e., it must stay in `WAITING`) until the device TX is in the `WAITING` state. ## UART Thread-Level Considerations Our device has multiple threads running at a given time - this raises a synchronization problem since the device only has one UART and threads write multiple bytes at a time. We'll solve this with two additions to the overall **UART Controller**. - **FIFO**: Reads are going to be buffered in a FIFO. In particular, the device RX will insert all read bytes from the UART to the back of the FIFO and pop bytes off as they are read by the threads. - **Lock-level synchronization**: Access to the TX (for writes) will be protected by a per-thread lock that must be voluntarily relinquished before a different thread can access the TX/FIFO. With these we can define the thread-level requirements for the UART controller: - controller writes input byte from thread X to UART TX if - thread X holds the write lock - input byte from thread X is valid - we will output the sigal `write_ready` that indicates the TX is in `WAITING` state (it's the responsibility of the writing thread to only submit a data byte if the UART is ready to accept) - the UART RX will only receive data if (i) the TX is not running (as above) and (ii) the read FIFO is not full Note that we haven't added a FIFO for writing - this is because writing can be actively waited upon by the writing thread, since (i) it is beneficial for threads to be blocked by I/O to allow other threads access to computation resources, and (ii) this ensures a write FIFO backlog isn't hidden by the UART controller. ## Memory The device has 2 separate memory modules: **block memory** and **instruction memory**. The block memory is used for storing matrices and instruction memory is used for storing (wait for it) instructions. Note that the memory parameters below are directly configurable in the verilator simulation instead of being hardcoded into the Verilog design. Note that for all memory modules **reads are asynchronous (unclocked)** while **writes are synchronous (clocked)**. ### Block Memory - block memory contains $2^{16}$ B - block memory **only** stores matrices of size $n^2$ - in my implemention $n^2 = 4^2 = 2^4 = 16$ --> block memory can store $2^{16} / 2^4 = 2^{12} = 4096$ matrices - block memory is **not generally addressable** (i.e. there isn't an input signal that permits a module to access a byte at an arbitrary address) - each instruction in the ISA (described in the next part) has a specifically-defined block memory access pattern that restricts how many bytes can be retrieved for each operation - --> the block memory module has **specific logic defined for each possible instruction** (e.g. in the previous section when supplying the staggered $A$ input to the systolic array ony $n / k$ bytes are retrieved from block memory for a given matrix) - block memory is **shared** - all threads running on the device have read/write access to block memory (as well as the device controller - the controller loads matrices from the driver over UART and writes them into the block memory) - block memory writing is synchronized using a global lock (similar to the UART module) - only one thread can write to any address in block memory on a given clock cycle - reads are **not synchronized by a lock** - this is not a concern for the device design since reads are asynchronous ### Instruction Memory - instruction memory contains $4 * 2^8 = 2^{10}$ B - each instruction contains $4$ B --> instruction memory can store $2^8 = 256$ instructions - instruction memory is **local** - each thread has a completely separate associated memory module (the justification for this design choice is provided in the next part) -