Cherry FPGA DMA

Responses are dotted lines. Protocol guarantees responses arrive in order.

Trying to make this look like PCIE so moving to PCIE later will feel natural.

All messages are send and forget, except for memory read which needs a response. Device<–>host communication is using UART or ethernet.

A packet has 1 byte header to define the packet type and length. Then the payload follows.

Sometimes order matters, sometimes it doesn't.

Reordering memory reads and writes is bad.
The host should sends its memory read responses back in-order to simplify hardware. Later we can make hardware stronger to remove this requirement.
Reordering program start commands is bad.
Reordering program uploads is fine.
Reordering program finished messages is ok. Host will just assume previous programs have also finished.

Host refers to the host driver
Python refers to the tinygrad/pytorch runtime.

There's 2 types of memory on the host PC.

One is tiled memory, this imitates how we'll structure memory in VRAM in the future. It's a 1GB array (emulating a 1GB VRAM). The tensors are stored tile contigous.

The other memory is normal row contigous with strides. This is how pytorch and tinygrad store tensors.

Right now to_gpu() moves a tensor from row contigous memory to the 1GB tiled memory to emulate transfer to VRAM. In the future, it will move the tensor to the real VRAM.