Cherry FPGA DMA

# Cherry FPGA DMA Responses are dotted lines. Protocol guarantees responses arrive in order. Trying to make this look like PCIE so moving to PCIE later will feel natural. All messages are send and forget, except for memory read which needs a response. Device<-->host communication is using UART or ethernet. A packet has 1 byte header to define the packet type and length. Then the payload follows. Sometimes order matters, sometimes it doesn't. * Reordering memory reads and writes is bad. * The host should sends its memory read responses back in-order to simplify hardware. Later we can make hardware stronger to remove this requirement. * Reordering program start commands is bad. * Reordering program uploads is fine. * Reordering program finished messages is ok. Host will just assume previous programs have also finished. Host refers to the host driver Python refers to the tinygrad/pytorch runtime. ```sequence Title: DMA and Memory Movement Device->Host: Memory Read Request Note right of Host: Reads 1GB tiled userspace array Host-->Device: Returns data tile Note left of Device: Puts data in L3 cache queue Host ->Device: Upload program id #2 Note left of Device: Saves program to pcache Host ->Device: Start program id #2 Note left of Device: Enqueues the program Device->Host: Memory Write Request Device->Host: Program #2 finished Note over Python: tensor.to_gpu() Note right of Python: Reshape to tile memory Python->Host: tiled tensor to 1GB userspce array Note over Python: tensor.from_gpu() Host-->Python: Row contigous tensor ``` There's 2 types of memory on the host PC. One is tiled memory, this imitates how we'll structure memory in VRAM in the future. It's a 1GB array (emulating a 1GB VRAM). The tensors are stored tile contigous. The other memory is normal row contigous with strides. This is how pytorch and tinygrad store tensors. Right now `to_gpu()` moves a tensor from row contigous memory to the 1GB tiled memory to emulate transfer to VRAM. In the future, it will move the tensor to the real VRAM.