Cherry FPGA DMA

Responses are dotted lines. Protocol guarantees responses arrive in order.

Trying to make this look like PCIE so moving to PCIE later will feel natural.

All messages are send and forget, except for memory read which needs a response. Device<>host communication is using UART or ethernet.

A packet has 1 byte header to define the packet type and length. Then the payload follows.

Sometimes order matters, sometimes it doesn't.

  • Reordering memory reads and writes is bad.
  • The host should sends its memory read responses back in-order to simplify hardware. Later we can make hardware stronger to remove this requirement.
  • Reordering program start commands is bad.
  • Reordering program uploads is fine.
  • Reordering program finished messages is ok. Host will just assume previous programs have also finished.

Host refers to the host driver
Python refers to the tinygrad/pytorch runtime.

Created with Raphaël 2.2.0DMA and Memory MovementDeviceDeviceHostHostPythonPythonMemory Read RequestReads 1GB tiled userspace arrayReturns data tilePuts data in L3 cache queueUpload program id #2Saves program to pcacheStart program id #2Enqueues the programMemory Write RequestProgram #2 finishedtensor.to_gpu()Reshape to tile memorytiled tensor to 1GB userspce arraytensor.from_gpu()Row contigous tensor

There's 2 types of memory on the host PC.

One is tiled memory, this imitates how we'll structure memory in VRAM in the future. It's a 1GB array (emulating a 1GB VRAM). The tensors are stored tile contigous.

The other memory is normal row contigous with strides. This is how pytorch and tinygrad store tensors.

Right now to_gpu() moves a tensor from row contigous memory to the 1GB tiled memory to emulate transfer to VRAM. In the future, it will move the tensor to the real VRAM.