struct dma_fence notes

# `struct dma_fence` notes # tl;dr `struct dma_fence` is refcounted. It must be initialized after allocation, which will leave the object with a refcount of 1. When the refcount becomes zero the object is freed. The default implementation will free lazily with `kfree_rcu()`. The structure can be embedded in other structures. If embedded as the first field, the refcounting of the embedded object can be used to manage the lifetime of the super object. This is the intended use. The structure is intended to be dynamically allocated, although it might work stack allocated if the original owner never decrements the refcount. The fence can be customized by populating a vtable. Most notable is the option to change the cleanup function that is called when the refcount becomes zero, and the default wait function. The primary function provided by `struct dma_fence` is that of a semaphore. Multiple waiters can wait on the fence, to be released when the fence is signaled. As an additional feature, users of the fence can enqueue callbacks to be called when the fence is signaled. Waiters of the fence must obtain a reference on the fence before waiting on it. Methods for obtaining refcounts in normal and RCU context are provided. All operations on the fence is protected by a spinlock. This lock is provided by the user at initialization time as a pointer that is installed in the object. # Details ## Files * `include/linux/dma-fence.h` * `drivers/dma-buf/dma-fence.c` * `drivers/gpu/drm/virtio/virtgpu_drv.h` * `drivers/gpu/drm/virtio/virtgpu_fence.c` * `include/drm/drm_syncobj.h` * `drivers/gpu/drm/drm_syncobj.c` ## Description ### Lifted from `dma-fence.c` DOC: DMA fences overview DMA fences, represented by &struct dma_fence, are the kernel internal synchronization primitive for DMA operations like GPU rendering, video encoding/decoding, or displaying buffers on a screen. A fence is initialized using dma_fence_init() and completed using dma_fence_signal(). Fences are associated with a context, allocated through dma_fence_context_alloc(), and all fences on the same context are fully ordered. Since the purposes of fences is to facilitate cross-device and cross-application synchronization, there's multiple ways to use one: - Individual fences can be exposed as a &sync_file, accessed as a file descriptor from userspace, created by calling sync_file_create(). This is called explicit fencing, since userspace passes around explicit synchronization points. - Some subsystems also have their own explicit fencing primitives, like &drm_syncobj. Compared to &sync_file, a &drm_syncobj allows the underlying fence to be updated. - Then there's also implicit fencing, where the synchronization points are implicitly passed around as part of shared &dma_buf instances. Such implicit fences are stored in &struct dma_resv through the &dma_buf.resv pointer. DOC: fence cross-driver contract Since &dma_fence provide a cross driver contract, all drivers must follow the same rules: * Fences must complete in a reasonable time. Fences which represent kernels and shaders submitted by userspace, which could run forever, must be backed up by timeout and gpu hang recovery code. Minimally that code must prevent further command submission and force complete all in-flight fences, e.g. when the driver or hardware do not support gpu reset, or if the gpu reset failed for some reason. Ideally the driver supports gpu recovery which only affects the offending userspace context, and no other userspace submissions. * Drivers may have different ideas of what completion within a reasonable time means. Some hang recovery code uses a fixed timeout, others a mix between observing forward progress and increasingly strict timeouts. Drivers should not try to second guess timeout handling of fences from other drivers. * To ensure there's no deadlocks of dma_fence_wait() against other locks drivers should annotate all code required to reach dma_fence_signal(), which completes the fences, with dma_fence_begin_signalling() and dma_fence_end_signalling(). * Drivers are allowed to call dma_fence_wait() while holding dma_resv_lock(). This means any code required for fence completion cannot acquire a &dma_resv lock. Note that this also pulls in the entire established locking hierarchy around dma_resv_lock() and dma_resv_unlock(). * Drivers are allowed to call dma_fence_wait() from their &shrinker callbacks. This means any code required for fence completion cannot allocate memory with GFP_KERNEL. * Drivers are allowed to call dma_fence_wait() from their &mmu_notifier respectively &mmu_interval_notifier callbacks. This means any code required for fence completeion cannot allocate memory with GFP_NOFS or GFP_NOIO. Only GFP_ATOMIC is permissible, which might fail. Note that only GPU drivers have a reasonable excuse for both requiring &mmu_interval_notifier and &shrinker callbacks at the same time as having to track asynchronous compute work using &dma_fence. No driver outside of drivers/gpu should ever call dma_fence_wait() in such contexts. ### Structure Layout ```c /** * struct dma_fence - software synchronization primitive * @refcount: refcount for this fence * @ops: dma_fence_ops associated with this fence * @rcu: used for releasing fence with kfree_rcu * @cb_list: list of all callbacks to call * @lock: spin_lock_irqsave used for locking * @context: execution context this fence belongs to, returned by * dma_fence_context_alloc() * @seqno: the sequence number of this fence inside the execution context, * can be compared to decide which fence would be signaled later. * @flags: A mask of DMA_FENCE_FLAG_* defined below * @timestamp: Timestamp when the fence was signaled. * @error: Optional, only valid if < 0, must be set before calling * dma_fence_signal, indicates that the fence has completed with an error. * * the flags member must be manipulated and read using the appropriate * atomic ops (bit_*), so taking the spinlock will not be needed most * of the time. * * DMA_FENCE_FLAG_SIGNALED_BIT - fence is already signaled * DMA_FENCE_FLAG_TIMESTAMP_BIT - timestamp recorded for fence signaling * DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT - enable_signaling might have been called * DMA_FENCE_FLAG_USER_BITS - start of the unused bits, can be used by the * implementer of the fence for its own purposes. Can be used in different * ways by different fence implementers, so do not rely on this. * * Since atomic bitops are used, this is not guaranteed to be the case. * Particularly, if the bit was set, but dma_fence_signal was called right * before this bit was set, it would have been able to set the * DMA_FENCE_FLAG_SIGNALED_BIT, before enable_signaling was called. * Adding a check for DMA_FENCE_FLAG_SIGNALED_BIT after setting * DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT closes this race, and makes sure that * after dma_fence_signal was called, any enable_signaling call will have either * been completed, or never called at all. */ struct dma_fence { spinlock_t *lock; const struct dma_fence_ops *ops; /* * We clear the callback list on kref_put so that by the time we * release the fence it is unused. No one should be adding to the * cb_list that they don't themselves hold a reference for. * * The lifetime of the timestamp is similarly tied to both the * rcu freelist and the cb_list. The timestamp is only set upon * signaling while simultaneously notifying the cb_list. Ergo, we * only use either the cb_list of timestamp. Upon destruction, * neither are accessible, and so we can use the rcu. This means * that the cb_list is *only* valid until the signal bit is set, * and to read either you *must* hold a reference to the fence, * and not just the rcu_read_lock. * * Listed in chronological order. */ union { struct list_head cb_list; /* @cb_list replaced by @timestamp on dma_fence_signal() */ ktime_t timestamp; /* @timestamp replaced by @rcu on dma_fence_release() */ struct rcu_head rcu; }; u64 context; u64 seqno; unsigned long flags; struct kref refcount; int error; }; ``` ## Use `dma_fence` is a synchronization primitive. It's function can be customized by populating a vtable. In the default configuration, after fence allocation and initialization, threads can wait for the fence to be signaled. Once the fence is signaled, all waiters re notified. The fence is one-shot. Before waiting on the fence, callers must obtain a reference. This can be done with `dma_fence_get()` in normal context or `dma_fence_get_rcu()` in rcu read side context. ## Implementation `dma_fence`is implemented with a spinlock pointer (`fence->lock`) and a callback list. The callback list is a linked list of: ```c struct dma_fence_cb { struct list_head node; dma_fence_func_t func; }; ``` Callbacks are added to the list through the default wait function `dma_fence_default_wait()` which will add `dma_fence_default_wait_cb()` to the callback list, or through `dma_fence_add_callback()` which will add an arbitrary callback function to the list. > ⚠️ Callbacks cannot be added after the fence has been signaled. This will > result in `-ENOENT`. Entries are added to the list with `fence->lock` held. The default wait function `dma_fence_default_wait()` (activated by `dma_fence_wait()` through the vtable) will put the function `dma_fence_default_wait_cb` on the callback list. This function wakes up the task that called `dma_fence_default_wait()`. After queuing the callback, `dma_fence_default_wait()` goes to sleep by `schedule_timeout()`. On wake, the task will grab `fence->lock` again and check that the fence was signaled. If not, an error is returned. The caller of `dma_fence_wait()` must hold a reference to the fence before the call. To signal the fence, a user has to call one of the signaling methods, the simplest being `dma_fence_signal()`. This will acquire `fence->lock`, take the callback list, call each callback, and release the lock. The callback list of `struct dma_fence` is aliased with two other fields. After initialization, the location of the callback list is a `struct list_head`. After signaling, it becomes a `ktime_t` that stores the timestamp when the fence was signaled. When `dma_fence_release()` is called, the location becomes a `struct rcu_head` used for cleaning up. ### Initialization and cleanup `struct dma_fence` is reference counted by embedded `struct kref`. `struct dma_fence` has to be initialized by a call to `dma_fence_init()`. This will initialize all fields of the structure, such as lock, refcount, vtable, etc. After `dma_fence_init()`, the fence will have a refcount of 1. The fence is cleaned up when the refcount goes to zero by a call to `dma_fence_release()`. This will in turn call either a custom clean up function via the vtable, or the default `dma_fence_free()`, which calls `kfree_rcu`. This means that if the structure is stack allocated, the refcount must _never_ go to zero and the owner must never call `dma_fence_put()`. The object must have been allocated with `kmalloc()` or `kmem_cache_alloc()`. If the structure is embedded within another structure, `fence->ops->release` _must_ be customized if the `struct dma_fence` is not the first field of the parent structure. ## Embedding `struct virtio_gpu_fence` is an example of embedding `struct dma_fence` for the purpose of customizing behavior. This driver uses batching of fence signaling. To queue an event for emission, the driver will add a `struct virtio_gpu_fence` to a list. Later, it will signal all of the fences on the list that are ready. The driver uses one lock for all its fences. Fences are freed through RCU by the default clean up function once refcount goes to zero.