Entity Allocation and Reservation

# Entity Allocation and Reservation Note that this is written in the context of the remote reservation v9 [pr](https://github.com/bevyengine/bevy/pull/18670), currently not merged. ## Background ### What specifically is an entity? If you've worked with ecs for a while now, this might seem like something to skip over, but it's important to have a precise definition. Here's my definition: An entity is a group of component values associated with a generally unique id, generally unique referring to the fact that aliasing is possible but rare. In Bevy, an entity is specified by a `Entity` id (row and generation), `ArchetypeId` (the types of components it has), and `ArchetpyeRow` (how to find the values of those components). This is all in `EntityMeta`, and in practice, much more is included there. When an entity is spawned, it gets its own `Entity` id, an archetype, etc. When it is despawned, it is removed from its archetype, and its id is recycled to be reused by a new spawn (with an incremented generation). Note that it is possible, even quite sensible, to have a set of `Entity` ids that do not correspond to entities. With respect to data, these `Entity` values are valid and unique, but the `ArchetypeId` for that `Entity` is left `INVALID`. This is very useful for any kind of custom allocator, reserving some ids for components as entities, and any way particular ids should be retained apart from a spawned entity. ### How to get an `Entity` id. In the past, getting a new, unique entity id was fundamentally linked to an entity spawning. As a result, there were two methods: `alloc` and `reserve`. The `alloc` method would supply a unique `Entity` id and would require its archetype information to be configured immediately, requiring mutable access to `Entities`. Where mutable access was impossible (ex: `Commands`), `reserve` could be called to get a unique `Entity` id without associating with an archetype. In order to give these reserved entities an archetype, they would need to be flushed, moving them all from no archetype (`INVALID`) to the empty archetype. Of course, they would then be immediately removed and land in a different archetype, but that's a separate issue. This flushing process is generally slow and unnecessary on paper. The only reason it exists is because `alloc` needed mutable access to immediately set archetypes. The only reason to use it is to explicitly create an empty entity at some point in the future; not very useful, but necessary when the archetype can not be set directly. ### Remote reservation To make matters more complicated, for assets as entities, an asset loader must be able to, from an async context, procure an `Entity` for sub assets. This can happen at any time, from anywhere. That means the part of `Entities` that hands out `Entity` ids needs to be in an `Arc` so other contexts can reference it. As a result, it needs to use synchronization primitives for *everything*, even `alloc` and `reserve`, which is not ideal for performance. The traditional `RwLock` is much too slow. Actually, even simple atomics are slow enough that using too many of them results in a 40% performance regression for some operations. Further, the primary consumer of remote reservation, the asset system, is currently *fully* non-blocking, which is a major selling point and not something to sacrifice lightly. If at all possible, a locks, awaits, and above all, blocking, should be avoided. ## Road to an allocator ### What is the `Entity` allocator The allocator is the part of `Entities` responsible for handing out `Entity` ids. This is remarkably similar to a memory allocator: the allocator simply hands out items (memory or `Entity`) without overlapping with other previously handed out items. The allocator can `alloc`, `free`, and`remote_alloc`. The only thing special about `remote_alloc` is that it can be called from anywhere at anytime, where as `free` and `alloc` must never be called at the same time. What's left is to find an implementation of `alloc`, `free`, and`remote_alloc` that is fast, lockless, non-blocking, etc. From there, the allocator can concern itself with `Entity` ids, and `Entities` can handle tracking `EntityMeta`, tracking and flushing pending `Entity` ids (also allocated from the allocator), etc. ### Implementation attempts I've worked on this for a while, so there's a lot I could share here. I'll instead just try to build some intuition for how we ended up with what we have now. I give a more comprehensive history [here](https://github.com/bevyengine/bevy/pull/18670#issuecomment-2894571387). My first thought was to separate the system for remote vs local allocators. This makes everything trivially simple! But, long story short, this ends up making access to `EntityMeta` *slightly* slower, but slightly slower in such a hot path is too slow. A next logical step is to not quite separate the allocators but make them mostly split; they can re-balance each other at a few sync points. The trouble is, preventing race conditions at that sync point is challenging to do efficiently. Most of my attempts here were either too slow or unsound. Even if this would work, having different allocators for remote vs local allocations could cause a slow buildup of memory consumption, effectively a memory leak. For example if there is a free entity in one allocator but not in another, the other allocator wouldn't see it, creating a new entity index, growing the `EntityMeta` list. From my experiments, splitting up the allocator for remote vs local allocations is not worth it. We need one big unified allocator for all kinds of allocation. This is hard; really hard. I also tried an async-centered request/fulfill model. This works, but is again, slower than ideal. It also requires async and waiting, two additional restrictions on where it can be used. At the end of the day, we need to keep a list of `Entity` ids that are free and ready to be allocated. That list can grow while another thread is using it (ex: if a `remote_alloc` interrupts a `free` or the other way around). To guard against that, the vector must be pinned in memory; that's puzzle number one. Puzzle number two is that we need to prevent `free` from pushing to the list while `remote_alloc` pops from the list. My first attempt at this was a structure similar to `RwLock<Vec<Entity>>`, only with a custom, non-blocking twist. But this ended up suffering from the same memory leaking concerns. Hopefully that gives some intuition for what motivated the current version. Before I dive into those details, why not just add a dependency? Is this really worth reinventing the wheel? ### Faster than a rust dependency Why not just use a `concurrent_queue` or some other off-the-shelf concurrent list? I and some others tried this: it's too slow (40% regression). What could we do better? First, adding to the list does not need to be *as* concurrent as these dependencies do. In off the shelf data structures, `free` and `alloc` would both be fully concurrent, but in the entity allocator, `free` and `alloc` will never conflict; we only need to worry about `remote_alloc`! That saves a lot of atomics already. Second, other libraries don't know the maximum capacity, but the number of `Entity` ids has a well defined limit. This can be exploited to save even more atomics. Finally, an `Entity` is only 8 bytes, offering a few more small optimizations. This is small, but all three of these add up quickly! ## How the new allocator works I'll save the technical details for rust docs, as they are subject to change, but it's worth building some conceptual intuition of the allocator to understand it. I mentioned earlier that there are two puzzles to solve: a growable, pinned in memory list, and a way to prevent `remote_alloc` from stepping on the toes of `free` and `alloc`. ### Pinned lists A pinned list is a collection that can grow without invalidating references to items in that list. A linked list is a classic example, but it would be way too slow to use here. Fundamentally, we need to expand a `Vec`'s capacity without reallocating it's memory. Typically, when a `Vec` has something pushed to it and has no more capacity, it doubles its capacity by reallocating to a different region of memory. What we need is a way to double the capacity without reallocating. The solution is a linked list of fixed-capacity buffers. Each buffer has twice the capacity as the last (doubling). If we run out of room in one buffer, jump to the next. Naturally, I was not the first to come up with this. This is called a [`SplitVec`](https://docs.rs/orx-split-vec/latest/orx_split_vec/). This is also used by `concurrent_queue`, etc. However, we can do better. Since there will never be more than `u32::MAX` entities, we don't need a linked list of these buffers/chunks. Instead, we can have a flat array of atomic points to these chunks. Even better, because these chunks are doubling in capacity, we can find the exact index (index of chunk and index within that chunk) using a handy instruction: count leading zeros, which is kind of the an inverse base 2 logarithm. With an array of these chunks and an atomic length, this solves the first puzzle! ### Preventing race conditions We want to keep `alloc` and `free` as fast as possible (since the caller will prevent those functions from racing). If there's any potential for a race condition ever, `remot_alloc` will be the cause of it, and it should be the one to yield. Further, since these functions are small and fast, we can just let `remote_alloc` spin or yield to the scheduler as needed. This is waiting, but no more so than the CPU scheduler will keep the `alloc` or `free` call waiting (not long at all). The tricky part is detecting when there is a conflict. The solution is to track the state of the list on it's length. This packs the length and generation variables into one 64bit atomic integer state. When we free, the length increases, which changes the state value, when we pop/alloc, we change the length and the generation value. This makes each state of the list uniquely identified by its state and length combination. As one reviewer summarized: "There's no ABA problem because once we push to the free list, we can't return to the same length without advancing the generation." From there, its simple to make `remote_alloc` compare and exchange the state/length instead of decrementing it. If the compare fails, we can just spin and try again. There's currently a bit more to implementing this, but you get the idea. ### Putting it all together The allocator is just a composition of a free list and its state. `Entities` holds one copy of the allocator, using rust's borrow rules to prevent `free` and `alloc` from conflicting. The difference between `reserve` and `alloc` is only that `reserve` also puts the `alloc`ated entity in a pending list to be flushed. Guarding against double flushes (which also allows early flushing!) is as simple as making sure the flushed entity has an invalid archetype. With all of that done, entities can be allocated and freed in a fast, centralized way, while still offering remote allocation and reservation. It's lockless, non-blocking, never awaits.