Thesis Diary - HackMD

# Thesis Diary Here we write down things we've learned/discovered/researched/implemented that should be mentioned, covered or taken into account during the writing of the thesis. **Suggested format**: ## Heading describing milestone (e.g. Scheduler Prototype) A description of what the milestone represents. * Notes can be as simple as these bullet points * Notes don't need dates, just write them as bullets (or subheadings if necessary) * Don't write things down too formally or too detailed (i.e. don't write as if this is the thesis) * Write enough so that we can read it a few weeks down the line and be able to write about it in the thesis * ! Link sources used, if any ! * ... ## Component Storage Prototype ### Explore sparse sets\#12 A sparse set is a data structure that allows for efficiently initialization O(1), insertion O(1), deletion O(1), look-up O(1) and iteration O(m) of the m integers present in a mostly empty set, with the domain of integers in the range of 0 to n-1. In ECS a sparse set can be used to quickly query which entities are to be iterated by a system. Another application is to use it for storing components of the same type in a sparse set, this is described further down below. The data structure serves the same purpose as a *bit vector* of size n, but with the benefit of constant initialization time and an iteration time of O(m) instead of O(n) where m << n. (I'm not sure if the initialization benefit is possible in Rust, since it is based on the idea of not cleaning the memory area that is used, which would take O(n) to clean instead of O(1)). #### Summary of inner workings of sparse set OR read https://research.swtch.com/sparse The data structure is made up of two arrays called `sparse[n]` and `dense[n]` as well as a counter called `size`, which keeps track of the number of integers in the set. When integers are inserted into the set, they are pushed to the dense array and the index of the number is added to the numbers position in the sparse array, meaning that `dense[index] = number` and `sparse[number] = index`. To check if an integer is part of the set, you simply check if `dense[sparse[number]] == number & sparse[number] < size`. The last check is to make sure that the number has not been removed, which would still occupy the same index, since it is not cleaned. Iteration is quick since the dense array contains all integers of the set in a packed array. A downside to this data structure is that it takes up O(2n) memory, instead of O(n) which a bit array would use, but what is gained is iteration speed, which is more important than memory usage in the case of ECS. To use the sparse set to store components, a third array is added that keeps the components ordered in the same way that the dense array is ordered, providing packed memory and thus efficient iteration of the components. Other references: * How it is used in EnTT: https://skypjack.github.io/2020-08-02-ecs-baf-part-9/ * Sparse set implemented in Rust: https://docs.rs/crate/sparseset/1.0.0 * Sparse sets are used in Bevy: https://github.com/bevyengine/bevy/blob/main/crates/bevy_ecs/src/storage/sparse_set.rs ### Explore entity archetypes\#11 The idea of archetypes is to store all entities with the same components in the same arrays. This makes it easy to find and iterate entities with a given subset of components. Depending on how the subsets are spread out in memory, archetypes can benefit from multi-threading, where each array can be operate on concurrently. However, one must be careful as the number of threads must be balanced against how sparse the data is. A drawback of archetypes is if components are frequently added and removed, since this means that the components must be moved from one array to another. The pattern might benefit from storing archetypes that are relatively constant together, while storing other components elsewhere. Another problem is that the components that are stored together might not be used together, that is they are not created based on usage patterns. References: * https://skypjack.github.io/2019-03-07-ecs-baf-part-2/ * https://ajmmertens.medium.com/building-an-ecs-2-archetypes-and-vectorization-fe21690805f9 * https://maz.digital/archetype-what-complex-word-for-nothing * https://bevyengine.org/news/bevy-0-5/ ### Notes for table (and sparse set) Table Storage * optimal for query iteration * if component is mostly accessed through iteration over its data across many entities, this is the best in performance * relatively slow when adding/removing table components to existing entities * creates new table and copies over all existing table components to a new memory location * default since most component types are rarely added/removed Sparse-set Storage * optimal for fast adding/removing of components to already existing entities * slower querying * ex. marker components indicating if enemy is aware of player or not * query filter * doesn’t affect memory layout of tables → components used on few entities or as temporary effect fits this type of memory Table Fragmentation * diverse entities, varied mix of component types → fragmented memory and less efficient * sparse-set does not affect memory layout, systems that do not care about components that use sparse set will not be affected by them existing References: * [Bevy component storage](https://bevy-cheatbook.github.io/patterns/component-storage.html) * [Tables in Bevy](https://github.com/bevyengine/bevy/blob/main/crates/bevy_ecs/src/storage/table.rs) * [Archetypes and Tables the same](https://bevyengine.org/news/bevy-0-5/) ### Table storage (archetypes) Table storage is another storage type that bevy has implemented to be in their archetype, but it follows a matrix pattern rather than a sparse set. You have "Tables" with enteties that have the same set of components. In each table, every row is an entity and every column is a type of component. Instead of iterating all enteties, you iterate the Tables which are fewer and mathing on the "archetype" of the table. Then iterate over the entities within that table. If a component is removed/added to an entity, the entity changes it's archetype and is moved to a new table. Is faster in iteration than sparse set since the components have a fixed column index, making lookup faster. Parallelism-friendly: entities only exist in one archetype at a time so systems that access the same components but in different archetypes can run in parallel. The problem is that it takes a lot of time to remove/add components to an entity, since the entity needs to be moved to a new archetype table. Therefore, with entities that have components that are often removed or added (say "spotted by enemy"-trait component), could use a mixed approach. By having the often added/removed components be for example sparse set stored, while the permanent components are table stored, this can be improved. References: * [Bevy component storage](https://bevy-cheatbook.github.io/patterns/component-storage.html) * [Tables in Bevy](https://github.com/bevyengine/bevy/blob/main/crates/bevy_ecs/src/storage/table.rs) * [Archetypes and Tables the same](https://bevyengine.org/news/bevy-0-5/) * [ECS back and forth](https://skypjack.github.io/2019-03-07-ecs-baf-part-2/) ### Archetype Implementation\#26 * Might be of interest to look into how memory is being utilized by archetypes. * https://rust-analyzer.github.io/blog/2020/12/04/measuring-memory-usage-in-rust.html ### Archetype Implementation that we ended up making Archetypes is a way to sort our entities into groups where each group is entities that have the same set of component types. The archetypes themselves are kept as a (Insert the way we store them, vector? hashmap?) inside the world, so all archetypes can be accessed and iterated through by systems. To cut down on iteration time, the archetype makes use of edges. Every instance an entity gains/loses a component, the entity needs to be moved to a new archetype that fits its new component type set. When this is done, the old archetype is recorded by the new archetype as a neighbor in an edge-hashmap/vector/sparse set. This simplifies the lookup time of an archetype significantly when an entity needs to be moved between these archetypes, since the entity only needs to look at the edge-vector/hashmap/sparse set to find its new archetype. * How we identify archetypes, entities and components * Mutiple storage options for components (sparse sets or archetype tables) ## Scheduler prototype ### Atomic memory orderings Data accesses are defined as any reads or writes to a location in memory (i.e. a variable). Compilers will and do reorder data accesses: ``` x = 1; y = 3; x = 2; ``` will be optimized into ``` x = 2; y = 3; ``` This is okay for sequential programs but causes problems for concurrent ones. Another thread might have expected `x=1` before `y` was assigned to `3`. Modern CPUs also reorder data accesses. This is due to the cache hierarchy, where a value might be in the cache of one core but not another one. For example, if two threads are running these programs: ``` initial state: x = 0, y = 1 THREAD 1 THREAD2 y = 3; if x == 1 { x = 1; y *= 2; } ``` then you would expect the normal interleavings to produce either the result `y = 3` or `y = 6`. But because the CPU does not guarantee data access ordering there is a third possible result `y = 2` that can happen when thread 2 knows about `x = 1` from thread 1 but not `y = 3`. Different CPU architectures provide different guarantees, for example x86-64 guarantees what is known as _strong ordering_ while ARM does not. To solve this, there are different memory ordering restrictions defined in Rust and C++20 that instruct both the compiler and hardware to not perform different kinds of reorderings. The most strict one is `SeqCst`, Sequentially Consistent, which restricts the ordering such that data accesses that happen before it stay before it and data accesses that happen after it stay after it. It also guarantees that all threads see these data accesses in order. Other memory orderings like `Acquire` and `Release` give subsets of the guarantees that `SeqCst` does, but are faster. `Relaxed` ordering means that both the compiler and CPU are free to reorder the operation as they like, but it is still different from an ordinary data access as it still occurs atomically. Strong ordering means the `Relaxed` memory ordering is rarely cheaper on x86-64 architectures, since all atomic operations are ordered by default. There is also the concept of _atomic fences_ which act as a form of _memory barrier_, forcing the CPU to synchronize memory across different threads. References: * [Rust Nomicon about atomics](https://doc.rust-lang.org/nightly/nomicon/atomics.html) * [Rust core ordering documentation](https://doc.rust-lang.org/nightly/core/sync/atomic/enum.Ordering.html) * [Rust stdlib fence documentation](https://doc.rust-lang.org/std/sync/atomic/fn.fence.html) * [Guide to memory orderings in C++11](https://preshing.com/20120913/acquire-and-release-semantics/) ### Rayon concurrency Seems to work fine with our current setup. Needs benchmarking against other methods and optimal chunksizes while iterating over components. Have not checked however a combination between rayon and crossbeam might be good if not hard to implement since both use hardware threads. ### DAG schedule Automatic genereration of the schedule seems to be quite difficult to achieve optimal parallezition. At the moment it is constructed by order of when user added the systems. Edges point towards most recently added system. A slower "staged" scheduler exists which probably more efficient parallelization can occur however each stage needs to be blocked by a barrier. This staged scheduler does not ensure that each system is executed in user added order which will need some review whether this is adequate. User added ordering may not be required however this will also need review to ensure that the output is not messed up. An option to let user add custom ordering might be required if these kinds of systems exists. ### Thread pool and the Thundering herd problem The thread pool currently wakes up all workers whenever a new task is added, which could leads to a thundering herd. To prevent this, only a single worker could be notified. This might have downsides and [**todo:**] needs further exploration. References: * [Thundering herd problem](https://en.wikipedia.org/wiki/Thundering_herd_problem) ### Dynamic vs static scheduling One way of scheduling tasks is to generate a precedence graph at initialization-time and then simply execute it in batches. The downside of this is that the batches cannot respond to changes in execution time, like one very slow task in a batch bottlenecking it. The benefit is that no computation is required during runtime to deterimine the batches, as they can be precomputed. Another way is to dynamically dispatch tasks based on when they're ready to execute. That is, when all incoming dependencies are finished, a task is immediately dispatched. This has the benefit of being able to dynamically respond to some tasks being slower than others. The downside is that it requires more computation during runtime to determine which dependencies have been freed up upon completion of a system. ## Querying prototype ### Mutable/Immutable Access to Components Part of the prototype was to create a design that allows for both mutable and immutable access to the component data in a way that is both performant and ergonomic. However, Rust's ownership and borrowing rules make this challenging, as it is difficult to ensure that multiple accesses to the same data are both safe and efficient. Many different approaches have been tried, but nothing got working in the end. Some of the approaches tried using interior mutability with `Cell` or `RefCell`. Some made use of Traits and or Enumerations, but was in many cases less ergonomic than basic access we already had or did not work at all. The conclusion that was made from this part of the prototype was that with the way we want to access components, we will have two different structs one for mutable queries and one for immutable. ### Implementation * Using Rust type system to provide a user-friendly API * Based on this guide: https://blog.logrocket.com/rust-bevy-entity-component-system/ * Parameters are generalized using the trait `SystemParameter` * Each system parameter has a state, initialized by `init_state` * To fetch a parameter, the state is used and the index of an entity. The result is an `Option` that is `None` if the component does not have the requested component type * This abstraction allows the number of system execution functions to grow linearly then the number of parameters increases instead of exponentially. To handle up to 16 parameters, 16 functions is needed instead of 2^17 – 1 = 131 071. * `Query` and `QueryMut` * Contains an immutable/mutable reference the the component * `init_state` borrows the component vector for the requested component type immutable/mutable from a `RefCell` from a `World`-reference * fetch_parameter fetches the component using the state and the entity index * Execute system * First, all states are initialized. States are mutable because it is needed for mutable access. * Thereafter, all system parameters are fetched, and if all return `Some(param)`, the system is function is executed with `param`. * This function is generated using macros for up to 16 parameters * Filters * Using the same `SystemParameter` trait * `With`/`Without` * Stores no data, only a `PhantomData` marking the component type for the compiler * `init_state` borrows the requested component vector (same as `Query`/`QueryMut`) * `fetch_parameter` returns `Some(Self)`/`None` depending on if the entity has the requested component * Binary operations (`And`/`Or`/`Xor`) * Constructed with two `SystemParameter`:s, marked using `PhantomData` * Stores a left- and right-state * When `fetch_parameter` is executed, `fetch_parameter` is called on both the left and the right `SystemParameter` using their respective state. Only looking if the result is `Some(_)` or `None`. * Generated using macros because all binary operations have the same implementation * Unary operation (`Not`) * Constructed with one `SystemParameter`, marked using `PhantomData` * `fetch_parameter` replace `Some(_)` with `None`, and `None` with `Some(Self)`