# Introduction
Worlds are explored as much with our ears as they are with our eyes. In games, audio is just as, if not more important to immersion than graphics, it is therefore necessary that Bevy's audio engine is taken care as well as (if not better than) its graphics engine.
For various reasons we will list below, the current audio engine of Bevy is lacking in features and performance, and an overhaul is needed to bring it up to speed with the rest of the engine.
This document will go in details about why it needs to change, how we will change it, and the scope for the initial "Better Audio" working group, which will work to bring it forward for inclusion into as a first-party audio engine in Bevy.
## Why the current audio engine is insufficent
There are several reasons why we think the current audio engine lacks the capability to be extended easily and meet the needs of Bevy users:
1. **Insufficent performance**. `rodio`, the audio library used by Bevy to provide the audio engine, has notable technical limitations which prevent more advanced features from being implemented:
- `rodio` does not follow the rules of audio programming, which restricts the performance ceiling of the audio engine, and can cause glitches when the audio engine and/or the whole host system is under load. Audio programing has strict restrictions on the kinds of algorithms and data synchronization methods because any kind of performance drop is noticeable as audio glitches by the end user.
2. **Lack of extensibility**. While effects can be applied on audio sources, they can only be done so on a single source at a time. Because no audio bussing features are available (and such a feature would be hard to implement given the architecture of `rodio`), it necessarily means each effect has to be duplicated, which is a deal-breaker for heavier effects (ex. reverb effects are notoriously heavy on processing and memory, as each internal part of the effect is constantly evolving and reading data off of delay buffers).
3. **Incompatible type-safety**. Because `rodio` has all its components declared as separate Rust types, it is hard to write generic code that handles audio in the general sense (i.e. for an editor). Since each effect applied on a source results in a new, distinct type, there can potentially be as many concrete audio types in use in any given project as there are audio sources.
Bevy has reflection features, which would help in those cases, however reflection on foreign types is not available, which means we would need to duplicate all those types to allow them to be reflected by Bevy.
4. **Incompatible with animation features**. Contrary to the graphics engine, where any value can be relatively easily animated by changing their value once before rendering, the audio engine has to have a tight integration with the animation features, as it needs to be able to evaluate animation curves at multiple intervals within a single "frame". Therefore, curves should be accessible by the audio engine itself so that animators can set up audio animations.
5. **Not ECS-friendly**. From a puristy point of view, entities and components should be the ground truth for all the systems in the application, including graphics and audio. Currently nearly all of the audio-related component data is actually hidden from the ECS, as instead of plain old data, the audio components are instead handles that systems use through setters to manipulate the audio engine.
While of all the above reasons, this is the least compelling as we could easily compromise on purity to allow this, it still limits the cooperation of data between systems, and should be kept in mind still.
## Initial plan of action: integrating `kira`
This working group has originally been created to replace `rodio` with `kira` as the first-party audio backend for Bevy. However, while `kira` is a good library for game audio in individual projects, it has some of the same issues as `rodio` as talked above, namely the type-safety and lack of reflection, as well as ecs-friendliness issues above.
Furthermore, `kira` complicates integration by having two "kinds" of parameters, one only used during initialization, and the other only available at runtime. This complicates integration as this means control is spread in two places, and this dichotomy cannot easily hidden or resolved from the side of the integration; it has to be exposed to the user in order to keep the flexibility.
## New audio engine
This new audio engine needs to be designed with advanced features in mind, even though this document, and the working group for this project, is only concerned with an initial implementation. This is because we need to focus on extendability in order not to impede future developmenets in the audio features of the engine. Features like audio busses, mixing and effect processing, as well as deeper sound spatialization, need to be able to be implemented in the future. This is why, in this document, the "features" are different to the "scope" of the project; the former describes the long-term goals of the audio engine, while the latter is focused on the work to be done by this working group.
### Features
- **Audio sources**
- Static and streaming of audio files
- Gapless playback and looping
- Automatic prioritization of sound playback based on user-defined priorities, distance (when spatial), and maximum polyphony
- Third-party custom sources
- **Spatial audio**
- v1 with basic panning and attenuation, v2 with HRTF and atmospheric absorption model
- Multiple listeners
- **Audio mixer**
- Customizable track inputs (either from spatial audio listeners or directly from sounds configured to output to the track)
- Effect rack for serial audio processing per-track and globally (through the Master track)
- Third-party custom effects
### Scope
The goal for the initial version of the new audio engine is to have feature-parity with the existing one, that is:
- Custom audio sources
- Basic spatial audio
- Per-source volume control
- Global volume control
It should also be extensible, allowing both first-party and third-party implementations of audio effects.
# Implementation
:::info
In the following sections, "component" refers to both effects and sources, being similar in implementation.
:::
In order to be user-friendly, and ituitive to use, the audio engine will have to integrate naturally into the ECS framework of the rest of Bevy. On the other hand, the restrictions of real-time programming make providing a reliable and high-quality audio engine hard.
Bridging the gap between those two ends will require specific solutions which this document will go over.
Controlling an audio component may require access to any number of sources of data from the game; that is, an audio component cannot be set as a standalone, isolated component.
In fact, each audio effect should be considered its own ECS system, running every frame to update state on the audio engine side from the game engine. This allows the ECS to retain the "ground truth" while simultaneously not restricting the implementation of this state syncing mechanism. Additionally, audio components should be stored in separate Bevy entities in order to simplify the lifecycle of Bevy components associated with audio components (meaning that despawning an entity can intuitively trigger resource cleanup on the audio engine side)
Strictly speaking, there isn't much in the way of an official API that Bevy is required to provide in order to implement audio effect systems, the communication layer effectively operating completely outside of the ECS itself; however because the goal of Bevy Audio is to make implementing audio effects as ergonomic as possible, API design and ergonomics should still be part of this design document.
## Interlude: audio programming rules
In the rest of the technical discussion of this design document, there will be several references to "rules of audio programming", "real-time programming", or "real-time safety". While there are subtleties in the different definitions of each, they will be used interchangeably here.
Real-time audio processing requires processing a stream of audio fast enough such that audio devices continually have enough data to drive outputs. When the audio process feeding that data does not compute fast enough, the user ends up hearing audio dropouts, or microloops, which are symptomatic of the audio process taking to long to process the requested audio stream.
Real-time programming is a set of rules which, when followed, guarentees that code **will** terminate in a finite amount of time. The rules can be boiled down to "no infinite loops and no system calls", which means no locking of mutexes (and no spinlocks), no allocations[^1], and always having bounds on loops (such as iterating through structures or an explicit range).
[^1]: This refers specifically to allocating through the system, as it is a system call. You can set up a custom allocator, and as long as *it* does not go through system calls, you can allocate within it.
A good article to read up on the subject is [this one](http://www.rossbencina.com/code/real-time-audio-programming-101-time-waits-for-nothing) from Ross Benica, which details the why and the how in more details.
## The communication layer
Communicating with an audio engine from the outside is tricky; not only does it involve cross-thread communication, it also needs to fulfill the rules of audio programming as explained above. This means that most methods of cross-thread communication (ie. mutexes, channels) are out.
The vast majority of solutions will use a circular (or ring) buffer data structure. This simple structure allows for channel-like communication by having one producer writing to shared memory and keeping track of their writing position, and one consumer reading from the shared memory and also keeping track of their reading position. The circular part comes from the fact that the positions of the reader and writer (or "heads", as analogous to a tape machine) wrap around the shared memory.
By implementing the read and write head as atomic pointers (or using `AtomicUsize` in Rust), both ends of the circular buffer and work in complete autonomy by only following a few simple rules, but most importantly, there are no mechanisms by which one end has to wait for the other. Reading (resp. writing) of a value either succeeds, or fails.
There are a number of high-quality circular buffer crates available, and we shouldn't reinvent the wheel here.
We can use the circular buffer as the basis for a channel data structure by having the ECS side hold on to a "Sender" or "Producer" struct which writes to shared memory and manages its write head, and a "Receiver" struct which reads from shared memory and manages its read head. They both work very similarly to Event Writers and Event Readers in Bevy, and as such should share as many similarities as possible given the differences in storage.
To change a parameter in the audio engine, the ECS system should send an "event" (represented by an enum) through the circular buffer, where the audio component implementation periodically checks its end of the circular buffer to update itself accordingly when any new event is sent in. The meaning of what the event is is left to the enum itself, and the audio component implementation.
Another problem facing sending events is that arbitrary data cannot be sent; for example, sending a `Vec<_>` through a circular buffer means that when the event is read, and and the data is not moved out, the Vec will get dropped, causing a deallocation, which with the default allocator, will cause a system call to reclaim memory. **This breaks the audio programming rule of no system calls**.
### An API for communicating between audio components and Bevy
Therefore, we need a way to "collect" data after its use. For this use-case, the [`basedrop`](https://crates.io/crates/basedrop) crate provides smart pointers which delay the `Drop` implementation of their inner data until it has been sent back from the audio thread. This allows audio components to be able to move data from events into their internal state. It's also difficult to enforce through enums, meaning users may accidentally provoke (de-)allocations when communicating through events; this introduces hard-to-debug footguns, and is thus not ergonomic.
Another solution is using serialization to provide a way to pass data into the audio engine. This works in enforcing allocation rules in the communication layer (by controlling the format of the serialized data, we can ensure no de-allocating types are transferred). However, this moves the problem to the audio component itself: it may unknowingly allocate in the process of deserializing the data back into a proper type.
A different solution could involve "parameter stores": a shared space is pre-allocated with all possible parameters (as enumerated by an enum), with their values serialized into a specific format, and both sides can read and write into them using triple-buffered storage (allowing for reading and writing independently, at the cost of making parameters *eventually* consistent). This has the advantage of relaxing the rules on what can be stored, as well as the serialization format (any self-describing format can be used here, e.g. `serde_json::Value`). This allows any type implementing `serde::Serialize + serde::Deserialize<'a>'` to effectively be stored as a parameter, from simple float values to entire structures, as needed.
The downside to this approach is that serializing/deserializing can be expensive, and deserializing on the audio thread could still incur system calls (ie. allocations from a `Vec<_>`). Additionally, features like change detection need extra data to be store per parameter, and checked explicitely, rather than being a natural consequence of the event nature of the previous solutions. It also remains that dropping deserialized structures can be a source of deallocations as well.
**TODO**: choose method to implement as communication layer.
## Bevy systems
**TODO**: Internal systems resulting from the choice of communication layer above
## Audio engine
The audio engine is the part that entirely resides within the audio thread, controlled by the OS audio APIs.
**TODO**