Records vs Structs for representing internal state in Membrane

# Records vs Structs for representing internal state in Membrane ###### tags: `membrane_core` `performance` `state` > see also GitHub issue [#316](https://https://github.com/membraneframework/membrane_core/issues/316) ## Motivation Through profiling applications built on top of `membrane_core` we came to conclusion that we spend a lot of time on retrieving and managing internal state of `Core.Element`s - see e.g. [Roberto's report](https://swmansion.slack.com/archives/C01UXP1GA9Z/p1626967790061000). That having been said, it seems it could be benefitial to find performance gains by optimizing this piece of the framework. ## The current state Currently, all the `Membrane.Core` components are internally represented by Elixir `struct`s, which are de facto `map`s with additional compile-time checks. It means that performance-wise they are completely equivalent to the latter. ## Maps (structs) In BEAM, `map`s with `MAP_SMALL_MAP_LIMIT=32` or less items are stored internally as a list of tuples with lookup time of `O(n)`*. That is the case of our `Core.Element`s internal state, where we store exactly 14 fields. ### Access Given the aforementioned upper limit, we can expect constant access time in practice - always at most 14 comparison ops. Still, its up to 14 times more in comparison with direct access. ### Modification In our case, modifying a map is just copying the existing one with the modified item updated. Therefore, we can expect 14 assignment operations plus additional overhead of allocating a new structure on the heap. #### Note on BEAM internals* As you may noticed, when creating a map containing up to 32 elements in Elxir, you get a map that is sorted by its keys. I was wondering if by any chance something like binary search is applied for lookup operations. Reading the C implementation for map BIFs, I've found out it seems like this is not the case - keys are compared in linear manner, what you can see for example in this [part of erts_maps_get()](https://github.com/erlang/otp/blob/d1f76d30832d70c3aaf113559fea1487d275010c/erts/emulator/beam/erl_map.c#L186). In the implementation linked above, maps of size greater than `MAP_SMALL_MAP_LIMIT` are reffered to as "*hashmaps*", when the smaller ones are called "*flatmaps*". You can find internal structure of the latter [here](https://github.com/erlang/otp/blob/d1f76d30832d70c3aaf113559fea1487d275010c/erts/emulator/beam/erl_map.h#L38). ## Records (tuples) `Record`s are just tuples with the first element being an atom. Elixir, as well as Erlang, add just some macros on top of it, so you can refer to its elements by keys rather than indexes. ### Access A `tuple` provides direct access to its elements, so in terms of retrieving an item it is as fast as you can get. ### Modification Similarly to `map`s consisting of 32 or less items, BEAM needs to copy the whole structure. I couldn't find much about implementation details, but I can point out that declaring fixed length and removing the need to keep a key along with a value lets `tuple`s implementation be at least a little bit more performant also in case of modification. Also, there's a possibility to optimize consequent updates of a `tuple`, so it gets copied only once*. #### Note on BEAM internals* If you update a tuple in consequent calls, and the indexes you are referring to are given in descending order, you can count on BEAM to copy the tuple only once. You can read more [here](http://erlang.org/doc/efficiency_guide/commoncaveats.html#setelement-3). It seems like Elixir Records take care of setting the right update order for you if given multiple keys - see [Record.ex](https://github.com/elixir-lang/elixir/blob/a3b12428f98f6e64573f518c3a2c8e38c4685a69/lib/elixir/lib/record.ex#L472). ## Alternative approach On the [`structs-to-records`](https://github.com/membraneframework/membrane_core/tree/structs-to-records) branch you can find my take on rewriting `Membrane.Core` to use Records instead of Structs for keeping the internal state of `Core.{Bin, Element, Pipeline}` components. Keep in mind that **this is only a draft**. The goal here was just to research the possibility to do such thing and get benchmarks with this alternative implementation up and running. ### The implementation After rewriting `Core.{Bin, Element, Pipeline}.State` to records, there was need for a module that could dynamically dispatch calls to `Record` macros depending on component's type (`parent`, `child` or `any`). This is the exact responsibility of [`Core.StateDispatcher`](https://github.com/membraneframework/membrane_core/blob/d7cbcc82c8b00bacb7d45a4e32fdacc01279bb5b/lib/membrane/core/state_dispatcher.ex#L1), which works as a gateway to retrieve or modify a component's state. The functionality is provided by a set of macros, so performance impact is minimalized. ## Profiling All the tests were run in Mix production environment. ### membrane_core_optimization_lab [`membrane_core_optimization_lab`](https://github.com/membraneframework/membrane_core_optimization_lab) was my playground for profiling the alternative implementation. Basically, it is just a pipeline with a source spawning buffers filled with random bytes, that are piped through 20 filters. Looking at the results of `fprof`, the number of calls, as well as time spent on retrieving an element from a map through i.e. `Map.get/3` but also `Access.get/2`, `Kernel.get_in/2` was reduced by about 30%-50%. Unfortunately, there are no corresponding calls to `:erlang.getelement` on the `fprof` report what is kinda weird, given that the function is explicitly inserted by [Record.ex](https://github.com/elixir-lang/elixir/blob/a3b12428f98f6e64573f518c3a2c8e38c4685a69/lib/elixir/lib/record.ex#L490). In this case, we can't really tell how big (if at all) the gains are. When it comes to updating the map, we can that time spent on executing "updating functions" is reduced at least by 30% (keep in mind that `update_in` is [a macro replaced by calls to `{Access, Map}.get_and_update/3`](https://github.com/elixir-lang/elixir/blob/9137fd1cb2368bb1bc3a9c590440dffc9f5fed62/lib/elixir/lib/kernel.ex#L2897)). This time the corresponding `:erlang.setelement/3` function is present on the list, but it's accumulated execution time is not very significant in comparison to potential savings. If you are interested in more details, I shared the `fprof` results of 10- and 30-second runs on Pastebin: [master 10s](https://pastebin.com/G9QwTGVu) / [structs-to-records 10s](https://pastebin.com/gWhbwsiL) [master 30s](https://pastebin.com/ZZurqXKR) / [structs-to-records 30s](https://pastebin.com/8Z5aXeyW) #### Update 08/11/2021 Additional profiling results containing `uS / CALLS` metric (`eprof_total`), call tree for `fprof` and cachegrind files can be found [here](https://swmansion.slack.com/files/U027FP71UP3/F02BF428WEL/profile.zip). Also, I have replaced `master` branch with`v0.7.0` tag in this comparison. ### membrane_demo/webrtc/videoroom Profiling was run with 5 testRTC bots in videoroom. After the bots had joined, I've gave the server a minute to warm up, then ran `eprof` and `fprof`. To convert `fprof` trace to cachegrind I used [`erlgrind`](https://github.com/isacssouza/erlgrind). The server was started in production environment, but in interactive mode to allow me to easily run profilers. `interactive mode` ```shell= MIX_ENV=prod iex -S mix phx.server ``` `eprof` ```elixir {:ok, p} = :eprof.start() :eprof.start_profiling(:erlang.processes() -- [p, self()]) Process.sleep(30_000) :eprof.stop_profiling() :eprof.log("eprof.log") :eprof.analyze(:total) :eprof.stop() ``` `fprof` ```elixir {:ok, p} = :fprof.start() :fprof.trace([:start, {:procs, :erlang.processes() -- [p, self()]}]) Process.sleep(15_000) :fprof.trace([:stop]) :fprof.profile() :fprof.stop() ``` `erlgrind` ```shell= ./erlgrind fprof.trace FILENAME.cgrind ``` [And here you can download the results.](https://swmansion.slack.com/files/U027FP71UP3/F02APHNB3L6/videoroom_profile.zip) To open \*.cgrind files you can use `qcachegrind` on Mac (`brew install qcachegrind`). ## Benchmark To benchmark `Core` in "real-life scenario" I used [the videoroom WebRTC demo](https://github.com/membraneframework/membrane_demo/tree/master/webrtc/videoroom) from Membrane Framework repository. CPU and memory usage of the process running videoroom server was measured every 1s by Python module [`psrecord`](https://github.com/astrofrog/psrecord). ### Server specs - 2,5GHz 32-Core AMD EPYC 7502P - 128GB of 2933MHz DDR4 RAM - Ubuntu 20.04.2 LTS ### Methodology I've used testRTC to spawn 10 bots streaming `Salsa 720p`\*. The bots have been joining in ~1s interval and streaming for 3 minutes. Then, they had about 30 seconds to leave the videoroom. I've been starting `psrecord` after the testRTC manager was ready to send bots and stopping it when the testRTC test was complete. `run the server` ```shell= MIX_ENV=prod mix phx.server ``` `run psrecord (in another tab)` ```shell= export PID=pid_of_server psrecord $PID --log ${PID}.txt --plot ${PID}.png \ --interval 1 --include-children ``` \* 1280x720p 9667Kbps video + 192Kbps audio ### Results Below you can find the plots of server's CPU and memory usage running `v0.7.0` tag and `structs-to-records` branch of Membrane Framework. #### Run 1 `v0.7.0` [(txt)](https://pastebin.com/jCm2Xr7h) ![v0.7.0](https://i.imgur.com/b12rewQ.png) `structs-to-records` [(txt)](https://pastebin.com/FJ41T3ZR) ![structs-to-records](https://i.imgur.com/vWmaSvl.png) #### Run 2 `v0.7.0` [(txt)](https://pastebin.com/iTWLH1uY) ![v0.7.0](https://i.imgur.com/cTgfy7J.png) `structs-to-records` [(txt)](https://pastebin.com/yVFZtr9C) ![structs-to-records](https://i.imgur.com/VyvaFBe.png) #### Run 3 `v0.7.0` [(txt)](https://pastebin.com/514BQvQe) ![v0.7.0](https://i.imgur.com/3X7TZkm.png) `structs-to-records` [(txt)](https://pastebin.com/BvA0vvRk) ![structs-to-records](https://i.imgur.com/dD4IGov.png) ## Conclusion While looking at the internals of Elixir/Erlang it may seem benefitial to replace `struct`s with `record`s, in our case **it didn't bring much in terms of performance**. Even if it would, there are another tradeoffs like less clean tracebacks or reduced code readability (you have to extract a field before using functions like `{get, update, put, delete}_in` on underlying maps). Therefore, I would recommend giving up on `record`s for holding internal state of elements in Membrane.