Making OCaml 5 succeed for the users

*Author: KC Sivaramakrishnan, CTO, Tarides* OCaml recently won the [ACM SIGPLAN PL Software award](https://tarides.com/blog/2023-06-20-ocaml-receives-the-acm-programming-languages-software-award/). The award recognizes a software system that has had significant impact on programming language implementation, research and tools. It is especially notable that [4 out of the 14](https://twitter.com/kc_srk/status/1670849062684467202/photo/3) named OCaml compiler developers are affiliated with Tarides: Anil, David, Jérôme and me. In this post, I discuss the wider effort afoot at Tarides in order to make OCaml 5, the latest release of the OCaml programming language, succeed for the users. I should note that I shall specifically focus on the new OCaml 5 features and omit important developments such as Tarides' work on the OCaml platform, which is discussed [elsewhere](https://discuss.ocaml.org/t/a-roadmap-for-the-ocaml-platform-seeking-your-feedback/12238). I started hacking on OCaml when I joined Anil, Stephen and Leo (who are also named in the SIGPLAN PL software award) at OCaml Labs in the University of Cambridge in 2014 to work on the Multicore OCaml project. The aim of the Multicore OCaml project was to add native support for concurrency and parallelism to the OCaml programming language. Multicore OCaml compiler was maintained as a fork of the OCaml compiler for many years before it merged with the mainline OCaml compiler in [January 2022](https://github.com/ocaml/ocaml/pull/10831). After almost an year of work stabilising the features, OCaml 5.0 was finally released in [December 2022](https://discuss.ocaml.org/t/ocaml-5-0-0-is-out/10974), nearly 8 years after the first commit. Has the Multicore OCaml project succeeded with the release of OCaml 5.0? The short answer is ***No***. There is a long road to making OCaml 5 succeed for the users. The goal of making OCaml 5 succeed for the users is a two step process: 1. Help users transition existing programs to OCaml 5 2. Help users take advantage of new concurrency and parallelism features in OCaml 5 ## Transitioning users to OCaml 5 Even with the arrival of OCaml 5, most OCaml programs will remain sequential forever. It is important that the current OCaml users can successfully transition over to OCaml 5, even if they don't plan to use the new features. We have carefully designed OCaml 5 such that the breaking changes are minimized. In particular, we eschewed a potentially more scalable GC design since it broken the C FFI compatibility (see section 7 "Discussion" in the [ICFP 2020 paper](https://kcsrk.info/papers/retro-parallel_icfp_20.pdf) on the new GC design). The only breaking changes in OCaml 5 were the removal of the support for naked pointers and unrelated removal of deprecated functions from the standard library. We released a [dynamic detector for naked pointers](https://discuss.ocaml.org/t/ann-a-dynamic-checker-for-detecting-naked-pointers/5805) to help users find and remove naked pointers from their codebase. ### Restoring unimplemented features OCaml 5.0 was an experimental release with many features unimplemented. In particular, OCaml 5.0 only supported the [x86](https://github.com/ocaml/ocaml/pull/10831) and [ARM](https://github.com/ocaml/ocaml/pull/10972) backends. With impressive efforts from the community, the OCaml maintainers have **restored support for All Tier 1 platforms in 5.1** including [RISC-V](https://github.com/ocaml/ocaml/pull/11418), [s390x](https://github.com/ocaml/ocaml/pull/11712) and [Power](https://github.com/ocaml/ocaml/pull/12276). Tarides helped implement or review the support for all of these backends. Tarides engineers also restored other important features such as [GC mark loop pre-fetching](https://github.com/ocaml/ocaml/pull/11827) and [frame-pointer support for x86 backend](https://github.com/ocaml/ocaml/pull/11144). All of these features will be in OCaml 5.1. We have also been working on restoring other big ticket items such as compaction and statmemprof which were not implemented for OCaml 5.0. In OCaml, compaction is the only time when the runtime releases memory it allocated for the heap back to the operating system. Many long running programs have an initialisation phase where they use a lot of memory followed by a steady state phase where they operate for a long time with less memory. It is a common practice to call `Gc.compact()` after the initialisation phase so that the steady state memory usage of the program remains low. Without compaction, the steady state will also use as much memory as the peak memory usage. This problem was [reported](https://discuss.ocaml.org/t/ocaml-5-gc-releasing-memory-back-to-the-os/11293) by the [Infer](https://fbinfer.com/) team at Meta (who were otherwise able to switch to OCaml 5 easily thanks to our focus on backwards compatibility). Tarides engineers have [opened a PR](https://github.com/ocaml/ocaml/pull/12193) for restoring compaction. The compaction feature is slated to be restored in OCaml 5.2. We have also been working on restoring statmemprof, the statistical memory profiler for OCaml. We are hoping to have a PR ready for this in the coming weeks. ### Fixing performance regressions OCaml 5 is a major rewrite of the runtime system and comes with a completely new allocator and a garbage collector (GC). As a result, some large OCaml projects such as [Frama-C](https://github.com/ocaml/ocaml/issues/11662), [Pyre](https://github.com/ocaml/ocaml/issues/11913), [EasyCrypt](https://github.com/EasyCrypt/easycrypt/issues/390) and [Infer](https://discuss.ocaml.org/t/ocaml-5-gc-releasing-memory-back-to-the-os/11293/16) have reported performance regressions. We have been steadily fixing these issues and have not encountered any serious challenges here. Many of the fixes have been incorporated into 5.1, and we expect more performance fixes to land in 5.2. The very fact that large open-source projects can build and test their code on OCaml 5 is itself a testament to our careful backwards compatible implementation of OCaml 5. #### Allocator performance A potential source of performance regressions is the allocator. OCaml 5 uses a new parallelism-aware allocator written from scratch and is different from the well-performing best-fit allocator available in OCaml since 4.10. Major industrial users of OCaml have [reported](https://blog.janestreet.com/memory-allocator-showdown/) that best-fit performs better than the earlier first-fit and next-fit allocators. In our benchmarking efforts, we observed that OCaml 5 allocator performs as well as the best-fit allocator as both allocators utilise size-segmented pages for the allocator. But [our benchmarks](https://github.com/ocaml-bench/sandmark/tree/main/benchmarks) are admittedly much smaller than industrial OCaml workloads. In order to derisk the transition to OCaml 5, we have [backported](https://github.com/sadiqj/ocaml/tree/backport_alloc) the OCaml 5 allocator to OCaml 4 compiler. The backported allocator helps industrial users run their workloads in OCaml 4 with only the allocator changed, which helps identify any regressions. <span style="color:red"> We are working with Jane Street to test the backported allocator on their internal workloads. </span> We hope to identify regressions which only show up at scale and fix them for everyone using OCaml 5. ### Continuously benchmarking compiler quality One of the goals with OCaml 5 is that, for sequential programs, the performance of those programs running on OCaml 5 is no worse than running on OCaml 4. Not only can the users compile and run their existing sequential code in OCaml 5, but the expectation is that the performance is also similar. To this end, we have been doing nightly benchmarking of compiled code using [Sandmark](https://sandmark.tarides.com/), a benchmarking service consisting of real-world open-source OCaml programs. Sandmark monitors multitude of performance parameters related to the running time, memory usage and GC latency. The benchmarks and the related repository of OCaml packages are constructed in such a way that it can build with both OCaml 4 and OCaml 5. This lets the compiler developers quickly identify any regressions that may be introduced in OCaml 5 with respect to the *same code* compiled under OCaml 4. Tarides is working to turn this into a Github bot that will make it easier for compiler developers to trigger benchmarking runs on development branches. ### Better observability Another strong reason to move to OCaml 5 from OCaml 4, even if you plan to remain sequential, is the better observability tools that come with OCaml 5. Starting from OCaml 5, the compiler supports [*runtime events*](https://v2.ocaml.org/manual/runtime-tracing.html) feature which brings deep introspection capabilities for OCaml programs running in production. Runtime events adds a series of probes to the OCaml program that emits data at specific events. This lets the *consumers* of these events produce interesting insights into the running programs. For example, [olly](https://github.com/tarides/runtime_events_tools/tree/main#olly) is a consumer that reports GC statistics including latency distribution. Olly can also produce traces of OCaml program runs visualising the GC behaviours. An important aspect of runtime events is that the cost of the probes in the fast path (when the probes are not emitting data) is so low that it is available for every OCaml 5 program. In particular, you do not need to recompile your programs with special options to enable event collection. Hence, every OCaml 5 program can be introspected at runtime for interesting events using olly. By default, the only probes available are to do with GC events. OCaml 5.1 also brings in support for [*Custom events*](https://github.com/ocaml/ocaml/pull/11474), where the user can describe new probes. It unlocks exciting possibilities for application specific introspection. For example, [meio](https://github.com/tarides/meio) is command-line tool that lets the user monitor the status of their application built using [Eio](https://github.com/ocaml-multicore/eio), a new concurrency library built using OCaml 5 features, at a per fiber (lightweight task) granularity. ## Taking advantage of OCaml 5 features We anticipate two kinds of users trying to take advantage of OCaml 5 years. 1. Users who want to use the new features in their existing code. 2. Users who want to write new code using the new features. While there is an increase of positive noise around OCaml recently, which may bring in new users, given the millions of line of existing OCaml code, our aim is to tackle (1) first. We hope that the experience with helping (1) succeed will inform what we should focus on for (2). ### Primitive features It is important at this point to note that OCaml 5 brings in distinct features for native concurrency and parallelism support in OCaml. For concurrency, OCaml 5 adds *effect handlers* and for parallelism it adds *domains* to the language. These features are spartan by design and our aim to build expressive libraries on top of these featurs which will live outside the compiler distribution. The OCaml manual pages on [effect handlers](https://v2.ocaml.org/manual/effects.html) and [parallelism](https://v2.ocaml.org/manual/parallelism.html) gives a good overview of these primitive features. I also discuss the approach we've taken in retrofitting concurrency to OCaml in the [ICFP 2022 Keynote](https://icfp22.sigplan.org/details/icfp-2022-papers/48/Retrofitting-Concurrency-Lessons-from-the-Engine-Room). ### Concurrency libraries #### Eio -- IO library For asynchronous, non-blocking IO, OCaml 4 has two industrial strength libraries such as Lwt and Async. These libraries simulate concurrency using a monad. These libraries are both very successful and OCaml code that uses asynchronous IO uses one of these libraries. These libraries do have some downsides in that, due to the use of a monad, they don't produce useful backtraces and OCaml's built-in exceptions cannot be used. Separation of synchronous and asynchronous code (function colours) and the lack of easy-to-use, higher-kinded polymorphism in OCaml means that one ends up with two versions of useful functions, one for monadic code and another for non-monadic code. This leads to code duplication such as the need to have a separate [Lwt's list module](https://ocsigen.org/lwt/latest/api/Lwt_list). These libraries can continue to be used in OCaml 5, but given that these libraries are not paralleism-safe, one cannot write parallel code that take advantage of them out of the box. [Eio](https://github.com/ocaml-multicore/eio) is a new direct-style IO library built using effect handlers. It avoids function colouring by using native stacks provided by effect handlers unlike Lwt and Async which simulate it using a monad. Thanks to this eio produces faster code, supports built-in exceptions, produces good backtraces and avoids code duplication. Eio also is built to be parallelism-safe. Eio provides a generic cross-platform API that can utilise optimised backends on different platforms such as [io_uring on Linux](https://en.wikipedia.org/wiki/Io_uring). One particular aspect that I would like to highlight is that Eio provides bridges for Async and Lwt so that existing code can be *incrementally* translated to Eio. This aspect is crucial for adoption as we believe that it is impractical to translate a large Lwt or Async codebase over to Eio in one go. Tarides is currently working towards the goal of [Eio 1.0](https://github.com/ocaml-multicore/eio/issues/388), which we expect to be released by Q3 2023. If you are interested in using Eio, Tarides engineers are running [a hands-on tutorial](https://icfp23.sigplan.org/details/icfp-2023-tutorials/4/Porting-Lwt-applications-to-OCaml-5-and-Eio) on porting Lwt applications over to Eio at ICFP 2023. #### Saturn -- Parallel data structures An essential component in the parallel programming toolkit is a library of parallel data structures. A sequential stack or queue data structure is fairly uncontroversial and it is common to have only a single stack or queue implementation in the language. Indeed, we have a single [stack](https://v2.ocaml.org/api/Stack.html) and a single [queue](https://v2.ocaml.org/api/Queue.html) data structure in the OCaml standard library. The addition of parallelism brings in an explosion of possibilities and challenges: * **Correctness** -- the addition of concurrency makes it much harder to reason about the correctness of the data structures. * **Specialisation** -- the performance of the data structure varies widely based on the number of parallel threads accessing the data structure. Hence, it is common to have specialised data structure that are optimised for a limited number of threads and capacity such as single or multiple producer and consumer to bounded or unbounded queues. * **Progress** -- should a pop operation on an empty queue block the caller or should it return immediately with a `None`? Both options are useful in different circumstances, but supporting one or the other will mean a very different tradeoffs and hence, different implementations. Moreover, the non-blocking options are [further classified](https://en.wikipedia.org/wiki/Non-blocking_algorithm) in literature based on the progress in the presence of concurrent operations. * **Composability** -- In a typical parallel data structure each of the individual operations such as a push or a pop are atomic. What if our application demands that multiple operations be performed atomically? Putting a lock around the entire thing does not often work since it affects performance non-trivially and introduces correctness issues such as deadlocks. There are other mechanism for well-behaved composition such as software transactional memory. In other languages, this explosion in the state space often leads to a multitude of concurrency libraries, with overlapping features and different trade-offs, often not clearly labeled. Users often face a challenge choosing the right library with the right trade-off. The correctness of the implementations are also often unclear. At Tarides, we have been working towards [Saturn](https://github.com/ocaml-multicore/lockfree/pull/67), a library called that brings together all of our efforts at building parallelism-safe libraries. Saturn will consist of lock-free and lock-based, blocking and non-blocking, composable and non-composable parallel data structures under one roof. Each of the different data structures will have a default version that is good enough to be used for parallelism and will have well-documented variants with clearly labeled tradeoffs. Our composable atomic data structures are built over the [kcas](https://github.com/ocaml-multicore/kcas) library which provides a software transactional memory (STM) on top of lock-free multi-word compare-and-swap (MCAS) primitive. While the kcas library implements MCAS in software efficiently, with the arrival of [Power backed in OCaml 5](https://github.com/ocaml/ocaml/pull/12276), we plan to explore the promise to utilise [hardware transactions](https://research.ibm.com/publications/transactional-memory-support-in-the-ibm-power8-processor) for MCAS. To ensure correctness, Saturn data structures are model-checked using [dscheck](https://github.com/ocaml-multicore/dscheck), an experimental model checker for OCaml that cleverly exploits effect handlers to mock and control parallel scheduling. We also plan to [continuously benchmark](https://autumn.ocamllabs.io/) the data structure to monitor any performance regressions. We expect Saturn to be released in Q3 2023. #### Domain-local await With OCaml 5, there are several notions of concurrency: * Domains -- OS threads potentially running in parallel on different cores * Systhreads -- OS threads on a given domain that time share a domain * Fibers -- Lightweight language-level threads implemented by the concurrency library. Each concurrency library may have its own scheduler. This makes the task of writing blocking data structures such as blocking channels challenging as the blocking mechanism is specific to each notion of concurrency. Ideally, we would like to write blocking data structures that are parametric over the blocking mechanism so that we can describe blocking channels once and for all for all of the different notions of concurrency. To this end, Tarides has been developing [domain-local await](https://github.com/ocaml-multicore/domain-local-await) (DLA), a scheduler independent mechanism for blocking. The goal is that concurrency libraries provide the implementation of the DLA interface, and with this, they can use blocking data structures from Saturn. For example, with the [implementation of DLA interface in Eio](https://github.com/ocaml-multicore/eio/pull/494), it is able to utilise [blocking transactions in kcas](https://github.com/ocaml-multicore/kcas/pull/32). By separating out the blocking mechanism from the blocking data structures, different concurrency libraries such as eio and [domainslib](https://github.com/ocaml-multicore/domainslib) may communicate easily. At Tarides, we are exploring other scheduler independent mechanisms for [timeout](https://github.com/ocaml-multicore/domain-local-timeout) and [io](https://github.com/polytypic/io). ### Multicore testing tools The task of moving a large OCaml codebase to take advantage of new OCaml 5 features may seem daunting. It is likely that none of the existing code was written with concurrency and parallelism in mind. Tarides has been working to empower users with multicore testing tools to ease the process of using the new OCaml 5 features. #### Thread Santizer When parallelism is introduced in a code base there is the risk of introducing *data races*. A data race is said to occur when two accesses to a memory location with at least one of them being a write and there is not synchronization between the accesses. For example, the following program: ```ocaml= let r = ref 0 let d = Domain.spawn (fun _ -> r := 1) let v = !r ``` has a data race since the main domain and the newly spawn domain `d` race to access the reference `r` and there is no synchronization between the accesses. OCaml being a pragmatic language encourages the use of mutable state with primitive operations such as reference cells, mutable record fields, arrays and standard library data structures such as hash tables, stacks and queues with in-place modification. Thus, it is likely that the addition of parallelism to an OCaml code base will introduce data races. In C++, the behaviour of program with data races is undefined. In OCaml, the situation is much better. OCaml program with data races have [well-defined semantics](https://v2.ocaml.org/manual/memorymodel.html). In particular, a program with data races will not violate type safety and will not crash. That said, the programs with data races may produce behaviours that cannot be explained only by the interleaving of operations from different threads. Hence, it is important that data races are detected and removed from the code base. To this end, Tarides has developed Thread Santizer (TSan) support for OCaml. TSan is an approach [developed by Google](https://dl.acm.org/doi/10.1145/1791194.1791203) to locate data races originally for C++ code bases. It works by instrumenting executables to keep a history of previous memory accesses (at a certain performance cost), in order to detect data races, even when they have no visible effect on the execution. TSan instrumentation has been implemented in various compilers (gcc, clang, as well as the Go and Swift compilers), and has proved very effective in detecting hundreds of concurrency bugs in large projects. Executables instrumented with TSan report data races *without false positives*. However, data races in code paths that are not visited will not be detected. Tarides engineers have used TSan successfully to port large non-trivial code bases such as [Irmin](https://github.com/mirage/irmin) to multicore. The response from the developers using TSan has been overwhelmingly positive. A particularly attractive feature of TSan in OCaml is the ease of use. All the developer needs to do is to install a different compiler switch with TSan enabled, and without any additional work, TSan reports data races with accurate backtraces for the conflicting accesses. A [PR](https://github.com/ocaml/ocaml/pull/12114) for adding TSan support for OCaml is currently open. TSan support for OCaml is likely to appear in OCaml 5.2. #### Property-based testing Data races are just one of the hazards with parallel programming. Even without data races the program may produce different results across several runs due to non-determinism. How can the users gain more confidence about the correctness of their implementations? To this end, we have been developing two property-based testing libraries namely [Lin and STM](https://github.com/ocaml-multicore/multicoretests). In property-based testing, the programmer provides a specification about the program that should remain true and the system tests that the properties hold under large number of different executions, typically randomly generated inputs. In the case of Lin and STM, the program is tested under different interleavings of domains. Lin tests whether the results obtained under parallel execution corresponds to the same operations applied one after the other in a sequential execution. STM take a pure model description and compares the results to the actual results seen in a parallel execution. Both libraries have been [extremely effective](https://github.com/ocaml-multicore/multicoretests#issues) in identifying issues in the standard library under parallel execution. The OCaml standard library was implemented without parallel execution in mind. While much of the standard library is not parallelism-safe, we do not expect parallel access to standard library to crash. Lin and STM have been particularly successful in identifying crashes. We believe that Lin and STM will help OCaml 5 users gain more confidence that their code is correct under parallel execution. ## Call for action If you are an OCaml user with an existing code base, please try OCaml 5 today. If you find regressions, please file a issue on the [OCaml github repo](https://github.com/ocaml/ocaml/issues/). If you are considering utilising the new OCaml 5 features, please give the concurrency libraries and the tools a go. We would love to hear whether the libraries and tools work for you. File issues in corresponding repos if you find anything that is amiss. If you are looking for commercial support on any of these topics, do not hesitate to [contact us](https://tarides.com/company/). All of the work discussed in this post are open-source. If you wish to contribute to these efforts, please look for "good first issue" tag in any of these repos. If you are looking to learn, please head over to the [community section](https://ocaml.org/community) to ask us questions and share and discuss OCaml-related topics. Happy hacking!