# A rust implementation of bzip2 ## Abstract: Can you explain the whole project and its expected outcome(s). The `bzip2` compression format is not the most popular nowadays, but it is used in many legacy settings. Consequently, it is part of the supply chain of many projects. The `.bz2` format is still occasionaly used to distribute files, as a compression method for http traffic, and bzip2 is a possible compression method for `.zip`, `.deb` and `.rpm` files. There are two components to bzip2: the `libbz2` dynamic library, and the `bzip2` command line application. We propose the following deliverables: - a rust crate for bzip2 encoding and decoding. - a `libbz2-rs` crate that defines an interface identical to `libbz2`, and produces byte-for-byte identical output for identical input - a safe rust `bzip2` rust binary The rust `libbz2` library and `bzip2` binary are explicitly meant to be drop-in replacements for their C equivalents. We guarantee that in the non-error case, the output will be byte-for-byte equivalent to their C counterparts. We plan to use the [`c2rust`](https://c2rust.com/) tool for this project. This tool automatically converts C to semantically equivalent rust. We believe that direct translation from C to rust is effective for certain kinds of libraries with non-trivial logic and an architecture that is close to what would be idiomatic in rust. Example domains are compression, media codecs. A major advantage of this approach is a working rust implementation on day one, an implementation that can be tested and fuzzed. But the output of `c2rust` is far from idiomatic rust, and the majority of the work is translating the `c2rust` output to safe, idiomatic rust. It has been difficult in the past to estimate how long this cleanup process takes. Because this project is limited in size, it is a good benchmark, and the data it provides will be used to motivate future projects. ## Have you been involved with projects or organisations relevant to this project before? And if so, can you tell us a bit about your contributions? We are currently working on an implementation of zlib in rust: https://github.com/memorysafety/zlib-rs. There is overlap on the technical level (e.g. huffman codes) but also in terms of code architecture the two projects are similar. We have good relations with [Immunant, inc.](https://immunant.com/), the maintainers of `c2rust`. We've also contributed to their [rav1d](https://github.com/memorysafety/rav1d) project, a port of the `rav1d` AV1 decoder that used `c2rust` for the initial translation work. ## Explain what the requested budget will be used for? TODO ### Does the project have other funding sources, both past and present? No. This is an entirely new project, and it has no other funding sources. ## Compare your own project with existing or historical efforts. We are the first rust project to explicitly aim for being a drop-in replacement of the C implementation. There is however some prior art in the rust ecosystem. ### Official rust version It looks like a start has been made on a rust version of bzip2, but this effort seems to have fizzled out. The https://gitlab.com/bzip2/bzip2/-/tree/rustify?ref_type=heads branch has not seen activity for over 4 years and only implements the crc32 algorithm in rust, none of the bzip2-specific logic. ### C bindings The `libbz2` C dynamic library can be used from rust with this crate. https://crates.io/crates/bzip2 This library has over 19 million downloads. It compiles the C code from source and links it into the rust binary. This process requires unsafe code, and cannot (reliably) be used for cross-compilation. ### Pure rust decoder There is a bzip2 decoder implementation https://github.com/paolobarbolini/bzip2-rs It looks reasonably complete. It however cannot be used without a default allocator or without the standard library. And of course it does not support encoding. The last changes to the repository more than a year ago ### Pure rust encoder There is a bzip2 encoder implementation, but it is not production quality https://github.com/jgbyrne/banzai/ Its author is cautious to make claims about the quality of the implementation: > banzai is a bzip2 encoder with linear-time complexity, written entirely in safe Rust. It is currently alpha software, which means that it is not battle-hardened and is not guaranteed to perform well and not eat your data. That's not to say, however, that I don't care about performance or reliability - bug reports are warmly appreciated! In the long term I would like to get this library to a state where it can be relied upon in production software. The last changes to the repository are 2 years old ## What are significant technical challenges you expect to solve during the project, if any? Overall this project has low technical risk. Distributing a C dynamic library based on a rust project is not very common. In particular we need to verify that our library is ABI-compatible with the C version. We will prove that c2rust is an effective way of building a solid rust implementation of existing C libraries. ## Describe the ecosystem of the project, and how you will engage with relevant actors and promote the outcomes? Given our connections in the rust ecosystem, we believe we have a decent shot at making our pure rust implementation the default choice when using bzip2 in rust. Through our work on ntpd-rs, we have both knowledge about and connections within the debian and fedora package ecosystems. We are confident that we can get our implementation picked up there. The inactive `rustify` branch in the bzip2 gitlab repository shows that the bzip2 maintainers are at least willing to consider a rust implementation. We'd be open to eventually upstreaming our implementation, but believe that we should first show independently that our implementation is solid. ## Work items **initial refactoring (2 days)** - run `c2rust` - run the full bzip2 test suite against our rust implementation - set this up on CI - enable codecov - C ABI testing (using libabigail) - restructure modules - remove the use of `extern "C"` for rust imports - three crates: `bzip2` binary, `libbz2-rs` c api, and the actual logic **initial testing (2 days)** - convert the test suite into rust (it's a collection of python scripts today) - set up fuzzing (on CI) **the `libbz2-rs` dynamic library (15 days)** - remove `extern "C"` libc functions and replace with rust `core`, `std` or `libc` libraries - make the internal logic of the rust library (encoding and decoding) safe - improve the test suite, in particular with more fine-grained tests - implement allocator logic - no default allocator means zero dependencies - default to the rust allocator when used as a rust crate (requires `alloc`) - otherwise `libc` is required for (aligned) malloc and free **the `bzip2` binary (5 days)** - make the logic of the binary safe **distribution (7 days)** - documentation for the project as a whole (e.g. README.md) - documentation for the rust crates - publish rust crates on crates.io - build and distribute the `libbz2-rs` library and `bzip2` binary **audit (3 days)** - determine scope - assist - process findings **project management (???)** in kind? **blog posts (1 day)** - general announcement + project results (in-kind) - c2rust experience, abi checking