# Zip linting & bzip2 in Rust ## Abstract: Can you explain the whole project and its expected outcome(s). This project combines improvements for tooling around two file formats: zip and bzip2. 1) Zip is a widely-used format for distributing files. It is a rather permissive file format, opening the door to various attacks such as zip bombs. We propose the following deliverables: - a tool that extracts extra fields from a zip file - a zip linter checking for suspicious file contents - a python interface to these tools 2) The `bzip2` compression format is still used in many legacy settings. Consequently, it is part of the supply chain of many projects. The `.bz2` format is still occasionaly used to distribute files, as a compression method for http traffic, and is a supported compression method for `.zip`, `.deb` and `.rpm` files. There are two components to bzip2: the `libbz2` dynamic library, and the `bzip2` command line application. We propose the following deliverables: - a rust crate for bzip2 encoding and decoding. - a `libbz2-rs` crate that defines an interface identical to `libbz2`, and produces byte-for-byte identical output for identical input - a safe rust `bzip2` rust binary The rust `libbz2` library and `bzip2` binary are explicitly meant to be drop-in replacements for their C equivalents. ## Have you been involved with projects or organisations relevant to this project before? And if so, can you tell us a bit about your contributions? Our project is a joint effort between Tweede golf (Folkert de Vries and others) and Armijn Hemel. --- While developing the [Binary Analysis Next Generation (BANG)](https://github.com/armijnhemel/binaryanalysis-ng) project many different variants of ZIP files were encountered, including files that shouldn't exist according to the specifications, or files that were (likely) fine, but which couldn't be unpacked by the standard unzipping tools. Some tool vendors have used the specifications merely as advise, rather than a specification. --- Tweede golf is currently working on an implementation of zlib in rust: https://github.com/memorysafety/zlib-rs. There is overlap on the technical level (e.g. huffman codes) but also in terms of code architecture the two projects are similar. We will use the [`c2rust`](https://c2rust.com/) tool to automatically convert `bzip2`'s C source code to semantically equivalent rust. We believe that direct translation from C to rust is effective for certain kinds of libraries with non-trivial logic and an architecture that is close to what would be idiomatic in rust. Example domains are compression and media codecs. We have good relations with [Immunant, inc.](https://immunant.com/), the maintainers of `c2rust`. We've also contributed to their [rav1d](https://github.com/memorysafety/rav1d) project, a port of the `rav1d` AV1 decoder that used `c2rust` for the initial translation work. ## Explain what the requested budget will be used for? ### zip **knowledge transfer (5 days)** Knowledge needs to be transfered from Armijn Hemel to Tweede Golf, especially regarding where the specifications, tools and reality do not match. We expect to spend 1 day in person and the rest virtual, spread across several weeks on an "as needed" basis. **extract "extra fields" (7 days)** - set up a basic parser using `rc-zip` that can find/parse the extra fields - dump these fields into some structured format (e.g. json) - expose this functionality to python as a python extension **compile/create collection of interesting zip files (7 days)** - files that existing unpackers choke on - files that existing malware scanners choke on - files that hide malware in creative ways **zip format checks (20 days)** - correctly parse all files from the previous section - implement lints that decect (if any) in files from the previous section - combine the individual checks into a linter for zip files - expose the linting functionality to python **audit (2 days)** We don't think there will be much to audit at the end of this project. Rather, we'd like to use the auditeer's experience earlier on in the development process to help validate and improve the tool (e.g. does it accurately detect mallicious zip files). --- ### bzip2 **initial refactoring (2 days)** - run `c2rust` - run the full bzip2 test suite against our rust implementation - set this up on CI - enable codecov - C ABI testing (using libabigail) - restructure modules - remove the use of `extern "C"` for rust imports - three crates: `bzip2` binary, `libbz2-rs` c api, and the actual logic **initial testing (2 days)** - convert the test suite into rust (it's a collection of python scripts today) - set up fuzzing (on CI) **the `libbz2-rs` dynamic library (15 days)** - remove `extern "C"` libc functions and replace with rust `core`, `std` or `libc` libraries - make the internal logic of the rust library (encoding and decoding) safe - improve the test suite, in particular with more fine-grained tests - implement allocator logic - no default allocator means zero dependencies - default to the rust allocator when used as a rust crate (requires `alloc`) - otherwise `libc` is required for (aligned) malloc and free **the `bzip2` binary (5 days)** - make the logic of the binary safe **distribution (7 days)** - documentation for the project as a whole (e.g. README.md) - documentation for the rust crates - publish rust crates on crates.io - build and distribute the `libbz2-rs` library and `bzip2` binary **audit (3 days)** - determine scope - assist - process findings **blog posts (1 day)** - general announcement + project results (in-kind) - c2rust experience, abi checking The project total comes to 77 days, or EUR 40040 (65/hour). ### Does the project have other funding sources, both past and present? No. This is an entirely new project, and it has no other funding sources. The zip component takes inspiration and guidance from the earlier [ZIP-format](https://nlnet.nl/project/ZIP-format/) project, but that project was merely about documenting quirks in the ZIP file format and various unpacking tools. No development was done as part of that project. ## Compare your own project with existing or historical efforts. The recent "Using ZIP files to smuggle malware through scanners undetected" project by a student at OS3 investigated how the different malware scanners in VirusTotal respond to ZIP files that contain malware. Example configurations are files where the malware was stored in different ZIP file fields, or where different compressions were used. The results showed that many malware scanners cannot properly process ZIP files that have been intentionally crafted to confuse the decoder, but still follow the specifications. The project also shows that it is possible to hide malware in plain sight. Inspection of malware scanners for which source code was available (such as ClamAV) confirmed that ZIP support is incomplete. --- We are the first rust project to explicitly aim for being a drop-in replacement of the C implementation of `bzip2`. There is however some prior art in the rust ecosystem. ### Official rust version It looks like a start has been made on a rust version of bzip2, but this effort seems to have fizzled out. The https://gitlab.com/bzip2/bzip2/-/tree/rustify?ref_type=heads branch has not seen activity for over 4 years and only implements the crc32 algorithm in rust, none of the bzip2-specific logic. ### C bindings The `libbz2` C dynamic library can be used from rust with this crate. https://crates.io/crates/bzip2 This library has over 19 million downloads. It compiles the C code from source and links it into the rust binary. This process requires unsafe code, and cannot (reliably) be used for cross-compilation. ### Pure rust decoder There is a bzip2 decoder implementation https://github.com/paolobarbolini/bzip2-rs It looks reasonably complete. It however cannot be used without a default allocator or without the standard library. And of course it does not support encoding. The last changes to the repository more than a year ago ### Pure rust encoder There is a bzip2 encoder implementation, but it is not production quality https://github.com/jgbyrne/banzai/ Its author is cautious to make claims about the quality of the implementation: > banzai is a bzip2 encoder with linear-time complexity, written entirely in safe Rust. It is currently alpha software, which means that it is not battle-hardened and is not guaranteed to perform well and not eat your data. That's not to say, however, that I don't care about performance or reliability - bug reports are warmly appreciated! In the long term I would like to get this library to a state where it can be relied upon in production software. The last changes to the repository are 2 years old. ## What are significant technical challenges you expect to solve during the project, if any? The purpose of our zip linter is to handle malicious zip files. We will need to make sure that our library can handle all kinds of input, and not blow up or consume excessive resources. --- Overall implementing `bzip2` has low technical risk. Distributing a C dynamic library based on a rust project is not very common. In particular we need to verify that our library is ABI-compatible with the C version. A major advantage of automatic translation with `c2rust` is a working rust implementation on day one; an implementation that can be tested and fuzzed. But the output of `c2rust` is far from idiomatic rust, and the majority of the work is refactoring the `c2rust` output to safe, idiomatic rust. It has been difficult in the past to estimate how long this cleanup process takes. Because this project is limited in size, it is a good benchmark, and the data it provides will be used to motivate future projects. ## Describe the ecosystem of the project, and how you will engage with relevant actors and promote the outcomes? We will actively promote the tool, for example on sites such as LWN.net, and look for interesting conferences to talk about challenges (NLUUG spring conference on May 25 will be the first conference). By having it packaged in NixOS we want to encourage adoption. --- Given our connections in the rust ecosystem, we believe we have a decent shot at making our pure rust implementation the default choice when using bzip2 in rust. Through our work on ntpd-rs, we have both knowledge about and connections within the debian and fedora package ecosystems. We are confident that we can get our implementation picked up there. The inactive `rustify` branch in the bzip2 gitlab repository shows that the bzip2 maintainers are at least willing to consider a rust implementation. We'd be open to eventually upstreaming our implementation, but believe that we should first show independently that our implementation is solid. ## Letters of support - Per Larsen (immunant) - Prossimo (c2rust knowledge) - https://github.com/alexcrichton (maintainer van de bzip2 crate)