Proposal · Bindep, a Binary Dependency Discovery System

# Proposal · Bindep, a Binary Dependency Discovery System *Vlad-Stefan Harbuz* *Nov 2025* This document outlines a proposal for the development of Bindep, a collection of tools for discovering binary dependencies, revealing previously unmapped dependency relationships across the global software ecosystem. Here is a summary of the proposal, with technial details following below. This is a living document, and technical details are being added as I work them out. ## Status Quo Most software packages depend on a wide variety of third-party packages. Understanding these dependency relationships across the global software ecosystem is valuable for multiple reasons. First, security. If a company is developing a piece of software that relies on an Open Source package, and that Open Source package has a security vulnerability, then the company's software is also vulnerable, and it's important to uncover this. Secondly, sustainable funding. For the global software ecosystem to be [stable](https://openpath.quest/2024/the-open-source-sustainability-crisis/), the development of the most-relied-on packages should be [well-funded](https://opensourcepledge.com/blog/not-paying-open-source-maintainers-is-expensive/) to ensure those packages are receiving the care they require. For this to be possible, we must know which packages are most depended on. When [keystone packages](https://vlad.website/keystone-maintainers-keep-the-internet-going/) used by millions are unsustainably maintained, leading to issues such as maintainer burnout, [global security incidents](https://en.wikipedia.org/wiki/XZ_Utils_backdoor) arise. Currently, the best way to reconstruct these dependency relationships is to use information from package manager manifest files, like `package.json`, `go.mod`, `pyproject.toml` and others. Such techniques have been employed by [ecosyste.ms](http://ecosyste.ms) to [index](https://packages.ecosyste.ms/) over 12 million packages, as well as by [thanks.dev](https://thanks.dev/home) and others. ## Problem Many software dependency relationships are impossible to discover using package manager manifest files. This is because these manifests only contain information about *source dependency relationships*, ie when a package includes the source code of another package. For example, inspecting *pandas*'s [`pyproject.toml`](https://github.com/pandas-dev/pandas/blob/main/pyproject.toml) helpfully shows us that it depends on *numpy*. But, in addition to these source dependencies, many projects have *binary dependencies*, ie when a package requires another package's *binary files* to function. For example, *numpy* depends on *OpenBLAS* — but this information is impossible to find in *numpy*'s manifests. *Binary dependencies* also include *system dependencies*, which are libraries that programs expect to be installed using the system package manager, such *PostgreSQL*. Many of the global software ecosystem's keystone packages are depended upon as binary dependencies, but these dependency relationships are not currently widely mapped or understood, and this in turn compromises the security and stability of the global software ecosystem. ## Solution Overview I propose a two-part solution for discovering binary dependencies, using a combination of what I call a *global linker* and a *build recipe analyser*. First, the global linker identifies each version of each package across a variety of systems, and mines each package's [binaries](https://blog.stephenmarz.com/2020/06/22/dynamic-linking/) for [symbols](https://noratrieb.dev/blog/posts/elf-linkage/#the-linkage-process). For example, analysing *OpenBLAS* in this way would let us know which symbols are associated with *OpenBLAS* (say, `scipy_openblas_get_parallel_64_`). The symbols associated with each package are stored in a global database. Then, binary dependency relationships between packages are reconstructed by analysing which symbols are used. *numpy* uses the symbol `scipy_openblas_get_parallel_64_`, so we now know it has a binary dependency on *OpenBLAS*. Second, the build recipe analyser identifies dependency relationships that the global linker misses. Because analysing binaries can only tell us about which dependencies are *actually* used, we cannot identify optional dependencies. To solve this problem, we can statically analyse build recipes, for example [Meson](https://mesonbuild.com/)'s [`meson.build`](https://mesonbuild.com/Tutorial.html) file, to reconstruct which dependencies would be loaded under all possible circumstances. ## Impact Overview Developing this tool would enable the software community to more accurately reconstruct the full dependency tree of specific programs, and of the Open Source ecosystem as a whole. Once the dependecy analysis process has been implemented, *auxiliary tools* can easily be developed as part of this proposal to broaden Bindep's impact. For example: 1. **Security Analysis Tools:** Storing CVEs for each package, for example as part of [ecosyste.ms](http://ecosyste.ms), and combining this information with the dependency tree generated by Bindep, would allow developers to identify the impact of any given CVE across the software ecosystem much more reliably than before, by enumerating all affected packages. 2. **Funding Analysis Tools:** Initiatives such as [ecosyste.ms Funds, thanks.dev](https://funds.ecosyste.ms/) and the [Open Source Endowment](https://endowment.dev/) aim to fund the [keystone maintainers](https://vlad.website/keystone-maintainers-keep-the-internet-going/) that the global software ecosystem relies on. Incorporating the dependency tree generated by Bindep would allow these valuable initiatives to fund crucial but previously unidentified keystone maintainers, of which there are many. This would further the goal of Open Source sustainability, especially by promoting Bindep using channels such as the [Open Source Pledge](https://opensourcepledge.com/). # Technical Details Making the global Open Source ecosystem more secure and [sustainable](https://openpath.quest/2024/the-open-source-sustainability-crisis/), for example as part of initiatives like [ecosyste.ms Funds](https://funds.ecosyste.ms/), the [Open Source Endowment](https://endowment.dev), and [thanks.dev](https://thanks.dev), requires information about what dependencies are in a certain project's dependecy tree. For example, [*React*](https://github.com/facebook/react) depends on [*eslint*](https://eslint.org/), and we know this because Javascript projects usually use manifest files that list dependencies and where to find them. In *React*'s case, that's a [`packages.json`](https://github.com/facebook/react/blob/main/package.json) file, like with most Javascript projects. There are other such manifests for various ecosystems — `requirements.txt` and `pyproject.toml` for Python, `go.mod` for Go, `Cargo.toml` for Rust and so on. These kinds of dependencies are *source dependencies* — each of these manifest files point to where a dependency's source code can be obtained, and this source code is then downloaded and compiled or interpreted along with the main project's code. But there's also a different kind of dependency: *binary dependencies*. Instead of including dependencies' *source code* as part of compilation/interpretation, some projects expect to be able to find *compiled binary forms* of each of their dependencies. In order to make use of these dependencies, a project must know where each dependency's compiled binary is on the system, which symbols within that binary it would like to use (\~function names etc), as well as the [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) in use, which are all given to a [linker](https://en.wikipedia.org/wiki/Linker_$computing$) or [FFI](https://en.wikipedia.org/wiki/Foreign_function_interface) mechanism (like [cffi](https://cffi.readthedocs.io/en/stable/)) that correctly [wires up](https://noratrieb.dev/blog/posts/elf-linkage/#the-linkage-process) eg calls to functions located within dependencies. Using binary dependencies is common in languages such as C and C++. But, when it comes to reconstructing dependency trees, this is a problem, because projects that use binary dependencies typically do not have a manifest file. This makes binary dependencies very difficult to identify. A further complication is that dependency trees sometimes span different ecosystems. [*pandas*](https://github.com/pandas-dev/pandas) depends on [*numpy*](https://github.com/numpy/numpy), and both are Python projects. But *numpy* depends on a variety of libraries that implement [*Basic Linear Algebra Subprograms*](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms), and those libraries are written not in Python, but C or C++. So to fully work out *pandas*'s dependency tree, we need to identify *binary dependencies* from *different ecosystems* than Python. Another thing to take into account is that some dependencies are optional. numpy can use one of various BLAS libraries, like [*OpenBLAS*](https://github.com/OpenMathLib/OpenBLAS), [*flexiblas*](https://github.com/mpimd-csc/flexiblas), [*LAPACK*](https://www.netlib.org/lapack/) or [*Intel MKL*](https://docs.cirrus.ac.uk/software-libraries/intel_mkl/). Not all of these dependencies are “hard” dependencies, because only one of these BLAS implementations is needed. A well-constructed dependency tree should incorporate this information. And it is desirable to have a solution that can construct dependency trees for a wide range of arbitrary never-before-seen repositories; autonomously, so without manual intervention; and at large scales, covering as many projects as possible. These are prerequisites for solutions than might be used to create a model of dependencies across the global Open Source ecosystem, sampling as many projects as possible. There are various strategies that might meet the above requirements. This document describes an approach combining two proposed strategies: a *global linker* and a *build recipe analyser*. Strategies that have been considered but rejected can also be found in Appendix 1, because even rejected strategies can provide helpful insights. ## Global Linker Code that has binary dependencies calls into these dependencies using specific symbols. For example, *numpy* might look at the compiled dynamic library `libscipy_openblas64_-8fb3d286.so` for the symbol `scipy_openblas_set_num_threads64_`. How can we identify that *numpy* depends on *openblas64*? Searching for the `.so` dynamic library file is not reliable, not only because its filename is not predictable, but also because the calling code does not need to call into a dynamically linked `openblas64.so` file — the `openblas64` code could even be statically compiled into the same binary as the calling code. But the symbols that a library is made up of, such as `scipy_openblas_set_num_threads64_`, *would* probably collectively correctly identify the library. The name “global linker” is an analogy to a regular linker. A [linker](https://en.wikipedia.org/wiki/Linker_$computing$) resolves the connections between a program and the libraries it depends on. A *global* linker resolves the connections between all packages and all libraries, ideally. Our global linker is, in all other respects, very different from a regular linker. This strategy is very universally applicable. It would, however, require building a global index mapping, symbols to the library release they belong to. ✨ **Read more:** Technical details continued in [ecosyste-ms/packages\#1261](https://github.com/ecosyste-ms/packages/issues/1261). ## Build Recipe Analyser Build tool recipes generally have some way to specify a dependency, and these specifications are then read by the build tool itself. For example, in Meson, dependencies are [specified](https://mesonbuild.com/Dependencies.html) by writing something like: ``` dependency('zlib', version : '>=1.2.8') ``` One might think a trivial static analysis, such as simply grepping for `dependency('$[a-zA-Z0-9-_]+$'`, would get us the dependencies, but consider the following excerpt from [*numpy*'s `meson.build`](https://github.com/numpy/numpy/blob/main/numpy/meson.build): ``` foreach _name : blas_order if _name == 'mkl' blas = dependency('mkl', modules: ['cblas'] + blas_interface + mkl_opts, required: false, # may be required, but we need to emit a custom error message version: mkl_version_req, ) if not blas.found() and mkl_may_use_sdl blas = dependency('mkl', modules: ['cblas', 'sdl: true'], required: false) endif else if _name == 'flexiblas' and use_ilp64 _name = 'flexiblas64' endif blas = dependency(_name, modules: ['cblas'] + blas_interface, required: false) endif if blas.found() break endif endforeach ``` Clearly, the syntax of build recipe files is complex enough to require actual parsing and evaluation. Such a static analysis of build recipes is possible, though. Lightweight interpreters for build recipes already exist, such as [parse.c](https://git.sr.ht/~lattis/muon/tree/master/item/src/lang/parser.c) from [muon](https://git.sr.ht/~lattis/muon), which is a lightweight Meson implementation. In fact, Meson itself provides an [IntrospectionInterpreter](https://github.com/mesonbuild/meson/blob/master/mesonbuild/ast/introspection.py) capable of identifying dependencies. Such interpreters could be used to turn build recipes into [AST](https://en.wikipedia.org/wiki/Abstract_syntax_tree)s, which can then be evaluated using custom rules that do nothing but collect the names of all dependencies referred to in the build recipe. Of course, such a parser-evaluator would have to be written for each build system, but once the most popular build systems, such as CMake and Meson, are covered, it seems likely that the dependencies of a good proportion of C/C++ projects could be reconstructed. There are still caveats and limitations. Most likely, this approach would only yield the *names* of dependencies, and not necessarily the URLs to their repositories, so we would have to build an index containing, for each name, the most likely repository or repositories to be associated with that name. *(More technical details will be added here as they are developed…)* ✨ **Implementation:** I've started an implementation of this approach in the [meson](https://codeberg.org/vladh/bindep/src/branch/main/meson) directory of the [Bindep repository](https://codeberg.org/vladh/bindep). # Appendix 1: Rejected Strategies ## Actually Build Each Project Instead of going through the trouble of writing code to statically analyse build recipes, one could make use of a build recipe parser that already exists — the build system itself. One could patch the build system so that, whenever a dependency specification is encountered, that dependency is printed in some convenient way, in addition to the normal build process. In fact, this may not even require patching. CMake will print a list of encountered dependencies when `CMakeLists.txt` specifies `set_property(GLOBAL PROPERTY GLOBAL_DEPENDS_DEBUG_MODE 1)`. And CMake can even print out an illustration containing a graph of dependencies, when called using `cmake --graphviz=graph.dot`. This is hopeful, since CMake is probably the most widely used C/C++ build system. But this strategy is infeasible because it requires *actually building* the project we're trying to get a dependency tree for. In addition to being computationally intensive and having unknown side effects, most projects simply cannot be autonomously built, because they require manual intervention such as config files being written, system packages being manually installed, and other prerequisites. So although this approach is interesting and informative, it is not sufficient. ## Parse Infrastructure Recipes Infrastructure recipes such as `Dockerfile`s specify the system dependencies that must be installed for a project to work. This does not exhaust all binary dependencies, but it is a good start. However, these dependencies can be specified in many different ways. Consider this excerpt from [linkding](https://github.com/sissbruecker/linkding)'s [`Dockerfile`](https://github.com/sissbruecker/linkding/blob/master/docker/default.Dockerfile)`:` ``` RUN apt-get update && apt-get -y install build-essential pkg-config libpq-dev libicu-dev libsqlite3-dev wget unzip libffi-dev libssl-dev curl ... # install uv, use installer script for now as distroless images are not availabe for armv7 ADD https://astral.sh/uv/0.8.13/install.sh /uv-installer.sh ... COPY pyproject.toml uv.lock ./ RUN /root/.local/bin/uv sync --no-dev --group postgres ... ARG SQLITE_RELEASE_YEAR=2023 ARG SQLITE_RELEASE=3430000 ... RUN wget https://www.sqlite.org/${SQLITE_RELEASE_YEAR}/sqlite-amalgamation-${SQLITE_RELEASE}.zip && \ unzip sqlite-amalgamation-${SQLITE_RELEASE}.zip && \ cp sqlite-amalgamation-${SQLITE_RELEASE}/sqlite3.h ./sqlite3.h && \ cp sqlite-amalgamation-${SQLITE_RELEASE}/sqlite3ext.h ./sqlite3ext.h && \ wget https://www.sqlite.org/src/raw/ext/icu/icu.c?name=91c021c7e3e8bbba286960810fa303295c622e323567b2e6def4ce58e4466e60 -O icu.c && \ gcc -fPIC -shared icu.c `pkg-config --libs --cflags icu-uc icu-io -o libicu.so ... RUN apt-get update && apt-get -y install media-types libpq-dev libicu-dev libssl3t64 curl ``` This excerpt contains information about many dependencies such as *libpq* and *libssl*. But parsing this recipe is problematic in many ways: * It is not straightforward to parse package manager commands such as those for *apt*, especially when many different package managers are used across distributions * The same project can be packaged under many different names in many different package managers — although this could be solved by building an index of such names and using heuristics * It is not at all straightforward to parse non-package-manager installation steps such as *wget*, *unzip* etc And in any case, not all projects will have a `Dockerfile`. So the strategy of using infrastructure recipes has serious limitations. ## Create a New Standard Instead of attempting to glean information from sources that were not made to be parsed in this way, such as `CMakeLists.txt`, it might be best to *create a specification* for a new manifest file format to be used in projects that make use of binary dependencies. Such a file format would allow developers to specify binary dependencies in a generally machine-readable format, which would make such dependencies easier to parse, in a way that is understood by everyone. Ideally, such a specification would somehow easily interoperate with existing build tools if this is required or useful. While this is probably a good idea, it would require widespread adoption, which is not feasible in the short-term, so this strategy would not help us meet our Open Source security and sustainability goals anytime soon.