Conda-forge
's Long-term Goals and PlansThis document contains the long-term goals and plans for conda-forge
.
It exists to aid the core team and outside entities understand where conda-forge
is as a whole, where it is going, and most importantly, how the community can further support its goals.
Conda-forge
's Long-term Goals and Plans
The first version of this document was developed by the core team over a period of several months starting in late 2020 and extending into early 2021.
This process and construction of this document was motivated by a few things happening in the community at the time.
First, conda-forge
has grown spectacularly since its inception (see e.g. this Conda-forge
year-in-review blog post).
This growth has occurred in multiple ways, including the number of artifacts we host, the number of community members maintaining those artifacts, the number of people downloading them, the diversity of the community, and the diversity of the kinds of software conda-forge
hosts.
Second, people and organizations beyond Anaconda Inc. started making core contributions to the conda ecosystem tooling. In particular, QuantStack is contributing a huge amount of effort to building efficient tooling and an open-source conda package server (see mamba-org).
This effort has transformed and enabled the conda ecosystem in numerous ways. Third, due to its increasing presence in the community, conda-forge
started receiving offers of financial support from a broader group of people/organizations.
Until now, we have lacked a coherent message about what help we needed and how these groups could help.
While this document cannot possibly answer all of these questions definitively, it can distill our thinking at the time, our current thoughts, and hopefully express clearly what is important in the opinion of the core team. We expect this document to be living in the sense that it is updated and improved over time. We also hope the community finds it useful and engages with conda-forge
on these items going forward.
– conda-forge/core
Growth is causing a variety of failures in our tooling and maintenance workflows such as:
The numbers behind these observations can be checked in the conda-forge/by-the-numbers
repository.
🚧 Quansight and QuantStack have submitted a grant (pending decision) to work on this 🚧
Summary: conda-forge
depends upon a large amount of infrastructure configuration spread across multiple GitHub repositories, external CI services, and Heroku instances.
This configuration info and provisioning of this infrastructure needs to be centralized into a service like terraform to enable better security, reliability, and recovery from adverse events.
Effort / cost: medium (estimated at 1 FTE over a year or so)
Priority: High
Context: As conda-forge
has grown, the infrastructure that powers various user-facing services (e.g., admin commands) and backend services like artifact validation and builds has grown organically too.
This situation has resulted in an array of bot accounts, API keys, and bespoke configuration settings spread across Azure DevOps, GitHub, TravisCI, Drone.io, CircleCI, and Heroku.
Further, very little to no documentation exists on how to re-provision any of these services should one of them encounter some serious event. We also cannot easily perform basic tasks, like rotating API tokens.
Description: N/A
References: N/A
Contact info: N/A
🚧 QuantStack is working on this item 🚧
Summary: conda-forge
hosts its artifacts in Anaconda.org. This is further served through a Cloudfare CDN, but it's, in principle, the sole source for all conda-forge
packages.
It'd be advisable to have backups and mirrors to prevent single-point-of-failure type issues.
Effort / cost:
Priority:
Context:
Description:
References:
Contact info:
"Wolf Vollprecht" This could be quite nicely and easily done as an additional push to an OCI registry. The source is anyways indexed by SHA256 from the recipe.
Summary: Most conda-forge
packages obtain their source code from official origins (git repository, packaging index…).
However, these sources are not guaranteed to be available forever, threatening the reproducibility of the conda-forge
packaging efforts.
This item would involve a mechanism to save and keep the used sources for each package pushed to the conda-forge
channel.
Effort / cost: TBD
Priority:
Context: conda-forge
undergoes ABI migrations often, which require rebuilding packages (same version) with different build-time dependencies.
If the source is not available for unrelated reasons, the migration process is interrupted until the new location of the source is found (best case).
Description: Note this would probably need its own Terms of Service.
References:
Contact info:
🚧 Quansight is working on some of these items 🚧
Summary: Provide infrastructure to build packages with specific technical requirements, like needing a GPU or copious amounts of RAM / storage not available in free services.
Effort / cost:
Priority:
Context: conda-forge
uses freely available CI resources (like Azure Pipelines or GitHub Actions) to build its packages.
These generous services have some limitations (allowed execution time, disk space, RAM available, processors…), which, in some cases, prevent some packages from being built.
Notorious examples include qt
, pytorch
or tensorflow
. The current workaround is to run the build script locally on somebody's machine, which introduces issues such as lack of standardization of the build machine, availability of volunteer time and trust chain.
Description:
Currently needed infrastructure:
References:
Contact info:
Context: Anaconda used to be the sole provider of tooling in the conda
ecosystem, but other organizations are producing their own tooling to either complement or substitute existing utilities (QuantStack, Quansight, conda-incubator, among others).
Tasks:
conda index
could be provided on its own, not as part of conda build
.🚧 The conda and mamba teams are working on this 🚧
Summary: More efficient serving of the ever-growing repodata.json
files with incremental updates
Effort / cost:
Priority:
Context: The package metadata used by the conda solvers is published upfront for each conda channel in a JSON file called repodata.json
. The more packages published in a channel, the bigger the JSON file. This file is downloaded every time conda
needs to install something. Popular channels like conda-forge store a lot of packages and are updated often, so caching and compression only help to some extent.
Description:
References:
Contact info:
Summary: Creating snapshots of conda-forge
repodata
at regular time intervals
Effort / cost:
Priority:
Context: The channel metadata changes when new packages are added, but it can also be patched retroactively to fix metadata problems introduced in the past. For conda-forge, the patches are submitted via conda-forge/conda-forge-repodata-patches-feedstock
and the propagated to the CDN.
Description:
References:
Contact info:
Summary: conda-forge
provides the MinGW
compilers, built at a certain epoch (timestamp). Updating to a more recent build would be desirable.
Effort / cost:
Priority:
Context:
Description: This requires rebuilding MinGW
, updating repodata
and rebuilding all downstream packages that depend on the MinGW
tool chain.
References:
Contact info:
Lots of institutional knowledge that changes often without being consolidated in a written medium. Enumeration and writing needed. Some examples include:
The following items are no longer part of the conda-forge
roadmap because they were funded and/or completed!
.conda
formatThe new package format is live on Anaconda.org, as of XX.XX.XXXX. conda-forge
supports it
https://github.com/conda-forge/conda-forge.github.io/issues/1586
https://github.com/conda-forge/conda-forge.github.io/issues/877