Conda-forge's Long-term Goals and Plans

Purpose

This document contains the long-term goals and plans for conda-forge.
It exists to aid the core team and outside entities understand where conda-forge is as a whole, where it is going, and most importantly, how the community can further support its goals.

Table of contents

Background

The first version of this document was developed by the core team over a period of several months starting in late 2020 and extending into early 2021.
This process and construction of this document was motivated by a few things happening in the community at the time.

First, conda-forge has grown spectacularly since its inception (see e.g. this Conda-forge year-in-review blog post).
This growth has occurred in multiple ways, including the number of artifacts we host, the number of community members maintaining those artifacts, the number of people downloading them, the diversity of the community, and the diversity of the kinds of software conda-forge hosts.
Second, people and organizations beyond Anaconda Inc. started making core contributions to the conda ecosystem tooling. In particular, QuantStack is contributing a huge amount of effort to building efficient tooling and an open-source conda package server (see mamba-org).
This effort has transformed and enabled the conda ecosystem in numerous ways. Third, due to its increasing presence in the community, conda-forge started receiving offers of financial support from a broader group of people/organizations.

Until now, we have lacked a coherent message about what help we needed and how these groups could help.

While this document cannot possibly answer all of these questions definitively, it can distill our thinking at the time, our current thoughts, and hopefully express clearly what is important in the opinion of the core team. We expect this document to be living in the sense that it is updated and improved over time. We also hope the community finds it useful and engages with conda-forge on these items going forward.

conda-forge/core

Context

Growth is causing a variety of failures in our tooling and maintenance workflows such as:

  • Significantly increased maintenance burdens
  • Creation of new or unexpected demands (i.e. community and user-driven, as well as by international policies)
  • Increase in potential security and reliability concerns

The numbers behind these observations can be checked in the conda-forge/by-the-numbers repository.


Continuous Integration Infrastructure and Cloud Services

Infrastructure as configuration

🚧 Quansight and QuantStack have submitted a grant (pending decision) to work on this 🚧

Summary: conda-forge depends upon a large amount of infrastructure configuration spread across multiple GitHub repositories, external CI services, and Heroku instances.
This configuration info and provisioning of this infrastructure needs to be centralized into a service like terraform to enable better security, reliability, and recovery from adverse events.

Effort / cost: medium (estimated at 1 FTE over a year or so)

Priority: High

Context: As conda-forge has grown, the infrastructure that powers various user-facing services (e.g., admin commands) and backend services like artifact validation and builds has grown organically too.
This situation has resulted in an array of bot accounts, API keys, and bespoke configuration settings spread across Azure DevOps, GitHub, TravisCI, Drone.io, CircleCI, and Heroku.
Further, very little to no documentation exists on how to re-provision any of these services should one of them encounter some serious event. We also cannot easily perform basic tasks, like rotating API tokens.

Description: N/A

References: N/A

Contact info: N/A


Mirroring

🚧 QuantStack is working on this item 🚧

Summary: conda-forge hosts its artifacts in Anaconda.org. This is further served through a Cloudfare CDN, but it's, in principle, the sole source for all conda-forge packages.
It'd be advisable to have backups and mirrors to prevent single-point-of-failure type issues.

Effort / cost:

Priority:

Context:

Description:

References:

Contact info:


Source tarball hosting

"Wolf Vollprecht" This could be quite nicely and easily done as an additional push to an OCI registry. The source is anyways indexed by SHA256 from the recipe.

Summary: Most conda-forge packages obtain their source code from official origins (git repository, packaging index).
However, these sources are not guaranteed to be available forever, threatening the reproducibility of the conda-forge packaging efforts.
This item would involve a mechanism to save and keep the used sources for each package pushed to the conda-forge channel.

Effort / cost: TBD

Priority:

Context: conda-forge undergoes ABI migrations often, which require rebuilding packages (same version) with different build-time dependencies.
If the source is not available for unrelated reasons, the migration process is interrupted until the new location of the source is found (best case).

Description: Note this would probably need its own Terms of Service.

References:

Contact info:


Specialized CI needs

🚧 Quansight is working on some of these items 🚧

Summary: Provide infrastructure to build packages with specific technical requirements, like needing a GPU or copious amounts of RAM / storage not available in free services.

Effort / cost:

Priority:

Context: conda-forge uses freely available CI resources (like Azure Pipelines or GitHub Actions) to build its packages.
These generous services have some limitations (allowed execution time, disk space, RAM available, processors), which, in some cases, prevent some packages from being built.
Notorious examples include qt, pytorch or tensorflow. The current workaround is to run the build script locally on somebody's machine, which introduces issues such as lack of standardization of the build machine, availability of volunteer time and trust chain.

Description:

Currently needed infrastructure:

  • GPU builds:
    • Linux x64: WIP (Quansight)
  • Long builds:
    • Linux: WIP (Quansight)
    • Windows
    • macOS x64
  • Native architectures:
    • macOS arm64
    • Linux aarch64
    • Linux ppc64le

References:

Contact info:


Software and Internal Tooling

Conda and Mamba tooling

Context: Anaconda used to be the sole provider of tooling in the conda ecosystem, but other organizations are producing their own tooling to either complement or substitute existing utilities (QuantStack, Quansight, conda-incubator, among others).

Tasks:

  • For better interoperability, some of Anaconda's tools could be split in smaller pieces. For example, conda index could be provided on its own, not as part of conda build.
  • Create schemas for the different file formats involved. (In progress 🚧 )
  • Standardize the de facto behaviors into technical specs tools can reimplement.

Incremental repodata updates

🚧 The conda and mamba teams are working on this 🚧

Summary: More efficient serving of the ever-growing repodata.json files with incremental updates

Effort / cost:

Priority:

Context: The package metadata used by the conda solvers is published upfront for each conda channel in a JSON file called repodata.json. The more packages published in a channel, the bigger the JSON file. This file is downloaded every time conda needs to install something. Popular channels like conda-forge store a lot of packages and are updated often, so caching and compression only help to some extent.

Description:

References:

Contact info:


Periodic snapshots of conda-forge repodata

Summary: Creating snapshots of conda-forge repodata at regular time intervals

Effort / cost:

Priority:

Context: The channel metadata changes when new packages are added, but it can also be patched retroactively to fix metadata problems introduced in the past. For conda-forge, the patches are submitted via conda-forge/conda-forge-repodata-patches-feedstock and the propagated to the CDN.

Description:

References:

Contact info:


Windows toolchain

Summary: conda-forge provides the MinGW compilers, built at a certain epoch (timestamp). Updating to a more recent build would be desirable.

Effort / cost:

Priority:

Context:

Description: This requires rebuilding MinGW, updating repodata and rebuilding all downstream packages that depend on the MinGW tool chain.

References:

Contact info:


Recipe generation

  • Generate R recipes with Grayskull
  • Generate multi-outputs for pip-extras
  • Generate multi-outputs for headers / dynamic / static libraries

Recipe maintenance


Supply chain security

  • Package signing on Quetz
  • Running an X-ray security scan on all the artifacts

Documentation

Lots of institutional knowledge that changes often without being consolidated in a written medium. Enumeration and writing needed. Some examples include:

  • Infrastructure deployment
  • Compilers
  • Migration process
  • Cross-organization tooling
  • Onboarding
  • Staged-recipes handbook

Achieved goals

The following items are no longer part of the conda-forge roadmap because they were funded and/or completed!

The .conda format

The new package format is live on Anaconda.org, as of XX.XX.XXXX. conda-forge supports it

https://github.com/conda-forge/conda-forge.github.io/issues/1586
https://github.com/conda-forge/conda-forge.github.io/issues/877

Select a repo