Next gen conda recipe spec

Cons of the current spec:

It combines three programming languages: YAML, Jinja and the custom selector syntax
Jinja and selectors makes it hard to parse using standard YAML parsers from Python
Parsing is a multi-stage process in Python that iterates until all values are filled
Version numbers are interpreted as floating point values ("0.14") in YAML

more ideas (convo w/ MB)

test matrix for noarch for different platforms / versions
keep all steps independent
more formal spec so that we can validate recipes, get inputs directly

Specification design process

(MBargull) If we want to come up with a "more sound" specification that is easier to parse (for machine as well as humans), avoids many complications of the current format and allows future extensions/workflow integrations, we should (on a higher level, IMHO) try to discern needed structures and processes as clearly as feasible.
This means:

For the overall build process:
- Identify and separate distinct phases, e.g. (just as an example, not meant to be correct/complete list),
  - Gather build context defined outside the actual package build specs (e.g., pinnings),
  - Parse build specs,
  - Gather build input/environment definitions,
  - Build package contents,
  - Post-process/packaging,
  - Testing,
  - ???
- And then clearly define inputs and outputs for such phases
  - Helps with separation of concerns,
  - Allows definition of build workflows, e.g., dispatching actual building and testing to workers,
  - Removes "unexpected" implicit behavior.
For the build specs (assuming a declarative tree-based format) (and its serialization format):
- Define required input information for current and possible future needs.
  I.e., support everything that is currently supported (however, maybe drop existing concepts
  if they compete with base goals of the improved format) and keep future extensions in mind.
- For clarity, start with a conceptual spec, i.e., don't jump right into YAML vs TOML vs ???:
  - E.g., for one of the requirements sections define something like
```
NODE[X] := X | SELECT_NODE[X]
SELECT_NODE[X] := [CONDITION, X]
NODE[REQUIREMENTS] := LIST[REQUIREMENT_NODE]
NODE[REQUIREMNT] := [PACKAGE_NAME, PACKAGE_VERSION, PACKAGE_BUILD, ...]
```
  - This allows a specification (somewhat) independent of the serialization format,
  - Avoids introducing format-specific constrictions early on, e.g.,
    if we go with the "sel(cond)": entry route, we assume to go with a format that makes it
    easy (or even allows) putting arbitrary content in mapping keys.
    (e.g., {if: cond, then: entry} would be a more flexible serialization alternative)
  - Pure tree-like definition allows to better identify data dependencies etc.
    And to decide on where what kind of dynamic content is necessary, e.g.,
    we don't have to stick with the more complicated TEXT -> Jinja2 -> YAML -> [OBJECTS]
    route but can do TEXT -> YAML -> [OBJECTS] -> [Jinja2(obj) for obj in OBJECTS].
    Plus, we can see that pin_compatible/pin_subpackage don't have to be processed by the
    template engine at all (hint: they have static content!).
  - Lets us identify where we can artifically restrict the format to avoid complexity.
    (And plan for lifting restrictions for possible future features.)

This should also help us identify which legacy concepts/features we can remove or replace by
more general approaches (e.g., I'd like us to remove unnecessarily domain-specific things like
noarch: python (should be {arch: noarch, package_type: python} or the like), CONDA_* and
PYTHON/{{ python }} env/jinja2 vars etc.) and to make implicit behavior more explicit etc.

Design parameters

Declarative file or script file?
- declarative is much simpler to parse, but less flexible. What dynamic features do we need? How do these mesh with declarative syntax?
  - Is a templating engine of any kind supported?
  - If so, what kind, and how are limits on that going to be imposed?
- script file can lead to impossible situations, like setup.py, and the inability to parse package metadata. This is largely avoided by embedding fully-rendered (static) metadata with built packages, but evaluating the metadata (rendering) can be slow for other things, like building graphs from just recipes.
- Could provide both, like Jenkins: https://www.edureka.co/community/54705/difference-between-declarative-pipeline-scripted-pipeliine
What dimensions of variability are necessary?
- Version pins from a central configuration
- Operating system
- CPU architecture + optimization flags
Producing related packages
- If a build script produces a set of files that belongs in several split packages, how is that specified?
- If one recipe relies on another, how can the version constraints be expressed?
- How is the dependency tree for related packages computed? Can it be done in a non-iterative way?
What is the model for metadata?
- Conda-build 2.1 introduced split packages, with the implicit top-level recipe being treated as an output to maintain compatibility with the existing body of recipes.
- Forcing everything to have explicit outputs would be nice
- Top-level metadata could then be understood to only be for build, and would remain present as a snapshot for any of the output steps.
- How does this model get affected by the various dimensions of variability?
- This is related to the "Alternative to bld.bat / build.sh" section below.
What workflows are supported? How does a recipe translate into:
- packages
- test environments (and how do these get re-run?)
- CI build/test/upload reports
- For recipes with interdependent split packages, are the dependency relationships fulfilled if only building one output? In other words, to "develop" an output of a recipe, are all of its requirements built and put in place so that the build of the ouptut can proceed?
Can mutex metapackage structures be simplified/streamlined?
- These have way too many facets to understand and get right. The pattern is well-established now, and recipes could include some shorthand for this.
Specify environment variables to be set at runtime. This is a feature that must be implemented in conda and conda-build simultaneously.

Additional questions, ideas

Support more of the match spec syntax. At least namespace (if it is ever implemented) will be mandatory and I can see the appeal of channel support as well (Wolf)
- channel:namespace:pkg 0.3.1 XXXXXX
- Think about supporting build numbers better. With the full match spec one can write mypackage[build_number>5] which seems impossible with the current conda_build spec
bld.bat / build.sh ↦ unify script names to build.bat and build.sh
Use a shell agnostic "scripting" language?

Extending lists with lists

(WV) The new selector syntax may tempt to write code like this:

requirements:
  host:
    - python
    - "sel(win)":
      - curl
      - flask
      "sel(linux)":
      - wget
      - bash

It would require some code to make this work as expected. I wonder if we should support it or not.

MRB: No!

Old School env vars?

At some point in time variants could be set with CONDA_PY, CONDA_R… environment vars. This apparently leads to conda-build having to check and regex the build scripts if these env vars are used. I think it's time to get rid of them and not include them with v2.

Dave Hirschfeld: Optional Dependencies

It would be great if there was a better story for optional dependencies.

The special-casing of test requirements is inconsistent with the requirements for host/build/run. The fact that you can't independently install the test dependencies is a huge usability wart and suggests there should be a better way of supporting them.

I think that rather than special-casing test dependencies, that test should simply be another sub-category under requirements alongside host/build/run. Taking this idea further, a user could define any sub-category they wanted under requirements - e.g. dev, docs, etc…

In the case of dask they would define sub-categories array, bag, dataframe, distributed, diagnostics, delayed:
https://github.com/dask/dask/blob/master/setup.py#L10-L28

As it is, conda forces you to take an all or nothing approach - e.g.
https://github.com/conda-forge/dask-feedstock/blob/master/recipe/meta.yaml

Even if all you want to use is dask.array you're forced to install the dependencies for everything. The answer often given is to use outputs, and that's what we do ourselves - every package builds pkg-test, pkg-docs and pkg-dev metapackages along with pkg itself. Whilst this can be made to work, IMHO it's a poor substitute for proper support for optional dependencies. Ideally a recipe maintainer could specify whatever sub-category under requirements they wanted and conda would allow you to install that sub-category independently.

e.g. using pip syntax it would be

conda install dask[array,bag]

It has been argued that the pip syntax conflicts with existing conda syntax which is a fair argument, but there could be other ways to specify optional dependencies - e.g.

conda install dask --include-deps array --include-deps bag

conda install dask could then be used as an alias for conda install dask --include-deps run

This proposal requires a change to the yaml spec and also support from conda/mamba. The yaml changes could be made backwards compatible by allowing test requirements to be specified either under test/requirements or requriements/test with a suitable deprecation period.

MRB: we should not further complicate recipes with optional deps or arbitrary dep sections. the gain here is marginal and the costs are big

Output handling

Outputs can currently be two things:

they can split a "super"-build up into multiple packages by selecting files specific to e.g. the python bindings of a C++ library
they can act as standalone build with their own build scripts, define dependencies …

These two functions and the (implicit?) super-build are confusing and not very intuitive.
One suggestion is to have explicit "transient" builds / outputs but the rules for this are also not clear.

MRB: what do you mean the rules are not clear? The rules are

any transient output cannot be a run or test dependence
its artifact is always deleted at the end
it must always be pinned exactly in any host or build section

Jinja functions

conda-build defines the following jinja functions:

load_setup_py_data (used by 0 packages on cf)
load_setuptools (used by 1 package on cf, conda_smithy)
load_npm (used by 0 packages on cf)
load_file_regex (used by 0 packages)
installed
pin_compatible (IMPORTANT)
pin_subpackage (IMPORTANT)
compiler (IMPORTANT)
cdt (IMPORTANT)
resolved_packages
time -> this might not be great for reproducible builds? (MRB: we have to keep these, they are very important for builds that need to autoincrement the version)
datetime -> this might not be great for reproducible builds? (MRB: we have to keep these, they are very important for builds that need to autoincrement the version)
environ (IMPORTANT)

Maybe pin_compatible and pin_subpackage should not be implemented as Jinja functions but rather as a preprocessing step in the build / host environment configuration? The syntax would stay the same but the {{ & }} brackets would be removed. The reasoning is that the output of those two functions depends on the solve and hashing process which is already a part of the processing that conda build does (vs. string manipulation and templating that Jinja does). A clean seperation could lead to code that becomes easier to digest.

The syntax to continue to use variables would change to pin_subpackage({{ name }}, max_pin='x.x')

MRB: The change of pin_subpackage and pin_compatible to non-jinja2 is pretty invasive and not needed. We can rearange the parsing to explicitly use the graph of deps and proceed in topological order to avoid the difficulties here.

WV: Yes, I agree we can keep the Jinja syntax, but internally the jinja might just modify the dependency to some other machine readable output, e..g

{{ pin_subpackage(name, max_pin='x.x') }} => PIN_SUBPACKAGE[somelib, max_pin=x.x, exact=False, ...] which we can parse in a later step. This would remove the need of Jinja and solving to be intermangled.

MRB: FWIW, jinja2 gives you access to their AST. They break the text up in a way that has nothing to do with the YAML AFAICT.

MRB: My goal here is to make sure that all of these changes can be easily put into conda-build. Thus keeping them minimal is very important. The parsing and solving being intermingled is OK once we have a faster solver around.

WV: I think having clean seperations between the stages is much preferred though. Right now it's very difficult to follow the conda build code. The different stages that (theoretically) should exist are probably:

Parse YAML
Evaluate Jinja
Sort outputs & recipes (this requires inspecting the dependencies …)
For each output / recipe, in order
- create all variants
- solve dependencies
- Insert pin_subpackage & pin_compatible
- build recipe & add output to local package cache

MRB: So all I am saying is that in step 3 above (sorting outputs and getting deps), we render with stub jinja2 functions for pin_subpackage and pin_compatible that just return the name. Then in stage 4, we use the information from the previous build. In other words, we can support the parsing + rendering + solving + building steps above with the jinja2 just fine.

WV: no, I don't think so because e.g. for sorting we need the right package name inserted. Imagine an output using {{ pin_compatible(name) }} then we would (for sorting) have to evaluate the jinja etc. It would be much easier (implementation wise) to evaluate this to mypackage PIN_COMPATIBLE[max_pin='x.x', ...] and we still have a way to figure out the correct sorting order without going back to Jinja. Jinja should execute only once.

MRB: This is a minor implementation detail. Executing jinja2 once to build a special string that we then parse again is no different from having the second step of parsing the string again be done in jinja2 as well. At the end of the day, you have to turn PIN_COMPATIBLE[max_pin='x.x', ...] into an actual package requirement string.

WV: sure but you are missing that I was using name as a variable. We can throw away the jinja context and all that overhead after the first parsing step is done.

MRB: No I was not. The context is one extra dictionary that is around. Seems like a small thing.

WV: it seems small, but it complicates the architecture. Clean seperation of the stages is very important to me. But it's an implementation detail, I agree with that and one is free to chose to do it differently. The key is that we want to engineer something that will be maintainable for a long time.

MRB: (edit:) implementing PIN_COMPATIBLE[max_pin='x.x', ...] in conda build as it is now is a huge change.

WV: ew don't want to support PIN_COMPATIBLE. The end user still writes {{ pin_compatible( ... ) }} but internally boa / conda-build might choose to change it to an internal representation and do as it pleases.

MRB: I mean implementing the internal rep, not supporting it for users. Sorry! :D

WV: for reference, that's how I am planning to do it with boa, and then generate the conda-build MetaData class which I hopefully can feed to conda-build to create packages :)

Alternative to `bld.bat` / `build.sh`

Chris Burr: This might be out of scope for this proposal but it would be nice for the scripts used for building to become more automatic. Nix, portage and many other package managers have a mechanism by which builds are split into "phases". Packages can modify how other stages behave, for example depending on cmake causes the configure stage to run something like cmake "$source_dir" "${cmake_flags[@]}". The cmake_flags array can then be set to a sensible default. This makes it easier to write correct recipes without knowing the details of how to use every build system and also makes it easier to modify how all recipes are built. Both Nix and Portage ultimately implement this using bash scripting. I'll include a suggestion of how this might look but I'm not advocating for it to be exactly this way. I'm especially not sure how to map this over to Windows builds.

If the pip package is defined like:

build:
    export:  # Tells dependent recipes to change how they run some stages
        install: python -m pip install "${src_dir}"

and cmake is defined like:

build:
    export_env:  # All exported environment changes are sourced before building any recipes
        declare -a cmake_flags
        cmake_flags+=("-DCMAKE_INSTALL_LIBDIR=lib")
        cmake_flags+=("-DCMAKE_INSTALL_PREFIX=${PREFIX}")
    export_phases:
        configure: |
            cmake \
                "${cmake_flags[@]}" \
                 "${src_dir}"
        build: cmake --build . -j${CPU_COUNT}
        install: cmake --build . --target install -j${CPU_COUNT}

Pure python without using pip

build:
    phases:
        build: python setup.py build
        install: python setup.py install
        check: |
            my_installed_comand --help
            my_installed_comand --version
        imports:
            - my_package

Python using pip

build:
    phases:  # Install phase is done automatically
        check: |
            my_installed_comand --help
            my_installed_comand --version
        imports:
            - my_package

Package that uses CMake but that should have a non-standard feature (SOME_FLAG) enabled

build:
    env: |
        cmake_flags+=("-DSOME_FLAG=ON")
    # No need to define any phases, everything is done automatically

Providing environment changes

Currently if a package needs to provide an environment variable it has to be done by writing two bash+fish+csh+bat+… scripts that will be sourced during activation/deactivation. Many packages don't do this for all shells making them broken in some shells. It would be nicer to do this declaratively in the recipe metadata.

Licensing

Currently some packages that provide static libraries or are header-only require downstream packages to include their licenses.
This is difficult to do correctly in practice therefore it might be useful to have a license_exports field.
See this comment.

Development environments

A common request is to be able to use the same recipe file local development or to provide nightly builds.
Tools such as conda-devenv have been made to ease this use case but this duplicates the maintanence work.

Version constraints

Currently when using specifying a version constraint such as numpy >=1.15 it conflicts with the global conda-forge pinning and causes the latest version to be used. It would be nice to be able to minimise versions in host to maximise compatibility later on. Or at least to be able to combine the global pins with the recipe ones, i.e. in this case where the global pin is 1.14 on x86 and 1.16 on ARM/POWER, 1.15 should be chosen on x86 and 1.16 should be chosen on alternative architectures.

Proposal v2

This one combines some of the great ideas from v1, but tries to make more minimal changes to the
spec.

rules:

always generate valid yaml
never embed functional information in comments
three-step parsing: 1) context constants, 2) generate context vars, 3) the rest of the recipe
no jinja2 control-flow elements
versions always have quotes
no implicit outputs (either every output is in outputs or not)
test requirements are always listed in test.requirements
each output can only have sections that are the same as the top-level ones (except the context block)
top-level sections are replaced by output level ones on a per key basis down one level (i.e., dict.update in python)
selectors are implemented as dictionaries whose keys are evaled in python - see the comments below on how they are
interpreted

context:
  name: blah
  version: "1.2"
  build_num: 3
  major_ver: "{{ version.split('.')[0] }}" 

# indicates that we are using version 2 of the recipe format
recipe_format: 2

package:
  name: {{ name|lower }}
  version: "{{ version }}"  # we have to have quotes here or conda build errors

source:
  - git_url: https://github.com/blib/blah
    git_rev: master
    patches:
      - 001_blah_to_blih.patch
      - sel(win): 002_blib_to_blob.patch

build:
  number: 0
  script: "{{ python }} -m pip install . --no-deps --ignore-installed -vv"
  
requirements:
  host:
    - pip
    - python >=3.5
    # this is a selector
    # it can be as many keys as needed
    # when parsing, only one key is allowed to eval to true, if more than one does 
    # an error is raised - the value in the dict with the true key is inserted for the 
    # element
    # this construct can appear anywhere in the recipe
    - sel(win or osx): rust =1
      sel(linux): rust >=1
  run:
    - python
    - "{{ pin_subpackage('libxgboost', exact=True) }}"
   
test:
  requirements:
    - nose
  imports:
    - blah

about:
  home: "https://palletsprojects.com/p/click/"
  license: "BSD"
  license_family: "BSD"
  license_file: "LICENSE.txt"
  summary: "Composable command line interface toolkit"

extra:
  recipe-maintainers:
    - FrankSinatra
    - ElvisPresley

Here is one with outputs:

context:
  name: blah
  version: "1.2"
  build_num: 3
  major_ver: "{{ version.split('.')[0] }}" 

package:
  version: "{{ version }}"

build:
  number: 0
  binary_relocation: False
  script: "{{ python }} -m pip install . --no-deps --ignore-installed -vv"

requirements:
  host:
   - pip
   - python >=3.5
   - "sel(win or osx)": rust =1
     "sel(linux)": rust >=1
  run:
   - python
   - "{{ pin_subpackage('libxgboost', exact=True) }}"

test:
  requirements:
   - nose
  imports:
   - blah

# all of these outputs use the same requirements as above
# along with the test requirements, but not imports
outputs
  - package:
      name: out1
      # has the same version as above
    build:
      binary_relocation: True
      script: out1_build.sh
    test:
      imports:
        - blah.out2
      commands:
        - echo "this format is nice!"
  
  - package:
      name: ou2
      version: "{{ version }}.1"
    build:
      script: out2_build.sh
    test:
      imports:
        - blah.out2

about:
  home: https://palletsprojects.com/p/click
  license: BSD
  license_family: BSD
  license_file: LICENSE.txt
  summary: Composable command line interface toolkit

extra:
  recipe-maintainers:
    - "FrankSinatra"
    - "ElvisPresley"

Comments on v1 (from MRB)

This proposal is a very logical reaction to complexity. complexity arises for many reasons and i (mrb) don't fully understand where the complexity in the orignal conda recipe format came from. IIUIC selectors predate jinja2 in the recipes so part of it might simply be the accumulation of features over time. I worry that another part of the complexity is from actual needs that we have not fully enumerated or understood. Thus it seems like we should either try to make minimal changes or we need to take a census of sorts of recipes, both simple and complicated, to ensure that we can support everything that is needed.

the version being a string is something we can test in conda-build itself and error otherwise. this in and of itself is not a motivation for moving to TOML or the like.
the selectors here are implemented too narrowly given how our recipes use them in practice.
I really like the very structured context block. It solves issues around enforcing good jinja2 usage and clearly stating constants at the top.
We will still need jinja2 munging of stuff in the context block. i think a three stage parsing process makes sense. 1) get the context, 2) eval munged context with jinja2, 3) render the rest of the recipe with the context.
this spec doesn't address a bunch of other mildy confusing things (eg test commands w/ test scripts too, requires versus requirements in the test section, the py selector value, how build scripts are specified for outputs, …)
i'd like to see all jinja2 control flow commands deprecated.

Proposal v1

Switch to a more restricted markup language (TOML) and get rid of some of the custom features.

Rules:

Never produce invalid TOML
Never use comments as functional elements
2-step parsing: first extract "context" variables, then render the rest with Jinja

# initialize a variable context (has to come first)
[context]
name = "click"
version = "7.0"

# Now you can use the context variables
[package]
name = "{{ name|lower }}"
version = "{{ version }}"

[source]
url = "https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/{{ name }}-{{ version }}.tar.gz"
sha256 = "5b94b49521f6456670fdb30cd82a4eca9412788a93fa6dd6df72c94d5a8ff2d7"

[build]
number = 0
script = "{{ PYTHON }} -m pip install . --no-deps --ignore-installed -vv "

[requirements]
host = [
  "pip",
  "python >=3.5",
  "rust =1 [win or osx]",
  "rust >=1 [linux]"
]

run = [
  "python",
  "{{ pin_subpackage('libxgboost', exact=True) }}"
]

[test]
imports = [
  "click"
]

[about]
home = "https://palletsprojects.com/p/click/"
license = "BSD"
license_family = "BSD"
license_file = "LICENSE.txt"
summary = "Composable command line interface toolkit"
doc_url = "https://palletsprojects.com/p/click/"
description = """
This is a long description
of the package that is described by this 
meta.yaml file and it spans multiple lines.
"""

[extra]
recipe-maintainers = [
  "FrankSinatra", "ElvisPresley"
]

TODO

Clarify how outputs are handled – implicit first level output, but if [outputs] key specified, no implicit outputs?!
- MCS: https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#implicit-metapackages and https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#outputs-section

Dave Hirschfeld

2020/05/30 04:21:00

I would love to have better support for optional dependencies. Since my proposal falls foul of the 500 character limit here I posted it as a gist: https://gist.github.com/dhirschfeld/c1aa9fb1fb19ae4b26f2c9579bdfab8c (Edited)

Marcel Bargull

2020/06/09 07:33:16

> Version .. interpreted as floating point ... in YAML Yes and no. `conda-build` doesn't adhere to the [Core Schema](https://yaml.org/spec/1.2/spec.html#id2804923) completely and uses [Failsafe Schema](https://yaml.org/spec/1.2/spec.html#id2802346) for things like versions. Nonetheless, this can (and does) confuse users that are familiar with YAML because the core schema is the recommended YAML schema. (Edited)

Guest Simpson2020/06/19 14:48:43

Top-level metadata could then be understood to only be for build, and would remain present as a snapshot for any of the output steps.

You still want to be able to have different build envs per-output here IMHO. (Edited)

Wolf Vollprecht

2020/05/30 19:44:51

Dave, I added your comments as a block in the additional questions & ideas section. hope you don't mind. (Edited)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Next gen conda recipe spec

more ideas (convo w/ MB)

Specification design process

Design parameters

Additional questions, ideas

Extending lists with lists

Old School env vars?

Dave Hirschfeld: Optional Dependencies

Output handling

Jinja functions

Alternative to bld.bat / build.sh

Providing environment changes

Licensing

Development environments

Version constraints

Proposal v2

Comments on v1 (from MRB)

Proposal v1

TODO

Alternative to `bld.bat` / `build.sh`