Summary

We propose to add a new top-level field, extraSchemas to the notebook JSON schema, to allow for further schema validation in addition to the basic notebook format.

Motivation

Today, the metadata fields in the notebook format are black boxes that cannot be validated or reasoned about by general tools. Furthermore, there is no notion of ownership over these metadata fields between extensions; extensions could conceivably conflict over the same names, with little indication to the user that this takes place.

By adding an extraSchemas field, this JEP would permit third-party extension authors to define a formal specification for the kinds of data that they wish to store within a notebook. Additionally, by defining the schema of these data, it is possible for notebook consumers (e.g. frontends) to generate rich user interfaces that modify the notebook metadata.

In the instance that organisations wish to impose additional restrictions upon internally authored notebooks, e.g. banning keywords in notebook sources, or requiring the presence of organisation-specific metadata keys, the addition of a per-notebook schema would facilitate such a scenario.

Guide-level explanation

Jupyter Notebooks are already validated against JSON Schemas by the nbformat library. An example schema is the v4.5 schema. These schemas provide the ability to specify the concept of a "valid" notebook using a well-established, declarative format.

The extraSchemas field allows document authors to introduce additional constraints upon the notebook contents, extending the base notebook schema. In addition to the root $schema, notebooks must satisfy each schema identified in the notebook's top-level extraSchemas field, i.e. the true schema is given by allOf(schema, *schemas). These extra schemas MUST satisfy a particular metaschema <TODO> that prohibits the addition of top-level properties to the cell subschema, or the top level notebook schema. This restriction MAY be relaxed in a future JEP.

Example of valid notebook in 4.7 format:

{
    "$schema": "https://jupyter.org/schema/notebook/4.6/notebook-4.6.schema.json"
    "extraSchemas": [
        "my-extension-schema-uri"
    ],
    "metadata": { 
        "my-extension": { ... }    
    }
    "cells": [ ... ]
}

Example of schema referenced in extraSchemas ("my-extension-schema-uri"):

{
    "$schema": "https://jupyter.org/schema/notebook/4.6/notebook-4.6.schema.json"
    "metadata": { 
        "type": "object",
        "required": ["my-extension"]
    }
}

Example of invalid notebook in 4.7 format:

{
    "$schema": "https://jupyter.org/schema/notebook/4.6/notebook-4.6.schema.json"
    "extraSchemas": [
        "my-extension-schema-uri"
    ],
    "metadata": {}
    "cells": [ ... ]
}

In the above example, the required property need not be exhaustive; separate schemas may each require a different set of properties.

Reference-level explanation

If schemas conflict with each other (for example, two schemas define incompatible restrictions on a metadata key), the notebook validation will fail.

Rationale and alternatives

Prior art

Unresolved questions

See Generalise cell types JEP

Questions from the workshop:

  1. Why do we use a list and allOf, instead of extraSchema being a dictionary with each key scoping to a specific part of the document?
    • Multiple extensions should be able to add things to this list without conflicting.
    • We wanted to conform to the allOf semantics on document validation, which makes it easier to use existing validation tooling.
    • We wanted to preserve maximum flexibility moving forward for extra schemas possibly modifying various parts of the document. We'll start with extra schemas specifying metadata keys.
  2. What happens if the notebook does not validate against one of the extra schemas?
    • Right now, nbformat may refuse to open the notebook. Going forward, perhaps validation failure of the extra schemas should be a warning to the user rather than a hard refusal to open the notebook
  3. How do we handle conflicts of extra schemas?
    • Part of this proposal is to surface conflicts that right now are resulting in possible silent corruption (for example, two plugins using the same metadata key). With the extra schemas giving the intent of the structure, we can inform users and plugin developers of these conflicts before they arise in practice, instead of silent corruption.
Select a repo