owned this note
owned this note
Published
Linked with GitHub
---
title: Add `extraSchemas` to notebook format
authors: Nick, Jason, Angus, Filip
issue-number: <pre-proposal-issue-number>
pr-number: <proposal-pull-request-number>
date-started: 2023-03-01
---
# Summary
<!-- Summarize proposal in 1 paragraph -->
We propose to add a new top-level field, `extraSchemas` to the notebook JSON schema, to allow for further schema validation in addition to the basic notebook format.
# Motivation
<!-- Why are we doing this? What use cases does it support? What is the expected outcome? -->
Today, the metadata fields in the notebook format are black boxes that cannot be validated or reasoned about by general tools. Furthermore, there is no notion of ownership over these metadata fields between extensions; extensions could conceivably conflict over the same names, with little indication to the user that this takes place.
By adding an `extraSchemas` field, this JEP would permit third-party extension authors to define a formal specification for the kinds of data that they wish to store within a notebook. Additionally, by defining the schema of these data, it is possible for notebook consumers (e.g. frontends) to generate rich user interfaces that modify the notebook metadata.
In the instance that organisations wish to impose additional restrictions upon internally authored notebooks, e.g. banning keywords in notebook sources, or requiring the presence of organisation-specific metadata keys, the addition of a per-notebook schema would facilitate such a scenario.
# Guide-level explanation
Jupyter Notebooks are already validated against [JSON Schemas](https://json-schema.org/) by the [nbformat]() library. An example schema is the [v4.5 schema](https://raw.githubusercontent.com/jupyter/nbformat/main/nbformat/v4/nbformat.v4.5.schema.json). These schemas provide the ability to specify the concept of a "valid" notebook using a well-established, declarative format.
The `extraSchemas` field allows document authors to introduce additional constraints upon the notebook contents, extending the base notebook schema. In addition to the root `$schema`, notebooks must satisfy each schema identified in the notebook's top-level `extraSchemas` field, i.e. the true _schema_ is given by `allOf(schema, *schemas)`. These extra schemas MUST satisfy a particular metaschema `<TODO>` that prohibits the addition of top-level properties to the `cell` subschema, or the top level notebook schema. This restriction MAY be relaxed in a future JEP.
Example of _valid_ notebook in 4.7 format:
```
{
"$schema": "https://jupyter.org/schema/notebook/4.6/notebook-4.6.schema.json"
"extraSchemas": [
"my-extension-schema-uri"
],
"metadata": {
"my-extension": { ... }
}
"cells": [ ... ]
}
```
Example of schema referenced in `extraSchemas` (`"my-extension-schema-uri"`):
```
{
"$schema": "https://jupyter.org/schema/notebook/4.6/notebook-4.6.schema.json"
"metadata": {
"type": "object",
"required": ["my-extension"]
}
}
```
Example of _invalid_ notebook in 4.7 format:
```
{
"$schema": "https://jupyter.org/schema/notebook/4.6/notebook-4.6.schema.json"
"extraSchemas": [
"my-extension-schema-uri"
],
"metadata": {}
"cells": [ ... ]
}
```
In the above example, the `required` property need not be exhaustive; separate schemas may each require a different set of properties.
<!--Explain the proposal as if it was already implemented and you were
explaining it to another community member. That generally means:
- Introducing new named concepts.
- Adding examples for how this proposal affects people's experience.
- Explaining how others should *think* about the feature, and how it should impact the experience using Jupyter tools. It should explain the impact as concretely as possible.
- If applicable, provide sample error messages, deprecation warnings, or migration guidance.
- If applicable, describe the differences between teaching this to existing Jupyter members and new Jupyter members.
For implementation-oriented JEPs, this section should focus on how other Jupyter
developers should think about the change, and give examples of its concrete impact. For policy JEPs, this section should provide an example-driven introduction to the policy, and explain its impact in concrete terms.
-->
# Reference-level explanation
If schemas conflict with each other (for example, two schemas define incompatible restrictions on a metadata key), the notebook validation will fail.
<!-- This is the technical portion of the JEP. Explain the design in
sufficient detail that:
- Its interaction with other features is clear.
- It is reasonably clear how the feature would be implemented.
- Corner cases are dissected by example.
The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work. -->
# Rationale and alternatives
<!-- - Why is this choice the best in the space of possible designs?
- What other designs have been considered and what is the rationale for not choosing them?
- What is the impact of not doing this? -->
# Prior art
<!-- Discuss prior art, both the good and the bad, in relation to this proposal.
A few examples of what this can include are:
- Does this feature exist in other tools or ecosystems, and what experience have their community had?
- For community proposals: Is this done by some other community and what were their experiences with it?
- For other teams: What lessons can we learn from what other communities have done here?
- Papers: Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background.
This section is intended to encourage you as an author to think about the lessons from other languages, provide readers of your JEP with a fuller picture.
If there is no prior art, that is fine - your ideas are interesting to us whether they are brand new or if it is an adaptation from other languages. -->
# Unresolved questions
<!-- - What parts of the design do you expect to resolve through the JEP process before this gets merged?
- What related issues do you consider out of scope for this JEP that could be addressed in the future independently of the solution that comes out of this JEP?
# Future possibilities
<!-- Think about what the natural extension and evolution of your proposal would
be and how it would affect the Jupyter community at-large. Try to use this section as a tool to more fully consider all possible
interactions with the project and language in your proposal.
Also consider how the this all fits into the roadmap for the project
and of the relevant sub-team.
This is also a good place to "dump ideas", if they are out of scope for the
JEP you are writing but otherwise related.
If you have tried and cannot think of any future possibilities,
you may simply state that you cannot think of anything.
Note that having something written down in the future-possibilities section
is not a reason to accept the current or a future JEP; such notes should be
in the section on motivation or rationale in this or subsequent JEPs. -->
See [`Generalise cell types` JEP](https://hackmd.io/EmDM0wm1Tli3VVW7KrTwJQ?both)
Questions from the workshop:
1. Why do we use a list and allOf, instead of extraSchema being a dictionary with each key scoping to a specific part of the document?
- Multiple extensions should be able to add things to this list without conflicting.
- We wanted to conform to the allOf semantics on document validation, which makes it easier to use existing validation tooling.
- We wanted to preserve maximum flexibility moving forward for extra schemas possibly modifying various parts of the document. We'll start with extra schemas specifying metadata keys.
2. What happens if the notebook does not validate against one of the extra schemas?
- Right now, nbformat may refuse to open the notebook. Going forward, perhaps validation failure of the extra schemas should be a warning to the user rather than a hard refusal to open the notebook
3. How do we handle conflicts of extra schemas?
- Part of this proposal is to surface conflicts that right now are resulting in possible silent corruption (for example, two plugins using the same metadata key). With the extra schemas giving the intent of the structure, we can inform users and plugin developers of these conflicts _before_ they arise in practice, instead of silent corruption.