Migration requirements

# Migration requirements ## Why this document This document summarizes my experience with the migration data structure in PartSeg and may be treated as an introduction to introduce such a mechanism to the napari environment. If there will be such will I propose to put in a separate package called maybe `napari-migration-engine` (aka `nme`) or `napari-storage-engine` (aka `nse`) Part of the inspiration for this document is the Django data model and its migration https://docs.djangoproject.com/en/4.0/topics/migrations/ ## PartSeg story PartSeg is software for reproducible ROI extraction (Segmentation) for masked 3D multichannel images and measuring various parameters of this ROI. For reproducibility, there is a need to have the possibility to save the ROI Extraction pipeline between sessions. The pipeline is a sequence of high-level steps needed to calculate ROI. To achieve preserve ROI Extractions pipeline between sessions and allow to copy it on another machine with keep possibility to edit it from interface later there was a decision to serialize it to JSON. This provides an option for manual inspection of a file to understand it by a more experienced user. Decision to selection JSON over YAML or TOML is done only based on the experience author with this first format in Python. ## The problem During the development of PartSeg, few changes break backward compatibility. Before introducing a base migration mechanism, loading an outdated pipeline may end with an error or even worst because the pipeline could load without exception, but the result of the calculation could be different than on the old version. ## The solution The migration is the mechanism to check if data, after load and before deserialization, is outdated and if yes, then apply some transformation to the middle form of data. The best option is to define migration per class, not migrate high-level structures like the whole pipeline in one step to keep code readability. Sample data and migration for field rename may look like old version: ```Python import pydantic class SampleClass(pydantic.BaseModel) some_non_descriptive_name: int ``` new version: ```Python import pydantic class SampleClass(pydantic.BaseModel) some_descriptive_name: int ``` migration function: ```Python import typing def migrate_sample_class(data: typing.Dict[str, typing.Any]) -> typing.Dict[str, typing.Any]: data["some_descriptive_name"] = data.pop("some_non_descriptive_name") return data ``` ## Details The situations which I meet and need to be supported in the migration mechanism 1. **Adding new field** which does not have a default value, or its default value is different from this which is needed to preserve behavior of older version. In PartSeg an example for this may be adding convex hull to nucleus segmentation steep. It was enabled by default, but old workflows should have disabled this option. 2. **Field name removal/change/type update**. Show as the example in the previous section. In PartSeg, during the written paper, there was a decision to update part of names to have better consistency with other tools. Next, to update names in interfece, there was the decision to update the name of variables in code. Some code examples with reduce number of attributes: ```Python import pydantic class SampleClass(pydantic.BaseModel): field1: int pa_field1: int pa_field2: float pb_field1: int pb_field2: float ``` new version ```python import pydantic class SubFieldClass(pydantic.BaseModel): field1: inf field2: float class SampleClass(pydantic.BaseModel): field1: int pa: SubFieldClass pb: SubFieldClass ``` migration ```python def migrate_fun(data: typing.Dict[str, typing.Any]) -> typing.Dict[str, typing.Any]: data["pa"] = SubFieldClass(field1=data.pop("pa_field1"), field2=data.pop("pa_field2")) data["pb"] = SubFieldClass(field1=data.pop("pb_field1"), field2=data.pop("pb_field2")) return data ``` 3. Move class to another subpackage or another package. During the development of PartSeg there are few situations when a big module was transformed into a package. There was also a decsion to split PastSeg into multiple packages. In this second case whole `PartSeg.utils` package was transformed to `PartSegCore` package. ## Where migration should be stored The migration should not be part of the class implementation to keep class readability. However, conversely, I see no profit from keeping it in a separated module called `migrations`. So I'm ok with keeping it in the same module. Storing in the same module requires directly referring to it in the registration step, but it could be more readable for less experienced plugin creators. ## Other questions? 1. Is it possible to register migrations outside the package implemented in it? From one side, it could simplify code reuse but increase the probability of registration from two packages. 2. how to version classes? Based on custom version or base on package version? ## Aditional data storage when saving on disc 1. For the package from which serialized objects come, there may be stored information about the used version of the package (This should simplify debug and allow provide better error message) 2. For pydantic based class, there may be a saved schema? ## Current idea of PartSeg migrations. This document is partially the result of my work transforming the current messy migration system into something clean. This job is here (not finished yet): https://github.com/4DNucleome/PartSeg/pull/462 The current idea is that migration is tuple of function, which gets dict of constructor kwargs and returns modified dict, and version number. Then class that needs migrations is registered in the global register using a decorator that accepts the list of migrations and the list of old names. The data serialized to disc contains, additionally to kwargs, the path to class and version number (defaulted to`"0.0.0"`). During the deserialization process, a list of migrations is selected and then applied in version order based on the version stored in a file. Maybe it is a good starting point.