Try   HackMD

Digital Twin Proposals/Discussion

tags: digital-twin-working-group

Background

We have quite a few digital twins at BlockScience and Danilo pionered the format of the cadCAD version of this. The steps of the execution logic are:

  1. Prepare data
    • GetData(St<τm,Ps)
  2. Backtest Model in cadCAD
    • Backtest(St<τm,Ps,Mb)St<τ
  3. Fit Stochastic Model
    • FitSignal
  4. Extrapolate Exogenous Signals
    • ExtrapolateSignal
  5. Run extrapolations in cadCAD
    • ExtrapolateState
  6. Create html reports from jupyter notebooks with papermill
    • Report

Terminology

  • Stm
    : Mechanism state at time
    t
  • Stb
    : Behavioural state at time
    t
  • S:Sb+Sm
  • Pm
  • Pb

Summary from @danlessa

  • Define a hierarchy that categorizes and allows for documenting and testing each part of the data ETL pipeline.
  • Use Notebooks with documentation and examples as reference points for implementation of specific features on the model.
  • Using objects / structs for encapsulating the Digital Twin & associated functions in terms of signatures
  • Use unit & data & integration tests

Purpose of Discussions

It may be very helpful in the long run to come to a consensus on the format of digital twins. As well, there may be a good way to integrate it within cadCAD or cadCAD tools for ease of use with other users. Standardized digital twins make other people's learning curve lowered.

As well, a standardized model can make it much easier to build out things like the integration testing which Sean/Emanuel are looking to work on for the filecoin digital twin.

Topic 1: Data Functionality Formatting

Summary: Defining a general best practice type of data hierarchy can lead to cleaner code and support easier use of both documentation + testing.

Using the Filecoin Digital Twin as an example, the hierachy used there was the following:

  1. Orchestration Data Functions: These serve primarily as wrappers aggregating data pulls and data processing and are the interface into what is pulled and downloaded in part 1. To the extent possible data should be grouped together but this is not always possible. For example, 5 data pulls/data processing functions are grouped into the main data pull for the digital twin but there is also another aggregate data function which pulls a reward schedule since this type of data is two dimensional instead of a time series type of data.
  2. Utility Functions: These are functions that are shared across the different functions. Examples include a function which truncates the datetime to the correct frequency based on the parameter of epoch length in the digital twin.
  3. Data Pull Functions: Functions that are simply built around pulling the data but do no transformations. In the filecoin digital twin these primarily represent query building and execution of the built queries in SQL.
  4. Data Processing Functions: For every data pull function there is a data processing function which computes any additional fields, modifies data, etc.
  5. Data Formatting Functions: Any functions which can be run after the data is pulled and saved to change it into the format for the digital twin (the use of data classes means that we don't want a pandas dataframe but instead a class for each state) to ingest.

This might seem excessive but by doing it like this we can document each part of the pipeline and as well build out tests which confirm whether our data is incorrect because of an error with the data pull (a field could be wrong), the processing of the data (maybe dropping null values causes certain data to be ejected that should not be), or at the aggregation level (if you do a left join but one dataset is missing dates then it will drop out the data from the other query).

Topic 2: Model Specific Documentation

Summary: Defining a set of jupyter notebook documentation guides to build with examples/templates would lead to a known expectation of what to build. A tenative set of guides would be: data, PSUBs, and inputs/parameters. Every digital twin would get these as templates to be built out as the model is iterated on.

This can be better defined after intial conversations but examples for data/PSUBs can be found within the digital twin of filecoin. These documentations would be live in the sense that it shows descriptions of what is going on, produces the source code for functionality using inspect and then has live examples of using each piece so that a user can see how the underlying functionality works.

Topic 3: Converting to object-oriented

Summary: An object-oriented approach could allow for clear designation of what functionality should be passed into the model. The caveat is it might reduce flexibility.

The idea behind a class for the digital twin would be that either an instance could be created or another class could inherit from the main class and then functionality would be filled in/different pieces of the digital twin could be tested individually.

For example, one input to this might be the data function(s) that pull in the data for the model. Another input could be setting the mapping of notebooks to build into html. With that kind of set up, the components would be self-contained and then the execute function of the class would be calling steps 1-6 which also could easily be called individually when working through different pieces of the model.

Topic 4: Testing

Summary: Defining a set of tests we want built for models and then the process to set up integration testing (testing which will only approve branches that pass tests) can be beneficial to avoid breaking of models.

There is active work being done on this by Sean/Emanuel in regards to building out integration testing for the filecoin digital twin which they can speak to. The basic idea is to have data functionality tests on both data pulls and data processing as well as tests for mechanisms used in the PSUBs.