owned this note
owned this note
Published
Linked with GitHub
# MARVIN-RFC 01: TFX Integration
Info | Description
:------------- | :-------
**Author(s)** | Lucas Cardoso Silva ([cardosolucas61.lcs@gmail.com](mailto:cardosolucas61.lcs@gmail.com))
**Updated** | 2021-01-13
## Objective
* Facilitate the adoption of Tenforflow Extended (TFX) in Apache Marvin-AI, abstracting its orchestration as much as possible;
* Enable the use of the DASFE design pattern, used on the Apache Marvin-AI platform to work with TFX components.
To achieve these main objectives:
* Create a custom interactive context to make an interface between TFX and Apache Marvin-AI, modifying the metadata persistence to the DASFE standard;
* Generate a docker image by combining the base of TFX with the base of Marvin daemon to enable TFX components execution;
* Use Apache Airflow to generate a DAG executor for DASFE, replacing TFX's AirflowDAGRunner with a simplified alternative with a higher degree of abstraction;
* Take advantage of the largest possible number of TFX components, generating code only for interfacing purposes.
## Motivation
* Apache Marvin-AI's main objective is to facilitate the build of Machine Learning solutions, using the DASFE pattern;
* TFX has excellent support for the efficient and distributed execution of different Machine Learning tasks, in a way that is transparent to the programmer, but its use is a little more complicated than Apache Marvin-AI;
* Thus, the motivation is to combine what the two frameworks do best: the ease of programming Apache Marvin-AI/DASFE and support for running TFX.
## Existing solutions
### Use TFX or Marvin-AI frameworks in a standalone way
TFX allows you to run your components interactively, viewing the results of each component within a Jupyter Notebook or Colab. There is also the possibility to create a component pipeline together with the persistence settings of metadata and parameters, create a DAG and run it with a pipeline orchestrator (such as KFP or Apache Airflow).
In addition, the conventional use of TFX discriminates the interactive execution of the component DAG orchestration. The code structure and settings are completely different between the two. If the developer opts for interactive execution, there is not much difficulty, but setting up a correct orchestration requires considerable effort and knowledge, which is often not part of the data scientist's skill set.
On the other hand, when using Marvin standalone, there is the ease of DASFE, which eliminates the need to configure a DAG for orchestration. However, TFX does not have the full orchestration and execution capacity, since Marvin's orchestration model is much simpler and is fixed.
There is no problem with using TFX with Marvin nowdays, but that doesn't make much sense. Anyway, it would be enough to follow the steps of using TFX standalone in a Marvin engine, as if this were just a simple notebook. But then it would not be possible to use, for example, Marvin Toolbox facilities such as dryrun, model serving, executor and artifact versioning. As a consequence, all the potential for abstraction that DASFE and the simplicity of taking a prototype to production is wasted.
## Design proposal
### Marvin TFX Context
By including an interface between TFX and Marvin, opting for the standardization of the pipeline along the lines of DASFE over the extensibility of TFX, we can abstract a good part of the verbosity of the configuration required to build a DAG in TFX.
The proposed class will be the MarvinTFXContext, its structure will be entirely based on TFX's InteractiveContext class.
The MarvinTFXContext class runs TFX components in a directory and records these executions in MDML. In this case, there is not much difference between the proposed class and the InteractiveContext offered by default in the TFX library.
```python=
context = MarvinTfxContext()
example_gen = CsvExampleGen(input=external_input(_data_root))
context.run(example_gen)
statistics_gen = StatisticsGen(
examples=example_gen.outputs['examples'])
context.run(statistics_gen)
```
The difference is that, when running interactively, persistence will be done in a temporary directory. However, when executing the solution orchestrated by Marvin, a root directory is generated with a unique identifier that persists these artifacts. The metadata in an interactive way is also persisted in a SQLite database in a temporary directory, while in the orchestrated solution we will have a standard configuration for the database (SQLite persisted in Marvin's standard directory structure), which can be configured by the user according to his needs for the use of other SGDBs or file paths.
The user can export his code in the DASFE standard to an Apache Airflow DAG, containing the steps: Acquisitor, Training Preparator, Trainer and Evaluator. The DAG code will be made available to the user, and it is up to him to modify it according to his needs or use it as it was generated. Orchestration with Marvin also has the benefit of allowing code outside the TFX components to be executed between steps in the pipeline.
The TFX implementation at Marvin will also feature a class for accessing TFX artifacts directly from the metadata. This will allow the unification of the interactive and orchestration modes, since that in the second one, the variables are reseted each action, as they are executed in functions within different classes. We can see the usage of this solution in the code below.
```python=
artifacts = TFXArtifacts(context)
transform = Transform(
examples=artifacts.get_examples(),
schema=artifacts.get_schema(),
module_file=os.path.abspath(_taxi_transform_module_file))
transform = context.run(transform)
```
### Daemon + TFX Image
TFX has a Docker image that is used by orchestrators to run their components independently, offering a standardized way of input and output.
Although this pattern is similar to that used by the Marvin daemon, since it can run each action independently, passing them through a parameter through a gRPC call, the way in which each image acts is different.
That said, we can produce a new daemon image based on the TFX image base containing the dependencies of both. While TFX has its own execution mode for the docker (the **docker_component_launcher** function), which consists of sending a component to the image using gRPC with standardized input and output, we will use the execution method provided in InteractiveContext (the function **in_process_component_launcher**), which provides for execution as an internal process of the system, making the container not limited to a TFX component and adapting to Marvin's actions.
Another possibility for the user is to build a Beam DAG within Marvin's actions. This solution makes it possible to run components that allow parallel execution this way, such as the ExampleValidator and Transform components. The code for building the Beam DAG is below.
```python=
example_validator = ExampleValidator(
statistics=statistics_gen.outputs['statistics'],
schema=schema_gen.outputs['schema'])
_taxi_transform_module_file = 'taxi_transform.py'
transform = Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
module_file=os.path.abspath(_taxi_transform_module_file))
pipeline = [
example_validator,
transform
]
context.run_beam_dag(pipeline)
```
During the setup phase of the image and the container, the user will be asked what base he wants to use.
### Serving
Tensorflow has its own serving api, Marvin will be only responsible for facilitating its use, provisioning forms of deployment. When defining the actions, only those of the batch type will be coded by the user, if he chooses TFX, since the tensorflow serving does not require the construction of a code for the predictor, only the model is passed.
### Airflow Orchestration
Apache Airflow is a tool to build and monitor workflows in a simplified and extensible way, allowing the generation of dynamic DAGs and extension of its workflow components, known as Operators.
Some operators are offered by the tool itself, but it also allows the user to build their own. So here we have two alternatives:
* Use DockerOperator to run a python script inside a Docker container. This container would run a dryrun for each Marvin batch action and share a volume where the TFX pipeline would be stored. In this case, a container will be created for each action executed, being finished right after the execution.
* Initialize the container and finish it at the end of the execution. Defining a set of Marvin's own operators.
Choosing each one of these steps does not alter any quality attributes required by this RFC, as initializing the daemon container multiple times adds an overhead of a few seconds to the execution, it is a irrelevant time for the vast majority of use cases.
The DAG code generated by Marvin will be available for the user to make adaptations such as more complex tasks for data acquisition or some customized artifact management, but it will be ready to be used as it is.
## Discussions
To enable a discussion on how the design proposal of this RFC impacts the use of Apache Marvin-AI, some usage scenarios are presented below. For each scenario, the quality attributes that are expected to be impacted by the design proposal are presented. The following quality attributes must be assessed for this RFC:
* Interoperability: assess the level of resource utilization of the two solutions (Apache Marvin-AI and TFX) in their entirety or at an acceptable level, so that the user can take advantage of the combined resources in the best possible way.
* Usability: generate a usable solution with a small or nonexistent learning curve for those who use the two solutions separately.
* Performance: solutions must present performance similar to their standalone use when integrated.
For each scenario, a trade-off analysis of each important architectural decision that makes up the design proposal is also presented:
* Marvin TFX Context: creation of the MarvinTfxContext class, as described above;
* Daemon + TFX Image: usage of the **in_process_component_launcher** mode;
* Generation of a DAG Airflow in the DASFE standard with some configuration possibilities.
The analysis consists on filling the presented tables with:
* Risk: a point/component of the design considered a risk to the design decision.
* Non-risk: a point/component of the design that is not considered a risk and/or nullifies some risk considered in the scenario.
* Sensitivity points: it is an architectural decision involving an architectural component and/or a relationship between components, which directly influences a quality attribute.
* Tradeoff points: It is an architectural decision that affects more than 1 quality attribute and it is also a sensitivity point that influences more than one architectural component.
The idea is that new scenarios can be added to further promote the discussion of this RFC. Contributors are requested to:
* Analyze the scenarios;
* Fill in the tables;
* Propose new scenarios.
### Scenario 01: Basic usage
**Related quality attributes**: Interoperability, usability and performance.
**Scenario**:
The data scientist starts his prototyping on the Marvin's toolbox notebook, already inside the development container. Intending to use TFX, he prototypes a solution interactively inside the notebook and, when finished, organizes the code cells and markings in the DASFE pattern.
When performing this first procedure, the data scientist closes the notebook and runs the **engine-dryrun** command to run the entire pipeline outside the prototyping environment. When successful, the user will have versioned TFX artifacts, a database with execution metadata and a model ready for deployment.
Architectural decision | Risk | Non-risk | Sensitivity points | Tradeoff points
:------ | :------ | :------ | :------ | :------
Interactive development | | NR1 | | T1
Marvin-AI's Pipeline orchestration | | | | T1
**Reasoning**:
* **NR1**: The prototyping nature of the solution does not influence the performance of the tasks, since the entire Marvin's architecture is designed to provide a prototyping close to the ideal performance.
* **T1**: Although interactive development allows the user to closely monitor all the visualizations and statistics generated by TFX components, more complex pipelines must be robustly orchestrated. Interpretability is lost at the expense of performance and continuous training.
### Scenario 02: Daemon + TFX Image
**Related quality attributes**: Interoperability, usability and performance.
**Scenario**:
The data scientist starts his prototyping on the Marvin's toolbox notebook, already inside the development container. Intending to use TFX, he prototypes a solution interactively inside the notebook and, when finished, organizes the code cells and markings in the DASFE pattern.
At the beginning of the scenario, the user is already using the Marvin daemon, allowing that, if he has already configured it, the prototyping has been carried out in a remote environment. This type of execution makes it possible to use machines or environments more powerful than the user's workstation.
The structure of the actions of Marvin-AI also allows code that is not part of the TFX components to be incorporated naturally in the DAG. Something that built-in DAG runners do not enable quickly and easily. In this scenario, the user enters a custom logging task recording some attributes of the components' execution without any problems.
Architectural decision | Risk | Non-risk | Sensitivity points | Tradeoff points
:------ | :------ | :------ | :------ | :------
Marvin-AI's actions usage | | NR2 | |
Remote development | | | | T2
Development environment ease of configuration ||||T2
**Reasoning**:
* **NR2**: The usage of Marvin's actions allows the user to be able to insert code outside the TFX components in the pipeline. Something not trivial within the standard orchestrations offered by the TFX library.
* **T2**: Allowing remote development outside a fully managed platform such as AWS or GCP generates extra configurations that the skills of data scientists generally do not cover, but this type of tool gives the user more autonomy in relation to the computational cost of the algorithms used.
### Scenario 03: Airflow DAG Generation in the DASFE standard with some configuration possibilities
**Related quality attributes**: Interoperability, usability.
**Scenario**:
The data scientist starts his prototyping on the Marvin's toolbox notebook, already inside the development container. Intending to use TFX, he prototypes a solution interactively inside the notebook and, when finished, organizes the code cells and markings in the DASFE pattern.
Once the code to be executed is defined, the user identifies the DAG configuration file inside the engine folder. He can then proceed with the implementation of the DAG and perform its schedulling as he meant, however he decides to modify the steps of the pipeline by adding new ones to perform a complex data acquisition. In this execution, queries will be made in several data sources and also a join. Apache Airflow is the perfect tool for this task, due to its capacity and ease of building DAGs even for complex tasks and their specificities.
Architectural decision | Risk | Non-risk | Sensitivity points | Tradeoff points
:------ | :------ | :------ | :------ | :------
Extensibility of Marvin DAG | | NR3 | |
Scheduling | | | SP1 |
**Reasoning**:
* **NR3**: Extensibility does not affect any quality requirements acessed by the scenario. Usability and interoperability remain the same within Marvin's actions.
* **SP1**: There is no way to predict the time spent within the execution of the DAG except by running a manual dryrun, which would not work if the training data grew with each execution. This factor can generate uncertainty in scheduling high-cost pipelines.