owned this note changed 3 years ago
Published Linked with GitHub

Submarine Roadmap

Submarine SDK Overhaul

Local cache (0.8.0) [P1]

  • Download S3/hdfs data to local file systems before training

Dataset API (0.9.0) [P1]

  • We can leverage fsspec
def exists(self, path: str) -> bool: ... def get_random_remote_path() -> str: ... def download(self, remote_path: str, local_path: str, recursive bool = false): fs = fsspec.filesystem(remote_path) fs.get(remote_path, local_path, recursive=recursive) def upload(self, local_path: str, remote_path: str): fs = fsspec.filesystem(remote_path) fs.get(remote_path, local_path, recursive=recursive) submarine.data.upload("/tmp/file", "s3://bucket/key")

Training API (TBD) [0.9.0] P2

  • There are some benefits to implementing submarine base trial class
    1. Automatically save model when training is done.
    2. Automatically output metrics, parameter (learning rate, batching size, model metrics)
    3. Automatically load checkpoint if it exists
    4. Get TF, Pytorch config internally
from submarine import keras class CIFARTrial(keras.TFKerasTrial): def build_model(self): model = build_model( layer1_dropout=self.context.get_hparam("layer1_dropout"), layer2_dropout=self.context.get_hparam("layer2_dropout"), layer3_dropout=self.context.get_hparam("layer3_dropout"), ) ... return model def training_data_loader(self) -> union[tf.dataset, pd.dataframe]: ... def validation_data_loader(self) -> union[tf.dataset, pd.dataframe]: ...

Reference

Model serving

  • Model quality monitoring [0.8.0] P3
  • A/B testing [0.9.0] P3
  • Serverless (auto scale model endpoints) [0.9.0] P3

Submarine Experiment

  • XGBoost training [0.8.0] P0
  • Model checkpoint (Recover experiment) [0.8.0] P0

Submarine operator

  • Replace trafik with Istio
  • Submarine-operator v3 [0.8.0] P0

Submarine workbench

  • Angular -> React [0.8.0] P2
  • Web socket [0.8.0] P2

Submarine Cli

  • Use cli to start Submarine [0.9.0] P3
  • Start Submarine in k3s [0.9.0] P3

Environment Overhaul [0.8.0] P0

Currently creating a environment require users to set docker image and conda yaml file. However, Users can't set arbitrary image, the image must be apache/submarine:jupyter-notebook

From the users perspective, they only care about.

  1. python version
  2. python package (Tensorflow, pandas)
  3. CPU, GPU
  4. Cuda version

Solution:

  1. On the workbench, Users will set different kinds of the above config to create an environment.
  2. Provide some environments (images) that cover most users' needs.
    • python-3.8-tensorflow-cpu, python-3.9-pytorch-gpu

Link: https://github.com/apache/submarine/issues/892 https://github.com/apache/submarine/issues/895
Issue: https://issues.apache.org/jira/projects/SUBMARINE/issues/SUBMARINE-1213

Notebook [0.8.0] P1

  • Stop notebook server if idle for a long time

Link: https://github.com/apache/submarine/issues/853
Issue: https://issues.apache.org/jira/projects/SUBMARINE/issues/SUBMARINE-1167
Solution: https://github.com/shangyuantech/submarine/commit/35db197e20ac7604bed224f2c3d066d46115c824

Refactor

  • Remove duplicate code in experiment [0.8.0] P4
  • Remove legacy code [0.8.0] P4

UI/UX [0.9.0] P4

  • (Workbench )Remove Data dict, Department, Work space, Interpreter?
  • Running experiment without building docker image. The entire flow will be like,
    1. Create an environment or use predefined environment
    2. Create a notebook, start developing the model
    3. Create an experiment
      • Choose a notebook
      • Mount the code to the experiment pods

Metrics

  • Experiment CPU/ Memory Usage [0.8.0] P0
  • Model CPU, memory, disk, and network I/O [0.9.0] P0

Example

  • Model serving [0.8.0] P0
  • Tracking example [0.8.0] P0

Workflow Orchestrator [0.8.0] P4

Issue: https://issues.apache.org/jira/projects/SUBMARINE/issues/SUBMARINE-1264

Docs [0.8.0] P3

Bug Bash

  • Fix E2E flaky test [0.8.0] P2
  • Improve test coverage (sonarcloud) [0.8.0] P2
tags: Submarine Roadmap
Select a repo