owned this note
owned this note
Published
Linked with GitHub
# Submarine Roadmap
## Submarine SDK Overhaul
### Local cache (0.8.0) [P1]
- Download S3/hdfs data to local file systems before training
### Dataset API (0.9.0) [P1]
- We can leverage **fsspec**
```python=
def exists(self, path: str) -> bool:
...
def get_random_remote_path() -> str:
...
def download(self, remote_path: str, local_path: str, recursive bool = false):
fs = fsspec.filesystem(remote_path)
fs.get(remote_path, local_path, recursive=recursive)
def upload(self, local_path: str, remote_path: str):
fs = fsspec.filesystem(remote_path)
fs.get(remote_path, local_path, recursive=recursive)
submarine.data.upload("/tmp/file", "s3://bucket/key")
```
### Training API (TBD) [0.9.0] P2
- There are some benefits to implementing submarine base trial class
1. Automatically save model when training is done.
2. Automatically output metrics, parameter (learning rate, batching size, model metrics)
3. Automatically load checkpoint if it exists
4. Get TF, Pytorch config internally
```python=
from submarine import keras
class CIFARTrial(keras.TFKerasTrial):
def build_model(self):
model = build_model(
layer1_dropout=self.context.get_hparam("layer1_dropout"),
layer2_dropout=self.context.get_hparam("layer2_dropout"),
layer3_dropout=self.context.get_hparam("layer3_dropout"),
)
...
return model
def training_data_loader(self) -> union[tf.dataset, pd.dataframe]:
...
def validation_data_loader(self) -> union[tf.dataset, pd.dataframe]:
...
```
#### Reference
- [Determined AI training api](https://docs.determined.ai/latest/tutorials/tf-mnist-tutorial.html)
- [SageMaker training api](https://sagemaker.readthedocs.io/en/stable/overview.html)
## Model serving
- Model quality monitoring [0.8.0] P3
- A/B testing [0.9.0] P3
- Serverless (auto scale model endpoints) [0.9.0] P3
## Submarine Experiment
- XGBoost training [0.8.0] P0
- Model checkpoint (Recover experiment) [0.8.0] P0
## Submarine operator
- Replace trafik with Istio
- Submarine-operator v3 [0.8.0] P0
## Submarine workbench
- Angular -> React [0.8.0] P2
- Web socket [0.8.0] P2
## Submarine Cli
- Use cli to start Submarine [0.9.0] P3
- Start Submarine in k3s [0.9.0] P3
## Environment Overhaul [0.8.0] P0
Currently creating a environment require users to set **docker image** and **conda yaml** file. However, Users can't set arbitrary image, the image must be `apache/submarine:jupyter-notebook`
From the users perspective, they only care about.
1. python version
2. python package (Tensorflow, pandas)
3. CPU, GPU
4. Cuda version
Solution:
1. On the workbench, Users will set different kinds of the above config to create an environment.
2. Provide some environments (images) that cover most users' needs.
- `python-3.8-tensorflow-cpu`, `python-3.9-pytorch-gpu`
Link: https://github.com/apache/submarine/issues/892 https://github.com/apache/submarine/issues/895
Issue: https://issues.apache.org/jira/projects/SUBMARINE/issues/SUBMARINE-1213
## Notebook [0.8.0] P1
- Stop notebook server if idle for a long time
Link: https://github.com/apache/submarine/issues/853
Issue: https://issues.apache.org/jira/projects/SUBMARINE/issues/SUBMARINE-1167
Solution: https://github.com/shangyuantech/submarine/commit/35db197e20ac7604bed224f2c3d066d46115c824
## Refactor
- Remove duplicate code in experiment [0.8.0] P4
- Remove legacy code [0.8.0] P4
## UI/UX [0.9.0] P4
- (Workbench )Remove Data dict, Department, Work space, Interpreter?
- Running experiment without building docker image. The entire flow will be like,
1. Create an environment or use predefined environment
2. Create a notebook, start developing the model
3. Create an experiment
- Choose a notebook
- Mount the code to the experiment pods
## Metrics
- Experiment CPU/ Memory Usage [0.8.0] P0
- Model CPU, memory, disk, and network I/O [0.9.0] P0
## Example
- Model serving [0.8.0] P0
- Tracking example [0.8.0] P0
## Workflow Orchestrator [0.8.0] P4
- Airflow operator for Submarine experiment ([Creating a custom Operator](https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html))
Issue: https://issues.apache.org/jira/projects/SUBMARINE/issues/SUBMARINE-1264
## Docs [0.8.0] P3
- Auto generate pysubmarine api docs from comments (Refer to https://realpython.com/documenting-python-code/)
## Bug Bash
- Fix E2E flaky test [0.8.0] P2
- Improve test coverage ([sonarcloud](https://sonarcloud.io/project/overview?id=apache_submarine)) [0.8.0] P2
###### tags: `Submarine` `Roadmap`