## Tue demo:
Opening:
- Unified workflow
- Pipeline based HP workflow starting from your notebook, with caching
- Sugle unified workflow: in making this, we made lots of fixes to compoenents to enfore cohesion and uniformness. Some of these components don't look exactly how expected them to be.
0. Explain what we are going to demo.
- Our focus has been to make Kubeflow feel like a cohesive platform, where all of its components are organized semantically so that users can navigate in a reasonable way. We want Kubeflow to feel like a single, well structured, product and not just a "collection" of separate components.
- A "semantic" organization wrt a "component" organization means that multiple components can provide services or [UI] views for the same objects and should thus be "grouped" together. This ofc entails for the UX perspective and the data linkage and references between components.
- The first step to achieve this is to remove any "inner" sidebar, which breaks the continuitiy and consistency of the overall UX. Kubeflow Pipelines and Katib are the two components that have a sidebar, and for obvious reasons, since they can be deployed as stand-alone. We propose that, when deployed as part of Kubeflow, these sidebar are hidden and all the deeplinks become reachable from the centraldashboard.
- These changes form our opinionated distribution of Kubeflow running on EKS. This is our actual deployment of our opinionated Kubeflow distribution on EKS.
1. Show the new Centraldashboard sidebar
- Show how we moved together the Katib and KFP experiments and how we things about *Runs* in the same way. But there is no Katib button there since Katib does not provide (yet?) a "Trials" view.
- Mention where we are running the cluster, we have an opininated dsitrbution of kubeflow running on EKS. Our customers are currently deployment theyr workflows in this environment. What you can see is also a fully multi-user environment based on namespace isolation and the latest multi-user improvements developed for Kubeflow Pipelines.
2. Connect to a Notebook Server
3. Show the new Kale UI
- We have improved upon our last release of Kale. We are evolving what Kale has been until now - a tool to convert Notebook to Kubeflow Pipelines, to a more general Data Science workflow tool to work and orchestrate jobs in Kubeflow. We want for Kale for become the place where a Data Scientist can do all of his/her work, from prototyping, to training at scale, and finally serving models.
- We wanto to run HP tuning -> we want pipeline parameter -> defined them in a cell ->
- Brand new UI to manage HP tuning jobs.
4. Configure Katib
- To demonstrate this, let's see how you can now run Kubeflow pipelines as part of Katib job, directly from Kale.
5. Start a Katib experiment
6. Show the Katib UI
- Say that we wanto to contribute to Katib to bring it in line with these recent development and the UI language employed across Kubeflow applications.
8. Show the KFP UI
- Show a run by opening it into a new tab. This will also trigger our new "forced" iframe redirection, to enforce the argument that we want everything to be well integrated.
- Click the toggle to simplify the graph view -
- Show the MLMD
- Show the generated Jupyter Artifacts
9. Show that caching works
- **How should we describe this, especially wrt the existing KFP caching implementation**?
- Running workflows at scale, especially in the context of HP tuning, requires to use resources wisely to speed up the computation time. We have implemented a caching mechanism that works with PVCs and relies on the snapshotting capabilities of Rok, our data management platform. The implementation is generic and can be extended to exploit other storage vendor. Our implementation is a superset of what KFP caching does. Because KFP caching assumes everything is immutable in the PVC. Our PVCs are closenes of immutable things.
- It exploits our Rok data platform snapshotting capabilities. Any storage vendor could come in and integrate with our caching controller. We take a snapshot of each step before and after its execution. For caching we only care about the latter one. We compute an execution hash (similarly to how KFP cache does) and search for it in MLMD. The execution hash is based off the workflow template and
- We make a pod unschedulable and have our caching controller mark it as succedeed both in term of kubernetes and in terms of argo (pod completed (k8s) and has proper annotations and labels for argo to consider it completed).
- caching logs: we change the pod that KFP UI looks for (to the cached one)
11. Show artifacts (MLMD)
- MLMD is not yet isolated but we know about it, we know how to fix and we will work on it.