Roadmap - HackMD

# Roadmap ###### tags: `Main` * Setup Team * Join the hackMD team * Join the slack workspace * Choose your partner (2 persons per team) for installation, assignments, etc. * Prepare your own testbed * Install OS, K8S, Kubeflow * Study Kubernetes * Online: * [K8S documentations](https://kubernetes.io/docs/home/) * [Kubectl book](https://kubectl.docs.kubernetes.io/) * [VMWare KubeAcademy](https://kube.academy/courses) * Book: * programming kubernetes developing cloudnative applications * Kubernetes up and running dive into the future of infrastructure * Kubernetes cookbook building cloud-native applications * KubeCon: * [Attend online workshop at 2020-8-17~20](https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/) * Google more previous talks online * K8S Programming assignments * Controller, Operator Patterns * Scheduler framework * Device plugin framework * Webhook * Study Kubeflow & ML pipeline * Understand AI platform and ML pipeline services * Know how to use Kubeflow and its features * TF-operator (Distributed training, Horovod) * Pipeline (Argo) * Serving (TFServing, KFServing) * Katib (Hyperparameter tuning, Neural architecture search) * bUILD ML pipeline use cases to build test cases for Kubeflow * Paper reading * Horovod: fast and easy distributed deep learning in TensorFlow * Parallel and Distributed Deep Learning * Tiresias: A GPU Cluster Manager for Distributed Deep Learning * Gandiva: Introspective Cluster Scheduling for Deep Learning * Optimus: An Efficient Dynamic Resource Scheduler for Deep * Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications * Resource Elasticity in Distributed Deep Learning * Understand the LSALAB projects from [github](https://github.com/NTHU-LSALAB) * KubeShare * Gemini * DRAGON * Known issues for Kubeflow (priority) * Only support namespace isolation for notebooks service (High) * Lack of admin manager functionality for creating project, and managing project resources/datasaet (High) * Bug in KF-Serving when launching instance (High) * Lack of user monitoring service for tracking job execution, resource usage information (Mid) * Lack of dataset management service (Mid) * Lack of data collection from output files for debugging (Low) * Extended Study * Istio (Servicemesh) * Knative * Helm (Deployment toolkit) * Prometheus * [Apply Google summer of code project](https://www.kubeflow.org/docs/about/gsoc/) * 1/15: Organization Applications Open * 2/6: Organization Application Deadline * 2/21: Organizations Announced * 3/17-4/1: Student Application Period * 4/1-5/5: Application Review Period * 5/5: Student Projects Announced * 5/5-6/2: Community Bonding * 6/2-8/25: Coding * 8/25-9/1: Students Submit Code and Final Evaluations * 9/1-9/8: Mentors Submit Final Evaluations * 9/9: Results Announced * Other references * [Markdown cheat sheet](https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf) * Overview ![Roadmap (updated on 2020/8/3)](https:// "title") --- # Agenda ### Next meeting time: 9/15 7-9pm ### Next meeting location:Delta 601 --- ## 2020/8/3 1:30pm@delta601 * Kickoff meeting [slides](https://docs.google.com/presentation/d/1mRCGwOi_1FiiHJbEr9rw4EKbnteJk6_w61FkWuK98Zk/edit?usp=sharing) * Confirm your hackMD, slack, google drive emails * Decide regular(next) meeting time: **Thursday 4-5pm at Delta601** * ToDo * Decide team members and enter the info in [notes](/@lsalab-k8s-2020/notes) * Apply accounts * [Microsoft office 365](http://learning.cc.nthu.edu.tw/p/412-1319-12292.php?Lang=zh-tw) * [AWS educate](https://www.awseducate.com/signin/SiteLogin) * Study AI platform services from public clouds * AWS: [Sagemaker](https://aws.amazon.com/tw/sagemaker/) * Azure: [ML Service](https://azure.microsoft.com/zh-tw/services/machine-learning/) * Google Cloud Platform: [AI Platform](https://cloud.google.com/ai-platform) * The url of a pre-installed kubeflow test site is posted in the slack channel * Arrange one-day event to install machines and prepare environments acording to the [installation guide](/@lsalab-k8s-2020/installations) * Study basic background knowledge in cloud, container, deep learning, and answer questions in [study notes](/@lsalab-k8s-2020/notes) * Study Kubernetes and answer questions in [study notes](/@lsalab-k8s-2020/notes) * Attend KubeCon and recap the talks in study notes * Every team should have at least one person attend every assigned talk * Everyone will take notes for two talks ## 2020/8/17-20 [KubeCon](https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/) * 8/17 7pm~10pm: * KubeAcademy: Kubernetes Application and Container Workflows hosted by VMware * 8/17 9:05pm~10:25pm: Tutorial * From Notebook to Kubeflow Pipelines with HP Tuning: A Data Science Journey - Stefano Fioravanzo & Ilias Katsakioris, Arrikto * 8/18 7:45pm~9:05pm: Machine Learning + Data * Is Sharing GPU to Multiple Containers Feasible? - Samed Güner, SAP * Enabling Multi-user Machine Learning Workflows for Kubeflow Pipelines - Yannis Zarkadas, Arrikto & Yuan Gong, Google * 8/18 8:30pm~9:05pm: Serverless * Expanding Serverless to Scale-out Kubeflow Pipelines - Yaron Haviv, Iguazio * 8/19 7:45pm~0:15am: Machine Learning + Data * Taming Data/State Challenges for ML Applications and Kubeflow - Skyler Thomas, Hewlett Packard Enterprise * How to Use Kubernetes to Build a Data Lake for AI Workloads - Peter MacKinnon & Uday Boppana, Red Hat * Production Multi-node Jobs with Gang Scheduling, K8s, GPUs and RDMA - Madhukar Korupolu & Sanjay Chatterjee, NVIDIA * Pwned By Statistics: How Kubeflow & MLOps Can Help Secure Your ML Workloads - David Aronchick, Microsoft * 8/20 8:30pm~9:05pm: Machine + Data * Kubeflow 1.0 Update by a Kubeflow Community Product Manager - Josh Bottum, Arrikto * 8/21 0:05~0:45am: Machine Learning + Data * MLPerf Meets Kubernetes - Xinyuan Huang & Elvira Dzhuraeva, Cisco ## 2020/8/18 4pm@delta601 * Review [study notes](/@lsalab-k8s-2020/notes) * Demonstrate AI platform services from public clouds & Discuss what Kubeflow should improve. * AWS * Azure * GCP * Discuss issues from installation * Announce the scheduler assignment * ToDo * Study Advanced Kubernetes and answer questions in [study notes](/@lsalab-k8s-2020/notes) * Follow the "Experiment with the Pipelines Samples" tutorial to test your kubeflow * Implement the [assignment for scheduler](/@lsalab-k8s-2020/assignment-scheduling) due on 9/3 ## 2020/8/27 * Review [study notes](/@lsalab-k8s-2020/@notes) for K8S basic * Review [study notes](/@lsalab-k8s-2020/@notes) for K8S advanced part I * ToDo * Study Advanced Kubernetes and answer questions in [study notes](/@lsalab-k8s-2020/notes) ## 2020/9/3 * Review the scheduler assignment * Controller tutorial * Announce the controller assignment * Review [study notes](/@lsalab-k8s-2020/@notes) for K8S advanced part I * ToDo * Implement the [assignment for controller](/@lsalab-k8s-2020/@assignments) due on 9/15 ## 2020/9/15 * Discuss thoughts and questions from KubeCon * Review [study notes](/@lsalab-k8s-2020/@notes) for K8S advanced part II * ToDo * Implement the [assignment for controller](/@lsalab-k8s-2020/@assignments) due on 9/22 ## 2020/9/23 * Review the controller assignment * ToDo * Implement the [assignment for go-micro](https://www.notion.so/justin0u0/Go-Micro-K8s-Assignment-ffff78401d99453ab72b68a6c4b5138c) due on 10/15 * Continue to work on the [assignment for controller](/@lsalab-k8s-2020/@assignments) deadline extended to 10/15 ## 2020/10/15 * Discuss the assignments for controller and go-micro * ToDo * Complete the assignments for controller and go-micro. Final deadline extension to 10/28. ## 2020/10/28 * Review the assignments for controller and go-micro * ToDo * Paper reading: Present them on 12/1 * GPU Sharing:[paper](https://drive.google.com/file/d/1iHvm-nrB6H6flsesEtPMqq6mRBJbMm87/view?usp=sharing);[slides](https://drive.google.com/file/d/1auW_PKNRbAaPXUc75YKAIYJd8Gc3PIAR/view?usp=sharing);[video](https://www.youtube.com/watch?v=1WQMKCGN9j4) ==> team1 * Scheduling:[DRAGON paper](https://drive.google.com/file/d/1MBV3r_X3BldXOvJTVqelVlSlYA_m0nOL/view?usp=sharing);[DRAGON slides](https://drive.google.com/file/d/12_u1Kgb9fjal7KrQ3MHeLW7ZlpZYORng/view?usp=sharing);[Optimus paper](https://drive.google.com/file/d/1vHALQsHM7gxxst-uBT88raS9AAJSniyf/view?usp=sharing) ==> team3 * Distributed model training: [general concept](http://www.juyang.co/distributed-model-training-ii-parameter-server-and-allreduce/);[horovod](https://arxiv.org/abs/1802.05799)==> team2 * Aii workshop talk:[slides](https://drive.google.com/file/d/1l-VlSRqx04CCGIxuEhmVC9aWNnRu5t50/view?usp=sharing);[video](https://drive.google.com/file/d/1hGf-kQVUCFMeA_-nifR-FvohCdMnL9yl/view?usp=sharing) ==> all teams ## 2020/11/10, 17, 26 * Paper and progress discussion ## 2020/12/1 * Paper presentation ## 2020/12/15 * Announce DL pipeline tutorial. * link: https://github.com/NTHU-LSALAB/DL-Pipeline-Tutorial * Due on 12/30 ## 2020/12/30 * Discuss DL pipeline tutorial ## 2021/2/4 * Everyone briefly tell me what do you think about the topics. * Each person presents one of the topic below: * [Prometheus](https://k2r2bai.com/2018/06/10/cncf/prometheus/)&[Grafana](https://grafana.com/grafana/) (羅家濬) * Briefly introduce what are these tools, and how you can use them, https://www.notion.so/Prometheus-and-Grafana-c725efaeb0e9424da7db73a7380da245 * Successive Halving (SHA) & ASHA [algorithms](https://arxiv.org/pdf/1810.05934.pdf) (王劭元) * Just use some examples to illustrate how the algorithms work https://www.notion.so/SHA-ASHA-714d5040bd7b48f2b45f0e19a256e286 * [CephFS](https://docs.ceph.com/en/latest/cephfs/) (闊光) * Explain what kind of file system it is, and what are the important feastures provided by it. https://www.notion.so/kerwenwwer/Ceph-f109c14af6914e47bc5ef8d8827dbde4 * [Volcano project](https://github.com/volcano-sh/volcano)(陳劭愷) * Introduce what kind of scheduling system it is, and what are the scheduling algorithms or features are provided. https://www.notion.so/justin0u0/Volcano-e431034c6dee47d4918576fa32797984 * [BERT tutorial](https://towardsdatascience.com/bert-for-dummies-step-by-step-tutorial-fb90890ffe03)(唐晏湄) * Give an overview on what this tutorial is trying to do. You don't necessary to actually complete the tutorial. Just try to give everyone an idea what this tutorial is trying to do. https://www.notion.so/BERT-for-dummies-Step-by-Step-Tutorial-6d986f73146d411e85ebdba783139ae5 * [InterSpeech tutorial](https://github.com/espnet/interspeech2019-tutorial) (王領崧) * Same as the BERT tutorial https://www.notion.so/ESPNet-c162dbc80695453299db7533486e2f6c#004406e1686c46f39f98b493174aadbb * Topic and roadmap discussion * [document](https://drive.google.com/file/d/1RA4qWWmuQtFP-kpS99hvKuUnKyt65fz2/view?usp=sharing) ## 2021/3/1-2021/7/31 * AI Platform implementation: * [Components](https://www.notion.so/justin0u0/Project-AI-Platform-55d2926251b34997b5ffa7b1b71f0721) * Local docker repository * Account management * Kubeflow pipeline * Monitoring system (Prometheus) * Juypter notebook * Object storage (MinIo) ## 2021/8/1-2021/9/16 * AI Platform implementation: * Fine-tune the implementation and UI. * Perpare demo clip and presentation. * Paper reading * [Shared folder for the papers](https://drive.google.com/drive/folders/1XYvF1HAzdVaizbu-WuFhJ9mpACv4SsJ0?usp=sharing) * Guideline * [Ten simple rules for reading a scientific paper](https://pdfs.semanticscholar.org/45c0/1decf2c24d98c0e08dff54877758bd1a39af.pdf?_ga=2.30152895.1837442392.1627564936-331049104.1624582993) * [How to Read a Paper](http://ccr.sigcomm.org/online/files/p83-keshavA.pdf) * [Notes & Questionnaire](https://hackmd.io/@lsalab-k8s-2020/PapersNotes-DistributedSystems) * 8/12: CAP Theorem, Consistency, Consensus * Cap Theorem * CAP Twelve Years Later How the “Rules” Have Changed * Consistency, Availability, and Convergence * Time, Clocks, and the Ordering of Events in a Distributed System * Paxos Made Live * Paxos Made Simple * The-Part-Time-Parliament (Optional) * 8/19: Consistency, Consensus, Convergence * Raft: In Search of an Understandable Consensus Algorithm * The Chubby lock service for loosely-coupled distributed systems * Weighted Voting for Replicated Data * THE LOAD, CAPACITY AND AVAILABILITY OF QUORUM SYSTEM (Optional) * Dont Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS * The Potential Dangers of Causal Consistency * Chord * Consistent Hashing and Random Trees * 8/26: Distributed Data Storage Systems * Ceph * GFS * DynamoDB * Bigtable * Spanner * 9/2: Distributed Data Processing & Communication Systems * Kafka * Modern Messaging for Distributed Sytems * Spark * Mapreduce * Disk-Locality in Datacenter Computing Considered Irrelevant * A message system supporting fault tolerance * Survey of Publish Subscribe Event Systems * The Many Faces of Publish/Subscribe * 9/9: Resource Managers * Large-scale cluster management at Google with Borg * Omega - flexible, scalable schedulers for large compute clusters * Borg, Omega, and Kubernetes * Mesos * Apache Hadoop YARN * SLURM: Simple Linux Utility for Resoure Management * 9/16: Caching & Scheduling * A Comparison of List Scheduling for Parallel Processing Systems * Dynamic critical-path scheduling: an effective technique for allocating task graphs to multiprocessors * DistCache * Design Considerations for Distributed Caching on the Internet * Scaling Memcache at Facebook * 9/23 Erasure Coding * AZ-Code - An Efficient Availability Zone Level Erasure Code to Provide High Fault Tolerance in Cloud Storage Systems * Erasure Coding in Windows Azure Storage * Rethinking Erasure Codes for Cloud File Systems * Other good papers * Cluster-Based Scalable Network Services * A Majority Consensus Approach to Concurrency Control * Session Guarantees for Weakly Consistent Replicated Data * Providing High Availability Using Lazy Replication * Coda a highly available file system for a distributed workstation environment * Hive – A Petabyte Scale Data Warehouse Using Hadoop * Oozie: Towards a Scalable Workflow Management System for Hadoop * Apache Hadoop YARN: Yet Another Resource Negotiator * CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data * TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters * Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing * Sparrow: Distributed, Low Latency Scheduling * Quincy Fair Scheduling for Distributed Computing Cluster * Static Scheduling Algorithms for Allocating Directed Task Graph to MultiProcessors * I/O-Aware Batch Scheduling for Petascale Computing Systems * The Slab Allocator: An Object-Caching Kernel Memory Allocator * Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis * [The Log: What every software engineer should know about real-time data's unifying abstraction](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying) * [awesome-distributed-systems](https://github.com/theanalyst/awesome-distributed-systems) * [Raft](https://raft.github.io/) * Excellent list of papers for machine learning systems * [The list](https://jeongseob.github.io/readings_mlsys.html). You should read every papers in the first 5 categories from "framework" to "Scheduling & Resource Management".