owned this note
owned this note
Published
Linked with GitHub
# Cloud Native Data Science with Pangeo
## Competitive analysis
1. Please list at least three books that might compete with your book? For each book, list the author, title, publisher, and publication year.
- [Cloud Computing for Science and Engineering](https://www.amazon.com/Computing-Science-Engineering-Scientific-Computation/dp/0262037246), Ian Foster, Dennis Gannon, 2017
- [Modeling and Simulation in HPC and Cloud Systems](https://www.amazon.com/Modeling-Simulation-Cloud-Systems-Studies/dp/331973766X), 2017
1. Describe how your book differs from the competition and indicate what advantages your book has over the competition.
1. Opinionated : We will describe a very specific software stack and architecture for cloud-based science. We will not attempt to provide a general overview of all the cloud has to offer.
2. Practical : This is a user's guide. Specific scientific use-cases are laid out.
3. Open Source : our book will focus exclusively on open-source, community-developed tools.
4. Vendor agnostic : while some of the use-cases will refer to particular vendors, the core of the methods should be portable across vendors and infrastructure configurations.
5. Dynamic : cloud tech is moving extremely fast. A static book will be out of date in a year. Our book will be a living document, maintained and updated by the community.
What has been written on this topic
The most relevant book is probably Cloud Computing for Science and Engineering:
https://cloud4scieng.org/chapters/
Online doc from UW; has very ambitious outline, but is not quite completed yet:
https://cloudmaven.github.io/documentation/index.html
Another marginally relevant section from the AWS AI book:
https://d2l.ai/chapter_appendix/aws.html
There will be _lots_ of possible overlap with Jupyter documentation, since Jupyter is such a key part of our stack.
- User side: https://jupyterlab.readthedocs.io/en/stable/
- Admin side: https://zero-to-jupyterhub.readthedocs.io/en/latest/
How do we manage this? Do we simply point people to those docs? Vendor them?
## Draft Outline
- Intro:
- What is the cloud?
Cloud computing is defined by its use of APIs to provision resources on an as needed basis. Unlike non-cloud systems that provide a fixed set of resources, and require human intervention to change those resources. Resources could be a compute systems, or data, and increasingly could be higher level services for various common activities, such ad devops, machine learning, etc.
For the purposees of this book, a _cloud_ system is one that runs kubernetes. Kubernetes is an open source container managment system that manages and runs docker containers that provide a set of functionality. The kubernetes apis provide a set of runtime and administrative apis that allow containers to interact, and scale.
- Why cloud?
- Collaboration
- worldwide operations
- cost management
- dynamic scaling up of resources when needed and scaling down when not needed
- treat infrastructure as software that can be versioned and rolled back if needed
- support reproducible workloads by encapsulating software dependencies
- increased dependability through continuous integration, testing and deployments
- The Pangeo Principles
- Move data as little as possible
- Separate concerns and specialize late
- Scale compute elastically
- Analyze data lazily
- Federate data platforms
-
- Part I: Data
- What is data?
- Data models
- NetCDF / CF
- Arrow
- Data containers
- Legacy Formats
- Cloud-Optimized Formats
- Parquet
- COG
- Zarr
- TileDB
- etc.
- Cloud Data Storage Services
- Object Stores
- Figshare / Zenodo
- Data APIs (e.g. OpenDAP)
- Data Catalogs
- Part II: User's Guide
- Prerequisites
- The scientific workflow
- Discovering data
- Loading data
- Analyzing data
- Visualizing Data
- Scaling out with Dask
- Chunks chunks chunks
- Using the dashboard
- Sharing your code
- Making your code reproducible
- Part II: Use Cases
- Spatiotemporal Analysis of Ocean Sea Surface Height
- Conditional Sampling of Updrafts in Large Eddy Simulations
- Trend Analysis of NCAR Large Ensemble
- Part III: Cloud Administrator's Guide
- Prerequisites
- Kubernetes
- Helm
- Setting up a cluster
- Part IV: HPC Administrator's Guide
- Conda and the software environment
- Configure and deploy JupyterLab for a single user
- Configure and deploy JupyterHub for a team
- Deploy Dask parallelism on job schedulers via Dask-jobqueue
## Tech considerations
* [A link to the Jupyter Book grant we wrote](https://www.dropbox.com/s/mi601wyggtkr8e8/proposal_jupyterbook.pdf?dl=0)
* [Neurolibre, open neuro publishing platform w/ Binder and jupyter](https://conp-pcno.github.io)
* [Quantitative economics open textbook](https://lectures.quantecon.org/py/)
* [National scale computing in canada from syzygy](https://blog.jupyter.org/national-scale-interactive-computing-2c104455e062)