Cloud Native Data Science with Pangeo

# Cloud Native Data Science with Pangeo ## Competitive analysis 1. Please list at least three books that might compete with your book? For each book, list the author, title, publisher, and publication year. - [Cloud Computing for Science and Engineering](https://www.amazon.com/Computing-Science-Engineering-Scientific-Computation/dp/0262037246), Ian Foster, Dennis Gannon, 2017 - [Modeling and Simulation in HPC and Cloud Systems](https://www.amazon.com/Modeling-Simulation-Cloud-Systems-Studies/dp/331973766X), 2017 1. Describe how your book differs from the competition and indicate what advantages your book has over the competition. 1. Opinionated : We will describe a very specific software stack and architecture for cloud-based science. We will not attempt to provide a general overview of all the cloud has to offer. 2. Practical : This is a user's guide. Specific scientific use-cases are laid out. 3. Open Source : our book will focus exclusively on open-source, community-developed tools. 4. Vendor agnostic : while some of the use-cases will refer to particular vendors, the core of the methods should be portable across vendors and infrastructure configurations. 5. Dynamic : cloud tech is moving extremely fast. A static book will be out of date in a year. Our book will be a living document, maintained and updated by the community. What has been written on this topic The most relevant book is probably Cloud Computing for Science and Engineering: https://cloud4scieng.org/chapters/ Online doc from UW; has very ambitious outline, but is not quite completed yet: https://cloudmaven.github.io/documentation/index.html Another marginally relevant section from the AWS AI book: https://d2l.ai/chapter_appendix/aws.html There will be _lots_ of possible overlap with Jupyter documentation, since Jupyter is such a key part of our stack. - User side: https://jupyterlab.readthedocs.io/en/stable/ - Admin side: https://zero-to-jupyterhub.readthedocs.io/en/latest/ How do we manage this? Do we simply point people to those docs? Vendor them? ## Draft Outline - Intro: - What is the cloud? Cloud computing is defined by its use of APIs to provision resources on an as needed basis. Unlike non-cloud systems that provide a fixed set of resources, and require human intervention to change those resources. Resources could be a compute systems, or data, and increasingly could be higher level services for various common activities, such ad devops, machine learning, etc. For the purposees of this book, a _cloud_ system is one that runs kubernetes. Kubernetes is an open source container managment system that manages and runs docker containers that provide a set of functionality. The kubernetes apis provide a set of runtime and administrative apis that allow containers to interact, and scale. - Why cloud? - Collaboration - worldwide operations - cost management - dynamic scaling up of resources when needed and scaling down when not needed - treat infrastructure as software that can be versioned and rolled back if needed - support reproducible workloads by encapsulating software dependencies - increased dependability through continuous integration, testing and deployments - The Pangeo Principles - Move data as little as possible - Separate concerns and specialize late - Scale compute elastically - Analyze data lazily - Federate data platforms - - Part I: Data - What is data? - Data models - NetCDF / CF - Arrow - Data containers - Legacy Formats - Cloud-Optimized Formats - Parquet - COG - Zarr - TileDB - etc. - Cloud Data Storage Services - Object Stores - Figshare / Zenodo - Data APIs (e.g. OpenDAP) - Data Catalogs - Part II: User's Guide - Prerequisites - The scientific workflow - Discovering data - Loading data - Analyzing data - Visualizing Data - Scaling out with Dask - Chunks chunks chunks - Using the dashboard - Sharing your code - Making your code reproducible - Part II: Use Cases - Spatiotemporal Analysis of Ocean Sea Surface Height - Conditional Sampling of Updrafts in Large Eddy Simulations - Trend Analysis of NCAR Large Ensemble - Part III: Cloud Administrator's Guide - Prerequisites - Kubernetes - Helm - Setting up a cluster - Part IV: HPC Administrator's Guide - Conda and the software environment - Configure and deploy JupyterLab for a single user - Configure and deploy JupyterHub for a team - Deploy Dask parallelism on job schedulers via Dask-jobqueue ## Tech considerations * [A link to the Jupyter Book grant we wrote](https://www.dropbox.com/s/mi601wyggtkr8e8/proposal_jupyterbook.pdf?dl=0) * [Neurolibre, open neuro publishing platform w/ Binder and jupyter](https://conp-pcno.github.io) * [Quantitative economics open textbook](https://lectures.quantecon.org/py/) * [National scale computing in canada from syzygy](https://blog.jupyter.org/national-scale-interactive-computing-2c104455e062)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.