# Google Cloud HPC Day
_March 20, 2019_
_Sunnyvale, CA_
###### tags: cloud, training, google, HPC, slurm
## Overview
- Preemptible VM's (can save 80% cost, but you can get kicked off with 30s notice - need codes that can be checkpointed and restarted)
- Cirq
- Compute Optimized VMs
- High Throughput Computing with HT Condor
- http://cloud.google.com/hpc
## Bursting HPC jobs to GCP with Slurm
- Slurm: resource manager, grew into a scheduler
- https://github.com/schedmd/slurm-gcp
- 3 parts
- scheduler daemon
- database
- tracks access control methods
- usage
- slurm daemon
- shepards workload
- e.g. keeps track of MPI processes
- supports Federation
- "burst to cloud": e.g. start running on local infrastructure and send some jobs to cloud
- setting up slurm on GCP
- clone repo
```
> git clone https://github.com/schedmd/slurm-gcp.git
```
- edit yaml file
```yaml
cluster_name: g1 # prepended to all node names
static_node_count: 2 # always
max_node count: 10 # max
machine_type: n1-standard-2
preemptible_bursting: False
suspend_time: 300 # how long should compute nodes stay up
default_users: lheagy@berkeley.edu, XXX # comma seperated list
```
- deploy
```
gcloud deployment-manager deployments --project=slurm-gcp create g1 --config slurm-cluster.yaml
```
- VM instances
- compute image
- compute nodes
- scheduler
- login node (can ssh into this)
- to get info about cluster (if sinfo not there, it hasn't finished spinning up)
```
> sinfo
NODES
8 idle~ g1-compute[8-10] # tilde means power controlled (e.g. idle and powered off)
2 idle g1-compute[1-2]
> squeue # jobs in the que
```
- submit a job
```
> time srun -N2 hostname
```
### Q & A
- federation: multiple clusters managed by slurm
- scheduler runs on both
- submit jobs to "master controller" --> distributes work
- job failures: you can get email notifications on success, failure, abort
## Customer Panel
#### Research workloads at Stanford
- genomics, medical --> big query
- 1200 storage buckets across stanford
- singularityhub
- shout-out to jupyterhub :tada:
#### FluidNumerics
- Slurm Clusters, hackathongs and GPUs
- Dr. Joseph Schoonover
#### Google Brain
- Zak Stone, Cloud TPU Product Manager,
- openAI blog: AI and Compute
- BERT: natural language processing
- b-float16
#### Julia Computing
- Julia on TPU: At the intersection of HPC and ML
- Keno Fischer
- https://github.com/FluxML/Flux.jl
- Neural ODEs: Neurips paper
- Can we do traditional HPC simulations on TPUs (half-precision on TPUs vs. CPU)?
## Code Labs
- https://storage.googleapis.com/hpc_day/Google%20HPC%20Day%20Codelab%20Guide.pdf
- [Creating a virtual machine](https://codelabs.developers.google.com/codelabs/cloud-create-a-vm/index.html?index=..%2F..%2Findex#0)
- [Creating a persistent disk](https://codelabs.developers.google.com/codelabs/cloud-persistent-disk/index.html?index=..%2F..%2Findex#0)
- Q: When creating a disk, how do I know which one is my persistent disk?
```
username@gcelab:~$ ls -l /dev/disk/by-id/
lrwxrwxrwx 1 root root 9 Feb 27 02:24 google-persistent-disk-0 -> ../../sda
lrwxrwxrwx 1 root root 10 Feb 27 02:24 google-persistent-disk-0-part1 -> ../../sda1
lrwxrwxrwx 1 root root 9 Feb 27 02:25 google-persistent-disk-1 -> ../../sdb
lrwxrwxrwx 1 root root 9 Feb 27 02:24 scsi-0Google_PersistentDisk_persistent-disk-0 -> ../../sda
lrwxrwxrwx 1 root root 10 Feb 27 02:24 scsi-0Google_PersistentDisk_persistent-disk-0-part1 -> ../../sda1
lrwxrwxrwx 1 root root 9 Feb 27 02:25 scsi-0Google_PersistentDisk_persistent-disk-1 -> ../../sdb
```
- A: `sdb` is the secondary disk (so the persistend one I created)