Google Cloud HPC Day

# Google Cloud HPC Day _March 20, 2019_ _Sunnyvale, CA_ ###### tags: cloud, training, google, HPC, slurm ## Overview - Preemptible VM's (can save 80% cost, but you can get kicked off with 30s notice - need codes that can be checkpointed and restarted) - Cirq - Compute Optimized VMs - High Throughput Computing with HT Condor - http://cloud.google.com/hpc ## Bursting HPC jobs to GCP with Slurm - Slurm: resource manager, grew into a scheduler - https://github.com/schedmd/slurm-gcp - 3 parts - scheduler daemon - database - tracks access control methods - usage - slurm daemon - shepards workload - e.g. keeps track of MPI processes - supports Federation - "burst to cloud": e.g. start running on local infrastructure and send some jobs to cloud - setting up slurm on GCP - clone repo ``` > git clone https://github.com/schedmd/slurm-gcp.git ``` - edit yaml file ```yaml cluster_name: g1 # prepended to all node names static_node_count: 2 # always max_node count: 10 # max machine_type: n1-standard-2 preemptible_bursting: False suspend_time: 300 # how long should compute nodes stay up default_users: lheagy@berkeley.edu, XXX # comma seperated list ``` - deploy ``` gcloud deployment-manager deployments --project=slurm-gcp create g1 --config slurm-cluster.yaml ``` - VM instances - compute image - compute nodes - scheduler - login node (can ssh into this) - to get info about cluster (if sinfo not there, it hasn't finished spinning up) ``` > sinfo NODES 8 idle~ g1-compute[8-10] # tilde means power controlled (e.g. idle and powered off) 2 idle g1-compute[1-2] > squeue # jobs in the que ``` - submit a job ``` > time srun -N2 hostname ``` ### Q & A - federation: multiple clusters managed by slurm - scheduler runs on both - submit jobs to "master controller" --> distributes work - job failures: you can get email notifications on success, failure, abort ## Customer Panel #### Research workloads at Stanford - genomics, medical --> big query - 1200 storage buckets across stanford - singularityhub - shout-out to jupyterhub :tada: #### FluidNumerics - Slurm Clusters, hackathongs and GPUs - Dr. Joseph Schoonover #### Google Brain - Zak Stone, Cloud TPU Product Manager, - openAI blog: AI and Compute - BERT: natural language processing - b-float16 #### Julia Computing - Julia on TPU: At the intersection of HPC and ML - Keno Fischer - https://github.com/FluxML/Flux.jl - Neural ODEs: Neurips paper - Can we do traditional HPC simulations on TPUs (half-precision on TPUs vs. CPU)? ## Code Labs - https://storage.googleapis.com/hpc_day/Google%20HPC%20Day%20Codelab%20Guide.pdf - [Creating a virtual machine](https://codelabs.developers.google.com/codelabs/cloud-create-a-vm/index.html?index=..%2F..%2Findex#0) - [Creating a persistent disk](https://codelabs.developers.google.com/codelabs/cloud-persistent-disk/index.html?index=..%2F..%2Findex#0) - Q: When creating a disk, how do I know which one is my persistent disk? ``` username@gcelab:~$ ls -l /dev/disk/by-id/ lrwxrwxrwx 1 root root 9 Feb 27 02:24 google-persistent-disk-0 -> ../../sda lrwxrwxrwx 1 root root 10 Feb 27 02:24 google-persistent-disk-0-part1 -> ../../sda1 lrwxrwxrwx 1 root root 9 Feb 27 02:25 google-persistent-disk-1 -> ../../sdb lrwxrwxrwx 1 root root 9 Feb 27 02:24 scsi-0Google_PersistentDisk_persistent-disk-0 -> ../../sda lrwxrwxrwx 1 root root 10 Feb 27 02:24 scsi-0Google_PersistentDisk_persistent-disk-0-part1 -> ../../sda1 lrwxrwxrwx 1 root root 9 Feb 27 02:25 scsi-0Google_PersistentDisk_persistent-disk-1 -> ../../sdb ``` - A: `sdb` is the secondary disk (so the persistend one I created)