AWS / 容器執行個體(EC2) === ###### tags: `ML / Platform` ###### tags: `ML`, `AWS`, `EC2`, `container`, `GPU` <br> [TOC] <br> ## AWS 入口點 https://signin.aws.amazon.com/console ### EC2 入口點 https://ap-southeast-1.console.aws.amazon.com/ec2/v2/home - [Amazon EC2 G4 執行個體](https://aws.amazon.com/tw/ec2/instance-types/g4/) 業界最經濟實惠的 GPU 執行個體 <br> <hr> <br> ## EC2 操作範例 ### 步骤 1: 选择一个 Amazon 系统映像 (AMI) [![](https://i.imgur.com/vWyzoLr.png)](https://i.imgur.com/vWyzoLr.png) <br> ### 步骤 2: 选择一个实例类型 [![](https://i.imgur.com/ltiVNRo.png)](https://i.imgur.com/ltiVNRo.pn) <br> ### 步骤 3: 配置实例详细信息 略 <br> ### 步骤 4: 添加存储 [![](https://i.imgur.com/jzK8HGG.png)](https://i.imgur.com/jzK8HGG.png) :::warning :warning: **root 配置 8GB 太小了,apt update 一下就滿了** ` failed to write status database record about 'python3-oauthlib' to '/var/lib/dpkg/status': No space left on device` ``` $ df -h Filesystem Size Used Avail Use% Mounted on /dev/root 7.7G 7.7G 0 100% / <--------- 滿了 devtmpfs 7.7G 0 7.7G 0% /dev tmpfs 7.7G 0 7.7G 0% /dev/shm tmpfs 1.6G 860K 1.6G 1% /run tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 7.7G 0 7.7G 0% /sys/fs/cgroup /dev/loop0 25M 25M 0 100% /snap/amazon-ssm-agent/4046 /dev/loop1 62M 62M 0 100% /snap/core20/1242 /dev/loop3 56M 56M 0 100% /snap/core18/2253 /dev/loop2 68M 68M 0 100% /snap/lxd/21835 /dev/loop4 43M 43M 0 100% /snap/snapd/14066 tmpfs 1.6G 0 1.6G 0% /run/user/1000 ``` 經過測試1,至少要 18GB ![](https://i.imgur.com/KEu7DVJ.png) 經過測試2,至少要 24GB (不確定是否有含暫存檔) ![](https://i.imgur.com/UKsLyIp.png) ::: :::info :bulb: **一些安裝程式是放在 root volume** 為了加速程式啟動,建議選擇 General Purpose SSD (gp3)? ::: <br> ### 步骤 5: 添加标签 略 <br> ### 步骤 6: 配置安全组 [![](https://i.imgur.com/L520DLq.png)](https://i.imgur.com/L520DLq.png) <br> ### 步骤 A: 选择现有密钥对或创建新密钥对 ![](https://i.imgur.com/MYTlgE6.png) 之後選擇既有的 pem ![](https://i.imgur.com/ekm4TDU.png) <br> ### 步骤 B: 連線 ```bash= $ ssh -i parabricks.pem ubuntu@13.213.40.194 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: UNPROTECTED PRIVATE KEY FILE! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Permissions 0664 for 'parabricks.pem' are too open. It is required that your private key files are NOT accessible by others. This private key will be ignored. Load key "parabricks.pem": bad permissions ubuntu@13.213.40.194: Permission denied (publickey). $ chmod 400 parabricks.pem $ ssh -i parabricks.pem ubuntu@13.213.40.194 ``` <br> ### 步骤 C: 格式化加掛「全新」磁碟 > 參考資料: > - [[HackMD] Azure / 虛擬機器(VM) (包含GPU) / 磁碟掛載 / Step3 - 格式化磁碟](https://hackmd.io/bMasy0__T3-lqFnNFklgvw#Step3---%E6%A0%BC%E5%BC%8F%E5%8C%96%E7%A3%81%E7%A2%9F) > - [讓執行個體上的執行個體存放區的磁碟區變成可用](https://docs.aws.amazon.com/zh_tw/AWSEC2/latest/UserGuide/add-instance-store-volumes.html#making-instance-stores-available-on-your-instances) > ![](https://i.imgur.com/k7HPMYx.png) - ### 查詢加掛的磁碟 ``` $ df -h Filesystem Size Used Avail Use% Mounted on /dev/root 7.7G 1.5G 6.3G 19% / devtmpfs 479M 0 479M 0% /dev tmpfs 485M 0 485M 0% /dev/shm tmpfs 97M 812K 97M 1% /run tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 485M 0 485M 0% /sys/fs/cgroup /dev/loop1 56M 56M 0 100% /snap/core18/2253 /dev/loop0 25M 25M 0 100% /snap/amazon-ssm-agent/4046 /dev/loop4 43M 43M 0 100% /snap/snapd/14066 /dev/loop3 68M 68M 0 100% /snap/lxd/21835 /dev/loop2 62M 62M 0 100% /snap/core20/1242 tmpfs 97M 0 97M 0% /run/user/1000 ``` 使用 `df -h` 未看見加掛的磁碟,必須使用 `lsblk` ``` $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 25M 1 loop /snap/amazon-ssm-agent/4046 loop1 7:1 0 55.5M 1 loop /snap/core18/2253 loop2 7:2 0 61.9M 1 loop /snap/core20/1242 loop3 7:3 0 67.2M 1 loop /snap/lxd/21835 loop4 7:4 0 42.2M 1 loop /snap/snapd/14066 xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / xvdb 202:16 0 500G 0 disk <--- ``` `xvdb` 尚未有掛載點 - ### 進行格式化 ```bash= $ sudo parted /dev/xvdb --script mklabel gpt mkpart xfspart xfs 0% 100% ``` ```bash= $ sudo mkfs -t ext4 /dev/xvdb1 mke2fs 1.45.5 (07-Jan-2020) Creating filesystem with 131071488 4k blocks and 32768000 inodes Filesystem UUID: d7cd9c94-786d-4c6a-b968-f4e9921a95e3 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000 Allocating group tables: done Writing inode tables: done Creating journal (262144 blocks): done Writing superblocks and filesystem accounting information: done ``` - `mkfs -t ext4 /dev/xvdb1` 等效於 `mkfs.ext4 /dev/xvdb1` - 參考資料 - [mount: wrong fs type, bad option, bad superblock](https://unix.stackexchange.com/questions/315063) ```bash= $ sudo partprobe /dev/xvdb1 ``` ```bash= $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 25M 1 loop /snap/amazon-ssm-agent/4046 loop1 7:1 0 55.5M 1 loop /snap/core18/2253 loop2 7:2 0 61.9M 1 loop /snap/core20/1242 loop3 7:3 0 67.2M 1 loop /snap/lxd/21835 loop4 7:4 0 42.2M 1 loop /snap/snapd/14066 xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / xvdb 202:16 0 500G 0 disk └─xvdb1 202:17 0 500G 0 part <--- 格式化完成,但尚未有掛載點 ``` - ### 掛載磁碟 ```bash= $ sudo mkdir /workspace $ sudo mount /dev/xvdb1 /workspace/ ``` <br> ### 步骤 D: 上傳資料 > 參考資料 > - [[HackMD][Parabricks-v3.5] Datasets](/Bx3R_i2wSfmDl1cQE0w3OA) ```bash $ rsync -e "ssh -i aws/parabricks.pem" \ --progress --partial --append -zh ./datasets/WGS-LIS-AI018A_R* \ ubuntu@13.213.40.194:/workspace/datasets/ ``` ```bash $ wget -O parabricks_sample.tar.gz \ "https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz" ``` <br> ### 步骤 E: 或是掛載「現有的」硬碟 - [將 Amazon EBS 磁碟區連接至執行個體](https://docs.aws.amazon.com/zh_tw/AWSEC2/latest/UserGuide/ebs-attaching-volume.html) ![](https://i.imgur.com/Qic4fq0.png) - 指令 ```bash= $ sudo mkdir /workspace # /worksapce 為 root 權限 $ sudo chown $(id -u):$(id -g) /workspace $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 25M 1 loop /snap/amazon-ssm-agent/4046 loop1 7:1 0 55.5M 1 loop /snap/core18/2253 loop2 7:2 0 61.9M 1 loop /snap/core20/1242 loop3 7:3 0 67.2M 1 loop /snap/lxd/21835 loop4 7:4 0 42.2M 1 loop /snap/snapd/14066 xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / xvdb 202:16 0 500G 0 disk └─xvdb1 202:17 0 500G 0 part <--- 未掛載 $ sudo mount /dev/xvdb1 /workspace/ ``` <br> <hr> <br> ## 儲存 I/O 效能 > - [[aws] Amazon EBS 磁碟區類型](https://docs.aws.amazon.com/zh_tw/AWSEC2/latest/UserGuide/ebs-volume-types.html) > - [AWS EBS 新舊服務類型對比](https://www.ithome.com.tw/news/141769) > [![](https://i.imgur.com/AnWgi2D.png)](https://i.imgur.com/AnWgi2D.png) ### 實際效能對應表 | aws gp3<br>(IOPS, MiB/s) | READ IOPS | READ BW=MiB/s | WRITE IOPS | WRITE BW=MiB/s | |-----:|-----|-----|-----|-----| | **16000, 625** | 12.3k | 48.2 | 4124 | 16.1 | | **16000, 625** | 12.4k | 48.3 | 4126 | 16.1 | || | **14706, 573** | 11.4k | 44.3 | 3790 | 14.8 | || | **14600, 571**<br>TWCC NFS 效能 | 11.3k | 44.0 | 3763 | 14.7 | || | **14600, 570** | 11.3k | 44.0 | 3763 | 14.7 | | **14600, 570** | 11.3k | 44.0 | 3763 | 14.7 | || | **3000, 125** | 2252 | 8.80 | 752 | 2.94 | | **3000, 125** | 2302 | 8.99 | 768 | 3.00 | | **3000, 125** | 2302 | 8.99 | 768 | 3.00 | ### 測試數據 - :::spoiler `gp3: 16000 IOPS, 625 MiB/s`, round1 ``` fiotest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.1 Starting 1 process Jobs: 1 (f=1): [m(1)][100.0%][r=52.1MiB/s,w=17.7MiB/s][r=13.3k,w=4541 IOPS][eta 00m:00s] fiotest: (groupid=0, jobs=1): err= 0: pid=14334: Fri Mar 18 04:36:12 2022 read: IOPS=12.3k, BW=48.2MiB/s (50.6MB/s)(6141MiB/127296msec) bw ( KiB/s): min=48128, max=95408, per=99.96%, avg=49379.12, stdev=4023.26, samples=254 iops : min=12032, max=23852, avg=12344.78, stdev=1005.81, samples=254 write: IOPS=4124, BW=16.1MiB/s (16.9MB/s)(2051MiB/127296msec) bw ( KiB/s): min=15224, max=31920, per=99.96%, avg=16489.83, stdev=1374.04, samples=254 iops : min= 3806, max= 7980, avg=4122.46, stdev=343.51, samples=254 cpu : usr=4.97%, sys=9.98%, ctx=1537463, majf=0, minf=9 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwt: total=1572145,525007,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=48.2MiB/s (50.6MB/s), 48.2MiB/s-48.2MiB/s (50.6MB/s-50.6MB/s), io=6141MiB (6440MB), run=127296-127296msec WRITE: bw=16.1MiB/s (16.9MB/s), 16.1MiB/s-16.1MiB/s (16.9MB/s-16.9MB/s), io=2051MiB (2150MB), run=127296-127296msec Disk stats (read/write): xvdb: ios=1566200/523779, merge=3657/464, ticks=5984982/2080847, in_queue=3290932, util=99.91% ``` - :::spoiler `gp3: 3000 IOPS, 125 MiB/s`, round1 fiotest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.1 Starting 1 process Jobs: 1 (f=0): [f(1)][100.0%][r=10.0MiB/s,w=3723KiB/s][r=2805,w=930 IOPS][eta 00m:00s] fiotest: (groupid=0, jobs=1): err= 0: pid=14347: Fri Mar 18 04:49:25 2022 read: IOPS=2302, BW=9209KiB/s (9430kB/s)(6141MiB/682886msec) bw ( KiB/s): min= 8696, max=25792, per=99.99%, avg=9207.12, stdev=486.68, samples=1365 iops : min= 2174, max= 6448, avg=2301.77, stdev=121.67, samples=1365 write: IOPS=768, BW=3075KiB/s (3149kB/s)(2051MiB/682886msec) bw ( KiB/s): min= 2560, max= 8968, per=99.98%, avg=3074.48, stdev=218.64, samples=1365 iops : min= 640, max= 2242, avg=768.61, stdev=54.66, samples=1365 cpu : usr=1.30%, sys=2.95%, ctx=1999740, majf=0, minf=6 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwt: total=1572145,525007,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=9209KiB/s (9430kB/s), 9209KiB/s-9209KiB/s (9430kB/s-9430kB/s), io=6141MiB (6440MB), run=682886-682886msec WRITE: bw=3075KiB/s (3149kB/s), 3075KiB/s-3075KiB/s (3149kB/s-3149kB/s), io=2051MiB (2150MB), run=682886-682886msec Disk stats (read/write): xvdc: ios=1567423/524271, merge=4032/536, ticks=32288915/11245506, in_queue=39371552, util=100.00% ::: <br> <hr> <br> ## EC2 / g4dn.xlarge > - ### [Amazon EC2 G4 執行個體](https://aws.amazon.com/tw/ec2/instance-types/g4/) > 業界最經濟實惠的 GPU 執行個體 > 產品詳細資訊: > [![](https://i.imgur.com/3IrPfJg.png)](https://i.imgur.com/3IrPfJg.png) ### 查看 GPU 資訊 > GPU 0: Tesla T4 (15360MiB) > > 參考資料: > - [[HackMD] Azure / 虛擬機器(VM) (包含GPU) / 查看 GPU 資訊](https://hackmd.io/bMasy0__T3-lqFnNFklgvw#查看-GPU-資訊) - ### `$ nvcc --version` (非必要) ```bash $ nvcc --version Command 'nvcc' not found, but can be installed with: sudo apt install nvidia-cuda-toolkit $ sudo apt update $ sudo apt install -y nvidia-cuda-toolkit $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243 # 此時 nvidia-smi 仍不可用 ``` - ### `$ nvidia-smi` [安裝 Nvidia Toolkit](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=18.04&target_type=deb_local) ![](https://i.imgur.com/GgEixjY.png) 選擇 Ubuntu 18.04 :::warning :warning: **在 Ubuntu 20.04 下安裝 Parabricks 會遇到底下錯誤:** `libboost_filesystem.so.1.65.1`: No such file or directory ::: ``` wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.com/compute/cuda/11.6.1/local_installers/cuda-repo-ubuntu1804-11-6-local_11.6.1-510.47.03-1_amd64.deb sudo dpkg -i cuda-repo-ubuntu1804-11-6-local_11.6.1-510.47.03-1_amd64.deb sudo apt-key add /var/cuda-repo-ubuntu1804-11-6-local/7fa2af80.pub sudo apt-get update sudo apt-get -y install cuda ``` ```bash $ nvidia-smi Wed Mar 16 04:25:02 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 40C P0 27W / 70W | 0MiB / 15360MiB | 10% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` ``` $ nvidia-smi -L GPU 0: Tesla T4 (UUID: GPU-e19257bd-b662-c236-01e8-c5ffdc38127c) ``` 最好還是 reboot 一下 ``` $ sudo reboot ``` <br> ### 查看 CPU 資訊 > Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz > 1(插槽) x 2(核/插槽) x 2(超執行緒/核) = 4條執行緒 - ### 檢視物理CPU的個數 (單位:插槽) ``` $ cat /proc/cpuinfo | grep "physical id" | sort | uniq | wc -l 1 ``` - ### 檢視邏輯CPU的個數 (單位:執行緒) ``` $ cat /proc/cpuinfo | grep "processor" processor : 0 processor : 1 processor : 2 processor : 3 ``` ``` $ cat /proc/cpuinfo | grep "processor" | wc -l 4 ``` - ### 檢視CPU是幾核 (單位:核/插槽) ``` $ cat /proc/cpuinfo | grep "cores" | uniq cpu cores : 2 ``` - ### 詳看 CPU 詳細資訊 :::spoiler `$ cat /proc/cpuinfo` ``` processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz stepping : 7 microcode : 0x500320a cpu MHz : 2499.998 cache size : 36608 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit bogomips : 4999.99 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 1 processor : 2 processor : 3 (以此類推) ``` <br> ### 查看 RAM 資訊 > 16GB - `$ cat /proc/meminfo` :::spoiler ``` MemTotal: 16085448 kB MemFree: 9060944 kB MemAvailable: 15481172 kB Buffers: 52188 kB Cached: 6471144 kB SwapCached: 0 kB Active: 362572 kB Inactive: 6286904 kB Active(anon): 828 kB Inactive(anon): 133028 kB Active(file): 361744 kB Inactive(file): 6153876 kB Unevictable: 23000 kB Mlocked: 18464 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 13900 kB Writeback: 0 kB AnonPages: 149280 kB Mapped: 132076 kB Shmem: 860 kB KReclaimable: 241468 kB Slab: 295964 kB SReclaimable: 241468 kB SUnreclaim: 54496 kB KernelStack: 2944 kB PageTables: 3036 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 8042724 kB Committed_AS: 973540 kB VmallocTotal: 34359738367 kB VmallocUsed: 10980 kB VmallocChunk: 0 kB Percpu: 3232 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 75688 kB DirectMap2M: 4859904 kB DirectMap1G: 11534336 kB ``` ::: - `$ lsmem` :::spoiler ``` RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x00000000bfffffff 3G online yes 0-23 0x0000000100000000-0x000000042fffffff 12.8G online yes 32-133 Memory block size: 128M Total online memory: 15.8G Total offline memory: 0B ``` ``` - `$ free -g` :::spoiler ``` total used free shared buff/cache available Mem: 15 0 8 0 6 14 Swap: 0 0 0 ``` ::: <br> <hr> <br> ## 上傳 image (AMI) ### [使用自訂的 Amazon Machine Image (AMI)](https://docs.aws.amazon.com/zh_tw/elasticbeanstalk/latest/dg/using-features.customenv.html) <br> <hr> <br> ## 安裝 parabricks > [[HackMD] AWS / 容器執行個體(EC2) / 安裝 Parabricks 軟體](https://hackmd.io/qrkD86oqR0eVe09PxC9boA#Install-Parabricks)