AWS / 容器執行個體(EC2)
===
###### tags: `ML / Platform`
###### tags: `ML`, `AWS`, `EC2`, `container`, `GPU`
<br>
[TOC]
<br>
## AWS 入口點
https://signin.aws.amazon.com/console
### EC2 入口點
https://ap-southeast-1.console.aws.amazon.com/ec2/v2/home
- [Amazon EC2 G4 執行個體](https://aws.amazon.com/tw/ec2/instance-types/g4/)
業界最經濟實惠的 GPU 執行個體
<br>
<hr>
<br>
## EC2 操作範例
### 步骤 1: 选择一个 Amazon 系统映像 (AMI)
[](https://i.imgur.com/vWyzoLr.png)
<br>
### 步骤 2: 选择一个实例类型
[](https://i.imgur.com/ltiVNRo.pn)
<br>
### 步骤 3: 配置实例详细信息
略
<br>
### 步骤 4: 添加存储
[](https://i.imgur.com/jzK8HGG.png)
:::warning
:warning: **root 配置 8GB 太小了,apt update 一下就滿了**
` failed to write status database record about 'python3-oauthlib' to '/var/lib/dpkg/status': No space left on device`
```
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 7.7G 7.7G 0 100% / <--------- 滿了
devtmpfs 7.7G 0 7.7G 0% /dev
tmpfs 7.7G 0 7.7G 0% /dev/shm
tmpfs 1.6G 860K 1.6G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 7.7G 0 7.7G 0% /sys/fs/cgroup
/dev/loop0 25M 25M 0 100% /snap/amazon-ssm-agent/4046
/dev/loop1 62M 62M 0 100% /snap/core20/1242
/dev/loop3 56M 56M 0 100% /snap/core18/2253
/dev/loop2 68M 68M 0 100% /snap/lxd/21835
/dev/loop4 43M 43M 0 100% /snap/snapd/14066
tmpfs 1.6G 0 1.6G 0% /run/user/1000
```
經過測試1,至少要 18GB

經過測試2,至少要 24GB (不確定是否有含暫存檔)

:::
:::info
:bulb: **一些安裝程式是放在 root volume**
為了加速程式啟動,建議選擇 General Purpose SSD (gp3)?
:::
<br>
### 步骤 5: 添加标签
略
<br>
### 步骤 6: 配置安全组
[](https://i.imgur.com/L520DLq.png)
<br>
### 步骤 A: 选择现有密钥对或创建新密钥对

之後選擇既有的 pem

<br>
### 步骤 B: 連線
```bash=
$ ssh -i parabricks.pem ubuntu@13.213.40.194
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0664 for 'parabricks.pem' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "parabricks.pem": bad permissions
ubuntu@13.213.40.194: Permission denied (publickey).
$ chmod 400 parabricks.pem
$ ssh -i parabricks.pem ubuntu@13.213.40.194
```
<br>
### 步骤 C: 格式化加掛「全新」磁碟
> 參考資料:
> - [[HackMD] Azure / 虛擬機器(VM) (包含GPU) / 磁碟掛載 / Step3 - 格式化磁碟](https://hackmd.io/bMasy0__T3-lqFnNFklgvw#Step3---%E6%A0%BC%E5%BC%8F%E5%8C%96%E7%A3%81%E7%A2%9F)
> - [讓執行個體上的執行個體存放區的磁碟區變成可用](https://docs.aws.amazon.com/zh_tw/AWSEC2/latest/UserGuide/add-instance-store-volumes.html#making-instance-stores-available-on-your-instances)
> 
- ### 查詢加掛的磁碟
```
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 7.7G 1.5G 6.3G 19% /
devtmpfs 479M 0 479M 0% /dev
tmpfs 485M 0 485M 0% /dev/shm
tmpfs 97M 812K 97M 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 485M 0 485M 0% /sys/fs/cgroup
/dev/loop1 56M 56M 0 100% /snap/core18/2253
/dev/loop0 25M 25M 0 100% /snap/amazon-ssm-agent/4046
/dev/loop4 43M 43M 0 100% /snap/snapd/14066
/dev/loop3 68M 68M 0 100% /snap/lxd/21835
/dev/loop2 62M 62M 0 100% /snap/core20/1242
tmpfs 97M 0 97M 0% /run/user/1000
```
使用 `df -h` 未看見加掛的磁碟,必須使用 `lsblk`
```
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 25M 1 loop /snap/amazon-ssm-agent/4046
loop1 7:1 0 55.5M 1 loop /snap/core18/2253
loop2 7:2 0 61.9M 1 loop /snap/core20/1242
loop3 7:3 0 67.2M 1 loop /snap/lxd/21835
loop4 7:4 0 42.2M 1 loop /snap/snapd/14066
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
xvdb 202:16 0 500G 0 disk <---
```
`xvdb` 尚未有掛載點
- ### 進行格式化
```bash=
$ sudo parted /dev/xvdb --script mklabel gpt mkpart xfspart xfs 0% 100%
```
```bash=
$ sudo mkfs -t ext4 /dev/xvdb1
mke2fs 1.45.5 (07-Jan-2020)
Creating filesystem with 131071488 4k blocks and 32768000 inodes
Filesystem UUID: d7cd9c94-786d-4c6a-b968-f4e9921a95e3
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks):
done
Writing superblocks and filesystem accounting information: done
```
- `mkfs -t ext4 /dev/xvdb1` 等效於
`mkfs.ext4 /dev/xvdb1`
- 參考資料
- [mount: wrong fs type, bad option, bad superblock](https://unix.stackexchange.com/questions/315063)
```bash=
$ sudo partprobe /dev/xvdb1
```
```bash=
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 25M 1 loop /snap/amazon-ssm-agent/4046
loop1 7:1 0 55.5M 1 loop /snap/core18/2253
loop2 7:2 0 61.9M 1 loop /snap/core20/1242
loop3 7:3 0 67.2M 1 loop /snap/lxd/21835
loop4 7:4 0 42.2M 1 loop /snap/snapd/14066
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
xvdb 202:16 0 500G 0 disk
└─xvdb1 202:17 0 500G 0 part <--- 格式化完成,但尚未有掛載點
```
- ### 掛載磁碟
```bash=
$ sudo mkdir /workspace
$ sudo mount /dev/xvdb1 /workspace/
```
<br>
### 步骤 D: 上傳資料
> 參考資料
> - [[HackMD][Parabricks-v3.5] Datasets](/Bx3R_i2wSfmDl1cQE0w3OA)
```bash
$ rsync -e "ssh -i aws/parabricks.pem" \
--progress --partial --append -zh
./datasets/WGS-LIS-AI018A_R* \
ubuntu@13.213.40.194:/workspace/datasets/
```
```bash
$ wget -O parabricks_sample.tar.gz \
"https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz"
```
<br>
### 步骤 E: 或是掛載「現有的」硬碟
- [將 Amazon EBS 磁碟區連接至執行個體](https://docs.aws.amazon.com/zh_tw/AWSEC2/latest/UserGuide/ebs-attaching-volume.html)

- 指令
```bash=
$ sudo mkdir /workspace # /worksapce 為 root 權限
$ sudo chown $(id -u):$(id -g) /workspace
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 25M 1 loop /snap/amazon-ssm-agent/4046
loop1 7:1 0 55.5M 1 loop /snap/core18/2253
loop2 7:2 0 61.9M 1 loop /snap/core20/1242
loop3 7:3 0 67.2M 1 loop /snap/lxd/21835
loop4 7:4 0 42.2M 1 loop /snap/snapd/14066
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
xvdb 202:16 0 500G 0 disk
└─xvdb1 202:17 0 500G 0 part <--- 未掛載
$ sudo mount /dev/xvdb1 /workspace/
```
<br>
<hr>
<br>
## 儲存 I/O 效能
> - [[aws] Amazon EBS 磁碟區類型](https://docs.aws.amazon.com/zh_tw/AWSEC2/latest/UserGuide/ebs-volume-types.html)
> - [AWS EBS 新舊服務類型對比](https://www.ithome.com.tw/news/141769)
> [](https://i.imgur.com/AnWgi2D.png)
### 實際效能對應表
| aws gp3<br>(IOPS, MiB/s) | READ IOPS | READ BW=MiB/s | WRITE IOPS | WRITE BW=MiB/s |
|-----:|-----|-----|-----|-----|
| **16000, 625** | 12.3k | 48.2 | 4124 | 16.1 |
| **16000, 625** | 12.4k | 48.3 | 4126 | 16.1 |
||
| **14706, 573** | 11.4k | 44.3 | 3790 | 14.8 |
||
| **14600, 571**<br>TWCC NFS 效能 | 11.3k | 44.0 | 3763 | 14.7 |
||
| **14600, 570** | 11.3k | 44.0 | 3763 | 14.7 |
| **14600, 570** | 11.3k | 44.0 | 3763 | 14.7 |
||
| **3000, 125** | 2252 | 8.80 | 752 | 2.94 |
| **3000, 125** | 2302 | 8.99 | 768 | 3.00 |
| **3000, 125** | 2302 | 8.99 | 768 | 3.00 |
### 測試數據
- :::spoiler `gp3: 16000 IOPS, 625 MiB/s`, round1
```
fiotest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=52.1MiB/s,w=17.7MiB/s][r=13.3k,w=4541 IOPS][eta 00m:00s]
fiotest: (groupid=0, jobs=1): err= 0: pid=14334: Fri Mar 18 04:36:12 2022
read: IOPS=12.3k, BW=48.2MiB/s (50.6MB/s)(6141MiB/127296msec)
bw ( KiB/s): min=48128, max=95408, per=99.96%, avg=49379.12, stdev=4023.26, samples=254
iops : min=12032, max=23852, avg=12344.78, stdev=1005.81, samples=254
write: IOPS=4124, BW=16.1MiB/s (16.9MB/s)(2051MiB/127296msec)
bw ( KiB/s): min=15224, max=31920, per=99.96%, avg=16489.83, stdev=1374.04, samples=254
iops : min= 3806, max= 7980, avg=4122.46, stdev=343.51, samples=254
cpu : usr=4.97%, sys=9.98%, ctx=1537463, majf=0, minf=9
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwt: total=1572145,525007,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=48.2MiB/s (50.6MB/s), 48.2MiB/s-48.2MiB/s (50.6MB/s-50.6MB/s), io=6141MiB (6440MB), run=127296-127296msec
WRITE: bw=16.1MiB/s (16.9MB/s), 16.1MiB/s-16.1MiB/s (16.9MB/s-16.9MB/s), io=2051MiB (2150MB), run=127296-127296msec
Disk stats (read/write):
xvdb: ios=1566200/523779, merge=3657/464, ticks=5984982/2080847, in_queue=3290932, util=99.91%
```
- :::spoiler `gp3: 3000 IOPS, 125 MiB/s`, round1
fiotest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.1
Starting 1 process
Jobs: 1 (f=0): [f(1)][100.0%][r=10.0MiB/s,w=3723KiB/s][r=2805,w=930 IOPS][eta 00m:00s]
fiotest: (groupid=0, jobs=1): err= 0: pid=14347: Fri Mar 18 04:49:25 2022
read: IOPS=2302, BW=9209KiB/s (9430kB/s)(6141MiB/682886msec)
bw ( KiB/s): min= 8696, max=25792, per=99.99%, avg=9207.12, stdev=486.68, samples=1365
iops : min= 2174, max= 6448, avg=2301.77, stdev=121.67, samples=1365
write: IOPS=768, BW=3075KiB/s (3149kB/s)(2051MiB/682886msec)
bw ( KiB/s): min= 2560, max= 8968, per=99.98%, avg=3074.48, stdev=218.64, samples=1365
iops : min= 640, max= 2242, avg=768.61, stdev=54.66, samples=1365
cpu : usr=1.30%, sys=2.95%, ctx=1999740, majf=0, minf=6
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwt: total=1572145,525007,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=9209KiB/s (9430kB/s), 9209KiB/s-9209KiB/s (9430kB/s-9430kB/s), io=6141MiB (6440MB), run=682886-682886msec
WRITE: bw=3075KiB/s (3149kB/s), 3075KiB/s-3075KiB/s (3149kB/s-3149kB/s), io=2051MiB (2150MB), run=682886-682886msec
Disk stats (read/write):
xvdc: ios=1567423/524271, merge=4032/536, ticks=32288915/11245506, in_queue=39371552, util=100.00%
:::
<br>
<hr>
<br>
## EC2 / g4dn.xlarge
> - ### [Amazon EC2 G4 執行個體](https://aws.amazon.com/tw/ec2/instance-types/g4/)
> 業界最經濟實惠的 GPU 執行個體
> 產品詳細資訊:
> [](https://i.imgur.com/3IrPfJg.png)
### 查看 GPU 資訊
> GPU 0: Tesla T4 (15360MiB)
>
> 參考資料:
> - [[HackMD] Azure / 虛擬機器(VM) (包含GPU) / 查看 GPU 資訊](https://hackmd.io/bMasy0__T3-lqFnNFklgvw#查看-GPU-資訊)
- ### `$ nvcc --version` (非必要)
```bash
$ nvcc --version
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
$ sudo apt update
$ sudo apt install -y nvidia-cuda-toolkit
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
# 此時 nvidia-smi 仍不可用
```
- ### `$ nvidia-smi`
[安裝 Nvidia Toolkit](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=18.04&target_type=deb_local)

選擇 Ubuntu 18.04
:::warning
:warning: **在 Ubuntu 20.04 下安裝 Parabricks 會遇到底下錯誤:**
`libboost_filesystem.so.1.65.1`: No such file or directory
:::
```
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.6.1/local_installers/cuda-repo-ubuntu1804-11-6-local_11.6.1-510.47.03-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-11-6-local_11.6.1-510.47.03-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu1804-11-6-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
```
```bash
$ nvidia-smi
Wed Mar 16 04:25:02 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 40C P0 27W / 70W | 0MiB / 15360MiB | 10% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
```
```
$ nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-e19257bd-b662-c236-01e8-c5ffdc38127c)
```
最好還是 reboot 一下
```
$ sudo reboot
```
<br>
### 查看 CPU 資訊
> Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
> 1(插槽) x 2(核/插槽) x 2(超執行緒/核) = 4條執行緒
- ### 檢視物理CPU的個數 (單位:插槽)
```
$ cat /proc/cpuinfo | grep "physical id" | sort | uniq | wc -l
1
```
- ### 檢視邏輯CPU的個數 (單位:執行緒)
```
$ cat /proc/cpuinfo | grep "processor"
processor : 0
processor : 1
processor : 2
processor : 3
```
```
$ cat /proc/cpuinfo | grep "processor" | wc -l
4
```
- ### 檢視CPU是幾核 (單位:核/插槽)
```
$ cat /proc/cpuinfo | grep "cores" | uniq
cpu cores : 2
```
- ### 詳看 CPU 詳細資訊
:::spoiler `$ cat /proc/cpuinfo`
```
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
stepping : 7
microcode : 0x500320a
cpu MHz : 2499.998
cache size : 36608 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 4999.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 1
processor : 2
processor : 3
(以此類推)
```
<br>
### 查看 RAM 資訊
> 16GB
- `$ cat /proc/meminfo`
:::spoiler
```
MemTotal: 16085448 kB
MemFree: 9060944 kB
MemAvailable: 15481172 kB
Buffers: 52188 kB
Cached: 6471144 kB
SwapCached: 0 kB
Active: 362572 kB
Inactive: 6286904 kB
Active(anon): 828 kB
Inactive(anon): 133028 kB
Active(file): 361744 kB
Inactive(file): 6153876 kB
Unevictable: 23000 kB
Mlocked: 18464 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 13900 kB
Writeback: 0 kB
AnonPages: 149280 kB
Mapped: 132076 kB
Shmem: 860 kB
KReclaimable: 241468 kB
Slab: 295964 kB
SReclaimable: 241468 kB
SUnreclaim: 54496 kB
KernelStack: 2944 kB
PageTables: 3036 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 8042724 kB
Committed_AS: 973540 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 10980 kB
VmallocChunk: 0 kB
Percpu: 3232 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 75688 kB
DirectMap2M: 4859904 kB
DirectMap1G: 11534336 kB
```
:::
- `$ lsmem`
:::spoiler
```
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
0x0000000100000000-0x000000042fffffff 12.8G online yes 32-133
Memory block size: 128M
Total online memory: 15.8G
Total offline memory: 0B
```
```
- `$ free -g`
:::spoiler
```
total used free shared buff/cache available
Mem: 15 0 8 0 6 14
Swap: 0 0 0
```
:::
<br>
<hr>
<br>
## 上傳 image (AMI)
### [使用自訂的 Amazon Machine Image (AMI)](https://docs.aws.amazon.com/zh_tw/elasticbeanstalk/latest/dg/using-features.customenv.html)
<br>
<hr>
<br>
## 安裝 parabricks
> [[HackMD] AWS / 容器執行個體(EC2) / 安裝 Parabricks 軟體](https://hackmd.io/qrkD86oqR0eVe09PxC9boA#Install-Parabricks)