# Usage guide for pilot run for H100
## Announcement
| Announce Date | Type | duration |
| ------------- | ------------------ | ------------------------------------ |
| 2024/11/21 | System maintenance | 2024/11/22 07:00 ~ 2024/11/23 18:00 |
| 2024/11/23 | System online | 2024/11/23 23:00 |
## [先導案簡易版使用者操作手冊_20241111](https://narlabshq-my.sharepoint.com/:b:/g/personal/1203087_narlabs_org_tw/Eb96Z_CzOh1Pv8yI18p6QzIBIbnJBzEdqt5mNoz2asPKgg?e=jaah54)
## Login information
| Name | IP | Note |
| ------------- | ------------- | ------------------- |
| lng01 | 140.110.148.3 | Login node |
| xdata | 140.110.148.8 | Data transfer node |
## CPU-GPU-NIC Affinity
```
[p00acy00@lgn01 src]$ nvidia-smi topo -m
GPU0 GPU1 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE SYS SYS SYS SYS 58-111,170-223 1 N/A
GPU1 NODE X SYS SYS SYS SYS 58-111,170-223 1 N/A
NIC0 SYS SYS X PIX NODE NODE
NIC1 SYS SYS PIX X NODE NODE
NIC2 SYS SYS NODE NODE X PIX
NIC3 SYS SYS NODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
```
```
[p00acy00@hgpn06 ~]$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PXB PXB NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PXB PXB NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PXB PXB SYS SYS SYS SYS SYS SYS 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE PXB PXB SYS SYS SYS SYS SYS SYS 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PXB PXB NODE NODE NODE NODE 56 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS PXB PXB NODE NODE NODE NODE 56 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PXB PXB 56 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PXB PXB 56 1 N/A
NIC0 PXB PXB NODE NODE SYS SYS SYS SYS X PXB NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC1 PXB PXB NODE NODE SYS SYS SYS SYS PXB X NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE X PIX NODE NODE SYS SYS SYS SYS SYS SYS
NIC3 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE PIX X NODE NODE SYS SYS SYS SYS SYS SYS
NIC4 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE NODE X PXB SYS SYS SYS SYS SYS SYS
NIC5 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE NODE PXB X SYS SYS SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYS SYS X PXB NODE NODE NODE NODE
NIC7 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYS SYS PXB X NODE NODE NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE NODE X PIX NODE NODE
NIC9 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE NODE PIX X NODE NODE
NIC10 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE X PXB
NIC11 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PXB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
NIC10: mlx5_10
NIC11: mlx5_11
```
```
[p00acy00@hgpn06 ~]$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
cpubind: 0 1
nodebind: 0 1
membind: 0 1
[p00acy00@hgpn06 ~]$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 0 size: 1031217 MB
node 0 free: 972645 MB
node 1 cpus: 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 1 size: 1032173 MB
node 1 free: 1014453 MB
node distances:
node 0 1
0: 10 21
1: 21 10
```
```
[p00acy00@hgpn13 ~]$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PXB NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS 0-47 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PXB NODE NODE SYS SYS SYS SYS SYS SYS 0-47 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PXB NODE SYS SYS SYS SYS SYS SYS 0-47 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PXB SYS SYS SYS SYS SYS SYS 0-47 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PXB NODE NODE NODE NODE NODE 56-103 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS NODE NODE NODE PXB NODE NODE 56-103 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PXB NODE 56-103 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE PXB 56-103 1 N/A
NIC0 PXB NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC1 NODE NODE NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC3 NODE PXB NODE NODE SYS SYS SYS SYS NODE NODE NODE X NODE NODE SYS SYS SYS SYS SYS SYS
NIC4 NODE NODE PXB NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE SYS SYS SYS SYS SYS SYS
NIC5 NODE NODE NODE PXB SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS PXB NODE NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE NODE NODE
NIC7 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE X PIX NODE NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE PIX X NODE NODE NODE
NIC9 SYS SYS SYS SYS NODE PXB NODE NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE X NODE NODE
NIC10 SYS SYS SYS SYS NODE NODE PXB NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE X NODE
NIC11 SYS SYS SYS SYS NODE NODE NODE PXB SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
NIC10: mlx5_10
NIC11: mlx5_11
```
```
[p00acy00@hgpn13 ~]$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103
cpubind: 0 1
nodebind: 0 1
membind: 0 1
[p00acy00@hgpn13 ~]$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 0 size: 1031325 MB
node 0 free: 743296 MB
node 1 cpus: 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 1 size: 1032118 MB
node 1 free: 747882 MB
node distances:
node 0 1
0: 10 21
1: 21 10
```
## Schedule
| Phase | Duration | Description |
|-------|-----------------------|------------------------------------------------|
| 1 | 2024/11/25~2024/12/25 | For general purpose test run. |
| 2 | 2024/12/26~2025/01/06 | For large scale test run. (special user first) |
[晶創主機一期試用時程規劃](https://narlabshq-my.sharepoint.com/:x:/g/personal/1203087_narlabs_org_tw/EQrupbr4DvxChaRhkfgKtXkBR_DGiaJFLKNy-26J2Qwf2Q?e=pIZZfp)
## Queuing Policies
### Phase1 (For general purpose test run.)
| Partition | GPUs/Partition | Executing-Jobs/Partition | scheduled-Jobs/Partition | Walltime (Hours) | QoS Factor | Preemptible |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| dev | 16 | 2 | 2 | 2 | 10 | no |
| normal | 32 | 2 | 2 | 24 | 1 | no |
### Phase2 (For large scale test run. special user first.)
This time slot is prioritized for special users conducting large-scale computational tests. General users may still dispatch jobs in the dev partition; however, these jobs may be preempted by higher-priority users under preemptive scheduling conditions.
| Partition | GPUs/Partition | Executing-Jobs/Partition | scheduled-Jobs/Partition | Walltime (Hours) | QoS Factor | Preemptible |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| dev | 16 | 2 | 2 | 2 | 1 | yes |
| normal | infinite | infinite | infinite | infinite | 1 | no |
### Special Projects
| Partition | GPUs/Partition | Executing-Jobs/Partition | scheduled-Jobs/Partition | Walltime (Hours) | QoS Factor | Preemptible |
|:----------------:|:---------------------:|:-------------------------------:|:-------------------------------:|:---------------------------:|:-----------------:|:------------------:|
| Taide | 40 | infinite | infinite | infinite | 1 | no |
```
[p00acy00@lgn01 ~]$ scontrol show partition
PartitionName=dev
AllowGroups=ALL DenyAccounts=gov112003,gov112009,gov113008,gov113026 AllowQos=ALL
AllocNodes=ALL Default=NO QoS=p_dev
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=02:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=hgpn[06-10,13-21]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=1568 TotalNodes=14 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=1512,mem=28000000M,node=14,billing=1512,gres/gpu=112
PartitionName=normal
AllowGroups=ALL DenyAccounts=gov112003,gov112009,gov113008,gov113026 AllowQos=ALL
AllocNodes=ALL Default=NO QoS=p_normal
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=hgpn[06-10,13-21]
PriorityJobFactor=10 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=1568 TotalNodes=14 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=1512,mem=28000000M,node=14,billing=1512,gres/gpu=112
PartitionName=taide
AllowGroups=ALL AllowAccounts=gov112003,gov112009,gov113008,gov113026,gov113080 AllowQos=ALL
AllocNodes=ALL Default=NO QoS=p_taide
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=hgpn[01-05]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=560 TotalNodes=5 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=540,mem=10000000M,node=5,billing=540,gres/gpu=40
```
### Revised history
[Revised history for schle and policy](/RS8Vi9fcSSe8z-cVeOI-Ug)