# 如何連線到國網中心使用計算資源 ## 名詞解釋 - 國網中心官網 [https://www.nchc.org.tw/](https://www.nchc.org.tw/) 公告資訊,等公開資訊。 ![01](https://hackmd.io/_uploads/ByLBjbpSC.png) - iService [https://iservice.nchc.org.tw/nchc_service/index.php](https://iservice.nchc.org.tw/nchc_service/index.php) 帳號、計畫帳務相關(重要) ![02](https://hackmd.io/_uploads/SJFCobTBA.png) - TWCC [https://www.twcc.ai/](https://www.twcc.ai/) 建立雲端計算環境、容器、虛擬機器等等(本次課程不會用到) ![03](https://hackmd.io/_uploads/SyichbarR.jpg) - 台灣杉、創進等大主機 ![image](https://hackmd.io/_uploads/SJuepZaSA.png) 這是我們會真正操作的主機,通常都是使用加密連線方式連接(ssh) | | Taiwania2 | Taiwania3 | Forerunner1 | |------------|-----------------|----------------|-----------------| | 登入節點 | 203.145.219.98 | 203.145.216.53 | 140.110.122.196 | | 傳輸節點 | 203.145.219.101 | 203.145.216.61 | TBD | | 圖形化節點 | NA | 203.145.216.53 | 140.110.122.206 | ## 流程 1. 註冊iservice [如何進行會員註冊](https://iservice.nchc.org.tw/nchc_service/nchc_service_qa_single.php?qa_code=32) 註冊帳號需要使用手機作為個人身分識別,無法多人一號。註冊時或日後與帳號相關變動都會以簡訊通知認證碼,請確保手機及門號設定不會阻擋我們的簡訊。若你發生收不到簡訊的情況,請參考下列說明先進行排除。 [手機收不到驗證碼簡訊](https://iservice.nchc.org.tw/nchc_service/nchc_service_qa_single.php?qa_code=200) 雖然提供了多種註冊方式,但建議還是用傳統的註冊方式建立帳號,不要用google,facebook, aihub等外部帳號。 2. 設定取得OTP的裝置 [兩階段驗證APP載具註冊與使用說明](https://iservice.nchc.org.tw/nchc_service/nchc_service_qa_single.php?qa_code=774) 因為「資通安全管理法」我們必須使用2FA登入,在法規沒把NCHC排除前,就只能接受這政策。 請不要嘗試想要用其他方式繞過這檢查機制,因為可能的方案都是有資安風險,所以被禁掉了。 3. 加入計畫 你可以通知實驗室計畫管理員(教授或者學長)將你加入計畫,或者你自己詢問實驗室計畫編號,自己申請加入。 預設你只有試用計畫(TRIxxxxxx),許多資源的存取都是受限的,例如預載的學術授權軟體。 ![04](https://hackmd.io/_uploads/Sy8lQGprA.png) 4. 準備連線的程式 以下整理了你可能會用到的程式 | | description | solution1 | solution2 | solution3 | |----------------|--------------------------------------------|-----------|-----------|-----------| | SSH Client | 安全的建立大主機的連線 | [moboxterm](https://mobaxterm.mobatek.net/) | [putty](https://putty.org/) | wsl | | File Transfer | 與大主機交換檔案使用 | [moboxterm](https://mobaxterm.mobatek.net/) | [winscp](https://winscp.net/eng/index.php) | [filezilla](https://filezilla-project.org/) | | X server | 將大主機上的圖形化程式,傳送到本地端顯示 | [moboxterm](https://mobaxterm.mobatek.net/) | [Xming](http://www.straightrunning.com/XmingNotes/) | [vcxsrv](https://github.com/marchaesen/vcxsrv) | | Remote Desktop | 利用圖形化的方式連線到遠端圖形化的桌面環境 | [ThinLinc](https://www.cendio.com/) | | | 5. 連線 [操作影片](https://youtu.be/7Hf20-XRaio) -- moboxterm ![image](https://hackmd.io/_uploads/Hy2GX46HC.png) -- ThinLinc ![image](https://hackmd.io/_uploads/SyeBQE6BR.png) 11. 上傳檔案 [操作影片](https://youtu.be/Hisg85RiFzE) ![05](https://hackmd.io/_uploads/Bkve2XTH0.png) `particles.c` ``` /****************************************************************************** * * * Basic MPI Example - Interacting Particles * * * * Simulate the interaction between 10k electrically charged particles. * * * ****************************************************************************** * * * The original code was written by Gustav at University of Indiana in 2003. * * * * The current version has been tested/updated by the HPC department at * * the Norwegian University of Science and Technology in 2011. * * * ******************************************************************************/ #include <stdio.h> #include <math.h> #include <stdlib.h> #include <mpi.h> #define FALSE 0 #define TRUE 1 #define MASTER_RANK 0 #define MAX_PARTICLES 10000 #define MAX_PROCS 128 #define EPSILON 1.0E-10 #define DT 0.01 #define N_OF_ITERATIONS 20 int main ( int argc, char **argv ) { int pool_size, my_rank; int i_am_the_master = FALSE; extern double drand48(); extern void srand48(); typedef struct { double x, y, z, vx, vy, vz, ax, ay, az, mass, charge; } Particle; Particle particles[MAX_PARTICLES]; /* Particles on all nodes */ int counts[MAX_PROCS]; /* Number of ptcls on each proc */ int displacements[MAX_PROCS]; /* Offsets into particles */ int offsets[MAX_PROCS]; /* Offsets used by the master */ int particle_number, i, j, my_offset, true_i; int total_particles; /* Total number of particles */ int count; /* Count time steps */ MPI_Datatype particle_type; double dt = DT; /* Integration time step */ double comm_time, start_comm_time, end_comm_time, start_time, end_time; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &pool_size); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if (my_rank == MASTER_RANK) i_am_the_master = TRUE; particle_number = MAX_PARTICLES / pool_size; if (i_am_the_master) printf ("%d particles per processor\n", particle_number); MPI_Type_contiguous ( 11, MPI_DOUBLE, &particle_type ); MPI_Type_commit ( &particle_type ); MPI_Allgather ( &particle_number, 1, MPI_INT, counts, 1, MPI_INT, MPI_COMM_WORLD ); displacements[0] = 0; for (i = 1; i < pool_size; i++) displacements[i] = displacements[i-1] + counts[i-1]; total_particles = displacements[pool_size - 1] + counts[pool_size - 1]; if (i_am_the_master) printf ("total number of particles = %d\n", total_particles); my_offset = displacements[my_rank]; MPI_Gather ( &my_offset, 1, MPI_INT, offsets, 1, MPI_INT, MASTER_RANK, MPI_COMM_WORLD ); if (i_am_the_master) { printf ("offsets: "); for (i = 0; i < pool_size; i++) printf ("%d ", offsets[i]); printf("\n"); } srand48((long) (my_rank + 28)); /* Here each process initializes its own particles. */ for (i = 0; i < particle_number; i++) { particles[my_offset + i].x = drand48(); particles[my_offset + i].y = drand48(); particles[my_offset + i].z = drand48(); particles[my_offset + i].vx = 0.0; particles[my_offset + i].vy = 0.0; particles[my_offset + i].vz = 0.0; particles[my_offset + i].ax = 0.0; particles[my_offset + i].ay = 0.0; particles[my_offset + i].az = 0.0; particles[my_offset + i].mass = 1.0; particles[my_offset + i].charge = 1.0 - 2.0 * (i % 2); } start_time = MPI_Wtime(); comm_time = 0.0; for (count = 0; count < N_OF_ITERATIONS; count++) { if (i_am_the_master) printf("Iteration %d.\n", count + 1); /* Here processes exchange their particles with each other. */ start_comm_time = MPI_Wtime(); MPI_Allgatherv ( particles + my_offset, particle_number, particle_type, particles, counts, displacements, particle_type, MPI_COMM_WORLD ); end_comm_time = MPI_Wtime(); comm_time += end_comm_time - start_comm_time; for (i = 0; i < particle_number; i++) { true_i = i + my_offset; /* initialize accelerations to zero */ particles[true_i].ax = 0.0; particles[true_i].ay = 0.0; particles[true_i].az = 0.0; for (j = 0; j < total_particles; j++) { /* Do not evaluate interaction with yourself. */ if (j != true_i) { /* Evaluate forces that j-particles exert on the i-particle. */ double dx, dy, dz, r2, r, qj_by_r3; /* Here we absorb the minus sign by changing the order of i and j. */ dx = particles[true_i].x - particles[j].x; dy = particles[true_i].y - particles[j].y; dz = particles[true_i].z - particles[j].z; r2 = dx * dx + dy * dy + dz * dz; r = sqrt(r2); /* Quench the force if the particles are too close. */ if (r < EPSILON) qj_by_r3 = 0.0; else qj_by_r3 = particles[j].charge / (r2 * r); /* accumulate the contribution from particle j */ particles[true_i].ax += qj_by_r3 * dx; particles[true_i].ay += qj_by_r3 * dy; particles[true_i].az += qj_by_r3 * dz; } } } /* * We advance particle positions and velocities only *after* * we have evaluated all accelerations using the *old* positions. */ for (i = 0; i < particle_number; i++) { double qidt_by_m, dt_by_2, vx0, vy0, vz0; true_i = i + my_offset; /* Save old velocities */ vx0 = particles[true_i].vx; vy0 = particles[true_i].vy; vz0 = particles[true_i].vz; /* Now advance the velocity of particle i */ qidt_by_m = particles[true_i].charge * dt / particles[true_i].mass; particles[true_i].vx += particles[true_i].ax * qidt_by_m; particles[true_i].vy += particles[true_i].ay * qidt_by_m; particles[true_i].vz += particles[true_i].az * qidt_by_m; /* Use average velocity in the interval to advance the particles' positions */ dt_by_2 = 0.5 * dt; particles[true_i].x += (vx0 + particles[true_i].vx) * dt_by_2; particles[true_i].y += (vy0 + particles[true_i].vy) * dt_by_2; particles[true_i].z += (vz0 + particles[true_i].vz) * dt_by_2; } } MPI_Barrier(MPI_COMM_WORLD); end_time = MPI_Wtime(); if (i_am_the_master) { printf ("Communication time %8.5f seconds\n", comm_time); printf ("Computation time %8.5f seconds\n", end_time - start_time - comm_time); printf ("\tEvaluated %d interactions\n", N_OF_ITERATIONS * total_particles * (total_particles - 1)); } MPI_Finalize (); exit(0); } ``` `impi.sh` ``` #!/bin/bash #SBATCH --job-name=particles #SBATCH --output=%x-%j.out #SBATCH --nodes=2 #SBATCH --cpus-per-task=1 #SBATCH --ntasks-per-node=56 #SBATCH --partition=ct112 #SBATCH --no-requeue #SBATCH --account=GOV113006 module purge module load intel/2024_01_46 echo "Job script is as follow" cat $SUBMIT_FILE echo "End of job script." echo "Your job starts at `date`" mpiexec -n $SLURM_NTASKS ./particles.exe echo "Your job completed at `date`" ``` 7. 編譯檔案 [操作影片](https://youtu.be/pyrXx1KcWJw) ``` 11:16:43 p00acy00@ilgn01:~$ module av -------------------------- /pkg/modulefiles/software --------------------------- apps/abaqus/2023 libs/hdf5/1.14.3 apps/abaqus/2024 (D) tools/code-server apps/adf/2024.101-intelmpi-intel tools/jupyterlab apps/adf/2024.102-openmpi-intel (D) tools/miniconda3 apps/gaussian/g16 tools/paraview/client/5.12.0 apps/lsdyna/R12.0.0 tools/paraview/server/5.12.0 apps/lsdyna/R13.0.0 tools/rstudio/4.4.0 apps/lsdyna/R13.1.0 (D) tools/singularity-ce/4.1.1 --------------------- /pkg/modulefiles/middleware/compiler --------------------- gcc/8.5.0 (D) gcc/11.2.0 intel/2022_3_1 intel/2024_01_46 ( L,D) gcc/10.4.0 intel/2019_u5 intel/2023_2 ----------------------------- /pkg/x86/modulefiles ----------------------------- gcc/13.2.0 nvhpc-hpcx-cuda12/23.9 oneapi/2023 gsl/2.7.1 nvhpc-hpcx/23.9 oneapi/2024_01_46 (D) mvapich/3.0 nvhpc-nompi/23.9 openmpi/4.1 nvhpc-byo-compiler/23.9 nvhpc/23.9 openmpi/5.0.0 (D) Where: L: Module is loaded D: Default Module If the avail list is too long consider trying: "module --default avail" or "ml -d av" to just list the default modules. "module overview" or "ml ov" to display the number of modules for each name. Use "module spider" to find all possible modules and extensions. Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys". 11:30:12 p00acy00@ilgn01:test_case$ module load intel/2024_01_46 11:30:27 p00acy00@ilgn01:test_case$ mpiic mpiicc mpiicpc mpiicpx mpiicx 11:30:27 p00acy00@ilgn01:test_case$ mpiicx particles.c -o particles.exe particles.c:36:15: warning: a function declaration without a prototype is deprecated in all versions of C and is treated as a zero-parameter prototype in C2x, conflicting with a previous declaration [-Wdeprecated-non-prototype] 36 | extern void srand48(); | ^ /usr/include/stdlib.h:481:13: note: conflicting prototype is here 481 | extern void srand48 (long int __seedval) __THROW; | ^ 1 warning generated. 11:30:49 p00acy00@ilgn01:test_case$ ls impi.sh particles.c particles.exe ``` 8. 準備腳本 ![06](https://hackmd.io/_uploads/BybSeNpH0.png) | | PC / 大型主機 | 自宅餐廳 / 餐廳 | |------------|--------------------------------------------|------------------------------------| | 資源管理者 | 排程軟體(LSF、PBS、SGE、Slurm) | 外場經理 | | 需求 | CPU數量、記憶體需求、GPU數量、大概要算多久 | 幾位用餐、位置是否靠窗、有沒有吃素 | [操作影片](https://) 參考系統狀態,決定使用資源。 ``` 11:30:53 p00acy00@ilgn01:test_case$ sinfo -s PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST alphatest inact 4:00:00 412/82/58/552 icpnp[101-156,201-256,301-348],icpnq[101-156,201-256,301-356,401-456,501-556,601-656,701-756] betatest up 2-00:00:00 384/62/2/448 icpnp[101-156,201-256],icpnq[101-156,201-256,301-356,401-456,501-556,601-656] development up 2:00:00 28/10/0/38 icpnp[311-348] ct112 up 4-00:00:00 109/2/1/112 icpnp[101-156,201-256] ct448 up 4-00:00:00 109/2/1/112 icpnp[101-156,201-256] ct1k up 4-00:00:00 109/2/1/112 icpnp[101-156,201-256] visual-dev up 2:00:00 0/6/0/6 gpn[01-06] visual up 2-00:00:00 0/4/0/4 gpn[03-06] ``` queue_table_v20240613 ![queue_table_v20240613](https://hackmd.io/_uploads/SyWL376r0.jpg) 9. 派送工作&檢查工作 ``` 11:41:37 p00acy00@ilgn01:test_case$ ls impi.sh particles.c particles.exe 11:42:05 p00acy00@ilgn01:test_case$ sbatch impi.sh Submitted batch job 38122 11:42:16 p00acy00@ilgn01:test_case$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 38072 particle ct112 gov113006 112 COMPLETED 0:0 38072.batch batch gov113006 56 COMPLETED 0:0 38072.extern extern gov113006 112 COMPLETED 0:0 38072.0 hydra_bst+ gov113006 112 COMPLETED 0:0 38122 particles ct112 gov113006 112 PENDING 0:0 ``` 下面是 Slurm 可能出現的任務狀態編碼,以及其含義解釋: | Status | Code | Explaination | | ------------- | :---: | ---------------------------------------------------------------------- | | COMPLETED | `CD` | The job has completed successfully. | | COMPLETING | `CG` | The job is finishing but some processes are still active. | | FAILED | `F` | The job terminated with a non-zero exit code and failed to execute. | | PENDING | `PD` | The job is waiting for resource allocation. It will eventually run. | | PREEMPTED | `PR` | The job was terminated because of preemption by another job. | | RUNNING | `R` | The job currently is allocated to a node and is running. | | SUSPENDED | `S` | A running job has been stopped with its cores released to other jobs. | | STOPPED | `ST` | A running job has been stopped with its cores retained. | A full list of these Job State codes can be found in [Slurm’s documentation.](https://slurm.schedmd.com/squeue.html#lbAG) 10. 參考資料 [台灣杉二號—使用說明](https://man.twcc.ai/@twccdocs/doc-twnia2-main-zh/https%3A%2F%2Fman.twcc.ai%2F%40twccdocs%2Ftwnia2-overview-zh) [台灣杉三號—使用說明](https://man.twcc.ai/@TWCC-III-manual/H1bEXeGcu) [Usage guide for pilot run for Forerunner1](https://hackmd.io/tu_fk2RDQSyukJiLVBNzhg) [](https://)