Try   HackMD

NAMD 基本知識介紹與實作平行化運算

悲報: 2080ti不支援p2p

RTX 2080 和 2080ti並不支援p2p

on Titan RTX and also Turing-family GeForce GPUs such as 2080/2080Ti, P2P is only supported if/when the NVLink bridge is in place (i.e. only over NVLink). For Turing family GeForce GPUs without a NVLink bridge option, P2P is not supported.

https://forums.developer.nvidia.com/t/does-titan-rtx-support-p2p-access-w-o-nvlink/70065

事前p2p檢查

https://github.com/NVIDIA/cuda-samples/tree/5f97d7d0dff880bc6567faa4c5e62e389a6d6999/Samples/0_Introduction/simpleP2P

檢查GPU p2p是否正常

git clone https://github.com/NVIDIA/cuda-samples.git
cd ~/cuda-samples/Samples/0_Introduction/simpleP2P
make
./simpleP2P

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

bandwidth test

cd ~/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/
make
./p2pBandwidthLatencyTest

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

nvidia-smi topo -m

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

https://developer.nvidia.com/gpudirectforvideo

NAMD 2.14

注意!!! binary和benchmark都必須放在所有節點的共享資料夾,如/opt,多節點平行運算程式才能正常執行

下載並解壓縮Source code

wget https://www.ks.uiuc.edu/Research/namd/2.14/download/946183/NAMD_2.14_Source.tar.gz
tar xzf NAMD_2.14_Source.tar.gz
cd NAMD_2.14_Source
tar xf charm-6.10.2.tar

單節點cpu + gpu

wget https://www.ks.uiuc.edu/Research/namd/2.14/download/946183/NAMD_2.14_Linux-x86_64-multicore-CUDA.tar.gz
tar xzf NAMD_2.14_Linux-x86_64-multicore-CUDA.tar.gz

單節點cpu

wget https://www.ks.uiuc.edu/Research/namd/2.14/download/946183/NAMD_2.14_Linux-x86_64-multicore.tar.gz
tar xzf NAMD_2.14_Linux-x86_64-multicore.tar.gz

verbs-smp-CUDA

wget https://www.ks.uiuc.edu/Research/namd/2.14/download/946183/NAMD_2.14_Linux-x86_64-verbs-smp-CUDA.tar.gz
tar xzf NAMD_2.14_Linux-x86_64-verbs-smp-CUDA.tar.gz
cd NAMD_2.14_Linux-x86_64-verbs-smp-CUDA

選擇要build的版本(選)

Build and test the Charm++/Converse library (single-node multicore version):

cd charm-6.10.2
./build charm++ multicore-linux-x86_64 --with-production
cd multicore-linux-x86_64/tests/charm++/megatest
make pgm
./pgm +p4   (multicore does not support multiple nodes)
cd ../../../../..

Build and test the Charm++/Converse library (InfiniBand verbs version):

cd charm-6.10.2
./build charm++ verbs-linux-x86_64 --with-production
cd verbs-linux-x86_64/tests/charm++/megatest
make pgm
./charmrun ++mpiexec +p4 ./pgm   (uses mpiexec to launch processes)
cd ../../../../..

Build and test the Charm++/Converse library (InfiniBand UCX OpenMPI PMIx version):

cd charm-6.10.2
./build charm++ ucx-linux-x86_64 ompipmix --with-production
cd ucx-linux-x86_64-ompipmix/tests/charm++/megatest
make pgm
mpiexec -n 4 ./pgm   (run as for an OpenMPI program on your cluster)
  cd ../../../../..

下載TCL and FFTW libraries(必要)

Download and install TCL and FFTW libraries:
(cd to NAMD_2.14_Source if you're not already there)

wget http://www.ks.uiuc.edu/Research/namd/libraries/fftw-linux-x86_64.tar.gz
tar xzf fftw-linux-x86_64.tar.gz
mv linux-x86_64 fftw
wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64.tar.gz
wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64-threaded.tar.gz
tar xzf tcl8.5.9-linux-x86_64.tar.gz
tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz
mv tcl8.5.9-linux-x86_64 tcl
mv tcl8.5.9-linux-x86_64-threaded tcl-threaded

選擇要編譯的版本(跟上述build的版本相同)

Set up build directory and compile:
multicore version:

./config Linux-x86_64-g++ --charm-arch multicore-linux-x86_64 --with-cuda

InfiniBand verbs version:

./config Linux-x86_64-g++ --charm-arch verbs-linux-x86_64 --with-cuda

InfiniBand UCX version:

./config Linux-x86_64-g++ --charm-arch ucx-linux-x86_64-ompipmix --with-cuda

make該版本(必要)

cd Linux-x86_64-g++
make   #(or gmake -j4, which should run faster)

如果有fftw的抱錯的話,打開arch/Linux-x86_64-g++.arch

vim arch/Linux-x86_64-g++.arch

把CXXOPTS那一欄加上-no-pie

NAMD_ARCH = Linux-x86_64
CHARMARCH = multicore-linux-x86_64

CXX = g++ -m64 -std=c++0x
CXXOPTS = -O3 -fexpensive-optimizations -ffast-math -no-pie
CC = gcc -m64
COPTS = -O3 -fexpensive-optimizations -ffast-math

將執行檔複製到/usr/bin中,方便在任一位置執行

sudo cp namd2 /usr/bin
namd2

測試單節點

下載stmv benchmark(新舊有些差異)

wget https://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz
tar xzf stmv.tar.gz
cd stmv

其他benchmark:
https://www.ks.uiuc.edu/Research/namd/utilities/

執行namd2(使用單張GPU)

namd2 +p4 +setcpuaffinity +devices 0 stmv.namd
  • device 為所用GPU之編號,沒有設參數則會預設使用所有

單節點多GPU

namd2 +p4 +setcpuaffinity +devices 0,1 stmv.namd

https://www.ks.uiuc.edu/Research/namd/2.14/notes.html

測試多節點

注意!!! binary和benchmark都必須放在所有節點的共享資料夾,如/opt,多節點平行運算程式才能正常執行

使用charmrun執行程式

NAMD 所基於的 Charm++ 运行時系统支持多种底层网络,因此请确保选择最适合您的硬件平台的 NAMD/Charm++ 构建版本。一般来说,我们建议用户避免使用基于 MPI 的 NAMD 构建

設定nodelist file以執行多節點運算

nodelist

host hpc1
host hpc2

pathfix設定(基本上用不到)

  • pathfix <dir1> <dir2> - 替換主機用於mount的目錄,dir1替換為dir2
  • pathfix路徑必須跟設定檔中output的目錄相同
group main ++pathfix /tmp_mnt /
host alpha1
host alpha2

single process

./charmrun ++p 36 ++ppn 9 ++nodelist nodelist ./namd2 +setcpuaffinity +ignoresharing /opt/stmv/stmv.namd
  • ++ppn&+p
    • ++ppn PEs(worker threads)的總數量
    • +p 使用的CPU core數量

++p 80 ++ppn 4 ./namd2 +setcpuaffinity +pemap 0-3,4-7,8-11,12-15,16-19,20-23,24-27,28-31 +commap 32,33,34,35,36,37,38,39 +devices 0,1,2,3,4,5,6,7 

multi process

./charmrun ++ppn 4 ++n 9 ++nodelist nodelist ./namd2 +setcpuaffinity +ignoresharing /opt/stmv/stmv.namd
  • ++ppn&++n
    • ++n 總共使用多少process
    • ++ppn 每個process的PEs(worker threads)數量

其他:

  • ++mpiexec
    • Use the cluster’s mpiexec job launcher instead of the built in ssh method.

https://www.nvidia.cn/data-center/gpu-accelerated-applications/namd/

https://www.ks.uiuc.edu/Research/namd/2.14/ug/node102.html

https://charm.readthedocs.io/en/latest/charm++/manual.html#launching-programs-with-charmrun

結果驗證

output會出現總共使用多少的host,可以以此當作是否多節點運行成功的依據

image

tuning step 1&2

https://www.ks.uiuc.edu/Research/namd/2.14/ug/node94.html

1.選擇CPU core數量

namd2 +p4 +setcpuaffinity +devices 0,1 stmv.namd

+p 是指要用的CPU core數量,官方建議每個GPU可以分8個core可以達到最教表現,再依次加減core數量來達到最佳效能

如果core的數量 > PATCH GRID IS nx BY ny BY nz
則cpu的core數量也可以試試PATCH GRID + 1(第一個甜蜜點)
如果core數量還夠的話可以使用 2 * PATCH + 1

2.選擇執行的一些參數(對performance基本沒影響)

  • +ignoresharing
    • argument must be used to disable the shared-device error message.

tuning step 3 調整設定檔

設定黨參數介紹

https://www.ks.uiuc.edu/Research/namd/2.14/ug/node110.html

前言

NAMD總共有兩個不同的標準

  • performance
    • wallclock, ns/day
  • accuracy
    • 目前尚未找到可以output accuracy的方法

分子之間總共有三種作用力,大小依序為

  1. 鍵結作用力
  2. 靜電作用力
  3. 凡德瓦力

由於鍵結作用力的大小是固定的(分子相連),因此我們計算的重點會放在靜電作用力凡德瓦力等非接觸力上,且由於靜電作用力對於分子的作用力遠大於凡德瓦力,因此對於performance的表現和accuracy的表現也影響較大,可以優先從靜電作用力下手

1.找出合適的PME grid大小

Particle Mesh Ewald (PME)負責計算分子之間的庫倫力(靜電作用力)

PME grid size指的是將系統分割成的小立方體數量。

  • 網格較小,分辨率和精確度較高,但是由於會有許多空間沒有分子卻需要被計算,因此會白費算力資源
  • 網格較大,分辨率和精確度較低
  • 足夠高以確保計算的精確度,但過高的分辨率會增加計算成本

PME將長程庫侖相互作用分解為短程和長程兩部分。

  • 短程部分在實空間中計算,因為它們在近距離內快速衰減。
  • 長程部分在傅里葉空間中計算,利用周期性邊界條件,這樣可以高效處理遠距離相互作用。

PMEGridSpacing

官方有提供一個PMEGridSpacing的參數,會根據你給入的數值自動幫你找出一個PME grid size

官方建議的初始數值為1.0(單位為A埃),且1.5已經算是很大的一個數值了

PMEGridSpacing 1.0

在設定以上的參數後,output也會告訴你系統自動幫你設定的grid size

image

建議可以從1.0-1.5試試看performance如何,在進行下一步的微調

PMEGridSizeX,Y,Z

PMEGridSizeX,Y,Z可以指定每個維度的長度

PME                  on
PMEGridSizeX         216
PMEGridSizeY         216
PMEGridSizeZ         216

可以根據上述PMEGridSpacing中表現最好的,並進行微調,官方建議數字的因數只能由2,3,5組成

如果PME grid的大小少於core的數量,可以增加以下參數稱加PME的擴展性(Scalability)

  • 例: grid size為 4x4x5 = 80,如果是給定系統使用96個CPU core(+p96)的話則建議使用以下設定
twoAwayX             yes
twoAwayY             yes
twoAwayZ             yes

PMETolerance(基本上不會動)

PMETolerance會分割短程和遠程的分子,預設長度為106,會引響到Accuracy

PMETolerance 1.0e-6

PMEProcessors(特殊情況)

當你認為你的performance十分不理想時,可以透過限制平行化來提升表現(restrict the amount of parallelism used)

PMEProcessors的數值必須大於PMEGridSizeX,Y的數值且最多不得超過可使用的core數量

PMEProcessors 8

2.調整cutoff和pairlist

cutoff

cutoff為一個長度,選擇一個分子並以cutoff數值當作其半徑,程式會計算範圍內的分子靜電力與凡德瓦力計算

每個timestep之間分子會移動,因此每個step都計算靜電力與凡德瓦力會太吃資源,因此程式只會每經過n個timestep才計算一次

timestep            1.0 #單位飛秒(fs)
stepspercycle       20

pairlist

由於分子會不斷移動,因此原本在cutoff內的分子有可能在下個循環時就離開cutoff的範圍,因此有了pairlistpairlistcutoff之間夾的範圍為有可能進入或超出cutoff範圍的分子

pairlist的長度必須大於cutoff,實作上我們會希望沒有分子在一個cycle內移動超過pairlist - cutoff的距離,且pairlist的值逐漸趨近cutoff

pairlist距離(pairlistDist)通常設為cutoff + 1.5(官方建議)。減小pairlist距離可以減小PATCH大小,但要確保不會因為太小而影響模擬的準確性。。

cutoff              12
pairlistdist        13.5

image

pairlistTrigger, pairlistGrow, pairlistShrink

設定以下三個參數可以在模擬的時候自行動態調整pairlist的長度,而是否需要調整pairlist是根據一個數值叫做pairlist tolerance

pairlist tolerance的定義如下:

  • initially (pairlistdist - cutoff)/2 but refined during the run)

三項數值的定義

pairlistTrigger

  • 預設值為0.3,如果分子移動的距離超過或小於pairlist tolerance 的30%,條件被觸發(Trigger),則會調整pairlist的距離,詳看以下參數

pairlistGrow(超過)

  • 預設值為0.01,如果距離超過pairlist tolerance 的30%,則pairlist的距離會增加1%

pairlistShrink(小於)

  • 預設值為0.01,如果距離小於pairlist tolerance 的30%,則pairlist的距離會減少1%

調整switchdist:

如果switching設置為on且switchdist設定為7.0 Å,則在7.0 Å時切換函數開始生效,使范德華勢能在8.0 Å處平滑地趨近於0。

通過合理設置這些參數,可以確保非鍵合相互作用的計算更加精確,並維持能量守恆。

switching被須為on才會啟動此參數

switching on
switchdist          10

image

在算分子的靜電作用力也是類似的概念

image

https://www.ks.uiuc.edu/Research/namd/2.6/olddocs/ug/node24.html#section:electdesc

實驗數據:(有效性待確認)

cut off 12, pairlistdist 13.5, switchdist 10(預設):
WallClock: 85.830368
cut off 10, pairlistdist 11.5, switchdist 9:
WallClock: 86.828651
cut off 13, pairlistdist 14.5, switchdist 10:
WallClock: 83.484566

調整margin(用途待確定)

margin < extra length in patch dimension (Å) >
Acceptable Values: positive decimal
Default Value: 0.0

image

Description: An internal tuning parameter used in determining the size of the cubes of space with which NAMD uses to partition the system. The value of this parameter will not change the physical results of the simulation. Unless you are very motivated to get the very best possible performance, just leave this value at the default.

http://www.ks.uiuc.edu/Research/namd/wiki/?NamdPerformanceTuning

https://www.ks.uiuc.edu/Research/namd/2.14/ug/node110.html

其他

透過調整output的方式,以下的設定可能可以增加performance

shiftIOToOne yes 
ldbUnloadOne yes
noPatchesOnOne yes

純CPU 1000steps
20 core WallClock: 559.907898 CPUTime: 558.425598 Memory: 4691.003906 MB
40 core WallClock: 419.788544 CPUTime: 416.760803 Memory: 6124.734375 MB

測試 2000 steps
單節點2GPU,+p4: WallClock: 200.125610
單節點2GPU,+p6: WallClock: 163.561081
單節點2GPU,+p8: WallClock: 132.111237
單節點2GPU,+p16: WallClock: 113.672607

測試 1000 steps
單節點2GPU,+p20: WallClock: 96.107086
單節點2GPU,+p20: WallClock: 106.593513

單節點1GPU,+p2: WallClock: 320.530579
單節點1GPU,+p4: WallClock: 195.832718

視覺化

下載VMD

wget https://www.ks.uiuc.edu/Research/vmd/alpha/vmd-1.9.4a55.bin.LINUXAMD64-CUDA102-OptiX650-OSPRay185-RTXRTRT.opengl.tar.gz
tar xzf vmd-1.9.4a55.bin.LINUXAMD64-CUDA102-OptiX650-OSPRay185-RTXRTRT.opengl.tar.gz

編譯

cd vmd-1.9.4a55
./configure
cd src
sudo make install