RTX 2080 和 2080ti並不支援p2p
on Titan RTX and also Turing-family GeForce GPUs such as 2080/2080Ti, P2P is only supported if/when the NVLink bridge is in place (i.e. only over NVLink). For Turing family GeForce GPUs without a NVLink bridge option, P2P is not supported.
https://forums.developer.nvidia.com/t/does-titan-rtx-support-p2p-access-w-o-nvlink/70065
檢查GPU p2p是否正常
git clone https://github.com/NVIDIA/cuda-samples.git
cd ~/cuda-samples/Samples/0_Introduction/simpleP2P
make
./simpleP2P
bandwidth test
cd ~/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/
make
./p2pBandwidthLatencyTest
nvidia-smi topo -m
https://developer.nvidia.com/gpudirectforvideo
注意!!! binary和benchmark都必須放在所有節點的共享資料夾,如/opt
,多節點平行運算程式才能正常執行
下載並解壓縮Source code
wget https://www.ks.uiuc.edu/Research/namd/2.14/download/946183/NAMD_2.14_Source.tar.gz
tar xzf NAMD_2.14_Source.tar.gz
cd NAMD_2.14_Source
tar xf charm-6.10.2.tar
單節點cpu + gpu
wget https://www.ks.uiuc.edu/Research/namd/2.14/download/946183/NAMD_2.14_Linux-x86_64-multicore-CUDA.tar.gz
tar xzf NAMD_2.14_Linux-x86_64-multicore-CUDA.tar.gz
單節點cpu
wget https://www.ks.uiuc.edu/Research/namd/2.14/download/946183/NAMD_2.14_Linux-x86_64-multicore.tar.gz
tar xzf NAMD_2.14_Linux-x86_64-multicore.tar.gz
verbs-smp-CUDA
wget https://www.ks.uiuc.edu/Research/namd/2.14/download/946183/NAMD_2.14_Linux-x86_64-verbs-smp-CUDA.tar.gz
tar xzf NAMD_2.14_Linux-x86_64-verbs-smp-CUDA.tar.gz
cd NAMD_2.14_Linux-x86_64-verbs-smp-CUDA
選擇要build的版本(選)
Build and test the Charm++/Converse library (single-node multicore version):
cd charm-6.10.2
./build charm++ multicore-linux-x86_64 --with-production
cd multicore-linux-x86_64/tests/charm++/megatest
make pgm
./pgm +p4 (multicore does not support multiple nodes)
cd ../../../../..
Build and test the Charm++/Converse library (InfiniBand verbs version):
cd charm-6.10.2
./build charm++ verbs-linux-x86_64 --with-production
cd verbs-linux-x86_64/tests/charm++/megatest
make pgm
./charmrun ++mpiexec +p4 ./pgm (uses mpiexec to launch processes)
cd ../../../../..
Build and test the Charm++/Converse library (InfiniBand UCX OpenMPI PMIx version):
cd charm-6.10.2
./build charm++ ucx-linux-x86_64 ompipmix --with-production
cd ucx-linux-x86_64-ompipmix/tests/charm++/megatest
make pgm
mpiexec -n 4 ./pgm (run as for an OpenMPI program on your cluster)
cd ../../../../..
下載TCL and FFTW libraries(必要)
Download and install TCL and FFTW libraries:
(cd to NAMD_2.14_Source if you're not already there)
wget http://www.ks.uiuc.edu/Research/namd/libraries/fftw-linux-x86_64.tar.gz
tar xzf fftw-linux-x86_64.tar.gz
mv linux-x86_64 fftw
wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64.tar.gz
wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64-threaded.tar.gz
tar xzf tcl8.5.9-linux-x86_64.tar.gz
tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz
mv tcl8.5.9-linux-x86_64 tcl
mv tcl8.5.9-linux-x86_64-threaded tcl-threaded
選擇要編譯的版本(跟上述build的版本相同)
Set up build directory and compile:
multicore version:
./config Linux-x86_64-g++ --charm-arch multicore-linux-x86_64 --with-cuda
InfiniBand verbs version:
./config Linux-x86_64-g++ --charm-arch verbs-linux-x86_64 --with-cuda
InfiniBand UCX version:
./config Linux-x86_64-g++ --charm-arch ucx-linux-x86_64-ompipmix --with-cuda
make該版本(必要)
cd Linux-x86_64-g++
make #(or gmake -j4, which should run faster)
如果有fftw的抱錯的話,打開arch/Linux-x86_64-g++.arch
vim arch/Linux-x86_64-g++.arch
把CXXOPTS那一欄加上-no-pie
NAMD_ARCH = Linux-x86_64
CHARMARCH = multicore-linux-x86_64
CXX = g++ -m64 -std=c++0x
CXXOPTS = -O3 -fexpensive-optimizations -ffast-math -no-pie
CC = gcc -m64
COPTS = -O3 -fexpensive-optimizations -ffast-math
將執行檔複製到/usr/bin
中,方便在任一位置執行
sudo cp namd2 /usr/bin
namd2
下載stmv benchmark(新舊有些差異)
wget https://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz
tar xzf stmv.tar.gz
cd stmv
其他benchmark:
https://www.ks.uiuc.edu/Research/namd/utilities/
執行namd2(使用單張GPU)
namd2 +p4 +setcpuaffinity +devices 0 stmv.namd
單節點多GPU
namd2 +p4 +setcpuaffinity +devices 0,1 stmv.namd
https://www.ks.uiuc.edu/Research/namd/2.14/notes.html
注意!!! binary和benchmark都必須放在所有節點的共享資料夾,如/opt
,多節點平行運算程式才能正常執行
使用charmrun執行程式
NAMD 所基於的 Charm++ 运行時系统支持多种底层网络,因此请确保选择最适合您的硬件平台的 NAMD/Charm++ 构建版本。一般来说,我们建议用户避免使用基于 MPI 的 NAMD 构建
設定nodelist file以執行多節點運算
nodelist
host hpc1
host hpc2
pathfix設定(基本上用不到)
pathfix <dir1> <dir2>
- 替換主機用於mount的目錄,dir1替換為dir2pathfix
路徑必須跟設定檔中output的目錄相同group main ++pathfix /tmp_mnt /
host alpha1
host alpha2
single process
./charmrun ++p 36 ++ppn 9 ++nodelist nodelist ./namd2 +setcpuaffinity +ignoresharing /opt/stmv/stmv.namd
++ppn
&+p
++ppn
PEs(worker threads)的總數量+p
使用的CPU core數量
++p 80 ++ppn 4 ./namd2 +setcpuaffinity +pemap 0-3,4-7,8-11,12-15,16-19,20-23,24-27,28-31 +commap 32,33,34,35,36,37,38,39 +devices 0,1,2,3,4,5,6,7
multi process
./charmrun ++ppn 4 ++n 9 ++nodelist nodelist ./namd2 +setcpuaffinity +ignoresharing /opt/stmv/stmv.namd
++ppn
&++n
++n
總共使用多少process++ppn
每個process的PEs(worker threads)數量其他:
++mpiexec
https://www.nvidia.cn/data-center/gpu-accelerated-applications/namd/
https://www.ks.uiuc.edu/Research/namd/2.14/ug/node102.html
https://charm.readthedocs.io/en/latest/charm++/manual.html#launching-programs-with-charmrun
output會出現總共使用多少的host,可以以此當作是否多節點運行成功的依據
https://www.ks.uiuc.edu/Research/namd/2.14/ug/node94.html
namd2 +p4 +setcpuaffinity +devices 0,1 stmv.namd
+p 是指要用的CPU core數量,官方建議每個GPU可以分8個core可以達到最教表現,再依次加減core數量來達到最佳效能
如果core的數量 > PATCH GRID IS nx BY ny BY nz
則cpu的core數量也可以試試PATCH GRID + 1(第一個甜蜜點)
如果core數量還夠的話可以使用 2 * PATCH + 1
+ignoresharing
https://www.ks.uiuc.edu/Research/namd/2.14/ug/node110.html
NAMD總共有兩個不同的標準
分子之間總共有三種作用力,大小依序為
由於鍵結作用力的大小是固定的(分子相連),因此我們計算的重點會放在靜電作用力和凡德瓦力等非接觸力上,且由於靜電作用力對於分子的作用力遠大於凡德瓦力,因此對於performance的表現和accuracy的表現也影響較大,可以優先從靜電作用力下手
Particle Mesh Ewald (PME)負責計算分子之間的庫倫力(靜電作用力)
PME grid size指的是將系統分割成的小立方體數量。
PME將長程庫侖相互作用分解為短程和長程兩部分。
PMEGridSpacing
官方有提供一個PMEGridSpacing
的參數,會根據你給入的數值自動幫你找出一個PME grid size
官方建議的初始數值為1.0(單位為A埃),且1.5已經算是很大的一個數值了
PMEGridSpacing 1.0
在設定以上的參數後,output也會告訴你系統自動幫你設定的grid size
建議可以從1.0-1.5試試看performance如何,在進行下一步的微調
PMEGridSizeX,Y,Z
PMEGridSizeX,Y,Z可以指定每個維度的長度
PME on
PMEGridSizeX 216
PMEGridSizeY 216
PMEGridSizeZ 216
可以根據上述PMEGridSpacing
中表現最好的,並進行微調,官方建議數字的因數只能由2,3,5組成
如果PME grid的大小少於core的數量,可以增加以下參數稱加PME的擴展性(Scalability)
+p96
)的話則建議使用以下設定twoAwayX yes
twoAwayY yes
twoAwayZ yes
PMETolerance(基本上不會動)
PMETolerance
會分割短程和遠程的分子,預設長度為
PMETolerance 1.0e-6
PMEProcessors(特殊情況)
當你認為你的performance十分不理想時,可以透過限制平行化來提升表現(restrict the amount of parallelism used)
PMEProcessors
的數值必須大於PMEGridSizeX,Y
的數值且最多不得超過可使用的core數量
PMEProcessors 8
cutoff
cutoff為一個長度,選擇一個分子並以cutoff數值當作其半徑,程式會計算範圍內的分子靜電力與凡德瓦力計算
每個timestep之間分子會移動,因此每個step都計算靜電力與凡德瓦力會太吃資源,因此程式只會每經過n個timestep才計算一次
timestep 1.0 #單位飛秒(fs)
stepspercycle 20
pairlist
由於分子會不斷移動,因此原本在cutoff
內的分子有可能在下個循環時就離開cutoff的範圍,因此有了pairlist
,pairlist
和cutoff
之間夾的範圍為有可能進入或超出cutoff範圍的分子
pairlist
的長度必須大於cutoff,實作上我們會希望沒有分子在一個cycle內移動超過pairlist - cutoff的距離,且pairlist的值逐漸趨近cutoff
pairlist距離(pairlistDist)通常設為cutoff + 1.5(官方建議)。減小pairlist距離可以減小PATCH大小,但要確保不會因為太小而影響模擬的準確性。。
cutoff 12
pairlistdist 13.5
pairlistTrigger, pairlistGrow, pairlistShrink
設定以下三個參數可以在模擬的時候自行動態調整pairlist
的長度,而是否需要調整pairlist
是根據一個數值叫做pairlist tolerance
pairlist tolerance
的定義如下:
三項數值的定義
pairlistTrigger
pairlist tolerance
的30%,條件被觸發(Trigger),則會調整pairlist
的距離,詳看以下參數pairlistGrow
(超過)
pairlist tolerance
的30%,則pairlist
的距離會增加1%pairlistShrink
(小於)
pairlist tolerance
的30%,則pairlist
的距離會減少1%調整switchdist:
如果switching設置為on且switchdist設定為7.0 Å,則在7.0 Å時切換函數開始生效,使范德華勢能在8.0 Å處平滑地趨近於0。
通過合理設置這些參數,可以確保非鍵合相互作用的計算更加精確,並維持能量守恆。
switching
被須為on才會啟動此參數
switching on
switchdist 10
在算分子的靜電作用力也是類似的概念
https://www.ks.uiuc.edu/Research/namd/2.6/olddocs/ug/node24.html#section:electdesc
實驗數據:(有效性待確認)
cut off 12, pairlistdist 13.5, switchdist 10(預設):
WallClock: 85.830368
cut off 10, pairlistdist 11.5, switchdist 9:
WallClock: 86.828651
cut off 13, pairlistdist 14.5, switchdist 10:
WallClock: 83.484566
調整margin(用途待確定)
margin
Acceptable Values: positive decimal
Default Value: 0.0
Description: An internal tuning parameter used in determining the size of the cubes of space with which NAMD uses to partition the system. The value of this parameter will not change the physical results of the simulation. Unless you are very motivated to get the very best possible performance, just leave this value at the default.
http://www.ks.uiuc.edu/Research/namd/wiki/?NamdPerformanceTuning
https://www.ks.uiuc.edu/Research/namd/2.14/ug/node110.html
透過調整output的方式,以下的設定可能可以增加performance
shiftIOToOne yes
ldbUnloadOne yes
noPatchesOnOne yes
純CPU 1000steps
20 core WallClock: 559.907898 CPUTime: 558.425598 Memory: 4691.003906 MB
40 core WallClock: 419.788544 CPUTime: 416.760803 Memory: 6124.734375 MB
測試 2000 steps
單節點2GPU,+p4: WallClock: 200.125610
單節點2GPU,+p6: WallClock: 163.561081
單節點2GPU,+p8: WallClock: 132.111237
單節點2GPU,+p16: WallClock: 113.672607
測試 1000 steps
單節點2GPU,+p20: WallClock: 96.107086
單節點2GPU,+p20: WallClock: 106.593513
單節點1GPU,+p2: WallClock: 320.530579
單節點1GPU,+p4: WallClock: 195.832718
下載VMD
wget https://www.ks.uiuc.edu/Research/vmd/alpha/vmd-1.9.4a55.bin.LINUXAMD64-CUDA102-OptiX650-OSPRay185-RTXRTRT.opengl.tar.gz
tar xzf vmd-1.9.4a55.bin.LINUXAMD64-CUDA102-OptiX650-OSPRay185-RTXRTRT.opengl.tar.gz
編譯
cd vmd-1.9.4a55
./configure
cd src
sudo make install