lammps 比賽筆記

# lammps 比賽筆記 ## 比賽input * in.lj * default build(no need extra packages) * Simple pair-wise model->similar to 凡得瓦力流體 * capture realistic physics of liquid-vapor-solid * like noble gas modeling such as argon * 就是一堆圓形粒子在裡面跑來跑去 * ![](https://i.imgur.com/oS8mttb.png) * in.rhodo * need MOLECULE, KSPACE, RIGID package installed(need fix shake, long-range elecrtonstatics, molecule topology) * 一種蛋白質溶解在水裡 * Uses long-range electrostatics to capture charges * ![](https://i.imgur.com/tsvgT1v.png) * **注意->Since only part of the pppm kspace style is GPU accelerated, it may be faster to only use GPU acceleration for Pair styles with long-range electrostatics. See the “pair/only” keyword of the package command for a shortcut to do that. The work between kspace on the CPU and non-bonded interactions on the GPU can be balanced through adjusting the coulomb cutoff without loss of accuracy.** ## lammps 比賽要求 Performance * tau/day * timesteps/s values. ->兩者數值越大越好 ## Configuration Changes allowed * Number of MPI ranks * MPI and thread affinity * Number of OpenMP threads per MPI rank * Compiler optimization flags * Can compile with “-default-stream per-thread” * FFT library * MPI library * Compiler version * CUDA version * CUDA-aware flags * CUDA MPS on/off * Can use any LAMMPS accelerator package * Any package option (see https://lammps.sandia.gov/doc/package.html), except precision * Coulomb cutoff * The work between kspace on the CPU and non-bonded interactions on the GPU can be balanced through adjusting the coulomb cutoff without loss of accuracy. * Can use atom sorting * atom modify command * Newton flag on/off * mpirun command * Can add load balancing * Can use LAMMPS “processors” command * Can turn off tabulation in pair_style (i.e “pair_modify table 0”) * Can use multiple Kokkos backends (e.g. CUDA + OpenMP) * Can use “kk/device” or “kk/host” suffix for any kernel to run on CPU or GPU ## Configuration Changes not allowed * Modifying any style: pair, fix, kspace, etc. * Number of atoms * Timestep value * Number of timesteps * Neighborlist parameters (except binsize) * neigh_modify command * Changing precision (must use double precision FP64) * LJ charmm cutoff ## Tasks and Submissions 1. 4 nodes or 4 GPUs 2. viusalize 3. Run IPM LAMMPS profile on one of the clusters on 4 nodes, what are the main MPI calls used. ## 需釐清 MPI task = MPI ranks OpenMP thread 全原子或MD模擬 ## 效能預估及整理參考 [Rist-lammps](https://www.hpci-office.jp/documents/appli_software/K_LAMMPS_performance.pdf) 其中說到large-size的情況下加入OpenMP thread可有效提升效能 ## 效能評測 1. loop time:the time inside the MD loop(number of the seconds it took) 2. performance: tau/day or timestep/s ### 參數參考 neighbor list -> full/half neighbor list or newton flag kspace -> fft library/ cutoff etc * On GPUs, timing breakdown won’t be accurate without CUDA_LAUNCH_BLOCKING=1 (but will slow down simulation and prevent overlap) ## lammps軟體本身相關說明 ISC21影片介紹{%youtube ssyLwwHYK-c%} ### lammps是如何做平行運算的會向下圖這樣分配用 [domain decomposition區域分解法](https://en.wikipedia.org/wiki/Domain_decomposition_methods)分開來計算每一個區域為subdomain ![](https://i.imgur.com/aREwWXv.png) 在計算時 subdomain會有互相重疊的部分(稱ghost region) ![](https://i.imgur.com/E63vFA6.png) 計算完後互相溝通(through MPI)迭代 ## neighbor list 以原子為中心，會有一個cutoff+skin 這樣的話就不用每隔多少timestep就要重新build一次neighbor lists ![](https://i.imgur.com/fKypvth.png) ## Newton option(效能相關) 對於兩個原子在不同的proc來說 Newton flag on->means only one processor calculate the force between two atoms Newton flag off->means both processor calculate the force ON->計算量下降，但溝通量提升 OFF->計算量上升，溝通量下降 OFF通常在GPU上有比較好的效果計算效能可能來自於 * problem size * cutoff lengths * GPU or CPU * compute/communication ratio * processor used ### newton flag on for CPU 通常使用 half neighbor list來避免多餘的運算 each pair only stored once 如果用此方法在GPU或multi threading則需要做一些atomic operations for thread-safety ![](https://i.imgur.com/bcUyHI7.png) ### Newton flag on for GPU 通常使用Full neighbor lists each pair stored twice->運算量加倍但溝通時間減少不用考慮thread-safety(cause you don't use atomic)(no need atomic operation) ![](https://i.imgur.com/8SAoN0J.png) ### shake algorithm 當你的模型中有bond angle時，可用shake algorithm ->fix shake 就是將bond跟angle固定以減少計算時間加大timestep 漸少frequency ![](https://i.imgur.com/z8HpMnw.png) ### long-range electrostatics columbic interaction在cutoff外面所以不好計算可用 ![](https://i.imgur.com/Meqx73b.png) 來使答案更準確 can vary the columb cutoff and get the same answer 因為lammps會自動結合short range cutoff and long range part並盡量得到相同答案 EX:在GPU上可以增加short range cutoff來讓在short range的計算更多 ### basic MD timestep 每個timestep做的事 1. initial integrate 2. MPI communication 3. Compute force(pair, bond, kspace, etc.) 4. Additional MPI coummunication(if newton flag on) 5. Final integrate 6. Output(if required on this timestep) ### 版本 stable version:more testing development version:latest and bug fixes ### compiling Need C++ compiler(GNU, Intel, Clang, nvcc) Need MPI library,or can use "STUBS" library Make/CMake ### package * Traditional Make: * make yes-molecule * make no-molecule * CMAKE: * -D PKG_MOLECULE=yes ### accelerator package cuda for GPU OpenMP for multthreaded CPUs 五種加速package * USER-OMP * USER-INTEL * OPT * GPU * KOKKOS #### OTP package(CPU) c++ 10+ reduce overhead * Methods rewritten in C++ templated form to reduce the overhead due to if tests and other conditional code * Code also vectorizes better than the regular CPU version * Contains 9 pair styles including Lennard-Jones * No GPU support :::info RUNNING -> Compile LAMMPS with OPT package compile ``` ##make make yes-opt make no-opt ##cmake -D PKG_OPT=yes ``` ex: ``` mpiexec -np 8 ./lmp_exe -in in.lj -sf opt ``` ::: use for anything opt can do with, like pair style, time integrator,any thing that has opt version 如果用這個加速package 只要指令可以用這個加速的話就會使用 #### USER-OMP Package(CPU) use OpenMP to enable multithreading on CPUs **lammps use MPI parallelization for the domain decomposition. the OpenMP threading is on top of MPI. typically using OpenMP isn't faster in straight MPI in lammps, beacuse of overhead, thread safety** * If a lot of cores for a big simulation on supercomputer, OpenMP will help communicating * If you have hyper threading enabled, OpenMP 也可以適用在超執行緒 :::info RUNNING->compaile with this package EX:(2 MPI with 2 OpenMP threads) ``` export OMP_NUM_THREADS=2 mpiexec -np 2 ./lmp_exe –in in.lj -sf omp ``` ::: #### USER-INTEL Package(CPU) 可以在intel CPU上有更好的效果(vectorize)(with or without OpenMP threading) 可以跟USER-OMP Package一起用 通常來說是效能最好的CPU加速包 :::info RUNNING->compaile with this package EX:(2 MPI and 2 threads on a Intel CPU) 一樣尾巴加 -sf intel 然後還有 -pk 這個 package command可以提供不同的東西 ``` mpiexec -np 2 ./lmp_exe -in in.lj -pk intel 0 omp 2 mode double -sf intel ``` -pk intel 0 omp 2 mode double -> 沒有offload, using two openmp threading and using double precision ::: [package command](https://lammps.sandia.gov/doc/package.html) #### GPU package(GPU) 通常適用於使用一或多GPU配很多CPU 在使用這個只有pair force computation(neighbor list or the longer electrostatics,etc )使用GPU算，其他都用CPU算 -> Atom base data 在 GPU 和 CPU之間互相傳送會自動try to overlap if you have pairwise(pair-style) on the GPU and Kspace on the CPU(因為沒有dependency on任何一方的計算所以才能加速) ->Provides NVIDIA and more general OpenCL support :::info RUNNING-> 1. build the GPU library 2. compile lammps with GPU package 3. run EX:(Run with 16 MPI and 4 GPUs) ``` mpiexec -np 16 ./lmp_exe -in in.lj -sf gpu -pk gpu 4 ``` -pk gpu 4 = 節點上有4個GPU 用全部(4個)GPU ::: best to use CUDA MPS(multi-Process Service)when using multiple MPI ranks per GPU ->reduce overhead #### KOKKOS package consist two parts 1. Parallel dispatch—threaded kernels are launched and mapped onto backend languages such as CUDA or OpenMP 2. Kokkos views—polymorphic memory layouts that can be optimized for a specific hardware -> switch different layout at compile time * Used on top of existing MPI parallelization (MPI + X) :::warning Supports OpenMP and GPUs running everything on GPU minimal data transfer from GPU to CPU ::: package command -> -pk kokkos [command] :::info RUNNING -> Compile LAMMPS with the KOKKOS package EX: ``` ##(Run with 4 MPI and 4 GPUs) mpiexec -np 4 ./lmp_exe -in in.lj -k on g 4 -sf kk ##(Run with 4 OpenMP threads) ./lmp_exe -in in.lj -k on t 4 -sf kk ``` -k -> using kokkos at runtime g 4-> using four gpu per nodes t 4-> using four threads ::: * overlap in kokkos package * –pk kokkos pair/only on -> like gpu package * need to compile with `--default-stream perthread` flag to achieve overlap * 可以在command後面加/kk/host來指定在CPU上計算 /kk/device指定在GPU上 * g 4 跟 t 4可同時使用(compile both Cuda and OpenMP) :::info Note that the product of MPI tasks * OpenMP threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer. If Hyper-Threading (HT) is enabled, then the product of MPI tasks * OpenMP threads/task should not exceed the physical number of cores * hardware threads. ::: ### FFT library LAMMPS needs FFT library for PPPM Kspace method 預設-> KISS FFT 其他可能可以更快的option:FFTW MKL cuFFT等 ### processes and thread affinity **to determind how MPI tasks and threads are assigned to cores and nodes** ->`--bind-to-core` or `--bind-to-socket` OpenMP variable `OMP_PROC_BIND` or `OMP_PLACES` * 須注意NUMA bindings between tasks, cores, and GPUs for example->dual-socket system=>MPI tasks driving GPUs should be on the same socket as the GPU [NUMA binding](https://www.kernel.org/doc/Documentation/devicetree/bindings/numa.txt) ###### tags: `lammps`