# lammps 比賽筆記
## 比賽input
* in.lj
* default build(no need extra packages)
* Simple pair-wise model->similar to 凡得瓦力流體
* capture realistic physics of liquid-vapor-solid
* like noble gas modeling such as argon
* 就是一堆圓形粒子在裡面跑來跑去
* 
* in.rhodo
* need MOLECULE, KSPACE, RIGID package installed(need fix shake, long-range elecrtonstatics, molecule topology)
* 一種蛋白質溶解在水裡
* Uses long-range electrostatics to capture charges
* 
* **注意->Since only part of the pppm kspace style is GPU accelerated, it may be faster to only use GPU acceleration for Pair styles with long-range electrostatics. See the “pair/only” keyword of the package command for a shortcut to do that. The work between kspace on the CPU and non-bonded interactions on the GPU can be balanced through adjusting the coulomb cutoff without loss of accuracy.**
## lammps 比賽要求
Performance
* tau/day
* timesteps/s values.
->兩者數值越大越好
## Configuration Changes allowed
* Number of MPI ranks
* MPI and thread affinity
* Number of OpenMP threads per MPI rank
* Compiler optimization flags
* Can compile with “-default-stream per-thread”
* FFT library
* MPI library
* Compiler version
* CUDA version
* CUDA-aware flags
* CUDA MPS on/off
* Can use any LAMMPS accelerator package
* Any package option (see https://lammps.sandia.gov/doc/package.html), except precision
* Coulomb cutoff
* The work between kspace on the CPU and non-bonded interactions on the GPU can be balanced through adjusting the coulomb cutoff without loss of accuracy.
* Can use atom sorting
* atom modify command
* Newton flag on/off
* mpirun command
* Can add load balancing
* Can use LAMMPS “processors” command
* Can turn off tabulation in pair_style (i.e “pair_modify table 0”)
* Can use multiple Kokkos backends (e.g. CUDA + OpenMP)
* Can use “kk/device” or “kk/host” suffix for any kernel to run on CPU or GPU
## Configuration Changes not allowed
* Modifying any style: pair, fix, kspace, etc.
* Number of atoms
* Timestep value
* Number of timesteps
* Neighborlist parameters (except binsize)
* neigh_modify command
* Changing precision (must use double precision FP64)
* LJ charmm cutoff
## Tasks and Submissions
1. 4 nodes or 4 GPUs
2. viusalize
3. Run IPM LAMMPS profile on one of the clusters on 4 nodes, what are the main MPI calls used.
## 需釐清
MPI task = MPI ranks
OpenMP thread
全原子或MD模擬
## 效能預估及整理
參考
[Rist-lammps](https://www.hpci-office.jp/documents/appli_software/K_LAMMPS_performance.pdf)
其中說到large-size的情況下 加入OpenMP thread可有效提升效能
## 效能評測
1. loop time:the time inside the MD loop(number of the seconds it took)
2. performance: tau/day or timestep/s
### 參數參考
neighbor list -> full/half neighbor list or newton flag
kspace -> fft library/ cutoff etc
* On GPUs, timing breakdown won’t be accurate without CUDA_LAUNCH_BLOCKING=1 (but will slow down simulation and prevent overlap)
## lammps軟體本身相關說明
ISC21影片介紹{%youtube ssyLwwHYK-c%}
### lammps是如何做平行運算的
會向下圖這樣分配
用 [domain decomposition區域分解法](https://en.wikipedia.org/wiki/Domain_decomposition_methods)分開來計算
每一個區域為subdomain

在計算時 subdomain會有互相重疊的部分(稱ghost region)

計算完後互相溝通(through MPI)迭代
## neighbor list
以原子為中心,會有一個cutoff+skin
這樣的話就不用每隔多少timestep就要重新build一次neighbor lists

## Newton option(效能相關)
對於兩個原子在不同的proc來說
Newton flag on->means only one processor calculate the force between two atoms
Newton flag off->means both processor calculate the force
ON->計算量下降,但溝通量提升
OFF->計算量上升,溝通量下降
OFF通常在GPU上有比較好的效果
計算效能可能來自於
* problem size
* cutoff lengths
* GPU or CPU
* compute/communication ratio
* processor used
### newton flag on for CPU
通常使用 half neighbor list來避免多餘的運算
each pair only stored once
如果用此方法在GPU或multi threading則需要做一些atomic operations for thread-safety

### Newton flag on for GPU
通常使用Full neighbor lists
each pair stored twice->運算量加倍但溝通時間減少
不用考慮thread-safety(cause you don't use atomic)(no need atomic operation)

### shake algorithm
當你的模型中有bond angle時,可用shake algorithm
->fix shake
就是將bond跟angle固定以減少計算時間 加大timestep 漸少frequency

### long-range electrostatics
columbic interaction在cutoff外面所以不好計算
可用

來使答案更準確
can vary the columb cutoff and get the same answer
因為lammps會自動結合short range cutoff and long range part並盡量得到相同答案
EX:在GPU上可以增加short range cutoff來讓在short range的計算更多
### basic MD timestep
每個timestep做的事
1. initial integrate
2. MPI communication
3. Compute force(pair, bond, kspace, etc.)
4. Additional MPI coummunication(if newton flag on)
5. Final integrate
6. Output(if required on this timestep)
### 版本
stable version:more testing
development version:latest and bug fixes
### compiling
Need C++ compiler(GNU, Intel, Clang, nvcc)
Need MPI library,or can use "STUBS" library
Make/CMake
### package
* Traditional Make:
* make yes-molecule
* make no-molecule
* CMAKE:
* -D PKG_MOLECULE=yes
### accelerator package
cuda for GPU
OpenMP for multthreaded CPUs
五種加速package
* USER-OMP
* USER-INTEL
* OPT
* GPU
* KOKKOS
#### OTP package(CPU)
c++ 10+ reduce overhead
* Methods rewritten in C++ templated form to reduce the overhead due to if tests and other conditional code
* Code also vectorizes better than the regular CPU version
* Contains 9 pair styles including Lennard-Jones
* No GPU support
:::info
RUNNING
-> Compile LAMMPS with OPT package
compile
```
##make
make yes-opt
make no-opt
##cmake
-D PKG_OPT=yes
```
ex:
```
mpiexec -np 8 ./lmp_exe -in in.lj -sf opt
```
:::
use for anything opt can do with, like pair style, time integrator,any thing that has opt version
如果用這個加速package 只要指令可以用這個加速的話 就會使用
#### USER-OMP Package(CPU)
use OpenMP to enable multithreading on CPUs
<font color="#fd2e02">**lammps use MPI parallelization for the domain decomposition. the OpenMP threading is on top of MPI.
typically using OpenMP isn't faster in straight MPI in lammps, beacuse of overhead, thread safety**</font>
* If a lot of cores for a big simulation on supercomputer, OpenMP will help communicating
* If you have hyper threading enabled, OpenMP 也可以適用在超執行緒
:::info
RUNNING->compaile with this package
EX:(2 MPI with 2 OpenMP threads)
```
export OMP_NUM_THREADS=2
mpiexec -np 2 ./lmp_exe –in in.lj -sf omp
```
:::
#### USER-INTEL Package(CPU)
可以在intel CPU上有更好的效果(vectorize)(with or without OpenMP threading)
可以跟USER-OMP Package一起用
<font color="#fd2e02">通常來說是效能最好的CPU加速包</font>
:::info
RUNNING->compaile with this package
EX:(2 MPI and 2 threads on a Intel CPU)
一樣尾巴加 -sf intel 然後還有 -pk 這個 package command可以提供不同的東西
```
mpiexec -np 2 ./lmp_exe -in in.lj -pk intel 0 omp 2 mode double -sf intel
```
-pk intel 0 omp 2 mode double -> 沒有offload, using two openmp threading and using double precision
:::
[package command](https://lammps.sandia.gov/doc/package.html)
#### GPU package(GPU)
通常適用於使用一或多GPU配很多CPU
在使用這個只有pair force computation(neighbor list or the longer electrostatics,etc )使用GPU算,其他都用CPU算
-> Atom base data 在 GPU 和 CPU之間互相傳送
會自動try to overlap if you have pairwise(pair-style) on the GPU and Kspace on the CPU(因為沒有dependency on任何一方的計算 所以才能加速)
->Provides NVIDIA and more general OpenCL support
:::info
RUNNING->
1. build the GPU library
2. compile lammps with GPU package
3. run
EX:(Run with 16 MPI and 4 GPUs)
```
mpiexec -np 16 ./lmp_exe -in in.lj -sf gpu -pk gpu 4
```
-pk gpu 4 = 節點上有4個GPU 用全部(4個)GPU
:::
<font color="#fd2e02">best to use CUDA MPS(multi-Process Service)when using multiple MPI ranks per GPU</font>
->reduce overhead
#### KOKKOS package
consist two parts
1. Parallel dispatch—threaded kernels are launched and mapped onto backend languages such as CUDA or OpenMP
2. Kokkos views—polymorphic memory layouts that can be optimized for a specific hardware -> switch different layout at compile time
* Used on top of existing MPI parallelization (MPI + X)
:::warning
Supports OpenMP and GPUs
running everything on GPU minimal data transfer from GPU to CPU
:::
package command -> -pk kokkos [command]
:::info
RUNNING ->
Compile LAMMPS with the KOKKOS package
EX:
```
##(Run with 4 MPI and 4 GPUs)
mpiexec -np 4 ./lmp_exe -in in.lj -k on g 4 -sf kk
##(Run with 4 OpenMP threads)
./lmp_exe -in in.lj -k on t 4 -sf kk
```
-k -> using kokkos at runtime
g 4-> using four gpu per nodes
t 4-> using four threads
:::
* overlap in kokkos package
* –pk kokkos pair/only on -> like gpu package
* need to compile with `--default-stream perthread` flag to achieve overlap
* 可以在command後面加/kk/host來指定在CPU上計算 /kk/device指定在GPU上
* g 4 跟 t 4可同時使用(compile both Cuda and OpenMP)
:::info
Note that the product of MPI tasks * OpenMP threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer. If Hyper-Threading (HT) is enabled, then the product of MPI tasks * OpenMP threads/task should not exceed the physical number of cores * hardware threads.
:::
### FFT library
LAMMPS needs FFT library for PPPM Kspace method
預設-> KISS FFT
其他可能可以更快的option:FFTW MKL cuFFT等
### processes and thread affinity
<font color="#fd2e02">**to determind how MPI tasks and threads are assigned to cores and nodes**</font>
->`--bind-to-core` or `--bind-to-socket`
OpenMP variable
`OMP_PROC_BIND` or `OMP_PLACES`
* 須注意NUMA bindings between tasks, cores, and GPUs
for example->dual-socket system=>MPI tasks driving GPUs should be on the same socket as the GPU
[NUMA binding](https://www.kernel.org/doc/Documentation/devicetree/bindings/numa.txt)
###### tags: `lammps`