# Adaptative String Method: tutorial
The [GitHub](https://github.com/kzinovjev/string-amber/blob/master/tutorial.md#string-file) with a tutorial, explanations and some scripts, and the [article](https://pubs.acs.org/doi/10.1021/acs.jpca.7b10842) of the method.
## Table of content
[toc]
## Preparation of the system
Before doing any ASM, it is necessary to prepare the system and equilibrate it using Amber or GROMACS for example. It is best to have the reactant and product state structures for the string, or at least the transition state.
## Directory architecture
The code will need to access some files, and it is best to have them all in the directory where the string will be running and to avoid using intricated architecture.
Here is an example of a good architecture:
```
./Project
├── 1_energy_mini
├── 2_equilibration
├── 3_ASM
│ └── 0_string
│ └── starting_struc/
│ ├── file1.rst7
│ └── file2.rst7
│ ├── STRING
│ ├── guess
│ ├── in.sh
│ ├── job.sh
│ ├── file.top
│ ├── model.in
│ └── CVs
│ └── 1_string_continue
│ └── starting_struc/
│ ├── file1.rst7
│ └── file2.rst7
│ ├── STRING
│ ├── guess
│ ├── continue_in.sh
│ ├── job.sh
│ ├── file.top
│ ├── model.in
│ └── CVs
│ └── 2_string_continue
.
.
.
```
## Required files for ASM
### CVs
Below, an example of a CVs file with 3 CVs:
```
3
$COLVAR
COLVAR_type = "BOND"
atoms = 12 ,14
$END
$COLVAR
COLVAR_type = "BOND"
atoms = 14, 17
$END
$COLVAR
COLVAR_type = "BOND"
atoms = 12 ,17
$END
```
The first line is the number of CVs defined in the file, then each CV is specified according to the following syntaxe:
```
$COLVAR
COLVAR_type = "[CVs_type]"
atoms = [atom1_mask] ,[atom2_mask]
$END
```
With `[CVs_type]`:
* BOND - interatomic distance (2 atoms)
* ANGLE - bond angle (3 atoms)
* DIHEDRAL - dihedral angle (4 atoms)
* PPLANE - signed point plane distance (4 atoms)
And the `[atomX_mask]` are the indexes of the atoms considered, they can be found using AmberTools:
>$ cpptraj
>$ parm [file].prmtop
>$ atominfo :[index_of_the_residue]
### guess
It tells the value of each CVs at each points; point 1 being the initial state and point $N$ the final one.
```
<number of guess points>
<CV_1 in point 1> <CV_2 in point 1> ... <CV_D in point 1>
<CV_1 in point 2> <CV_2 in point 2> ... <CV_D in point 2>
...
<CV_1 in point N> <CV_2 in point N> ... <CV_D in point N>
```
The starting and ending values can be generated with [trj_to_cvs.py](https://github.com/kzinovjev/string-amber/blob/master/utils/trj_to_cvs.py) and unbiased MD trajectories of reactants and products.
If there are more string nodes than values of CVs, the code will extrapolate in between the points.
### STRING
A file to write manually, in which you can specify instructions for the string.
Here is the simplest example:
```
$STRINGSETUP
dir = "results/"
guess_file = "guess"
$END
```
Other options can be specified:
```
fix_ends = {.true., .false.} (default: .true.)
preparation_steps = {int} (default: 1000?)
buffer_size = {int} ??
```
### Input parameters
It is the model file (model$.$in) with all the input parameters and options that will be used with in$.$sh to generate all the input files. If using sander from AmberTools21, details can be found in the [Amber21 manual](https://ambermd.org/doc12/Amber21.pdf), starting from section 21.3.
### Generate inputs
The shell script that generates all the input files for each node as well as the groupfile that allows running all the jobs for every nodes.
<details open>
<summary>in.sh</summary>
```bash=
#!/bin/bash
# If groupfile exists, remove it
if [ -f string.groupfile ];
then
rm string.groupfile
fi
PARM=starting_structures/dinuc.prmtop # path to topology file
REACT=starting_structures/qmmmeq.rst7 # path to reactants structure
PROD=starting_structures/qmmmeq.rst7 # path to products structure
# Number of string nodes is provided as command line argument
nodes=$1
for i in `seq 1 $nodes`; do
# generate sander input for node $i
sed "s/XXXXX/$RANDOM/g" model.in > $i.in
# use reactants structure for the first half of the nodes and products for the rest
if [ $i -le $(($nodes/2)) ];
then
crd=$REACT
else
crd=$PROD
fi
# write out the sander arguments for node $i to the groupfile
echo "-O -rem 0 -i $i.in -o $i.out -c $crd -r $i.rst7 -x $i.nc -inf $i.mdinfo -p $PARM" >> string.groupfile
done
echo "N REPLICAS = $i"
```
</details>
Then to run it:
> chmod u+x in$.$sh
> ./in.sh [number_of_string_nodes]
### Job script
It is the script to submit a job on baal (LBT-IBPC). The top and user part need to be modified according to the job you want to run.
<details open>
<summary>job.sh</summary>
```bash=
#!/bin/bash
#PBS -S /bin/bash
#PBS -N string_QMMM
#PBS -o $PBS_JOBID.out
#PBS -e $PBS_JOBID.err
#PBS -q gpu_40c_1n
#PBS -l nodes=1:ppn=40
#PBS -l walltime=12:00:00
#PBS -A simlab_project
#PBS -l epilogue=/shared/scripts/ADMIN__epilogue-qsub.example
### Above should be specified by the user, simlab is everyone else except stirnemann team ###
### FOR EVERYTHING BELOW, I ADVISE YOU TO MODIFY THE USER-part ONLY ###
WORKDIR="/"
NUM_NODES=$(cat $PBS_NODEFILE|uniq|wc -l)
if [ ! -n "$PBS_O_HOME" ] || [ ! -n "$PBS_JOBID" ]; then
echo "At least one variable is needed but not defined. Please touch your manager about."
exit 1
else
if [ $NUM_NODES -le 1 ]; then
WORKDIR+="scratch/"
export WORKDIR+=$(echo $PBS_O_HOME |sed 's#.*/\(home\|workdir\)/\(.*_team\)*.*#\2#g')"/$PBS_JOBID/"
mkdir $WORKDIR
rsync -ap $PBS_O_WORKDIR/ $WORKDIR/
/shared/scripts/ADMIN__auto-rsync.example 200 & # sync frequency (secondes)
else
export WORKDIR=$PBS_O_WORKDIR
fi
fi
echo "your current dir is: $PBS_O_WORKDIR"
echo "your workdir is: $WORKDIR"
echo "number of nodes: $NUM_NODES"
echo "number of cores: "$(cat $PBS_NODEFILE|wc -l)
echo "your execution environment: "$(cat $PBS_NODEFILE|uniq|while read line; do printf "%s" "$line "; done)
cd $WORKDIR
# If you're using only one node, it's counterproductive to use IB network for your MPI process communications
#if [ $NUM_NODES -eq 1 ]; then
# export PSM_DEVICES=self,shm
# export OMPI_MCA_mtl=^psm
# export OMPI_MCA_btl=shm,self
#else
# Since we are using a single IB card per node which can initiate only up to a maximum of 16 PSM contexts
# we have to share PSM contexts between processes
# CIN is here the number of cores in node
# CIN=$(cat /proc/cpuinfo | grep -i processor | wc -l)
# if [ $(($CIN/16)) -ge 2 ]; then
# PPN=$(grep $HOSTNAME $PBS_NODEFILE|wc -l)
# if [ $CIN -eq 40 ]; then
# export PSM_SHAREDCONTEXTS_MAX=$(($PPN/4))
# elif [ $CIN -eq 32 ]; then
# export PSM_SHAREDCONTEXTS_MAX=$(($PPN/2))
# else
# echo "This computing node is not supported by this script"
# fi
# echo "PSM_SHAREDCONTEXTS_MAX defined to $PSM_SHAREDCONTEXTS_MAX"
# else
# echo "no PSM_SHAREDCONTEXTS_MAX to define"
# fi
#fi
function get_gpu-ids() {
if [ $PBS_NUM_PPN -eq $(cat /proc/cpuinfo | grep -cE "^processor.*:") ]; then
echo "0,1" && return
fi
if [ -e /dev/cpuset/torque/$PBS_JOBID/cpus ]; then
FILE="/dev/cpuset/torque/$PBS_JOBID/cpus"
elif [ -e /dev/cpuset/torque/$PBS_JOBID/cpuset.cpus ]; then
FILE="/dev/cpuset/torque/$PBS_JOBID/cpuset.cpus"
else
FILE=""
fi
if [ -e $FILE ]; then
if [ $(cat $FILE | sed -r 's/^([0-9]).*$/\1/') -eq 0 ]; then
echo "0" && return
else
echo "1" && return
fi
else
echo "0,1" && return
fi
}
gpus=$(get_gpu-ids)
## END-DO
##
## USER part (i should do my stuff here)
##
#module load torque/gnu/6.1.2
module load Ambertools-ASM/gnu/21
module load gcc/8.3.0
module load openmpi/gnu/3.1.4
#Run
cd $WORKDIR/
rm -rf results
mkdir results
export AMBERHOME=/shared/software/Ambertools-ASM/21/gnu
source $AMBERHOME/amber.sh
mpirun sander.MPI -O -i qmmm.in -o qmmm.out -p dinuc.prmtop -c dinuc.rst7 -x traj.nc # command i should specify
cd $WORKDIR
date
## DO NOT MODIFY THE PART OF SCRIPT: you will be accountable for any damage you cause
# At the term of your job, you need to get back all produced data synchronizing workdir folder with you starting job folder and delete the temporary one (workdir)
if [ $NUM_NODES -le 1 ]; then
cd $PBS_O_WORKDIR
rsync -ap $WORKDIR/ $PBS_O_WORKDIR/
rm -rf $WORKDIR
fi
## END-DO
```
</details>
Then, to run it on baal:
> qsub job$.$sh
## Trouble shooting
If something went wrong, please be sure that:
- You have all the required files **in your working directory**
- [ ] CVs
- [ ] STRING
- [ ] guess (original or from last [num].string )
- [ ] in$.$sh
- [ ] model$.$in
- [ ] job$.$sh
- [ ] starting structures
- [ ] toplogy file
- [ ] string$.$groupfile (generated by in$.$sh)
- [ ] (restraint file)
- You did ask for a simulation time shorter than the limite time set on the cluster
- You checked the groupfile and scripts (correct paths, number of cores, etc.)
- You took and indicated the correct starting structures
- Your system is centered in the box (Amber manages PBC, but ASM doesn't)
## After the first run
Once the job is finish, run the following python script to check for convergence:
<details open>
<summary>checkstringV2.py</summary>
```python=
import re
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
def get_nnodes():
"""
Assumes that for each node there is an 'in' file (1.in, 2.in etc) and
there are no other 'in' files with numeric names.
"""
regex = re.compile(r"([0-9]*)\."+"in")
paths = filter(lambda x: regex.match(x) and os.path.isfile(x), os.listdir('.'))
return len(list(paths))
def parse_string():
preparation_steps = None
results_dir = '.'
only_pmf = False
CV_file = 'CVs'
frequency_PMF = int(2/0.001)
with open('STRING') as f:
for raw_line in f:
line = raw_line.strip()
if line.startswith('dir'):
results_dir = line.split()[-1].strip('"')
if line.startswith('preparation_steps'):
preparation_steps = int(line.split()[-1])
if line.startswith('only_PMF'):
only_pmf = line.split()[-1].lower() == '.true.'
if line.startswith('buffer_size'):
frequency_PMF = int(line.split()[-1])
if line.startswith('CV_file'):
CV_file = line.split()[-1].strip('"')
return preparation_steps, results_dir, only_pmf, frequency_PMF, CV_file
def state_string():
last_string_step = int(np.loadtxt(os.path.join(results_dir, 'convergence.dat'))[-1][0])
start_data_analysis = frequency_PMF
end_data_analysis = last_string_step - last_string_step%frequency_PMF
if end_data_analysis >= 6 * int(frequency_PMF):
start_data_analysis = end_data_analysis - 4 * frequency_PMF
return start_data_analysis, end_data_analysis
def CVs_lists():
colvars = []
BOND = []
ANGLES = []
with open(CV_file) as f:
counter = 0
for raw_line in f:
line = raw_line.strip()
if 'COLVAR_type' in line:
CV = line.split()[-1].strip('"')
counter += 1
colvars.append(counter)
if CV == 'BOND' or CV == 'PPLANE':
BOND.append(counter)
if CV == 'ANGLE' or CV == 'DIHEDRAL':
ANGLES.append(counter)
return BOND, ANGLES, colvars
def time_step():
with open('1.in') as f:
for raw_line in f:
line = raw_line.strip()
if 'dt' in line:
time_step = (line.split()[-1].strip(","))
else:
time_step = 0.001
return time_step
def nodes_vlimit_exceeded():
"""
If vlimit is exceeded theres is something wrong with the structures
So this function checks if exceeded in some string node
"""
print ('Checking rst integrity')
for node in range (1, get_nnodes() + 1):
with open(str(node)+'.out') as f:
if 'vlimit exceeded' in f.read():
print ('check node', node)
print ('Not mentioned structures are OK')
def canvas_grid_definition():
if only_pmf == True:
x, y = 1, 1
elif len(BOND) != 0 and len(ANGLES) !=0:
x, y = 4, 1
else:
x, y = 3, 2
grid = plt.GridSpec( x, y, wspace=0.1, hspace=0.4)
return grid
def CVs_plot(CV):
if CV == BOND:
plt.title('Distances along the string')
for string_value in range(start_data_analysis,end_data_analysis+1,frequency_PMF):
strings_data = pd.read_csv(f'results/{string_value}.string',delim_whitespace=True, names=colvars)
sns.lineplot(data=strings_data[CV], linewidth=0.7,legend=False, dashes=False)
guess_data = pd.read_csv('results/0.string', delim_whitespace=True, names=colvars)
sns.lineplot(data=guess_data[CV], dashes=False)
plt.xlabel('node')
plt.ylabel('$\AA$')
else:
plt.title('Angles along the string')
for string_value in range(start_data_analysis, end_data_analysis + 1, frequency_PMF):
strings_data = pd.read_csv(f'results/{string_value}.string', delim_whitespace=True, names=colvars)
sns.lineplot(data=strings_data[CV], linewidth=0.7, legend=False, dashes=False)
guess_data = pd.read_csv('results/0.string', delim_whitespace=True, names=colvars)
sns.lineplot(data=guess_data[CV], dashes=False)
plt.xlabel('node')
plt.ylabel('Degrees')
return
nodes_vlimit_exceeded()
preparation_steps, results_dir, only_pmf, frequency_PMF, CV_file = parse_string()
header_names = ['s','N_bin','Pop_bin','Force', 'Force_CI (95%)','PMF','PMF_CI (95%)']
if only_pmf == '.true.':
last_step_PMF = len(open(f'{results_dir}/1_final.dat').readlines())
final_PMF = last_step_PMF - last_step_PMF % frequency_PMF
final_PMF_data = pd.read_csv(f'{results_dir}/{final_PMF}_final.PMF', delim_whitespace=True, skiprows=(0, 1),
header=(0), names=header_names)
plt.title(f'{final_PMF}*{time_step()} ps of PMF')
sns.lineplot(data=final_PMF_data, x="s", y="PMF")
plt.fill_between(final_PMF_data['s'], final_PMF_data['PMF'] - final_PMF_data['PMF_CI (95%)'],
final_PMF_data['PMF'] + final_PMF_data['PMF_CI (95%)'], alpha=0.2)
preparation_steps, results_dir, only_pmf, frequency_PMF, CV_file = parse_string()
start_data_analysis, end_data_analysis = state_string()
BOND, ANGLES, colvars = CVs_lists()
grid = canvas_grid_definition()
print('start {0} ends {1} freq {2}'.format(state_string()[0],state_string()[1], frequency_PMF))
plt.subplots(constrained_layout=True)
#plt.subplot(x,y,1)
plt.subplot(grid[0, 0])
plt.title('Convergence')
convergence_data = pd.read_csv('results/convergence.dat', delim_whitespace=True, names=['time','convergence'])
convergence_data['time'] = convergence_data['time']*time_step()
sns.lineplot(x='time',y='convergence',data=convergence_data)
plt.xlabel('time (ps)')
plt.ylabel('s ($amu^{1/2}$$\AA$)')
#plt.subplot(x, y, 2)
plt.subplot(grid[1, 0])
if os.path.isfile('results/STOP_STRING') and os.path.isfile(f'results/{frequency_PMF}_final.PMF'):
last_step_PMF = len(open('results/1_final.dat').readlines())
final_PMF = last_step_PMF - last_step_PMF % frequency_PMF
plt.title('FINAL PMF')
final_PMF_data = pd.read_csv(f'results/{final_PMF}_final.PMF', delim_whitespace=True, skiprows=(0, 1),
header=(0), names=header_names)
sns.lineplot(data=final_PMF_data, x="s", y="PMF")
plt.fill_between(final_PMF_data['s'], final_PMF_data['PMF'] - final_PMF_data['PMF_CI (95%)'],
final_PMF_data['PMF'] + final_PMF_data['PMF_CI (95%)'], alpha=0.2)
else:
plt.title('PMFs along the string')
for PMF in range(start_data_analysis,end_data_analysis+1,frequency_PMF):
PMF_data = pd.read_csv(f'results/{PMF}.PMF', delim_whitespace=True, skiprows=(0,1),header=(0), names=header_names)
sns.lineplot(x='s',y='PMF' ,data=PMF_data,linewidth=0.7, label=PMF)
if len(BOND) != 0 and len(ANGLES) !=0:
#plt.subplot(x,y,3)
plt.subplot(grid[2, 0])
CVs_plot(BOND)
#plt.subplot(x,y,4)
plt.subplot(grid[3, 0])
CVs_plot(ANGLES)
else:
REX_list = ['step']
#plt.subplot(x, y, 4)
plt.subplot(grid[:, 1])
plt.title('replica exchange along the string')
for node in range(1, get_nnodes() + 1):
REX_list.append(node)
replica_exchange_data = pd.read_csv('results/plot.REX', delim_whitespace=True, names=REX_list)
for node in range(1, get_nnodes() + 1):
sns.lineplot(data=replica_exchange_data[node], linewidth=0.7, dashes=False)
plt.xlabel('time')
plt.ylabel('String Node')
if len(BOND) != 0:
plt.subplot(grid[2, 0])
#plt.subplot(x, y, 3)
CVs_plot(BOND)
else:
#plt.subplot(x, y, 3)
plt.subplot(grid[2, 0])
CVs_plot(ANGLES)
mng = plt.get_current_fig_manager()
mng.resize(1800,1000)
plt.show()
```
</details>
If you are executing this on baal please be sure that your loading the module to use python3. To see which module you should load:
>$ module avail
And that you activate the proper conda environment (should be 'sci_tools') to acces the required packages.
:::warning
:warning: **Job submition**
Although this script is quite fast to run and does not take a lot of ressources, please avoid running it on the head node by doing:
>$ ./checkstringV2.py
As it is a bad practice to run job without using the job manager.
To properly run your job **on baal**, please use:
>$ qsub -I -l nodes=1:ppn=40 -A simlab_project
This will start an interactive session which will log you inside a computing node.
CTRL+D to quit the computing node and go back to the head node.
:::
After running this script, you should see a pop-up window showing plots like so, *assuming you specified -X when loging via ssh*:

The most important one is the Convergence, it tells you how close your string is to converging. To spot when the string should be stopped, you must see a plateau like this one from the [GitHub](https://github.com/kzinovjev/string-amber/blob/master/tutorial.md#string-file) tutorial:

## Continuing the string
Normally, it is said on the [GitHub](https://github.com/kzinovjev/string-amber/blob/master/tutorial.md#string-file) that "*the job should always be killed manually by the user, after checking that no more sampling is needed. So, it is better to put some unreasonably large number for the total number of steps (`nstlim`) .*"
But on clusters, it is common to have a walltime limit (for baal, it depends on the computing nodes, please see details [here](http://www-lbt.ibpc.fr/wiki/doku.php?id=cluster-lbt:limits)). Therefore, you might need to continue the string until you obtain convergence, and then stop it with the correct procedure.
First, create a directory named sorted in the current directory, where you string has finished running. Copy inside `sorted/` the following script to copy the structures:
<details open>
<summary>order.py</summary>
```python=
#!/usr/bin/python
import os
with open('../results/plot.REX', 'r') as f:
idx = f.readlines()[-1].split()[1:]
for i in range(len(idx)):
os.system('cp ../' + str(i+1) + '.rst7 ' + idx[i] + '.rst7')
```
</details>
Then, you need to create a new directory, following the suggested architecture, and move the `sorted/` directory into this new directory, renaming it `starting_struct/`.
>$ mv -r ./0_string/sorted/ ./1_string_continue/starting_struct/
After that, go into `0_string/results/` and execute the following command `ls -1v *.string` to get the latest `[file].string/`.
This file will be the new `guess` file, so once you got it, copy and paste it into the new directory and rename it.
>$ cp ./0_string/results/[file].string ./1_string_continue/guess
And add a new line at the top of this file with the number of string nodes.
Next, you will need to copy and paste the required files:
- [ ] CVs
- [ ] STRING
- [ ] model$.$in
- [ ] job$.$sh
- [ ] toplogy file
And finally, the in$.$sh script should include the starting structures you got from the previous run. Here is the script:
<details open>
<summary>continue_in.sh</summary>
```bash=
#!/bin/bash
# script modified by LK 28/02/23
# If groupfile exists, remove it
if [ -f string.groupfile ];
then
rm string.groupfile
fi
PARM=dinuc.prmtop # path to topology file
REACT=./starting_struct/ # path to starting structures
# Number of string nodes is provided as command line argument
nodes=$1
for i in `seq 1 $nodes`; do
# generate sander input for node $i
sed "s/XXXXX/$RANDOM/g" model.in > $i.in
# write out the sander arguments for node $i to the groupfile
echo "-O -rem 0 -i $i.in -o $i.out -c $REACT$i.rst7 -r $i.rst7 -x $i.nc -inf $i.mdinfo -p $PARM" >> string.groupfile
done
echo "N REPLICAS = $i"
```
</details>
And run it:
>$ ./continue_in.sh [number of nodes]
>$ qsub job$.$sh
## Visualise your trajectory
To get the last frame of each string node, execute:
<details open>
<summary>order.py</summary>
```python=
#!/usr/bin/python
import os
with open('../results/plot.REX', 'r') as f:
idx = f.readlines()[-1].split()[1:]
for i in range(len(idx)):
os.system('cp ../' + str(i+1) + '.rst7 ' + idx[i] + '.rst7')
```
</details>
:::danger
:warning: If you are using this script after stopping the optimization, you will need to change the name of the file 'plot.REX' to 'plot_final.REX'.
Else, you will not have the last frame properly reordered.
:::
After executing this script, you will have as many .rst7 files (restart files) as string nodes. This help check if the structures are overall ok.
To load them in VMD, start by loading a new structure and load the .prmtop file. Then, write in the TK console in VMD the following command (example with 40 string nodes):
> for {set i 1} {$i < 41} {incr i} {mol addfile ${i}.rst7 type rst7 waitfor -1}
:::info
:information_source: Note that in VMD, the counting is different.
:::
Or you can use this TCL script that automatise everything for you :
<details open>
<summary>vizualise_LF_string.tcl</summary>
```tcl=
#!/usr/bin/tclsh
# This script is meant to load the last frame of the string and the topology all at once, LK 21/04/23
# execute it in the terminale $vmd -e vizualise_LF_string.tcl path/to/topfile.top and follow the intructions in the terminal
# ask the number of frame to load
puts "How many frames should be loaded?"
# get input and add +1 for the loop later
gets stdin frame
set nframe [expr $frame+1]
# loop to load all the frames
for {set i 1} {$i<$nframe} {incr i} {mol addfile ${i}.rst7 type rst7 waitfor -1}
# create 1st rep
mol delrep 0 top
mol representation CPK 1.700000 0.300000 12.000000 12.000000
mol color Name
mol selection {not water} # change the name accordingly
mol material Opaque
mol addrep top
# create 2nd rep
mol representation CPK 0.600000 0.300000 12.000000 12.000000
mol color Name
mol selection {same residue as water and within 3 of not water} # change the name acordingly
mol material Opaque
mol addrep top
mol selupdate 1 top 1
mol colupdate 1 top 0
```
</details>
To execute it, simply type in terminal:
>$vmd -e vizualise_LF_string.tcl path/to/topoly_file.top
And indicate the number of frames to load when prompt in the terminal.
To get the reordered trajectories of each string node, use the following script:
<details open>
<summary>reorder_try.py</summary>
```python3=
#!/usr/bin/env python
import argparse
import os
import shutil
import re
from operator import itemgetter
from itertools import groupby
TRAJECTORY_EXTENSION = 'nc'
SANDER_INPUT_EXTENSION = 'in'
OUTPUT_DIR = 'reorder_trj_results'
TMP_DIR = '.reorder_trj_tmp'
CPPTRAJ_IN_NAME = '.reorder_cpptraj_tmp.in'
def new_fragment(trj, node, start, stop):
return {'trj': trj, 'node': node, 'start': start, 'stop': stop}
def step_to_frame(step, steps_per_frame, offset):
return (step + offset) // steps_per_frame
def data_to_fragments(data, steps_per_frame, offset):
ntrj = len(data[0][1:])
fragments = []
first_frame = step_to_frame(0, steps_per_frame, offset) + 1
for trj in range(1, ntrj + 1):
trj_fragments = []
last_frame = -1
for row in data:
frame = step_to_frame(row[0], steps_per_frame, offset)
if frame == last_frame or frame < first_frame:
continue
node = row[trj]
if len(trj_fragments) == 0:
trj_fragments.append(new_fragment(trj, node, first_frame, frame))
elif trj_fragments[-1]['node'] == node:
trj_fragments[-1]['stop'] = frame
else:
trj_fragments.append(new_fragment(trj, node, frame, frame))
last_frame = frame
fragments.append(trj_fragments)
return fragments
def group_fragments_by_node(fragments):
flat_fragments = [_ for trj_fragments in fragments for _ in trj_fragments]
sorted_fragments = sorted(flat_fragments, key=itemgetter('node', 'start'))
return groupby(sorted_fragments, key=itemgetter('node'))
def assign_fragments_filenames(fragments, path):
for node, node_fragments in group_fragments_by_node(fragments):
for index, fragment in enumerate(node_fragments):
fragment['filename'] = os.path.join(path, '{}_{}.nc'.format(node, index))
def break_trj(in_filename, parm, trj_filename, fragments):
with open(in_filename, 'w') as f:
f.write('parm {}\n'.format(parm))
f.write('trajin {}\n'.format(trj_filename))
for fragment in fragments:
f.write('trajout {filename} start {start} stop {stop}\n'
.format(**fragment))
os.system('/ccc/work/cont003/gen11021/gen11021/softwares/Ambertools21-ASM/bin/cpptraj -i {} >> break.tmp'.format(in_filename))
def combine_node_trj(in_filename, parm, node_trj_filename, node_fragments):
with open(in_filename, 'w') as f:
f.write('parm {}\n'.format(parm))
for fragment in node_fragments:
f.write('trajin {filename}\n'.format(**fragment))
f.write('trajout {}\n'.format(node_trj_filename))
os.system('/ccc/work/cont003/gen11021/gen11021/softwares/Ambertools21-ASM/bin/cpptraj -i {} >> combine.tmp'.format(in_filename))
def mkdir(name):
try:
os.mkdir(name)
except OSError:
print('{} exists and will be removed'.format(name))
shutil.rmtree(name)
os.mkdir(name)
def move_dir_from_tmp(name):
tmp_path = os.path.join(TMP_DIR, name)
output_path = os.path.join(OUTPUT_DIR, name)
os.system('mv {} {}'.format(tmp_path, output_path))
def reorder_trajectories(topology, filenames, trj_period,
offset, data, subdir, output_format):
fragments = data_to_fragments(data, trj_period, offset)
assign_fragments_filenames(fragments, TMP_DIR)
for i, trj_fragments in enumerate(fragments):
break_trj(CPPTRAJ_IN_NAME, topology, filenames[i], trj_fragments)
print('Splitting {} of {} done'.format(i+1, len(filenames)))
path = os.path.join(TMP_DIR, subdir)
mkdir(path)
for node, node_fragments in group_fragments_by_node(fragments):
node_trj_path = os.path.join(path, str(node) + '.' + output_format)
combine_node_trj(CPPTRAJ_IN_NAME, topology, node_trj_path, node_fragments)
print('Combining {} of {} done'.format(node, len(filenames)))
move_dir_from_tmp(subdir)
if subdir == 'string':
os.system("rm {}/*".format(TMP_DIR))
def read_data(filename):
with open(filename) as f:
return [[int(x) for x in line.strip().split()] for line in f]
def get_nnodes():
"""
Assumes that for each node there is an 'in' file (1.in, 2.in etc) and
there are no other 'in' files with numeric names.
"""
regex = re.compile(r"([0-9]*)\."+SANDER_INPUT_EXTENSION)
paths = filter(lambda x: regex.match(x) and os.path.isfile(x),
os.listdir('.'))
return len(list(paths))
def trajectory_filenames(nnodes):
return ['{}.{}'.format(i, TRAJECTORY_EXTENSION) for i in range(1, nnodes+1)]
def assert_file_exist(filename):
if not os.path.exists(filename):
print('{} not found. Exiting.'.format(filename))
exit()
def parse_string():
preparation_steps = None
results_dir = '.'
only_pmf = False
assert_file_exist('STRING')
with open('STRING') as f:
for raw_line in f:
line = raw_line.strip()
if line.startswith('dir'):
results_dir = line.split()[-1].strip('"')
if line.startswith('preparation_steps'):
preparation_steps = int(line.split()[-1])
if line.startswith('only_PMF'):
only_pmf = line.split()[-1].lower() == '.true.'
return preparation_steps, results_dir, only_pmf
def parse_in(in_file):
ntwx = None
dt = None
assert_file_exist(in_file)
with open(in_file) as f:
ntwx_regex = re.compile(r".*=(\D*)([0-9]*)")
dt_regex = re.compile(r".*=(\D*)([0-9.]*)")
for line in f:
if line.strip().startswith('ntwx'):
ntwx = int(ntwx_regex.match(line.strip()).groups()[1])
elif line.strip().startswith('dt'):
dt = float(dt_regex.match(line.strip()).groups()[1])
return ntwx, dt
def parse_stop_string(stop_string):
try:
with open(stop_string) as f:
return int(f.readlines()[-1].strip())
except IOError:
return None
def main():
parser = argparse.ArgumentParser(
description=""
"Reorders trajectories generated by a string method calculation.\n"
"Must be executed from the working directory (the one where the STRING \n"
"file is located)",
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("prmtop", help="Amber topology file")
parser.add_argument("--format", help="Format of the output trajectories",
default='nc')
parser.add_argument("--nodes", type=int,
help="Number of nodes (use if automatic detection fails)")
args = parser.parse_args()
topology = args.prmtop
# Parse STRING file to extract preparation_steps (default 2000),
# results directory and only_PMF
preparation_steps, results_dir, only_pmf = parse_string()
# Parse first input file to know how often the frames were saved to the
# trajectories (ntwx) and get the timestep (dt) to calculate
# preparation_steps
in_file = '1.{}'.format(SANDER_INPUT_EXTENSION)
ntwx, dt = parse_in(in_file)
if not ntwx:
print('ntwx not set in {}, so no trajectories were written. Exiting.'
.format(in_file))
exit()
if not preparation_steps:
preparation_steps = int(1. / dt)
# Parse convergence.dat (if found) to know after how many steps
# the job switched to pmf
try:
with open(os.path.join(results_dir, 'convergence.dat')) as f:
last_string_step = int(f.readlines()[-1].split()[0])
except IOError:
last_string_step = None
has_string_data = os.path.exists(os.path.join(results_dir, 'plot.REX'))
has_pmf_data = os.path.exists(os.path.join(results_dir, 'plot_final.REX'))
if not (has_string_data or has_pmf_data):
print('plot.REX or plot_final.REX not found in {}. Exiting.'
.format(results_dir))
exit()
string_offset = preparation_steps
pmf_offset = string_offset + last_string_step or 0
if only_pmf:
has_string = False
has_pmf = True
elif last_string_step:
has_string = True
has_pmf = True
else:
has_string = True
has_pmf = False
nnodes = args.nodes or get_nnodes()
print("{} nodes".format(nnodes))
trajectory_names = trajectory_filenames(nnodes)
mkdir(TMP_DIR)
mkdir(OUTPUT_DIR)
if has_string:
string_data = read_data(os.path.join(results_dir, 'plot.REX'))
print("Extracting string trajectories...")
reorder_trajectories(topology, trajectory_names, ntwx,
string_offset, string_data, 'string', args.format)
if has_pmf:
pmf_data = read_data(os.path.join(results_dir, 'plot_final.REX'))
print("Extracting PMF trajectories...")
reorder_trajectories(topology, trajectory_names, ntwx,
pmf_offset, pmf_data, 'pmf', args.format)
shutil.rmtree(TMP_DIR)
if __name__ == '__main__':
main()
```
</details>
And execute it with the following command:
>$ ./reorder_try.py file.prmtop
From there, you will have every string node trajectories reconstructed after the replica exchange, so not continuous trajectories.
## Get the PMF
Once the string has converge, you will need to manually stop the code by creating a file named STOP_STRING in which you will indicate 2 values: the first one corresponds to the step number at which you estimate the string has converged, and the second one is the step number at which the code should stop optimising the string position and switch to Umbrella Sampling.
###To do so, start a continuation as explain above and let the optimization run for a few picosecond. Then, once the code has run a bit and the results/ folder is created, create the STROP_STRING file inside the results/ folder. There should be at least 2ps between when the string is stopped and the beginning of the US. The US should last ~100ps.###
Then, create a WHAM directory and copy in there all the /result/*final.dat files. Copy these two following scripts as well:
<details open>
<summary>convert_dat.py</summary>
```python3=
#!/usr/bin/python
import sys
with open(sys.argv[1], 'r') as f:
data = f.readlines()
print(data[0][:31])
for line in data[1:]:
print(line[:16])
```
</details>
<details open>
<summary>wham.f90</summary>
```fortran=
program wham
real*8, parameter :: T = 300. !Temperature
! real*8, parameter :: R = 0.0083144621 !Gas constant
real*8, parameter :: R = 0.0083144621 / 4.184 !Gas constant
real*8, parameter :: bin_width = 0.1 !Width of the bins (in the same units as the coordinate)
real*8, parameter :: threshold = 0.000001 !Threshold for convergence to be achieved (kcal/mol)
integer, parameter :: niter = 100000000 !Maximum number of iterations
integer, parameter :: minidx = 0 !Lowest index of input file (for example, dat.0)
integer, parameter :: maxidx = 1000 !Highest index of input file (for example, dat.1000)
character(len=*), parameter :: filename = ".dat" ! Postfix for the input filenames
integer :: nsims, nbins ! Number of simulations, number of bins
real*8, dimension(:), allocatable :: Kf ! force constant in each simulation
real*8, dimension(:), allocatable :: f, ftmp ! Normalizing constant for each simulation
real*8, dimension(:), allocatable :: x0 ! reference value for each simulation
real*8, dimension(:), allocatable :: x ! Coordinate values in each bin
real*8, dimension(:), allocatable :: P ! probabilities in each bin
real*8, dimension(:), allocatable :: PMF ! PMF in each bin
integer, dimension(:), allocatable :: Ns ! Number of points in each simulation
integer, dimension(:), allocatable :: Nb ! Total number of points in each bin
real*8, dimension(:,:), allocatable :: c ! biasing factors
integer :: i, j, k, bin
real*8 :: tmp, RT, thr, min_x, max_x, maxPMF, minPMF
logical, dimension(minidx:maxidx) :: exist
character*40 :: si
!---Figuring out which input files exist and counting them---
do i = minidx, maxidx
write(si,*) i
si = adjustl( si )
inquire( file = trim( si )//filename, exist = exist(i) )
end do
nsims = count( exist )
!------------------------------------------------------------
allocate( Kf(nsims), f(nsims), ftmp(nsims), x0(nsims), Ns(nsims) )
!---Determining force constants, reference values, largest and---
!----smallest x values and number of data points in each file----
max_x = -huge( 0.0_8 )
min_x = huge( 0.0_8 )
j = 0
do i = minidx, maxidx
if ( exist(i) ) then
j = j + 1
k = 0
write(si,*) i
si = adjustl( si )
open( unit = 10, file = trim( si )//filename, status = "old" )
read(10,*) Kf(j), x0(j)
do
read(10,*, end=10) tmp
min_x = min( tmp, min_x )
max_x = max( tmp, max_x )
k = k + 1
end do
10 continue
Ns(j) = k
end if
end do
!---------------------------------------------------------------
nbins = int( (max_x - min_x)/bin_width ) + 1
allocate( x(nbins), P(nbins), PMF(nbins), Nb(nbins) )
!---Finally, obtaining number of points in each bin---
Nb = 0
j = 0
do i = minidx, maxidx
if ( exist(i) ) then
j = j + 1
write(si,*) i
si = adjustl( si )
open( unit = 10, file = trim( si )//filename, status = "old" )
read(10,*) tmp, tmp
do k = 1, Ns(j)
read(10,*) tmp
bin = int( (tmp - min_x)/bin_width ) + 1
Nb(bin) = Nb(bin) + 1
end do
end if
end do
!-----------------------------------------------------
!---Calculating x values in the center of each bin---
do i = 1, nbins
x(i) = min_x+( real(i, kind = 8 ) - 0.5)*bin_width
end do
!----------------------------------------------------
!---Calculating the biasing factors---
allocate( c(nsims,nbins) )
RT = R*T
do i = 1, nsims
c(i,:) = exp(-0.5_8*Kf(i)*(x-x0(i))**2/RT) ! c(i,j)=exp(-Vi(xj)/RT)
end do
!-------------------------------------
f = 1.
thr = threshold/RT ! converting threshold
!---WHAM!---
do i = 1, niter
ftmp = f
do j = 1, nbins
P(j) = real( Nb(j), kind = 8 )/sum( real( Ns, kind = 8 )*f*c(:,j) )
end do
f = 1._8/matmul( c, P )
if ( maxval( abs( log(f/ftmp) ) ) < thr ) then
write(*,"(A,i6,A)") "#converged in ",i," iterations"
exit
end if
if (i == niter) write(*,"(A)") "#not converged!"
end do
!-----------
!---Calculating the PMF---
P = P / sum( P )
do i = 1, nbins
if( P(i) > 0 ) then
PMF(i) = -R*T*log(P(i))
else
PMF(i) = huge( 0.0_8 )
end if
end do
PMF = PMF
!-------------------------
!---Determining the barrier height---
maxPMF = maxval( PMF(10:nbins-10) ) !First and last 10 bins are ignored
i = 10
do while ( PMF(i) < maxPMF )
i = i + 1
end do
minPMF = PMF(i)
do j = 10, i-1
minPMF = min( minPMF, PMF(j) )
end do
write(*,"(A17,F23.10)") "#Barrier height: ", maxPMF-minPMF
!------------------------------------
write(*,"(A)") "#PMF:"
PMF = PMF - minPMF
do i = 1, nbins
write(*,"(I20,2F20.10)") Nb(i), x(i), PMF(i)
end do
close( 10 )
deallocate( Kf, f, ftmp, x0, Ns, x, P, PMF, Nb, c )
end program
```
</details>
And finally, execute the following command in the terminal, here it is for 64 string nodes:
> $for i in \`seq 1 64\`;do python convert_dat.py ${i}_final.dat > $i.dat; done; gfortran -o wham.exe wham.f90; ./wham.exe > PMF
You then have a file named PMF with the second column corresponding to the energy in (kcal/mol) and the third column corresponds to the proggression of the reaction.
Or simply :
<details open>
<summary>make_wham.sh</summary>
```bash=
#!/bin/bash
# This script is meant to automatise the list of command to make the PMF for the ASM, LK 05/05/23
# Execute it in the /WHAM directory
cp ../results/*final.dat .
cp ~/Documents/scripts/001_python_script/convert_dat.py .
cp ~/Documents/scripts/002_fortran_script/wham.f90 .
echo "How many string nodes are there?"
read n_nodes
for i in `seq 1 $n_nodes`;do python convert_dat.py ${i}_final.dat > $i.dat; done; gfortran -o wham.exe wham.f90; ./wham.exe > PMF
```
</details>
## Automatisation
To run an optimsation and then an optimisation + US in one shot, use these scripts in their correct directories.
You will need to change in the optimisation job and the one that mkdir and cp the directories names and arguments.
```
./Project
├ job_cpmkdir.sh
├ cp_mkdir_continu_string.sh
├── 000_string_opti
│ └── job1.sh
├── 001_string_opti_et_US
│ └── job_ASM_24h.sh
```
<details open>
<summary>cp_mkdir_continu_string.sh</summary>
```bash=
#!/bin/bash
###>> This script is meant for preparing all the files to continue a string in opti+US mode after optimisation, LK 26/05/23 <<###
###>> use it in the same dir as $prev_dir and $dest_dir and run it: continu_string_mkdirCP.sh <prev_dir> <dest_dir> <top_file> <string_nodes> <name_job> <###
###>> this script makes assumption on some file names, please check before using it <<###
prev_dir="$1" # like ./TEST_004_string_opti_agrstrt
dest_dir="$2" # like ./TEST_005_string_opti_et_US_agrstrt
top_file="$3" # just its name
string_nodes="$4" # just the number
name_job="$5" # just its name
mkdir "$dest_dir"
cp "$prev_dir"/CVs "$prev_dir"/STRING "$prev_dir"/model "$prev_dir"/"$top_file" "$dest_dir/"
### check if there is a restraint file and cp it if it is there or skip it
rstrt_file="restraint"
if [ -f "$prev_dir"/"$rstrt_file" ]; then
echo "restraint file found. Copying..."
cp "$prev_dir"/"$rstrt_file" "$dest_dir/"
else
echo "No restraint file found. Skipping..."
fi
### sort the .rst7 file and copy them as starting structures
mkdir "$prev_dir"/sorted
cp ~/scripts/001_python_script/mod_order.py "$prev_dir"/sorted
cd "$prev_dir"/sorted || exit
python3 mod_order.py
cd ../../ || exit
cp -r "$prev_dir"/sorted "$dest_dir/starting_struct"
### get the highest string file and copy it as guess
cd "$prev_dir"/results/ || exit
new_guess=$(ls -1v *.string | tail -n 1)
echo "$new_guess"
cd ../../ || exit
cp "$prev_dir"/results/"$new_guess" "$dest_dir/guess"
sed -i "1i $string_nodes" "$dest_dir/guess"
### edit STRING and model files
sed -i 's/.true./.false./g' "$dest_dir/STRING"
sed -i 's/ nstlim = 50000, ! Steps, for 25 ps/ nstlim = 200000, ! Steps, for 100 ps/g' "$dest_dir/model"
### cp and edit the job.sh file
cp ~/scripts/003_bash_script/job_ASM_24h.sh $dest_dir/
sed -i "s/NAME_JOB/$name_job/g" $dest_dir/job_ASM_24h.sh
### generate the input files
cp ~/scripts/003_bash_script/continue_in.sh "$dest_dir/"
cd "$dest_dir" || exit
./continue_in.sh "$string_nodes"
```
</details>
<details open>
<summary>job1.sh</summary>
```bash=
#!/bin/bash
#MSUB -r DOPO_angOPrst_ASMopti # Request name
#MSUB -N 1 # number of nodes
#MSUB -n 64 # number of cpus
#MSUB -T 86400 #i.e 24h Elapsed time limit in seconds
#MSUB -o log.%I.o # Standard output. %I is the job id
#MSUB -e log.%I.e # Error output. %I is the job id
#MSUB -q rome
#MSUB -A gen11021 # Project ID
#MSUB -m scratch
set -x
module purge
module load feature/mkl/mpi feature/mkl/multi-threaded
module load gnu mpi fftw3/mkl/
AMBERHOME='/ccc/work/cont003/gen11021/gen11021/softwares/Ambertools22-ASM'
source $AMBERHOME/amber.sh
rm -rf results
mkdir results
ccc_mprun $AMBERHOME/bin/sander.MPI -ng 64 -groupfile ./string.groupfile
cd ..
ccc_msub job_cpmkdir.sh
```
</details>
<details open>
<summary>job_cpmkdir.sh</summary>
```bash=
#!/bin/bash
#MSUB -r DOPO_cpmkdir # Request name
#MSUB -N 1 # number of nodes
#MSUB -o log.%I.o # Standard output. %I is the job id
#MSUB -e log.%I.e # Error output. %I is the job id
#MSUB -q rome
#MSUB -A gen11021 # Project ID
#MSUB -m scratch
set -x
ccc_mprun ./cp_mkdir_continu_string.sh ./006_string_OPTI_ang_et_OP_rstrt ./007_string_OPTI_US_ang_et_OP_rstrt RD5N_R3H.prmtop 64 DOPO_angOPrst_ASMoptiUS
cd ./007_string_OPTI_US_ang_et_OP_rstrt/
ccc_msub job_ASM_24h.sh
```
</details>
<details open>
<summary>job_ASM_24h.sh</summary>
```bash=
#!/bin/bash
#MSUB -r DOPO_angOPrst_ASMoptiUS # Request name
#MSUB -N 1 # number of nodes
#MSUB -n 64 # number of cpus
#MSUB -T 86400 #i.e 24h Elapsed time limit in seconds
#MSUB -o log.%I.o # Standard output. %I is the job id
#MSUB -e log.%I.e # Error output. %I is the job id
#MSUB -q rome
#MSUB -A gen11021 # Project ID
#MSUB -m scratch
set -x
module purge
module load feature/mkl/mpi feature/mkl/multi-threaded
module load gnu mpi fftw3/mkl/
AMBERHOME='/ccc/work/cont003/gen11021/gen11021/softwares/Ambertools22-ASM'
source $AMBERHOME/amber.sh
rm -rf results
mkdir results
echo "12000" > results/STOP_STRING
echo "16000" >> results/STOP_STRING
ccc_mprun $AMBERHOME/bin/sander.MPI -ng 64 -groupfile ./string.groupfile
```
</details>
## Analysis
To analyse your trajectories, you can use the script [get_pmf_cv_values.py](https://github.com/kzinovjev/string-amber/blob/master/utils/get_pmf_cv_values.py) provided by Kirill, but it works only for the CV(s) defined in the CV file of the string. It can also cause issue as PBC are not considered.
Another solution are these two scripts that give more freedom regarding the CV(s).
The first one processes the amber trajectories and extract the CV(s) for each frame using cpptraj, which automatically takes into account the PBC:
To use it, you will need a cpptraj template file, e.g.:
```
# cpptraj_template.in
parm RD5N_POC_Mg.prmtop
trajin {input_file}
distance Mg_O1P @6490 @15 out {output_file}
distance O2_P @12 @14 out {output_file}
distance P_O5 @14 @17 out {output_file}
distance O2_O5 @14 @12 out {output_file}
angle O2_P_O5 @12 @14 @17 out {output_file}
distance Mg_Ow1 @6490 @4303 out {output_file}
distance Mg_Ow2 @6490 @4204 out {output_file}
distance Mg_Ow3 @6490 @4369 out {output_file}
distance Mg_Ow4 @6490 @4321 out {output_file}
distance Mg_Ow5 @6490 @3991 out {output_file}
run
```
Be carefull to respect the naming conventions for the placeholders (e.g. "{input_file}") in the cpptraj template file.
You can also specify the naming for the generated files. To see all the options "python3 compute_CV_from_traj.py --help".
<details open>
<summary>compute_CV_from_traj.py</summary>
```python3=
import subprocess
import argparse
def main():
"""
Main function for processing Amber trajectories using cpptraj.
Parses command-line arguments, reads the template file, and processes each trajectory.
Parameters:
None
Returns:
None
"""
parser = argparse.ArgumentParser(description='Extract the values of CV(s) from Amber trajectory using cpptraj.')
parser.add_argument('num_file', type=int, help='Number of trajectories to be processed')
parser.add_argument('template_file', help='Template file of the cpptraj commands to be executed.')
parser.add_argument('topology_file', help='Path to the topology file (e.g., RD5N_POC_Mg.prmtop).')
parser.add_argument('-o', '--output', help='Path to the output file (optional). Default is "CV_{index}.txt".', default='CV_{index}.txt')
args = parser.parse_args()
# Define the cpptraj commands template
template_file_path = args.template_file
with open(template_file_path, "r") as template_file:
cpptraj_commands_template = template_file.read()
# Loop over your input files
num_file = args.num_file
for i in range(1, num_file + 1):
process_trajectory(i, args.output, args.template_file, args.topology_file, cpptraj_commands_template)
def process_trajectory(i, output_format, template_file, topology_file, template_content):
"""
Process a single trajectory.
Parameters:
- i (int): Index of the trajectory.
- output_format (str): Output file format with placeholder for index (e.g., 'output_{index}.txt').
- template_file (str): Path to the template file of cpptraj commands.
- topology_file (str): Path to the topology file (e.g., RD5N_POC_Mg.prmtop).
- template_content (str): Content of the cpptraj commands template.
Returns:
None
"""
# Define input and output file names
input_file = f"{i}.nc"
output_file = output_format.format(index=i)
# Create the cpptraj commands with the specific input and output files
cpptraj_commands = template_content.format(input_file=input_file, index=i, output_file=output_file)
# Save the cpptraj commands to a file
cpptraj_script_path = f"cpptraj_script_{i}.in"
with open(cpptraj_script_path, "w") as script_file:
script_file.write(cpptraj_commands)
# Run cpptraj using subprocess with the specified topology file
cpptraj_command = f"cpptraj -p {topology_file} -i {cpptraj_script_path}"
subprocess.run(cpptraj_command, shell=True)
# Remove the temporary script file
subprocess.run(f"rm {cpptraj_script_path}", shell=True)
if __name__ == "__main__":
main()
```
</details>
Then to compute the means and STDs of each CVs, you can use this other python script:
<details open>
<summary>compute_meanSTD_fromCV.py</summary>
```python3=
import pandas as pd
import os
from natsort import natsorted
import argparse
import re
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
parser = argparse.ArgumentParser(description='Process CV files and compute mean and std.')
parser.add_argument('-o', '--output', help='Custom output filename. Default is: "CV_{index}.txt"', default='CV_{index}.txt')
parser.add_argument('num_file', type=int, help='Number of files to be processed')
parser.add_argument('-p', '--path', help='Path of the file to be processed. Default is current directory', default='./')
parser.add_argument('-n', '--name', help='Name to save the dataframe as csv file. Default is: "result_dataframe.csv"', default='result_dataframe.csv')
args = parser.parse_args()
def process_cv_file(file_path, result_index, output_filename):
"""
Process CV file and compute mean and std.
Parameters:
- file_path (str): Path to the CV file.
- result_index (int): Index for the result DataFrame.
- output_filename (str): Output filename pattern.
Returns:
- pd.DataFrame: Result DataFrame.
"""
try:
file_data = pd.read_csv(file_path, delim_whitespace=True, index_col=False, encoding='utf-8')
file_data = file_data.drop(columns=["#Frame"])
column_means = file_data.mean()
column_stds = file_data.std()
result_df = pd.DataFrame(index=[result_index+1])
for key in column_means.index:
result_df[f'{key}_mean'] = column_means[key]
result_df[f'{key}_std'] = column_stds[key]
return result_df
except pd.errors.EmptyDataError:
logger.warning(f"The file {file_path} is empty. Skipping.")
except Exception as e:
logger.error(f"Error processing file {file_path}: {e}")
def main(base_directory, output_filename=args.output):
"""
Main function to process CV files in a directory.
Parameters:
- base_directory (str): Directory containing CV files.
- output_filename (str): CV filename pattern.
Returns:
- pd.DataFrame: Final result DataFrame.
"""
final_result_df = pd.DataFrame()
all_files = os.listdir(base_directory)
output_pattern = re.escape(output_filename).replace(re.escape("{index}"), r"\d+")
matching_files = natsorted([file_name for file_name in all_files if re.match(output_pattern, file_name)])
for index, file_name in enumerate(matching_files[:args.num_file]):
file_path = os.path.join(base_directory, file_name)
logger.info(f"Processing File: {file_path}")
file_result_df = process_cv_file(file_path, index, output_filename.format(index=index))
if file_result_df is not None:
final_result_df = pd.concat([final_result_df, file_result_df], axis=0)
return final_result_df
if __name__ == "__main__":
data_directory = args.path
result_dataframe = main(data_directory, args.output)
result_dataframe.to_csv(data_directory + args.name)
```
</details>
Again, to see the options and arguments use "-help".
## Generate structures
On the PMF stage a string, use Rolf's script: `/home/kantin/Documents/scripts/001_python_script/RD_generate_string.py`
Then in the extracted directory created by his script, organise the files:
<details open>
<summary>organise_extrac_rst7.sh</summary>
```bash=
#!/bin/bash
#### for organising in separate directories all the .rst7 files generated after extraction of the frames with RD_generate_string.py ####
for i in {1..20}; do
# Create the destination directory if it doesn't exist
mkdir -p "./${i}"
# Move files starting with the current number to the destination directory
mv ./${i}_*.rst7 "${i}/"
# Count the number of files in the subdirectory
count=$(ls -1 "1/$i" | wc -l)
echo "Number of files in subdirectory ${i}: $count"
done
```
</details>
Then we need to relax the structures:
<details open>
<summary>continue_in.sh</summary>
```bash=
#!/bin/bash
# script modified by LK 28/02/23
# If groupfile exists, remove it
if [ -f string.groupfile ];
then
rm string.groupfile
fi
PARM=RD5N_POC.prmtop # path to topology file
REACT=./starting_struct/ # path to starting structures
# Number of string nodes is provided as command line argument
nodes=$1
for i in `seq 1 $nodes`; do
# generate sander input for node $i
sed "s/XXXXX/$RANDOM/g" model > $i.in
# write out the sander arguments for node $i to the groupfile
echo "-O -rem 0 -i $i.in -o $i.out -c ${REACT}NUM_$i.rst7 -r NUM_$i.rst7 -x $i.nc -inf $i.mdinfo -p $PARM" >> string.groupfile
done
echo "N REPLICAS = $i"
```
</details>
All can be found at /ccc/scratch/cont003/gen11021/kantinla/M2_stage_srtch/001_RD5N_POC/003_ASM_B3LYP/extracted.
After runnig the 50 steps, take the rst7 and from them extract the xyz files using cpptraj:
<details open>
<summary>all_make_and_sort.sh</summary>
```bash=
#!/bin/bash
####################################################
## generate the xyz files, move them in a new dir ##
####################################################
# LK, 19/07/2023
for dir in {1..20}
do
cd $dir
run_num=$(basename $dir)
for file in *.rst7
do
filename="${file%.*}"
cpptraj -p RD5N_POC.prmtop -y "$file" << EOF
strip !(:1-2)
trajout ${filename}_nW.xyz [xyz] nobox notitle append
run
EOF
tail -n +3 ${filename}_nW.xyz > ${filename}_qm.xyz
rm ${filename}_nW.xyz
# Move the stripped XYZ file to the directory named after the run number
mkdir -p ${run_num}_xyz_files
mv ${filename}_qm.xyz ${run_num}_xyz_files/
done
cd ..
mv ${run_num}_xyz_files/ ../
done
```
</details>
For generating orca inputs:
<details open>
<summary>gen_orc_in.sh</summary>
```bash=
#!/bin/bash
for dir in {1..20}; do
cd "${dir}_xyz_files" || {
echo "Failed to change directory to $dir" >&2
continue
}
if [ ! -f orca_input_model.txt ]; then
echo "orca_input_model.txt not found in directory $dir" >&2
cd ..
continue
fi
for file in *.xyz; do
if [ ! -f "$file" ]; then
echo "Skipping non-existent file: $file" >&2
continue
fi
filename="${file%.*}"
cat orca_input_model.txt "$file" > "${filename}_sp_orc.in" && echo "*" >> "${filename}_sp_orc.in"
done
mkdir "${dir}_orca_inputs"
mv *_sp_orc.in "${dir}_orca_inputs"
mv "${dir}_orca_inputs" ..
cd ..
done
```
</details>
For sqm we need 3 scripts:
<details open>
<summary>add_znum_to_xyz.py</summary>
```python=
import glob
atomic_numbers = {
"C": 6,
"H": 1,
"O": 8,
"P": 15,
}
file_pattern = "1_*_qm.xyz"
# Find all files matching the pattern
file_list = glob.glob(file_pattern)
for input_file in file_list:
output_file = input_file.replace("_qm.xyz", "_processed.xyz")
with open(input_file, "r") as f_in, open(output_file, "w") as f_out:
for line in f_in:
atom_data = line.strip().split()
if len(atom_data) >= 4:
atom_symbol = atom_data[0]
if atom_symbol in atomic_numbers:
atomic_number = atomic_numbers[atom_symbol]
new_line = f"{atomic_number} {' '.join(atom_data)}"
f_out.write(new_line + "\n")
else:
f_out.write(line)
else:
f_out.write(line)
```
</details>
<details open>
<summary>all_add_znum.sh</summary>
```bash=
#!/bin/bash
for i in {1..20}; do
echo "Processing directory ${i}_xyz_files"
cp add_znum_to_xyz.py "${i}_xyz_files"
sed -i "s|file_pattern = \"1_|file_pattern = \"${i}_|g" "${i}_xyz_files/add_znum_to_xyz.py"
cd "${i}_xyz_files/" || { echo "Cannot change to directory ${i}_xyz_files"; continue; }
echo "Running add_znum_to_xyz.py"
python add_znum_to_xyz.py 2>&1 || echo "Failed to run python script in ${i}_xyz_files"
cd ..
done
```
</details>
<details open>
<summary>gen_sqm_in.sh</summary>
```bash=
#!/bin/bash
for dir in {1..20}; do
cd "${dir}_xyz_files" || {
echo "Failed to change directory to $dir" >&2
continue
}
if [ ! -f sqm_input_model.txt ]; then
echo "sqm_input_model.txt not found in directory $dir" >&2
cd ..
continue
fi
for file in *_processed.xyz; do
if [ ! -f "$file" ]; then
echo "Skipping non-existent file: $file" >&2
continue
fi
filename="${file%.*}"
cat sqm_input_model.txt "$file" > "${filename}_sp_sqm.in"
sed -i 's/^/ /' "${filename}_sp_sqm.in"
done
mkdir "${dir}_sqm_inputs"
mv *_sp_sqm.in "${dir}_sqm_inputs"
mv "${dir}_sqm_inputs" ..
cd ..
done
```
</details>
Then to run the SP calculations for Orca:
<details open>
<summary>job_template_sp_orc.sh</summary>
```bash=
#!/bin/bash
#MSUB -r {extract}_{node}_orc_sp # Request name
#MSUB -n 16 # Number of tasks to use
#MSUB -T 86400 # Elapsed time limit in seconds 24h
#MSUB -o %I.out # Standard output. %I is the job id
#MSUB -e %I.err # Error output. %I is the job id
#MSUB -q rome
#MSUB -A gen11021 # Project ID
#MSUB -m scratch
ml sw mpi/openmpi/4.1.1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/ccc/work/cont003/gen11021/dubouede/softwares/orca_5_0_3_linux_x86-64_shared_openmpi411/
set -x
/ccc/work/cont003/gen11021/dubouede/softwares/orca_5_0_3_linux_x86-64_shared_openmpi411/orca {extract}_{node}_qm_sp_orc.in > {extract}_{node}_qm_sp_orc.log
```
</details>
<details open>
<summary>gen_jobs.sh</summary>
```bash=
#!/bin/bash
# Path to your input files.
input_path="./"
# Name of your level of theory directory
level_of_theory="SPcalc_B3LYP_6311G"
# Name of your template job script.
template="job_template_sp_orc.sh"
# First set of loops: create directories
for extract in $(seq -f "%02g" 1 20)
do
for node in $(seq -f "%02g" 1 64)
do
# Create directories
dir_path="${level_of_theory}/${extract}_extracted/${node}_node"
mkdir -p $dir_path
done
done
# Second set of loops: copy and modify files
for extract in {1..20}
do
for node in {1..64}
do
# Copy the input files into their appropriate directories
new_extract=$(printf "%02d" $extract)
new_node=$(printf "%02d" $node)
dir_path="${level_of_theory}/${new_extract}_extracted/${new_node}_node"
cp "${input_path}/${extract}_orca_inputs/${extract}_${node}_qm_sp_orc.in" "${dir_path}/${new_extract}_${new_node}_qm_sp_orc.in"
# Copy the template job file and replace placeholders
cp $template "${dir_path}/job.sh"
sed -i "s/{extract}/$new_extract/g" "${dir_path}/job.sh"
sed -i "s/{node}/$new_node/g" "${dir_path}/job.sh"
done
done
```
</details>
And then in SPcalc_B3LYP_6311G/ run :
<details open>
<summary>gen_jobs.sh</summary>
```bash=
#!/bin/bash
for extract in {01..20}; do
for node in {01..64}; do
cd "${extract}_extracted/${node}_node/"
ccc_msub job.sh
cd ../../
done
done
```
</details>
For sqm it's the same but the names change.