Adaptative String Method: tutorial

# Adaptative String Method: tutorial The [GitHub](https://github.com/kzinovjev/string-amber/blob/master/tutorial.md#string-file) with a tutorial, explanations and some scripts, and the [article](https://pubs.acs.org/doi/10.1021/acs.jpca.7b10842) of the method. ## Table of content [toc] ## Preparation of the system Before doing any ASM, it is necessary to prepare the system and equilibrate it using Amber or GROMACS for example. It is best to have the reactant and product state structures for the string, or at least the transition state. ## Directory architecture The code will need to access some files, and it is best to have them all in the directory where the string will be running and to avoid using intricated architecture. Here is an example of a good architecture: ``` ./Project ├── 1_energy_mini ├── 2_equilibration ├── 3_ASM │ └── 0_string │ └── starting_struc/ │ ├── file1.rst7 │ └── file2.rst7 │ ├── STRING │ ├── guess │ ├── in.sh │ ├── job.sh │ ├── file.top │ ├── model.in │ └── CVs │ └── 1_string_continue │ └── starting_struc/ │ ├── file1.rst7 │ └── file2.rst7 │ ├── STRING │ ├── guess │ ├── continue_in.sh │ ├── job.sh │ ├── file.top │ ├── model.in │ └── CVs │ └── 2_string_continue . . . ``` ## Required files for ASM ### CVs Below, an example of a CVs file with 3 CVs: ``` 3 $COLVAR COLVAR_type = "BOND" atoms = 12 ,14 $END $COLVAR COLVAR_type = "BOND" atoms = 14, 17 $END $COLVAR COLVAR_type = "BOND" atoms = 12 ,17 $END ``` The first line is the number of CVs defined in the file, then each CV is specified according to the following syntaxe: ``` $COLVAR COLVAR_type = "[CVs_type]" atoms = [atom1_mask] ,[atom2_mask] $END ``` With `[CVs_type]`: * BOND - interatomic distance (2 atoms) * ANGLE - bond angle (3 atoms) * DIHEDRAL - dihedral angle (4 atoms) * PPLANE - signed point plane distance (4 atoms) And the `[atomX_mask]` are the indexes of the atoms considered, they can be found using AmberTools: >$ cpptraj >$ parm [file].prmtop >$ atominfo :[index_of_the_residue] ### guess It tells the value of each CVs at each points; point 1 being the initial state and point $N$ the final one. ``` <number of guess points> <CV_1 in point 1> <CV_2 in point 1> ... <CV_D in point 1> <CV_1 in point 2> <CV_2 in point 2> ... <CV_D in point 2> ... <CV_1 in point N> <CV_2 in point N> ... <CV_D in point N> ``` The starting and ending values can be generated with [trj_to_cvs.py](https://github.com/kzinovjev/string-amber/blob/master/utils/trj_to_cvs.py) and unbiased MD trajectories of reactants and products. If there are more string nodes than values of CVs, the code will extrapolate in between the points. ### STRING A file to write manually, in which you can specify instructions for the string. Here is the simplest example: ``` $STRINGSETUP dir = "results/" guess_file = "guess" $END ``` Other options can be specified: ``` fix_ends = {.true., .false.} (default: .true.) preparation_steps = {int} (default: 1000?) buffer_size = {int} ?? ``` ### Input parameters It is the model file (model$.$in) with all the input parameters and options that will be used with in$.$sh to generate all the input files. If using sander from AmberTools21, details can be found in the [Amber21 manual](https://ambermd.org/doc12/Amber21.pdf), starting from section 21.3. ### Generate inputs The shell script that generates all the input files for each node as well as the groupfile that allows running all the jobs for every nodes. <details open> <summary>in.sh</summary> ```bash= #!/bin/bash # If groupfile exists, remove it if [ -f string.groupfile ]; then rm string.groupfile fi PARM=starting_structures/dinuc.prmtop # path to topology file REACT=starting_structures/qmmmeq.rst7 # path to reactants structure PROD=starting_structures/qmmmeq.rst7 # path to products structure # Number of string nodes is provided as command line argument nodes=$1 for i in `seq 1 $nodes`; do # generate sander input for node $i sed "s/XXXXX/$RANDOM/g" model.in > $i.in # use reactants structure for the first half of the nodes and products for the rest if [ $i -le $(($nodes/2)) ]; then crd=$REACT else crd=$PROD fi # write out the sander arguments for node $i to the groupfile echo "-O -rem 0 -i $i.in -o $i.out -c $crd -r $i.rst7 -x $i.nc -inf $i.mdinfo -p $PARM" >> string.groupfile done echo "N REPLICAS = $i" ``` </details> Then to run it: > chmod u+x in$.$sh > ./in.sh [number_of_string_nodes] ### Job script It is the script to submit a job on baal (LBT-IBPC). The top and user part need to be modified according to the job you want to run. <details open> <summary>job.sh</summary> ```bash= #!/bin/bash #PBS -S /bin/bash #PBS -N string_QMMM #PBS -o $PBS_JOBID.out #PBS -e $PBS_JOBID.err #PBS -q gpu_40c_1n #PBS -l nodes=1:ppn=40 #PBS -l walltime=12:00:00 #PBS -A simlab_project #PBS -l epilogue=/shared/scripts/ADMIN__epilogue-qsub.example ### Above should be specified by the user, simlab is everyone else except stirnemann team ### ### FOR EVERYTHING BELOW, I ADVISE YOU TO MODIFY THE USER-part ONLY ### WORKDIR="/" NUM_NODES=$(cat $PBS_NODEFILE|uniq|wc -l) if [ ! -n "$PBS_O_HOME" ] || [ ! -n "$PBS_JOBID" ]; then echo "At least one variable is needed but not defined. Please touch your manager about." exit 1 else if [ $NUM_NODES -le 1 ]; then WORKDIR+="scratch/" export WORKDIR+=$(echo $PBS_O_HOME |sed 's#.*/$home\|workdir$/$.*_team$*.*#\2#g')"/$PBS_JOBID/" mkdir $WORKDIR rsync -ap $PBS_O_WORKDIR/ $WORKDIR/ /shared/scripts/ADMIN__auto-rsync.example 200 & # sync frequency (secondes) else export WORKDIR=$PBS_O_WORKDIR fi fi echo "your current dir is: $PBS_O_WORKDIR" echo "your workdir is: $WORKDIR" echo "number of nodes: $NUM_NODES" echo "number of cores: "$(cat $PBS_NODEFILE|wc -l) echo "your execution environment: "$(cat $PBS_NODEFILE|uniq|while read line; do printf "%s" "$line "; done) cd $WORKDIR # If you're using only one node, it's counterproductive to use IB network for your MPI process communications #if [ $NUM_NODES -eq 1 ]; then # export PSM_DEVICES=self,shm # export OMPI_MCA_mtl=^psm # export OMPI_MCA_btl=shm,self #else # Since we are using a single IB card per node which can initiate only up to a maximum of 16 PSM contexts # we have to share PSM contexts between processes # CIN is here the number of cores in node # CIN=$(cat /proc/cpuinfo | grep -i processor | wc -l) # if [ $(($CIN/16)) -ge 2 ]; then # PPN=$(grep $HOSTNAME $PBS_NODEFILE|wc -l) # if [ $CIN -eq 40 ]; then # export PSM_SHAREDCONTEXTS_MAX=$(($PPN/4)) # elif [ $CIN -eq 32 ]; then # export PSM_SHAREDCONTEXTS_MAX=$(($PPN/2)) # else # echo "This computing node is not supported by this script" # fi # echo "PSM_SHAREDCONTEXTS_MAX defined to $PSM_SHAREDCONTEXTS_MAX" # else # echo "no PSM_SHAREDCONTEXTS_MAX to define" # fi #fi function get_gpu-ids() { if [ $PBS_NUM_PPN -eq $(cat /proc/cpuinfo | grep -cE "^processor.*:") ]; then echo "0,1" && return fi if [ -e /dev/cpuset/torque/$PBS_JOBID/cpus ]; then FILE="/dev/cpuset/torque/$PBS_JOBID/cpus" elif [ -e /dev/cpuset/torque/$PBS_JOBID/cpuset.cpus ]; then FILE="/dev/cpuset/torque/$PBS_JOBID/cpuset.cpus" else FILE="" fi if [ -e $FILE ]; then if [ $(cat $FILE | sed -r 's/^([0-9]).*$/\1/') -eq 0 ]; then echo "0" && return else echo "1" && return fi else echo "0,1" && return fi } gpus=$(get_gpu-ids) ## END-DO ## ## USER part (i should do my stuff here) ## #module load torque/gnu/6.1.2 module load Ambertools-ASM/gnu/21 module load gcc/8.3.0 module load openmpi/gnu/3.1.4 #Run cd $WORKDIR/ rm -rf results mkdir results export AMBERHOME=/shared/software/Ambertools-ASM/21/gnu source $AMBERHOME/amber.sh mpirun sander.MPI -O -i qmmm.in -o qmmm.out -p dinuc.prmtop -c dinuc.rst7 -x traj.nc # command i should specify cd $WORKDIR date ## DO NOT MODIFY THE PART OF SCRIPT: you will be accountable for any damage you cause # At the term of your job, you need to get back all produced data synchronizing workdir folder with you starting job folder and delete the temporary one (workdir) if [ $NUM_NODES -le 1 ]; then cd $PBS_O_WORKDIR rsync -ap $WORKDIR/ $PBS_O_WORKDIR/ rm -rf $WORKDIR fi ## END-DO ``` </details> Then, to run it on baal: > qsub job$.$sh ## Trouble shooting If something went wrong, please be sure that: - You have all the required files **in your working directory** - [ ] CVs - [ ] STRING - [ ] guess (original or from last [num].string ) - [ ] in$.$sh - [ ] model$.$in - [ ] job$.$sh - [ ] starting structures - [ ] toplogy file - [ ] string$.$groupfile (generated by in$.$sh) - [ ] (restraint file) - You did ask for a simulation time shorter than the limite time set on the cluster - You checked the groupfile and scripts (correct paths, number of cores, etc.) - You took and indicated the correct starting structures - Your system is centered in the box (Amber manages PBC, but ASM doesn't) ## After the first run Once the job is finish, run the following python script to check for convergence: <details open> <summary>checkstringV2.py</summary> ```python= import re import os import numpy as np import matplotlib.pyplot as plt import seaborn as sns import pandas as pd def get_nnodes(): """ Assumes that for each node there is an 'in' file (1.in, 2.in etc) and there are no other 'in' files with numeric names. """ regex = re.compile(r"([0-9]*)\."+"in") paths = filter(lambda x: regex.match(x) and os.path.isfile(x), os.listdir('.')) return len(list(paths)) def parse_string(): preparation_steps = None results_dir = '.' only_pmf = False CV_file = 'CVs' frequency_PMF = int(2/0.001) with open('STRING') as f: for raw_line in f: line = raw_line.strip() if line.startswith('dir'): results_dir = line.split()[-1].strip('"') if line.startswith('preparation_steps'): preparation_steps = int(line.split()[-1]) if line.startswith('only_PMF'): only_pmf = line.split()[-1].lower() == '.true.' if line.startswith('buffer_size'): frequency_PMF = int(line.split()[-1]) if line.startswith('CV_file'): CV_file = line.split()[-1].strip('"') return preparation_steps, results_dir, only_pmf, frequency_PMF, CV_file def state_string(): last_string_step = int(np.loadtxt(os.path.join(results_dir, 'convergence.dat'))[-1][0]) start_data_analysis = frequency_PMF end_data_analysis = last_string_step - last_string_step%frequency_PMF if end_data_analysis >= 6 * int(frequency_PMF): start_data_analysis = end_data_analysis - 4 * frequency_PMF return start_data_analysis, end_data_analysis def CVs_lists(): colvars = [] BOND = [] ANGLES = [] with open(CV_file) as f: counter = 0 for raw_line in f: line = raw_line.strip() if 'COLVAR_type' in line: CV = line.split()[-1].strip('"') counter += 1 colvars.append(counter) if CV == 'BOND' or CV == 'PPLANE': BOND.append(counter) if CV == 'ANGLE' or CV == 'DIHEDRAL': ANGLES.append(counter) return BOND, ANGLES, colvars def time_step(): with open('1.in') as f: for raw_line in f: line = raw_line.strip() if 'dt' in line: time_step = (line.split()[-1].strip(",")) else: time_step = 0.001 return time_step def nodes_vlimit_exceeded(): """ If vlimit is exceeded theres is something wrong with the structures So this function checks if exceeded in some string node """ print ('Checking rst integrity') for node in range (1, get_nnodes() + 1): with open(str(node)+'.out') as f: if 'vlimit exceeded' in f.read(): print ('check node', node) print ('Not mentioned structures are OK') def canvas_grid_definition(): if only_pmf == True: x, y = 1, 1 elif len(BOND) != 0 and len(ANGLES) !=0: x, y = 4, 1 else: x, y = 3, 2 grid = plt.GridSpec( x, y, wspace=0.1, hspace=0.4) return grid def CVs_plot(CV): if CV == BOND: plt.title('Distances along the string') for string_value in range(start_data_analysis,end_data_analysis+1,frequency_PMF): strings_data = pd.read_csv(f'results/{string_value}.string',delim_whitespace=True, names=colvars) sns.lineplot(data=strings_data[CV], linewidth=0.7,legend=False, dashes=False) guess_data = pd.read_csv('results/0.string', delim_whitespace=True, names=colvars) sns.lineplot(data=guess_data[CV], dashes=False) plt.xlabel('node') plt.ylabel('$\AA$') else: plt.title('Angles along the string') for string_value in range(start_data_analysis, end_data_analysis + 1, frequency_PMF): strings_data = pd.read_csv(f'results/{string_value}.string', delim_whitespace=True, names=colvars) sns.lineplot(data=strings_data[CV], linewidth=0.7, legend=False, dashes=False) guess_data = pd.read_csv('results/0.string', delim_whitespace=True, names=colvars) sns.lineplot(data=guess_data[CV], dashes=False) plt.xlabel('node') plt.ylabel('Degrees') return nodes_vlimit_exceeded() preparation_steps, results_dir, only_pmf, frequency_PMF, CV_file = parse_string() header_names = ['s','N_bin','Pop_bin','Force', 'Force_CI (95%)','PMF','PMF_CI (95%)'] if only_pmf == '.true.': last_step_PMF = len(open(f'{results_dir}/1_final.dat').readlines()) final_PMF = last_step_PMF - last_step_PMF % frequency_PMF final_PMF_data = pd.read_csv(f'{results_dir}/{final_PMF}_final.PMF', delim_whitespace=True, skiprows=(0, 1), header=(0), names=header_names) plt.title(f'{final_PMF}*{time_step()} ps of PMF') sns.lineplot(data=final_PMF_data, x="s", y="PMF") plt.fill_between(final_PMF_data['s'], final_PMF_data['PMF'] - final_PMF_data['PMF_CI (95%)'], final_PMF_data['PMF'] + final_PMF_data['PMF_CI (95%)'], alpha=0.2) preparation_steps, results_dir, only_pmf, frequency_PMF, CV_file = parse_string() start_data_analysis, end_data_analysis = state_string() BOND, ANGLES, colvars = CVs_lists() grid = canvas_grid_definition() print('start {0} ends {1} freq {2}'.format(state_string()[0],state_string()[1], frequency_PMF)) plt.subplots(constrained_layout=True) #plt.subplot(x,y,1) plt.subplot(grid[0, 0]) plt.title('Convergence') convergence_data = pd.read_csv('results/convergence.dat', delim_whitespace=True, names=['time','convergence']) convergence_data['time'] = convergence_data['time']*time_step() sns.lineplot(x='time',y='convergence',data=convergence_data) plt.xlabel('time (ps)') plt.ylabel('s ($amu^{1/2}$$\AA$)') #plt.subplot(x, y, 2) plt.subplot(grid[1, 0]) if os.path.isfile('results/STOP_STRING') and os.path.isfile(f'results/{frequency_PMF}_final.PMF'): last_step_PMF = len(open('results/1_final.dat').readlines()) final_PMF = last_step_PMF - last_step_PMF % frequency_PMF plt.title('FINAL PMF') final_PMF_data = pd.read_csv(f'results/{final_PMF}_final.PMF', delim_whitespace=True, skiprows=(0, 1), header=(0), names=header_names) sns.lineplot(data=final_PMF_data, x="s", y="PMF") plt.fill_between(final_PMF_data['s'], final_PMF_data['PMF'] - final_PMF_data['PMF_CI (95%)'], final_PMF_data['PMF'] + final_PMF_data['PMF_CI (95%)'], alpha=0.2) else: plt.title('PMFs along the string') for PMF in range(start_data_analysis,end_data_analysis+1,frequency_PMF): PMF_data = pd.read_csv(f'results/{PMF}.PMF', delim_whitespace=True, skiprows=(0,1),header=(0), names=header_names) sns.lineplot(x='s',y='PMF' ,data=PMF_data,linewidth=0.7, label=PMF) if len(BOND) != 0 and len(ANGLES) !=0: #plt.subplot(x,y,3) plt.subplot(grid[2, 0]) CVs_plot(BOND) #plt.subplot(x,y,4) plt.subplot(grid[3, 0]) CVs_plot(ANGLES) else: REX_list = ['step'] #plt.subplot(x, y, 4) plt.subplot(grid[:, 1]) plt.title('replica exchange along the string') for node in range(1, get_nnodes() + 1): REX_list.append(node) replica_exchange_data = pd.read_csv('results/plot.REX', delim_whitespace=True, names=REX_list) for node in range(1, get_nnodes() + 1): sns.lineplot(data=replica_exchange_data[node], linewidth=0.7, dashes=False) plt.xlabel('time') plt.ylabel('String Node') if len(BOND) != 0: plt.subplot(grid[2, 0]) #plt.subplot(x, y, 3) CVs_plot(BOND) else: #plt.subplot(x, y, 3) plt.subplot(grid[2, 0]) CVs_plot(ANGLES) mng = plt.get_current_fig_manager() mng.resize(1800,1000) plt.show() ``` </details> If you are executing this on baal please be sure that your loading the module to use python3. To see which module you should load: >$ module avail And that you activate the proper conda environment (should be 'sci_tools') to acces the required packages. :::warning :warning: **Job submition** Although this script is quite fast to run and does not take a lot of ressources, please avoid running it on the head node by doing: >$ ./checkstringV2.py As it is a bad practice to run job without using the job manager. To properly run your job **on baal**, please use: >$ qsub -I -l nodes=1:ppn=40 -A simlab_project This will start an interactive session which will log you inside a computing node. CTRL+D to quit the computing node and go back to the head node. ::: After running this script, you should see a pop-up window showing plots like so, *assuming you specified -X when loging via ssh*: ![](https://i.imgur.com/8ofFQhy.png) The most important one is the Convergence, it tells you how close your string is to converging. To spot when the string should be stopped, you must see a plateau like this one from the [GitHub](https://github.com/kzinovjev/string-amber/blob/master/tutorial.md#string-file) tutorial: ![](https://i.imgur.com/pjLth2i.png) ## Continuing the string Normally, it is said on the [GitHub](https://github.com/kzinovjev/string-amber/blob/master/tutorial.md#string-file) that "*the job should always be killed manually by the user, after checking that no more sampling is needed. So, it is better to put some unreasonably large number for the total number of steps (`nstlim`) .*" But on clusters, it is common to have a walltime limit (for baal, it depends on the computing nodes, please see details [here](http://www-lbt.ibpc.fr/wiki/doku.php?id=cluster-lbt:limits)). Therefore, you might need to continue the string until you obtain convergence, and then stop it with the correct procedure. First, create a directory named sorted in the current directory, where you string has finished running. Copy inside `sorted/` the following script to copy the structures: <details open> <summary>order.py</summary> ```python= #!/usr/bin/python import os with open('../results/plot.REX', 'r') as f:         idx = f.readlines()[-1].split()[1:] for i in range(len(idx)):         os.system('cp ../' + str(i+1) + '.rst7 ' + idx[i] + '.rst7') ``` </details> Then, you need to create a new directory, following the suggested architecture, and move the `sorted/` directory into this new directory, renaming it `starting_struct/`. >$ mv -r ./0_string/sorted/ ./1_string_continue/starting_struct/ After that, go into `0_string/results/` and execute the following command `ls -1v *.string` to get the latest `[file].string/`. This file will be the new `guess` file, so once you got it, copy and paste it into the new directory and rename it. >$ cp ./0_string/results/[file].string ./1_string_continue/guess And add a new line at the top of this file with the number of string nodes. Next, you will need to copy and paste the required files: - [ ] CVs - [ ] STRING - [ ] model$.$in - [ ] job$.$sh - [ ] toplogy file And finally, the in$.$sh script should include the starting structures you got from the previous run. Here is the script: <details open> <summary>continue_in.sh</summary> ```bash= #!/bin/bash # script modified by LK 28/02/23 # If groupfile exists, remove it if [ -f string.groupfile ]; then rm string.groupfile fi PARM=dinuc.prmtop # path to topology file REACT=./starting_struct/ # path to starting structures # Number of string nodes is provided as command line argument nodes=$1 for i in `seq 1 $nodes`; do # generate sander input for node $i sed "s/XXXXX/$RANDOM/g" model.in > $i.in # write out the sander arguments for node $i to the groupfile echo "-O -rem 0 -i $i.in -o $i.out -c $REACT$i.rst7 -r $i.rst7 -x $i.nc -inf $i.mdinfo -p $PARM" >> string.groupfile done echo "N REPLICAS = $i" ``` </details> And run it: >$ ./continue_in.sh [number of nodes] >$ qsub job$.$sh ## Visualise your trajectory To get the last frame of each string node, execute: <details open> <summary>order.py</summary> ```python= #!/usr/bin/python import os with open('../results/plot.REX', 'r') as f:         idx = f.readlines()[-1].split()[1:] for i in range(len(idx)):         os.system('cp ../' + str(i+1) + '.rst7 ' + idx[i] + '.rst7') ``` </details> :::danger :warning: If you are using this script after stopping the optimization, you will need to change the name of the file 'plot.REX' to 'plot_final.REX'. Else, you will not have the last frame properly reordered. ::: After executing this script, you will have as many .rst7 files (restart files) as string nodes. This help check if the structures are overall ok. To load them in VMD, start by loading a new structure and load the .prmtop file. Then, write in the TK console in VMD the following command (example with 40 string nodes): > for {set i 1} {$i < 41} {incr i} {mol addfile ${i}.rst7 type rst7 waitfor -1} :::info :information_source: Note that in VMD, the counting is different. ::: Or you can use this TCL script that automatise everything for you : <details open> <summary>vizualise_LF_string.tcl</summary> ```tcl= #!/usr/bin/tclsh # This script is meant to load the last frame of the string and the topology all at once, LK 21/04/23 # execute it in the terminale $vmd -e vizualise_LF_string.tcl path/to/topfile.top and follow the intructions in the terminal # ask the number of frame to load puts "How many frames should be loaded?" # get input and add +1 for the loop later gets stdin frame set nframe [expr $frame+1] # loop to load all the frames for {set i 1} {$i<$nframe} {incr i} {mol addfile ${i}.rst7 type rst7 waitfor -1} # create 1st rep mol delrep 0 top mol representation CPK 1.700000 0.300000 12.000000 12.000000 mol color Name mol selection {not water} # change the name accordingly mol material Opaque mol addrep top # create 2nd rep mol representation CPK 0.600000 0.300000 12.000000 12.000000 mol color Name mol selection {same residue as water and within 3 of not water} # change the name acordingly mol material Opaque mol addrep top mol selupdate 1 top 1 mol colupdate 1 top 0 ``` </details> To execute it, simply type in terminal: >$vmd -e vizualise_LF_string.tcl path/to/topoly_file.top And indicate the number of frames to load when prompt in the terminal. To get the reordered trajectories of each string node, use the following script: <details open> <summary>reorder_try.py</summary> ```python3= #!/usr/bin/env python import argparse import os import shutil import re from operator import itemgetter from itertools import groupby TRAJECTORY_EXTENSION = 'nc' SANDER_INPUT_EXTENSION = 'in' OUTPUT_DIR = 'reorder_trj_results' TMP_DIR = '.reorder_trj_tmp' CPPTRAJ_IN_NAME = '.reorder_cpptraj_tmp.in' def new_fragment(trj, node, start, stop): return {'trj': trj, 'node': node, 'start': start, 'stop': stop} def step_to_frame(step, steps_per_frame, offset): return (step + offset) // steps_per_frame def data_to_fragments(data, steps_per_frame, offset): ntrj = len(data[0][1:]) fragments = [] first_frame = step_to_frame(0, steps_per_frame, offset) + 1 for trj in range(1, ntrj + 1): trj_fragments = [] last_frame = -1 for row in data: frame = step_to_frame(row[0], steps_per_frame, offset) if frame == last_frame or frame < first_frame: continue node = row[trj] if len(trj_fragments) == 0: trj_fragments.append(new_fragment(trj, node, first_frame, frame)) elif trj_fragments[-1]['node'] == node: trj_fragments[-1]['stop'] = frame else: trj_fragments.append(new_fragment(trj, node, frame, frame)) last_frame = frame fragments.append(trj_fragments) return fragments def group_fragments_by_node(fragments): flat_fragments = [_ for trj_fragments in fragments for _ in trj_fragments] sorted_fragments = sorted(flat_fragments, key=itemgetter('node', 'start')) return groupby(sorted_fragments, key=itemgetter('node')) def assign_fragments_filenames(fragments, path): for node, node_fragments in group_fragments_by_node(fragments): for index, fragment in enumerate(node_fragments): fragment['filename'] = os.path.join(path, '{}_{}.nc'.format(node, index)) def break_trj(in_filename, parm, trj_filename, fragments): with open(in_filename, 'w') as f: f.write('parm {}\n'.format(parm)) f.write('trajin {}\n'.format(trj_filename)) for fragment in fragments: f.write('trajout {filename} start {start} stop {stop}\n' .format(**fragment)) os.system('/ccc/work/cont003/gen11021/gen11021/softwares/Ambertools21-ASM/bin/cpptraj -i {} >> break.tmp'.format(in_filename)) def combine_node_trj(in_filename, parm, node_trj_filename, node_fragments): with open(in_filename, 'w') as f: f.write('parm {}\n'.format(parm)) for fragment in node_fragments: f.write('trajin {filename}\n'.format(**fragment)) f.write('trajout {}\n'.format(node_trj_filename)) os.system('/ccc/work/cont003/gen11021/gen11021/softwares/Ambertools21-ASM/bin/cpptraj -i {} >> combine.tmp'.format(in_filename)) def mkdir(name): try: os.mkdir(name) except OSError: print('{} exists and will be removed'.format(name)) shutil.rmtree(name) os.mkdir(name) def move_dir_from_tmp(name): tmp_path = os.path.join(TMP_DIR, name) output_path = os.path.join(OUTPUT_DIR, name) os.system('mv {} {}'.format(tmp_path, output_path)) def reorder_trajectories(topology, filenames, trj_period, offset, data, subdir, output_format): fragments = data_to_fragments(data, trj_period, offset) assign_fragments_filenames(fragments, TMP_DIR) for i, trj_fragments in enumerate(fragments): break_trj(CPPTRAJ_IN_NAME, topology, filenames[i], trj_fragments) print('Splitting {} of {} done'.format(i+1, len(filenames))) path = os.path.join(TMP_DIR, subdir) mkdir(path) for node, node_fragments in group_fragments_by_node(fragments): node_trj_path = os.path.join(path, str(node) + '.' + output_format) combine_node_trj(CPPTRAJ_IN_NAME, topology, node_trj_path, node_fragments) print('Combining {} of {} done'.format(node, len(filenames))) move_dir_from_tmp(subdir) if subdir == 'string': os.system("rm {}/*".format(TMP_DIR)) def read_data(filename): with open(filename) as f: return [[int(x) for x in line.strip().split()] for line in f] def get_nnodes(): """ Assumes that for each node there is an 'in' file (1.in, 2.in etc) and there are no other 'in' files with numeric names. """ regex = re.compile(r"([0-9]*)\."+SANDER_INPUT_EXTENSION) paths = filter(lambda x: regex.match(x) and os.path.isfile(x), os.listdir('.')) return len(list(paths)) def trajectory_filenames(nnodes): return ['{}.{}'.format(i, TRAJECTORY_EXTENSION) for i in range(1, nnodes+1)] def assert_file_exist(filename): if not os.path.exists(filename): print('{} not found. Exiting.'.format(filename)) exit() def parse_string(): preparation_steps = None results_dir = '.' only_pmf = False assert_file_exist('STRING') with open('STRING') as f: for raw_line in f: line = raw_line.strip() if line.startswith('dir'): results_dir = line.split()[-1].strip('"') if line.startswith('preparation_steps'): preparation_steps = int(line.split()[-1]) if line.startswith('only_PMF'): only_pmf = line.split()[-1].lower() == '.true.' return preparation_steps, results_dir, only_pmf def parse_in(in_file): ntwx = None dt = None assert_file_exist(in_file) with open(in_file) as f: ntwx_regex = re.compile(r".*=(\D*)([0-9]*)") dt_regex = re.compile(r".*=(\D*)([0-9.]*)") for line in f: if line.strip().startswith('ntwx'): ntwx = int(ntwx_regex.match(line.strip()).groups()[1]) elif line.strip().startswith('dt'): dt = float(dt_regex.match(line.strip()).groups()[1]) return ntwx, dt def parse_stop_string(stop_string): try: with open(stop_string) as f: return int(f.readlines()[-1].strip()) except IOError: return None def main(): parser = argparse.ArgumentParser( description="" "Reorders trajectories generated by a string method calculation.\n" "Must be executed from the working directory (the one where the STRING \n" "file is located)", formatter_class=argparse.ArgumentDefaultsHelpFormatter ) parser.add_argument("prmtop", help="Amber topology file") parser.add_argument("--format", help="Format of the output trajectories", default='nc') parser.add_argument("--nodes", type=int, help="Number of nodes (use if automatic detection fails)") args = parser.parse_args() topology = args.prmtop # Parse STRING file to extract preparation_steps (default 2000), # results directory and only_PMF preparation_steps, results_dir, only_pmf = parse_string() # Parse first input file to know how often the frames were saved to the # trajectories (ntwx) and get the timestep (dt) to calculate # preparation_steps in_file = '1.{}'.format(SANDER_INPUT_EXTENSION) ntwx, dt = parse_in(in_file) if not ntwx: print('ntwx not set in {}, so no trajectories were written. Exiting.' .format(in_file)) exit() if not preparation_steps: preparation_steps = int(1. / dt) # Parse convergence.dat (if found) to know after how many steps # the job switched to pmf try: with open(os.path.join(results_dir, 'convergence.dat')) as f: last_string_step = int(f.readlines()[-1].split()[0]) except IOError: last_string_step = None has_string_data = os.path.exists(os.path.join(results_dir, 'plot.REX')) has_pmf_data = os.path.exists(os.path.join(results_dir, 'plot_final.REX')) if not (has_string_data or has_pmf_data): print('plot.REX or plot_final.REX not found in {}. Exiting.' .format(results_dir)) exit() string_offset = preparation_steps pmf_offset = string_offset + last_string_step or 0 if only_pmf: has_string = False has_pmf = True elif last_string_step: has_string = True has_pmf = True else: has_string = True has_pmf = False nnodes = args.nodes or get_nnodes() print("{} nodes".format(nnodes)) trajectory_names = trajectory_filenames(nnodes) mkdir(TMP_DIR) mkdir(OUTPUT_DIR) if has_string: string_data = read_data(os.path.join(results_dir, 'plot.REX')) print("Extracting string trajectories...") reorder_trajectories(topology, trajectory_names, ntwx, string_offset, string_data, 'string', args.format) if has_pmf: pmf_data = read_data(os.path.join(results_dir, 'plot_final.REX')) print("Extracting PMF trajectories...") reorder_trajectories(topology, trajectory_names, ntwx, pmf_offset, pmf_data, 'pmf', args.format) shutil.rmtree(TMP_DIR) if __name__ == '__main__': main() ``` </details> And execute it with the following command: >$ ./reorder_try.py file.prmtop From there, you will have every string node trajectories reconstructed after the replica exchange, so not continuous trajectories. ## Get the PMF Once the string has converge, you will need to manually stop the code by creating a file named STOP_STRING in which you will indicate 2 values: the first one corresponds to the step number at which you estimate the string has converged, and the second one is the step number at which the code should stop optimising the string position and switch to Umbrella Sampling. ###To do so, start a continuation as explain above and let the optimization run for a few picosecond. Then, once the code has run a bit and the results/ folder is created, create the STROP_STRING file inside the results/ folder. There should be at least 2ps between when the string is stopped and the beginning of the US. The US should last ~100ps.### Then, create a WHAM directory and copy in there all the /result/*final.dat files. Copy these two following scripts as well: <details open> <summary>convert_dat.py</summary> ```python3= #!/usr/bin/python import sys with open(sys.argv[1], 'r') as f: data = f.readlines() print(data[0][:31]) for line in data[1:]: print(line[:16]) ``` </details> <details open> <summary>wham.f90</summary> ```fortran= program wham real*8, parameter :: T = 300. !Temperature ! real*8, parameter :: R = 0.0083144621 !Gas constant real*8, parameter :: R = 0.0083144621 / 4.184 !Gas constant real*8, parameter :: bin_width = 0.1 !Width of the bins (in the same units as the coordinate) real*8, parameter :: threshold = 0.000001 !Threshold for convergence to be achieved (kcal/mol) integer, parameter :: niter = 100000000 !Maximum number of iterations integer, parameter :: minidx = 0 !Lowest index of input file (for example, dat.0) integer, parameter :: maxidx = 1000 !Highest index of input file (for example, dat.1000) character(len=*), parameter :: filename = ".dat" ! Postfix for the input filenames integer :: nsims, nbins ! Number of simulations, number of bins real*8, dimension(:), allocatable :: Kf ! force constant in each simulation real*8, dimension(:), allocatable :: f, ftmp ! Normalizing constant for each simulation real*8, dimension(:), allocatable :: x0 ! reference value for each simulation real*8, dimension(:), allocatable :: x ! Coordinate values in each bin real*8, dimension(:), allocatable :: P ! probabilities in each bin real*8, dimension(:), allocatable :: PMF ! PMF in each bin integer, dimension(:), allocatable :: Ns ! Number of points in each simulation integer, dimension(:), allocatable :: Nb ! Total number of points in each bin real*8, dimension(:,:), allocatable :: c ! biasing factors integer :: i, j, k, bin real*8 :: tmp, RT, thr, min_x, max_x, maxPMF, minPMF logical, dimension(minidx:maxidx) :: exist character*40 :: si !---Figuring out which input files exist and counting them--- do i = minidx, maxidx write(si,*) i si = adjustl( si ) inquire( file = trim( si )//filename, exist = exist(i) ) end do nsims = count( exist ) !------------------------------------------------------------ allocate( Kf(nsims), f(nsims), ftmp(nsims), x0(nsims), Ns(nsims) ) !---Determining force constants, reference values, largest and--- !----smallest x values and number of data points in each file---- max_x = -huge( 0.0_8 ) min_x = huge( 0.0_8 ) j = 0 do i = minidx, maxidx if ( exist(i) ) then j = j + 1 k = 0 write(si,*) i si = adjustl( si ) open( unit = 10, file = trim( si )//filename, status = "old" ) read(10,*) Kf(j), x0(j) do read(10,*, end=10) tmp min_x = min( tmp, min_x ) max_x = max( tmp, max_x ) k = k + 1 end do 10 continue Ns(j) = k end if end do !--------------------------------------------------------------- nbins = int( (max_x - min_x)/bin_width ) + 1 allocate( x(nbins), P(nbins), PMF(nbins), Nb(nbins) ) !---Finally, obtaining number of points in each bin--- Nb = 0 j = 0 do i = minidx, maxidx if ( exist(i) ) then j = j + 1 write(si,*) i si = adjustl( si ) open( unit = 10, file = trim( si )//filename, status = "old" ) read(10,*) tmp, tmp do k = 1, Ns(j) read(10,*) tmp bin = int( (tmp - min_x)/bin_width ) + 1 Nb(bin) = Nb(bin) + 1 end do end if end do !----------------------------------------------------- !---Calculating x values in the center of each bin--- do i = 1, nbins x(i) = min_x+( real(i, kind = 8 ) - 0.5)*bin_width end do !---------------------------------------------------- !---Calculating the biasing factors--- allocate( c(nsims,nbins) ) RT = R*T do i = 1, nsims c(i,:) = exp(-0.5_8*Kf(i)*(x-x0(i))**2/RT) ! c(i,j)=exp(-Vi(xj)/RT) end do !------------------------------------- f = 1. thr = threshold/RT ! converting threshold !---WHAM!--- do i = 1, niter ftmp = f do j = 1, nbins P(j) = real( Nb(j), kind = 8 )/sum( real( Ns, kind = 8 )*f*c(:,j) ) end do f = 1._8/matmul( c, P ) if ( maxval( abs( log(f/ftmp) ) ) < thr ) then write(*,"(A,i6,A)") "#converged in ",i," iterations" exit end if if (i == niter) write(*,"(A)") "#not converged!" end do !----------- !---Calculating the PMF--- P = P / sum( P ) do i = 1, nbins if( P(i) > 0 ) then PMF(i) = -R*T*log(P(i)) else PMF(i) = huge( 0.0_8 ) end if end do PMF = PMF !------------------------- !---Determining the barrier height--- maxPMF = maxval( PMF(10:nbins-10) ) !First and last 10 bins are ignored i = 10 do while ( PMF(i) < maxPMF ) i = i + 1 end do minPMF = PMF(i) do j = 10, i-1 minPMF = min( minPMF, PMF(j) ) end do write(*,"(A17,F23.10)") "#Barrier height: ", maxPMF-minPMF !------------------------------------ write(*,"(A)") "#PMF:" PMF = PMF - minPMF do i = 1, nbins write(*,"(I20,2F20.10)") Nb(i), x(i), PMF(i) end do close( 10 ) deallocate( Kf, f, ftmp, x0, Ns, x, P, PMF, Nb, c ) end program ``` </details> And finally, execute the following command in the terminal, here it is for 64 string nodes: > $for i in \`seq 1 64\`;do python convert_dat.py ${i}_final.dat > $i.dat; done; gfortran -o wham.exe wham.f90; ./wham.exe > PMF You then have a file named PMF with the second column corresponding to the energy in (kcal/mol) and the third column corresponds to the proggression of the reaction. Or simply : <details open> <summary>make_wham.sh</summary> ```bash= #!/bin/bash # This script is meant to automatise the list of command to make the PMF for the ASM, LK 05/05/23 # Execute it in the /WHAM directory cp ../results/*final.dat . cp ~/Documents/scripts/001_python_script/convert_dat.py . cp ~/Documents/scripts/002_fortran_script/wham.f90 . echo "How many string nodes are there?" read n_nodes for i in `seq 1 $n_nodes`;do python convert_dat.py ${i}_final.dat > $i.dat; done; gfortran -o wham.exe wham.f90; ./wham.exe > PMF ``` </details> ## Automatisation To run an optimsation and then an optimisation + US in one shot, use these scripts in their correct directories. You will need to change in the optimisation job and the one that mkdir and cp the directories names and arguments. ``` ./Project ├ job_cpmkdir.sh ├ cp_mkdir_continu_string.sh ├── 000_string_opti │ └── job1.sh ├── 001_string_opti_et_US │ └── job_ASM_24h.sh ``` <details open> <summary>cp_mkdir_continu_string.sh</summary> ```bash= #!/bin/bash ###>> This script is meant for preparing all the files to continue a string in opti+US mode after optimisation, LK 26/05/23 <<### ###>> use it in the same dir as $prev_dir and $dest_dir and run it: continu_string_mkdirCP.sh <prev_dir> <dest_dir> <top_file> <string_nodes> <name_job> <### ###>> this script makes assumption on some file names, please check before using it <<### prev_dir="$1" # like ./TEST_004_string_opti_agrstrt dest_dir="$2" # like ./TEST_005_string_opti_et_US_agrstrt top_file="$3" # just its name string_nodes="$4" # just the number name_job="$5" # just its name mkdir "$dest_dir" cp "$prev_dir"/CVs "$prev_dir"/STRING "$prev_dir"/model "$prev_dir"/"$top_file" "$dest_dir/" ### check if there is a restraint file and cp it if it is there or skip it rstrt_file="restraint" if [ -f "$prev_dir"/"$rstrt_file" ]; then echo "restraint file found. Copying..." cp "$prev_dir"/"$rstrt_file" "$dest_dir/" else echo "No restraint file found. Skipping..." fi ### sort the .rst7 file and copy them as starting structures mkdir "$prev_dir"/sorted cp ~/scripts/001_python_script/mod_order.py "$prev_dir"/sorted cd "$prev_dir"/sorted || exit python3 mod_order.py cd ../../ || exit cp -r "$prev_dir"/sorted "$dest_dir/starting_struct" ### get the highest string file and copy it as guess cd "$prev_dir"/results/ || exit new_guess=$(ls -1v *.string | tail -n 1) echo "$new_guess" cd ../../ || exit cp "$prev_dir"/results/"$new_guess" "$dest_dir/guess" sed -i "1i $string_nodes" "$dest_dir/guess" ### edit STRING and model files sed -i 's/.true./.false./g' "$dest_dir/STRING" sed -i 's/ nstlim = 50000, ! Steps, for 25 ps/ nstlim = 200000, ! Steps, for 100 ps/g' "$dest_dir/model" ### cp and edit the job.sh file cp ~/scripts/003_bash_script/job_ASM_24h.sh $dest_dir/ sed -i "s/NAME_JOB/$name_job/g" $dest_dir/job_ASM_24h.sh ### generate the input files cp ~/scripts/003_bash_script/continue_in.sh "$dest_dir/" cd "$dest_dir" || exit ./continue_in.sh "$string_nodes" ``` </details> <details open> <summary>job1.sh</summary> ```bash= #!/bin/bash #MSUB -r DOPO_angOPrst_ASMopti # Request name #MSUB -N 1 # number of nodes #MSUB -n 64 # number of cpus #MSUB -T 86400 #i.e 24h Elapsed time limit in seconds #MSUB -o log.%I.o # Standard output. %I is the job id #MSUB -e log.%I.e # Error output. %I is the job id #MSUB -q rome #MSUB -A gen11021 # Project ID #MSUB -m scratch set -x module purge module load feature/mkl/mpi feature/mkl/multi-threaded module load gnu mpi fftw3/mkl/ AMBERHOME='/ccc/work/cont003/gen11021/gen11021/softwares/Ambertools22-ASM' source $AMBERHOME/amber.sh rm -rf results mkdir results ccc_mprun $AMBERHOME/bin/sander.MPI -ng 64 -groupfile ./string.groupfile cd .. ccc_msub job_cpmkdir.sh ``` </details> <details open> <summary>job_cpmkdir.sh</summary> ```bash= #!/bin/bash #MSUB -r DOPO_cpmkdir # Request name #MSUB -N 1 # number of nodes #MSUB -o log.%I.o # Standard output. %I is the job id #MSUB -e log.%I.e # Error output. %I is the job id #MSUB -q rome #MSUB -A gen11021 # Project ID #MSUB -m scratch set -x ccc_mprun ./cp_mkdir_continu_string.sh ./006_string_OPTI_ang_et_OP_rstrt ./007_string_OPTI_US_ang_et_OP_rstrt RD5N_R3H.prmtop 64 DOPO_angOPrst_ASMoptiUS cd ./007_string_OPTI_US_ang_et_OP_rstrt/ ccc_msub job_ASM_24h.sh ``` </details> <details open> <summary>job_ASM_24h.sh</summary> ```bash= #!/bin/bash #MSUB -r DOPO_angOPrst_ASMoptiUS # Request name #MSUB -N 1 # number of nodes #MSUB -n 64 # number of cpus #MSUB -T 86400 #i.e 24h Elapsed time limit in seconds #MSUB -o log.%I.o # Standard output. %I is the job id #MSUB -e log.%I.e # Error output. %I is the job id #MSUB -q rome #MSUB -A gen11021 # Project ID #MSUB -m scratch set -x module purge module load feature/mkl/mpi feature/mkl/multi-threaded module load gnu mpi fftw3/mkl/ AMBERHOME='/ccc/work/cont003/gen11021/gen11021/softwares/Ambertools22-ASM' source $AMBERHOME/amber.sh rm -rf results mkdir results echo "12000" > results/STOP_STRING echo "16000" >> results/STOP_STRING ccc_mprun $AMBERHOME/bin/sander.MPI -ng 64 -groupfile ./string.groupfile ``` </details> ## Analysis To analyse your trajectories, you can use the script [get_pmf_cv_values.py](https://github.com/kzinovjev/string-amber/blob/master/utils/get_pmf_cv_values.py) provided by Kirill, but it works only for the CV(s) defined in the CV file of the string. It can also cause issue as PBC are not considered. Another solution are these two scripts that give more freedom regarding the CV(s). The first one processes the amber trajectories and extract the CV(s) for each frame using cpptraj, which automatically takes into account the PBC: To use it, you will need a cpptraj template file, e.g.: ``` # cpptraj_template.in parm RD5N_POC_Mg.prmtop trajin {input_file} distance Mg_O1P @6490 @15 out {output_file} distance O2_P @12 @14 out {output_file} distance P_O5 @14 @17 out {output_file} distance O2_O5 @14 @12 out {output_file} angle O2_P_O5 @12 @14 @17 out {output_file} distance Mg_Ow1 @6490 @4303 out {output_file} distance Mg_Ow2 @6490 @4204 out {output_file} distance Mg_Ow3 @6490 @4369 out {output_file} distance Mg_Ow4 @6490 @4321 out {output_file} distance Mg_Ow5 @6490 @3991 out {output_file} run ``` Be carefull to respect the naming conventions for the placeholders (e.g. "{input_file}") in the cpptraj template file. You can also specify the naming for the generated files. To see all the options "python3 compute_CV_from_traj.py --help". <details open> <summary>compute_CV_from_traj.py</summary> ```python3= import subprocess import argparse def main(): """ Main function for processing Amber trajectories using cpptraj. Parses command-line arguments, reads the template file, and processes each trajectory. Parameters: None Returns: None """ parser = argparse.ArgumentParser(description='Extract the values of CV(s) from Amber trajectory using cpptraj.') parser.add_argument('num_file', type=int, help='Number of trajectories to be processed') parser.add_argument('template_file', help='Template file of the cpptraj commands to be executed.') parser.add_argument('topology_file', help='Path to the topology file (e.g., RD5N_POC_Mg.prmtop).') parser.add_argument('-o', '--output', help='Path to the output file (optional). Default is "CV_{index}.txt".', default='CV_{index}.txt') args = parser.parse_args() # Define the cpptraj commands template template_file_path = args.template_file with open(template_file_path, "r") as template_file: cpptraj_commands_template = template_file.read() # Loop over your input files num_file = args.num_file for i in range(1, num_file + 1): process_trajectory(i, args.output, args.template_file, args.topology_file, cpptraj_commands_template) def process_trajectory(i, output_format, template_file, topology_file, template_content): """ Process a single trajectory. Parameters: - i (int): Index of the trajectory. - output_format (str): Output file format with placeholder for index (e.g., 'output_{index}.txt'). - template_file (str): Path to the template file of cpptraj commands. - topology_file (str): Path to the topology file (e.g., RD5N_POC_Mg.prmtop). - template_content (str): Content of the cpptraj commands template. Returns: None """ # Define input and output file names input_file = f"{i}.nc" output_file = output_format.format(index=i) # Create the cpptraj commands with the specific input and output files cpptraj_commands = template_content.format(input_file=input_file, index=i, output_file=output_file) # Save the cpptraj commands to a file cpptraj_script_path = f"cpptraj_script_{i}.in" with open(cpptraj_script_path, "w") as script_file: script_file.write(cpptraj_commands) # Run cpptraj using subprocess with the specified topology file cpptraj_command = f"cpptraj -p {topology_file} -i {cpptraj_script_path}" subprocess.run(cpptraj_command, shell=True) # Remove the temporary script file subprocess.run(f"rm {cpptraj_script_path}", shell=True) if __name__ == "__main__": main() ``` </details> Then to compute the means and STDs of each CVs, you can use this other python script: <details open> <summary>compute_meanSTD_fromCV.py</summary> ```python3= import pandas as pd import os from natsort import natsorted import argparse import re import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) parser = argparse.ArgumentParser(description='Process CV files and compute mean and std.') parser.add_argument('-o', '--output', help='Custom output filename. Default is: "CV_{index}.txt"', default='CV_{index}.txt') parser.add_argument('num_file', type=int, help='Number of files to be processed') parser.add_argument('-p', '--path', help='Path of the file to be processed. Default is current directory', default='./') parser.add_argument('-n', '--name', help='Name to save the dataframe as csv file. Default is: "result_dataframe.csv"', default='result_dataframe.csv') args = parser.parse_args() def process_cv_file(file_path, result_index, output_filename): """ Process CV file and compute mean and std. Parameters: - file_path (str): Path to the CV file. - result_index (int): Index for the result DataFrame. - output_filename (str): Output filename pattern. Returns: - pd.DataFrame: Result DataFrame. """ try: file_data = pd.read_csv(file_path, delim_whitespace=True, index_col=False, encoding='utf-8') file_data = file_data.drop(columns=["#Frame"]) column_means = file_data.mean() column_stds = file_data.std() result_df = pd.DataFrame(index=[result_index+1]) for key in column_means.index: result_df[f'{key}_mean'] = column_means[key] result_df[f'{key}_std'] = column_stds[key] return result_df except pd.errors.EmptyDataError: logger.warning(f"The file {file_path} is empty. Skipping.") except Exception as e: logger.error(f"Error processing file {file_path}: {e}") def main(base_directory, output_filename=args.output): """ Main function to process CV files in a directory. Parameters: - base_directory (str): Directory containing CV files. - output_filename (str): CV filename pattern. Returns: - pd.DataFrame: Final result DataFrame. """ final_result_df = pd.DataFrame() all_files = os.listdir(base_directory) output_pattern = re.escape(output_filename).replace(re.escape("{index}"), r"\d+") matching_files = natsorted([file_name for file_name in all_files if re.match(output_pattern, file_name)]) for index, file_name in enumerate(matching_files[:args.num_file]): file_path = os.path.join(base_directory, file_name) logger.info(f"Processing File: {file_path}") file_result_df = process_cv_file(file_path, index, output_filename.format(index=index)) if file_result_df is not None: final_result_df = pd.concat([final_result_df, file_result_df], axis=0) return final_result_df if __name__ == "__main__": data_directory = args.path result_dataframe = main(data_directory, args.output) result_dataframe.to_csv(data_directory + args.name) ``` </details> Again, to see the options and arguments use "-help". ## Generate structures On the PMF stage a string, use Rolf's script: `/home/kantin/Documents/scripts/001_python_script/RD_generate_string.py` Then in the extracted directory created by his script, organise the files: <details open> <summary>organise_extrac_rst7.sh</summary> ```bash= #!/bin/bash #### for organising in separate directories all the .rst7 files generated after extraction of the frames with RD_generate_string.py #### for i in {1..20}; do # Create the destination directory if it doesn't exist mkdir -p "./${i}" # Move files starting with the current number to the destination directory mv ./${i}_*.rst7 "${i}/" # Count the number of files in the subdirectory count=$(ls -1 "1/$i" | wc -l) echo "Number of files in subdirectory ${i}: $count" done ``` </details> Then we need to relax the structures: <details open> <summary>continue_in.sh</summary> ```bash= #!/bin/bash # script modified by LK 28/02/23 # If groupfile exists, remove it if [ -f string.groupfile ]; then rm string.groupfile fi PARM=RD5N_POC.prmtop # path to topology file REACT=./starting_struct/ # path to starting structures # Number of string nodes is provided as command line argument nodes=$1 for i in `seq 1 $nodes`; do # generate sander input for node $i sed "s/XXXXX/$RANDOM/g" model > $i.in # write out the sander arguments for node $i to the groupfile echo "-O -rem 0 -i $i.in -o $i.out -c ${REACT}NUM_$i.rst7 -r NUM_$i.rst7 -x $i.nc -inf $i.mdinfo -p $PARM" >> string.groupfile done echo "N REPLICAS = $i" ``` </details> All can be found at /ccc/scratch/cont003/gen11021/kantinla/M2_stage_srtch/001_RD5N_POC/003_ASM_B3LYP/extracted. After runnig the 50 steps, take the rst7 and from them extract the xyz files using cpptraj: <details open> <summary>all_make_and_sort.sh</summary> ```bash= #!/bin/bash #################################################### ## generate the xyz files, move them in a new dir ## #################################################### # LK, 19/07/2023 for dir in {1..20} do cd $dir run_num=$(basename $dir) for file in *.rst7 do filename="${file%.*}" cpptraj -p RD5N_POC.prmtop -y "$file" << EOF strip !(:1-2) trajout ${filename}_nW.xyz [xyz] nobox notitle append run EOF tail -n +3 ${filename}_nW.xyz > ${filename}_qm.xyz rm ${filename}_nW.xyz # Move the stripped XYZ file to the directory named after the run number mkdir -p ${run_num}_xyz_files mv ${filename}_qm.xyz ${run_num}_xyz_files/ done cd .. mv ${run_num}_xyz_files/ ../ done ``` </details> For generating orca inputs: <details open> <summary>gen_orc_in.sh</summary> ```bash= #!/bin/bash for dir in {1..20}; do cd "${dir}_xyz_files" || { echo "Failed to change directory to $dir" >&2 continue } if [ ! -f orca_input_model.txt ]; then echo "orca_input_model.txt not found in directory $dir" >&2 cd .. continue fi for file in *.xyz; do if [ ! -f "$file" ]; then echo "Skipping non-existent file: $file" >&2 continue fi filename="${file%.*}" cat orca_input_model.txt "$file" > "${filename}_sp_orc.in" && echo "*" >> "${filename}_sp_orc.in" done mkdir "${dir}_orca_inputs" mv *_sp_orc.in "${dir}_orca_inputs" mv "${dir}_orca_inputs" .. cd .. done ``` </details> For sqm we need 3 scripts: <details open> <summary>add_znum_to_xyz.py</summary> ```python= import glob atomic_numbers = { "C": 6, "H": 1, "O": 8, "P": 15, } file_pattern = "1_*_qm.xyz" # Find all files matching the pattern file_list = glob.glob(file_pattern) for input_file in file_list: output_file = input_file.replace("_qm.xyz", "_processed.xyz") with open(input_file, "r") as f_in, open(output_file, "w") as f_out: for line in f_in: atom_data = line.strip().split() if len(atom_data) >= 4: atom_symbol = atom_data[0] if atom_symbol in atomic_numbers: atomic_number = atomic_numbers[atom_symbol] new_line = f"{atomic_number} {' '.join(atom_data)}" f_out.write(new_line + "\n") else: f_out.write(line) else: f_out.write(line) ``` </details> <details open> <summary>all_add_znum.sh</summary> ```bash= #!/bin/bash for i in {1..20}; do echo "Processing directory ${i}_xyz_files" cp add_znum_to_xyz.py "${i}_xyz_files" sed -i "s|file_pattern = \"1_|file_pattern = \"${i}_|g" "${i}_xyz_files/add_znum_to_xyz.py" cd "${i}_xyz_files/" || { echo "Cannot change to directory ${i}_xyz_files"; continue; } echo "Running add_znum_to_xyz.py" python add_znum_to_xyz.py 2>&1 || echo "Failed to run python script in ${i}_xyz_files" cd .. done ``` </details> <details open> <summary>gen_sqm_in.sh</summary> ```bash= #!/bin/bash for dir in {1..20}; do cd "${dir}_xyz_files" || { echo "Failed to change directory to $dir" >&2 continue } if [ ! -f sqm_input_model.txt ]; then echo "sqm_input_model.txt not found in directory $dir" >&2 cd .. continue fi for file in *_processed.xyz; do if [ ! -f "$file" ]; then echo "Skipping non-existent file: $file" >&2 continue fi filename="${file%.*}" cat sqm_input_model.txt "$file" > "${filename}_sp_sqm.in" sed -i 's/^/ /' "${filename}_sp_sqm.in" done mkdir "${dir}_sqm_inputs" mv *_sp_sqm.in "${dir}_sqm_inputs" mv "${dir}_sqm_inputs" .. cd .. done ``` </details> Then to run the SP calculations for Orca: <details open> <summary>job_template_sp_orc.sh</summary> ```bash= #!/bin/bash #MSUB -r {extract}_{node}_orc_sp # Request name #MSUB -n 16 # Number of tasks to use #MSUB -T 86400 # Elapsed time limit in seconds 24h #MSUB -o %I.out # Standard output. %I is the job id #MSUB -e %I.err # Error output. %I is the job id #MSUB -q rome #MSUB -A gen11021 # Project ID #MSUB -m scratch ml sw mpi/openmpi/4.1.1 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/ccc/work/cont003/gen11021/dubouede/softwares/orca_5_0_3_linux_x86-64_shared_openmpi411/ set -x /ccc/work/cont003/gen11021/dubouede/softwares/orca_5_0_3_linux_x86-64_shared_openmpi411/orca {extract}_{node}_qm_sp_orc.in > {extract}_{node}_qm_sp_orc.log ``` </details> <details open> <summary>gen_jobs.sh</summary> ```bash= #!/bin/bash # Path to your input files. input_path="./" # Name of your level of theory directory level_of_theory="SPcalc_B3LYP_6311G" # Name of your template job script. template="job_template_sp_orc.sh" # First set of loops: create directories for extract in $(seq -f "%02g" 1 20) do for node in $(seq -f "%02g" 1 64) do # Create directories dir_path="${level_of_theory}/${extract}_extracted/${node}_node" mkdir -p $dir_path done done # Second set of loops: copy and modify files for extract in {1..20} do for node in {1..64} do # Copy the input files into their appropriate directories new_extract=$(printf "%02d" $extract) new_node=$(printf "%02d" $node) dir_path="${level_of_theory}/${new_extract}_extracted/${new_node}_node" cp "${input_path}/${extract}_orca_inputs/${extract}_${node}_qm_sp_orc.in" "${dir_path}/${new_extract}_${new_node}_qm_sp_orc.in" # Copy the template job file and replace placeholders cp $template "${dir_path}/job.sh" sed -i "s/{extract}/$new_extract/g" "${dir_path}/job.sh" sed -i "s/{node}/$new_node/g" "${dir_path}/job.sh" done done ``` </details> And then in SPcalc_B3LYP_6311G/ run : <details open> <summary>gen_jobs.sh</summary> ```bash= #!/bin/bash for extract in {01..20}; do for node in {01..64}; do cd "${extract}_extracted/${node}_node/" ccc_msub job.sh cd ../../ done done ``` </details> For sqm it's the same but the names change.