MDTutorials
In this tutorial, we focus on a detailed procedure for preparing input files for advanced sampling simulation. There are 2 parts in this tutorial:
Preparation of the Input Files for Advanced Sampling Simulations - Part 1
This part of the tutorial focuses on the workflow starting from downloading the PDB file from Protein Data Bank to prepare input files for advanced sampling methods in general, including a coordinate file, a topology file, and an index file. We choose PLCC/PLCG1 protein-ligand binding complex as our system of interest in this part of the tutorial. Sections in this part of the tutorial include:
Preparation of the Input Files for Advanced Sampling Simulations - Part 2 (unpublished for now)
In this part of the tutorial, we will first introduce methods specifically for preparing the input files for Hamiltonian replica exchange and expanded ensemble simulations, given the files prepared in the first part of the tutorial using PLCC/PLCG1 protein-ligand binding complex as the system of interest. Then, we will utilize the workflows in both parts of the tutorial to different but similar systems, including the protein-ligand binding complexes like PLCC/ErbB2-pep and PLCC/IR-pep in a more brief way compared to PLCC/PLCG1-pep.
Sections in this part of the tutorial include:
In this part of tutorial, our system of interest, PLCC/PLCG1-pep is the binding complex between the C-terminal SH2 domain of Phosphoplipase C, 1 (PLCC) and its own intraprotein recognition site (PLCG1-pep). The X-ray crystallography structure of this binding complex could be downloaded from Protein Data Bank (PDB) with PDB ID as 4k45. The target of this tutorial is to provide a detailed method to prepare the GROMACS input files for further advanced sampling simulations, in our case, Hailtonian replica exchange and expanded ensemble simulations. These input files include a topology file (.top
), a configuration file (.gro
) file and an index file (.ndx
) file. If you are already familiar with the reasons for each step in the whole process and just want a quick look at all the command used, please directly jump to Section 6. For information about the installation of the software used in this tutorial, please refer to the wiki page in Shirts group.
When simulating a protein-ligand binding complex, the first thing to check is the missing residues and the missing atoms, which should be documented in the header of the PDB file. For example, REMARK 465
shows the missing residues of the binding complex as follows:
As shown above, 4 residues of the protein (PLCC, chain A) were missing (one at the beginning and 3 in the end). On the other hand, the first 10 residues of the peptide (PLCG1-pep, chain B) were missing. These residues are missing generally because they are too flexible and floppy, so the electron density for these residues is too low to resolve the structure. To make the structure for the simulation consistent with the one investigated by the experimental methods, our goal in Section 2., in this case, is to use the software MODELLER to rebuild all the 4 missing residues of the protein, and the asparagine residue of the peptide (residue 779). Once the rebuilding of the missing residues is completed, you should also found that MODELLER also automatically rebuilt all the missing atoms of side chains of the protein. To start with, make a folder Build
, download the PDB file 4k45.pdb
from the link above to the folder and follow the steps below in the folder.
To generate an alignment file, first, we have to generate a sequence file from 4k45.pdb
. To do so, run the Python script sequence.py
which has the following content
As a result, a sequence file 4k45.seq
will be produced, which contains the sequence of the structure in 4k45.pdb
(not including the missing residues) as below.
Note that the sequences should be ended up with an *
, and the forward-slash /
in the sequence serves as a breaker between different chains. So the part before the slash is the sequence of the protein, while the remaining part is the sequence of the peptide. In the sequence of the peptide, it should be noticed that the phosphotyrosine residue was ignored by MODELLER, which is a result of the HETATM records of the phosphotyrosine.
In the alignment file alignment.ali
, which can be manually made from the content in 4k45.seq
, there should be two sequences shown as follows:
As shown above, several changes need to be made to make an alignment file from 4k45.seq
:
4k45
in the alignment file, just copy the content of 4k45.seq
, identify the missing residues and add a -
to represent each residue..
to represent the HETATM residue (e.g. DNA, RNA, ligand, etc.), which is the phosphotyrosine residue in this case. In MODELLER, .
is called BLK or block residue type. Accordingly, since the phosphotyrosine was ignored in 4k45.seq
, we have to correct the number of residues once we include phosphotyrosine. Therefore, change A:+109
into A:+110
.4k45_fill
is a complete sequence with all the missing residues filled that serves as a template. Make sure that the two sequences have equal numbers of residues and are aligned with each other.automodel
After getting the alignment file, we should be able to use it as the input to the modeling using automodel
. Specifically, we run the script build.py
which has the following content:
Remember that by default all HETATM records are ignored. Therefore, to read all the HETATM from the template, we have to turn on env.io.hetatm
by setting env.io.hetatm = True
. After executing build.py
, we should get only one PDB file, 4k45_fill.B99990001.pdb
as a result of setting a.starting_model = 1
and a.ending_model = 1
. (Also note that all the water molecules have been removed.) Use VMD (Visual Molecular Dynamics) to visualize the structure to see it seems reasonable, which means that there should be no broken molecules, missing atoms, or other anomalies. (In this case, check if the phosphotyrosine still stays inside the binding pocket composed of 4 arginine residues.) As a record, the following is the output of using build.py
:
Note that if a wrong is used in build.py
, then the atoms might be erroneously added, which might lead an error when using H++ web sever to perform parameterization. (An error like FATAL: Atom .R<NASN 107>.A<OXT 17> does not have a type.
might occur.)
To prepare the topology file of a structure, we first have to decide which force field to use. In the case of protein-ligand binding complexes, one of the frequently chosen force fields is AMBER99SB-ILDN, which is an all-atom force field. However, simply using the GROMACS program pdb2gmx
will not work, since the peptide in our case contains a non-standard residue, phosphotyrosine (PTR), which is not included in the residue topology database (aminoacids.rtp
in the folder /usr/share/gromacs/top/amber99sb-ildn.ff
). of AMBER99SB-ILDN force field. Generally, there are multiple ways to deal with this issue, and the most intuitive way might be manually adding the parameters of PTR to aminoacids.rtp
and modifying residuetypes.dat
in the folder /usr/share/gromacs/top
then running pdb2gmx
. However, it might be tedious to get the parameters of PTR, so it could be a better choice to rely on other tools (so that pdb2gmx
is not required).
When it comes to parameterizing a small molecule or a non-standard amino acid, Amber is often a useful tool. Amber is another molecule dynamics simulation engine like GROMACS and it also contains package (AmberTools) that is able to generate topology and coordinate files convertible to GROMACS input files. While the procedure of parameterization using AmberTools could be complicated, a web server H++ developed by Onufriev et al. is able to automize the whole process of the parameterization of either proteins or non-protein structures. However, it is still recommended to know how Amber works to parametrize a structure rather than using H++ as a black box. Therefore, here we introduce the core concepts of parameterization using AmberTools as below. If you are new to Amber or AmberTools, it is highly recommended to refer the manual of Amber 2018 or 2019, especially Section 1, 12, 13, 14, and 15. Also, the following tutorials are recommended to get familiarized with the fundamentals using AmberTools, especially parameterization:
Generally, when parameterizing a small molecule (instead of a polypeptide), we have to separate the part of the small molecule from the original structure and use several programs in Amber following the procedures as below to independently parameterize it:
antechamber
to assign charges and atom types.
antechamber -i ${name}.pdb -fi pdb -o ${name}.mol2 -fo mol2 -c bcc -s 2 -nc ${nc} -pl 15
.mol2
file.parmchk2
to build the missing parameters.
parmchk2 -i ${name}.mol2 -f mol2 -o ${name}.frcmod
.mol2
file; Output: a .frcmod
file (parameter file)tleap -f tleap.in
, which requires a tLEaP input file looks like follows:
.prmtop
file) and an Amber coordinate file (.inpcrd
) file. Both files are convertible using InterMol or acpype.py
. Note that the missing atoms and the hydrogens will be builts by tLEaP automatically.Parameterization/GAFF
)In our case, however, the ligand is a peptide with a non-standard amino acid (PTR). To generate a topology file of the whole binding complex, there are several ways to do using AmberTools (without using H++):
pdb2gmx
to parameterize the protein (with AMBER99SB-ILDN force field), then combine the coordinate files and the topology files output by different parameterizing tools.pdb2gmx
, then combine the coordinate files and the topology files. However, in this case, PTR is not a terminal residue, which means that the peptide ligand will be broken into two pieces when we separate PTR from the original PDB for AmberTools parametrization. To parameterize the remaining part other than PTR, we might have to label the two pieces of the peptide as two different chains (since there is no covalent bond in between).The procedure of parameterizing using AmberTools in all of these three methods resembles the procedure of parameterizing the small molecule as shown previously, except that antechamber
and parmchk2
might not be required. Note that, however, it could be tedious and error-prone when combining the coordinate files if different parts of the peptide are parameterized by different tools. Therefore, the last method is not recommended.
While H++ web server is able to process the whole complex at a time, it is recommended to parameterize the protein and the peptide ligand separately so that we can have these two entities as two different [ molecultype ]
directives in the topology file. Parameterizing the whole complex would lead to only one [ moleculetype ]
directive corresponding to the whole complex, which makes it hard for us to only decouple the interactions of the peptide ligand in replica exchange or expanded ensemble simulations.
To separately parameterize the protein and the peptide ligand, we first have to separate the coordinate file. To do so, execute the following command in folder Topology
to copy 4k45_fill.B99990001.pdb
and save as PLCC.pdb
. (Before this, go back to the parent folder of the current folder Build
and make another folder Topology
, in which all the steps in this section will be carried out.)
Then, cut the part of the peptide ligand and save it as pep7.pdb
.
Note that if there are several images/copies in the original PDB file downloaded from Protein Data Bank, usually there is only one image chosen to proceed to simulations. Since the chosen image is usually not at the center of the simulation box, it is recommended to use the GROMACS program editconf
to center the structure to prevent the structure cross the periodic boundaries during energy minimization or other following steps:
However, in our case, there is only one image in the original PDB file, which should be already close to the center of the box, so it is not necessary to do so.
When preparing the structure input file for any service using AmberTools, there are several things to note:
reduce
and LEaP utilities in AmberTools add hydrogens automatically with predefined names. Having hydrogens in PDB files with names that LEaP does not recognize within its residue libraries leads to a total mess.pep7.pdb
have been removed by MODELLER automatically.)pdb4amber
, which is introduced in Section 12.4 in the manual.Obviously, different concerns might need to be taken care of in different cases. More details about the system preparation for AmberTools are elaborated in Section 12 in the manual of Amber 2018. To name just a few,
With PLCC.pdb
and pep7.pdb
to start with, now we can use H++ the parameterize them separately.
First, let's parameterize the protein part (PLCC.pdb
) using H++. To begin with, click on the tab "Process a structure" on the H++ web server, upload PLCC.pdb
and click on "Process File". After that, the user will be directed to a page in which the user can specify parameters used during the parameterization, including salinity, internal dielectric and external dielectric constants, pH value and so on. H++ will decide the protonation states of each residue based on these parameters (mainly the pH value).
For typical physiological conditions, using the default value of 80 for the external dielectric (water) and salinity = 0.15 M is reasonable. The situation is less straightforward with the internal dielectric. If you are mostly interested in deeply buried residues, a lower value is recommended such as 4. If your focus is residues closer to the surface, a higher value is appropriate, such as 10 or even 20. In our case, we have to make sure that the parameters used here are consistent with the experimental conditions. According to Section 2.2.6. (page 37) of Dr. McKercher's thesis, PLCC was dialyzed overnight at 4 Β°C into buffer containing 25 mM Tris, pH 7.4, and 200 mM NaCl and the binding reactions were performed at 25 Β°C (298 K). Here, we will just ignore Tris buffer and set the salinity in the web server as 0.2 (M) and the pH value as 7.4. As for the dieletric constants, we would just use the default values. Note that here we adopt the experimental conditions of ITC instead of X-ray crystallography because what we want to simulate is the binding complex in a solution instead of in a crystal form. ITC and NMR experiments are conducted in solution phase, so we have to make sure that our simulation parameters consistent with the ones used in at least one of these experiments.
Note that H++ is more than parameterizing a structure. As you can see on the page for specifying parameters, it is also able to construct a simulation box, solvate the system and add ions to the box. However, the box types available in H++ are limited to cubic and octahedral boxes. While the volume of an octahedral box is about 77% of a cubic box and can, therefore, decrease computational cost, it is not the best choice compared with a rhombic dodecahedral box available in GROMACS, whose volume is only about 70.7% of a cubic box. Therefore, we are not going to use H++ to solvate the system in a simulation box, but just parameterize the structure and use GROMACS to subsequently complete solvation and neutralization. After setting up all these parameters, click on the tab "Process β¦" to process the file. (Note that it is recommended to register an account on H++, which would make it easier to manage the submissions.)
As a result, in the results page, the titration sites and their titration curves, isoelectric point (8.69 in our case) and the total charge at pH 7.4 (+3 in our case) will be shown. Also, there are some key files available to be downloaded, including an Amber topology file 0.2_80_10_pH7.4_complex.top
and an Amber coordinate file 0.2_80_10_pH7.4_complex.crd
. Accordingly, save the Amber topology file and the Amber coordinate file as PLCC.top
and PLCC.crd
, respectively. However, note that these files are for Amber simulation engine instead of GROMACS, we have to use convert the files to the extensions that are compatible with GROMACS. Since InterMol is only able to convert prmtop
file, here we have to use acpype.py
to convert the files. To do this, clone the repository that is called "useful-scripts" in Shirts group, copy the folder GAFF
in the folder Parameterization
and execute the command
As a result, several files will be output, including a GROMACS topology file PLCC_GMX.top
, a GROMACS coordinate file PLCC_GMX.gro
, and two parameter files for testing the output files with GROAMCS (em.mdp
and md.mdp
). Since we will also perform energy minimization and short MD simulation later, we are not going to use em.mdp
and md.mdp
generated here, but the files PLCC_GMX.gro
and PLCC_GMX.top
are exactly what we want for now.
Basically, the parameters (salinity, pH value and so on) used in the parameterization of the peptide ligand are exactly the same as the ones used in the parameterization of the protein except that several changes have to be made:
pep7.pdb
are HETATM instead of ATOM. Since the HETATM records will be ignored by AmberTools during parameterization and would cause a discontinuity in the peptide in our case, we have to change HETATM records of PTR into ATOM records.pep7.pdb
using H++ using exactly the same settings as we used previously in the parameterization of PLCC.pdb
, the following error would occur and be shown in 0.15_80_10_pH6_pep7.tleap.err
:
This error indicates that HB1 atom in PTR does not have a type. Note that this atom was not present in the input PDB file but added by the reduce
program in AmberTools used by H++ to correct the orientation of the ASN residue. To avoid this error, we have to deselect the option "Correct orientation of ASN, GLN and HIS groups β¦" on the page that we specify the parameters. (All the other parameters should remain the same.) While this option is selected by default, skipping it is generally ok. It may affect the accuracy of the predicted protonation state, but the predicted protonation state is an approximation anyway. If you are concerned, you can run with the option turned on. Then look in the reduce.log
file under "listings". Search for "FLIP" in this file. It will tell you which groups were flipped. Then you can load your coordinate files in VMD or other viewers to see which groups are near the flipped groups. if the pKa for the nearby groups are far from your pH of interest, then it should not affect predicted protonation state. Very rarely is this not the case. But if it is you can manually change the orientation of the groups.Given that all the considerations above are taken care of, the parameterization should work well and the total charge at pH 7.4 is -3. Similarly, save the Amber topology file and the Amber coordinate file as pep7.top
and pep7.crd
, respectively, and execute the following command to convert the files:
As a result, a GROMACS topology file pep7_GMX.top
and a GROMACS coordinate file pep7_GMX.gro
will be produced.
Since the simulations will be performed on the whole binding complex, not either only the protein or only the peptide ligand, we have to combine the topology files and the coordinate files of these two entities. To begin with, execute the following command:
To combine the coordinate files, we only have to cut the coordinates in pep7_GMX.gro
and append to the bottom of complex_GMX.gro
, change the number of atoms shown at the top of the file from 1744 to 1878 and save the file. Note that the numbering of the residues of the peptide ligand starts from 1 instead of 107. This is exactly what we want, since this is how the topology file that we are going to make distinguishes different molecule types defined in [ moleculetype ]
directive. The numbering of the peptide also starts from 1 instead of 1745, but this will be fixed in the next step using the GROMACS program editconf
when we solvate the system. Note that even if the box dimensions in PLCC_GMX.gro
and pep7_GMX.gro
are different, but it is just a result of different molecule sizes. The relative distance between the protein and the peptide remains the same. To examine this, it is recommended to load complex_GMX.gro
and 4k45_fill.B99990001.pdb
in VMD. As shown below, it can be seen that the two structures align with each other perfectly. (The one in blue is complex_GMX.gro
while the other in red is 4k45_fill.B99990001.pdb
.)
Combining the topology files is also straightforward. While making pep7_GMX.top
into an .itp
file (include topology file) and include the .itp
file in complex_GMX.top
(using the statement like #include "pep7.itp"
) should work, here we will just copy the content of pep_GMX.top
and paste that to complex_GMX.top
so that we don't have to always have both the .itp
and .top
file in the same directory. However, there are still several things to note. The first is that in complex_GMX.top
, there should only be one [ atomtypes ]
directive that includes all the atom types in both the protein and the peptide ligand. Therefore, here we adopt the union set of the [ atomtypes ]
directives of the two topology files. To do so, we have to append the atom types that only appear in pep7_GMX.top
to the [ atomtypes ]
directive of complex_GMX.top
. Specifically, the appended atom types are:
Then, copy all the directives other than [ defaults ]
and [ atomtypes ]
in pep7_GMX.top
, including [ atoms ]
, [ pairs ]
, [ bonds ]
, [ angles ]
and [ dihedrals ]
(including propers and impropers) and paste them right after the last directive (should be [ dihedrals ]
directive for the impropers) of the protein. At last, change the system name in [ system ]
directive from PLCC
to complex in water
and add pep7 1
to the bottom of the file. (Note that the molecule type of the peptide ligand is pep7
defined in the [ moleculetype ]
directive of the ligand.) As a result, the bottom of the topology file should be:
Up to this point, the combination of the coordinate files and the topology files should be complete.
To solvate the system in water molecules, we first have to construct a simulation box. (Accordingly, create a new folder Solvation
in the directory same level as Topology
.) As mentioned, a rhombic dodecahedral box is one of the best choices. To center the protein in a dodecahedral box with a volume such that the distance between the solute and the box is 1.0 nm, execute the following command in folder Solvation
:
As a result, a coordinate file complex_box.gro
will be generated and it is shown that the volume of the box is 212.339 .
Then, to add water molecules to the system, execute:
Accordingly, a coordinate file complex_sol.gro
and a new topology file complex_GMX.top
(in folder Topology
) will be produced, both with water molecules added. Note that the old topology file is not overwritten but renamed as #complex_GMX.top.1#
. (Note that it is recommnded to make sure that [ molecules ]
directive in complex_GMX.top
is well-formatted after the solvents are added.)
After adding water molecules in the simulation box, we then have to add counterions to neutralize the system, which can be done by the GROMACS program genion
. To run genion
, we need a .tpr
file, which can be produced by the grompp
command. Basically, grompp
command reads a molecule topology file, checks the validity of the file, expands the topology from a molecular description to an atomic description. However, at this point, our topology file does not have atom types and molecule types (like SOL
in complex_GMX.top
) for the ions and solvent molecules, but only the ones for the binding complex. Therefore, to enable GROMACS to recognize moleculetype
such as SOL
and CL
when we run the grompp
command, we have to modify the topology file complex_GMX.top
.
In this case, we will borrow these parameters (moleculetype
and atomtypes
) from AMBER99SB-ILDN force field and use TIP3P water model. Accordingly, add the following lines to the topology file complex_GMX.top
, right before the [ system ]
directive:
If these lines are not added, errors like No such moleculetype SOL
(which could be avoided by including tip3p.itp
) and No such moleculetype CL
(which could be avoided by including ions.itp
would occur. (Note that it is not recommended to delete the [ defaults ]
directive in the topology file and add #include "amber99sb-ildn.ff/forcefield.itp"
, since there might be duplicates of atom types and the original ones might be overridden.)
To make the include topology tip3p.itp
and ions.itp
effective, all the atom types used in defining the moleculetypes
in these files must be defined in the [ atomtypes ]
directive in our topology file complex_GMX.top
. However, obviously most atom types used in these two include topology files are not defined in complex_GMX.top
. Since all the atom types used in these .itp
files can be found in ffnonbonded.itp
, to solve the issue, just copy the parameters needed from ffnonbonded.itp
and append to the [ atomtypes ]
directive in complex_GMX.top
with a format consistent with the existing parameters. Specifically, the following lines are add:
If the lines above are not added, an error like Atomtype XX not found
would occur. One thing to note here is that in complex_GMX.top
, the mass and charge in the [ atomtypes ]
directive can be modified in [ atoms ]
directive. If masses are not provided in [ atoms ]
, they will be referenced from [ atomtypes ]
. On the other hand, charges present in [ atomtypes ]
are never used for anything and must be specified in [ atoms ]
. Therefore, it is fine to have zero masses and charges in [ atomtypes ]
, as you can see in complex_GMX.top
in our case. (MODELLER has automatically assigned zero masses and charges.) However, at this point, [ atoms ]
directive only contains atoms in the binding complex. Therefore, when grompp
after the ions (like sodium ion(type Na) or chloride ion(type Cl)) are added, the [ atomtypes ]
will be referenced for the mass of the ions. Accordingly, to avoid errors like atom CL (Res CL-1) has mass 0 (state A) / 0 (state B)
, the mass of the atom types Cl
and Na
should not be zero, as you can see in the added lines above. (For the mass of any atom type, refer to ffnonbonded.itp
.) For more information, check one of the discussion on GROMACS mailing list.
In addition, note that it is recommended to check if there are duplicates of atom types and if the atom types in each residue defined in aminoacids.rtp
match the atom types of the equivalent atoms in the topology file complex_GMX.top
.
Note that in aminoacids.rtp
, the atomtype of the CA atom of LEU is CT, while that of LEU is CX in complex_GMX.top
. Similarly, the atom type of the CB atom of LEU in aminoacids.rtp
and complex_GMX.top
is CT and A2C, respectively. Since A2C and CX defined in complex_GMX.top
and CT in ffnonbonded.itp
have exactly the same parameters (sigma being 3.39967e-01 and epsilon being 4.57730e-01), these three atom types are all effectively the same. At the moment, it is not clear why MODELLER assigned A2C and CX instead of CT, but as long as the parameters are faithfully assigned and the process ends up being straightforward, all other details are flexible and leaving the atom types like this should be fine. After modifying the topology as above, no errors relevant to atom types or molecule types will be produced in the following simulations.
In a lot of cases, it is common to have net charges on the binding complex and the neutralization of the system is required. Generally, the process is carried out by adding sodium ions and the chloride. These added ions not only neutralize th system, but also specify the salinity of the system. Therefore, even if we happen to have neutral binding complex (protein: +3, peptide: -3 at pH 7.4.), we still have to add ions to the system to reach the salinity of 200 mM of NaCl as used in the ITC experiment. To add ions, we first have to run the following command to generate a .tpr
file:
where the parameter file ions.mdp
has the following content:
In our case, no errors should occur when generating a .tpr
file. However, in some other cases, it is possible the execution of the command will be stopped due to a warning like below: (The warning below is from the case of the binding complex at pH 6.)
If this warning occurs, generally it should be fine to just ignore it, since the next step is to use genion
program in GROMACS to neutralize the system. After all, we have to run grompp
before running genion
, so it is natural to get this warning at this point. To ignore the warning, just use the flag -maxwarn
so that the command would be
However, before running the command above, usually there is another thing that we should take care of. In some cases, you might see that in addition to a warning, there is also a note in the STDOUT
of grompp
which looks like this:
This note indicates that the total charge of the system is not an integer as it should be. In fact, it is common to get charges that are not exactly an integer, which is just a result of rounding. While the difference between the charges in the topology file (2.999915) and the real value (in this case, 3) is very small, it might be better to spread this difference throughout several atoms. Since the difference is quite small, it would not matter too much which atoms are selected and modified accordingly. To minimize the influence of adjusting the charges, here we can choose to spread the charge difference (0.000085) throughout the protein that is furthest from the peptide, which is residue 106 THR. Since THR has 15 atoms, to spread out the charge difference as even as possible, we choose to add a charge of 0.000005 to each of the first 5 atoms and 0.000006 to the remaining 10 atoms by modifying the topology file complex_GMX.top
. Although the note will still probably present after the adjustment (showing that a total charge is closer to 3), at least we do out best we can do here. In fact, such a small charge difference should not cause any observable influence in the following simulations.
To use genion
to add sodium and chloride ions, we first have to estimate how many ions should be added, which requires the consideration of the volume of the box and the physiological concentration used. Since we just set the salinity as 0.15 M when using H++ (which means that [NaCl] = 0.15 M), given the volume of the box (222.58 ), we can calculate the number of the sodium ions as follows:
As a result, since the net charge of the whole binding complex is 0 at pH 7.4 (+3 for the protein and -3 for the peptide ligand), 26 sodium ions and 26 chloride ions should be added. Accordingly, in the same folder Solvation
, execute the following command and select Group 15 (SOL
) as the continuous group of solvent molecules in the prompt:
Similarly, in folder Topology
, a new topology file complex_GMX.top
will be produced with ions added and the old one will be renamed as #complex_GMX.top.2#
.
Having the solvated, electroneutral system assembled, before we begin dynamics, we must ensure that the system has no steric clashes or inappropriate geometry by relaxing the solvated system to a low-energy state through energy minimization. To start with, in a newly created folder EM
, execute a grompp
command first to get a .tpr
file:
where the parameter file em.mdp
has the following content:
Then, run mdrun
command that requires the .tpr
file just produced:
As a result, the following information will be printed out as the energy minimization is finished:
Energy minimization ensures that we have a reasonable starting structure, in terms of geometry and solvent orientation. To begin real dynamics, we must equilibrate the solvent and ions around the protein and the ligand. If we were to attempt unrestrained dynamics at this point, the system may collapse. The reason is that the solvent is mostly optimized within itself, and not necessarily with the solute. It needs to be brought to the temperature we wish to simulate and establish the proper orientation about the solute (the protein and the ligand). Since NPT ensemble is relatively close to a typical experimental setup, in our case, we will perform equilibration and the subsequent short MD simulation both in NPT ensemble.
While this is generally not necessary, instead of performing NPT equilibration directly, in this case, we choose to perform NVT equilibration first (and then NPT equilibration), since it is often more robust to first equilibrate the temperature of the system under an NVT ensemble before applying a barostat to control the temperature. The simultaneous combination of velocity generation and coordinate scaling under the influence of barostat can introduce instabilities in a system that may be far from equilibrium. To get started, create a folder called Equil
in the directory same level of Topology
. Then, in folder Equil
, create another tow folders: NVT
and NPT.
To run an NVT equilibration, execute the following command in the folder NVT
to generate a .tpr
file to run mdrun
:
where nvt_equil.mdp
has the following content:
With complex_equil.tpr
, run the following command to perform NVT equilibration:
As a result, it took about 9 to 10 minutes to finish the equilibration. After the equilibration is complete, we can use gmx energy
to output xvg
files containing the data of potential and temperature as a function of time. As shown below, it is clear that the temperature of the system quickly reached the target value (298 K) and remained stable over the remainder of the equilibration. (Although there is fluctuation, the average is no longer drifting. The average temperature over the last 100 ps of equilibration is 297.316 K.) Similarly, the potential energy quickly reached a plateau, which indicates that the NVT equilibration here should be adequate.
After the NVT equilibration is completed, we then want to run another NPT equilibration to reach the same ensemble in the MD simulation later (also NPT ensemble). To generate a .tpr
file for running an NPT equilibration, in folder NPT
, execute the following command:
Note that in the command above, the checkpoint file NVT/state.cpt
from NVT equilibration is passed to the -t
option. The checkpoint file contains a complete description of the thermodynamic state of the system, in high precision. To continue a simulation, the checkpoint file must be passed to grompp
. If it is omitted, information regarding velocities, thermostat variables, and high-precision coordinates will be lost.
In addition, the parameter file md.mdp
is exactly the same as nvt_equil.mdp
except (check one of the discussions on GROMACS mailing list or a GROMACS tutorial by Dr. Lemkul for more information about NPT equilibration):
pcoupl = no
in the NVT equilibration, here we use Berendsen barostat (pcoupl = Berendsen
). Although Berendsen barostat is actually not a good barostat for MD simulation since it artificially suppresses fluctuations, in the equilibration process fluctuation, suppression is wanted. (In production MD simulation, Parrinello-Rahman barostat is recommended.)continuation = yes
continuation
to yes
. (In the previous NVT equilibration, the default value no
was adopted, since the constraints need to be solved at the first time step.) This setting indicates that this simulation run is the continuation of a previous run (NVT equilibration), in which constraints were applied. To preserve continuity, the constraints should not be re-solved, and the input coordinate should be assumed to satisfy the constraint algorithm being applied.gen_vel = no
With complex_equil.tpr
, we can perform an NPT equilibration by running the following command:
As a result, it also took about 9 to 10 minutes to complete the equilibration. Similarly, using gmx energy
, we can analyze the pressure and the volume as a function of time. As shown below, the pressure of the system is (bar), which seems very different from our target pressure (1 bar). However, the pressure itself is a quantity that fluctuates wildly in a microscopic system. As a result, instantaneous values are going to vary considerably, and average values may not correspond exactly to the desired target value. However, given the huge standard deviation of pressure during NPT equilibration, the average obtained during the simulation is statistically indistinguishable from the target of 1 bar. An additional indicator of the convergence of NPT equilibration is the density or the volume of the system, which are less prone to the fluctuations observed in the instantaneous pressure values. As shown in the figure on the right, the volume shrank to about 209 and the average stopped drifting after about 30 ps of equilibration. (The average volume over the last 100 ps is 208.793 .) Since the volume is essentially constant over the duration of NPT equilibration, it is reasonable to conclude that the system is adequately equilibrated and unrestrained simulation can be initiated.
To provide a reasonable structure as the input to further advanced sampling simulations, after equilibration, we conduct a short MD simulation of 5 ns, which should be sufficient for the system to relax. To run the simulation, first, create another folder short_MD
and in the directory same level as Topology
and execute the following command in the folder:
Similarly, -t
flag is used to preserve the previous work of NPT equilibration. The parameter file (md.mdp
) used here is exactly the same as npt_equil.mdp
except for the number of time steps (2500000, 5 ns instead of 100000, 200 ps) and the barostat (Parrinello-Rahman barostat instead of Berendsen barostat). As mentioned, a P-R barostat is generally recommended in production MD simulations.
Index groups are necessary for almost every GROMACS program. All these programs can generate default index groups. However, when analyzing the data or apply restraint as we will do in the following advanced sampling simulations, sometimes we need index groups other than the ones defined in the default index file. In this case, we have to use make_ndx
command to define our own special index groups.
First of all, we might need to center the whole binding complex in the simulation box when visualizing the trajectory later. While there is a group called "Protein" in the default index file that should include both the protein and the peptide ligand in our case, it actually does not include phosphotyrosine in the ligand, since phosphotyrosine was not recognized by GROMACS but with aids of AmberTools as we perform parameterization previously. Therefore, we have to define a group for the binding complex on our own. To start with, execute the command:
In the prompt, type a 1-1878
so that a new group, Group 24, will be created. To rename this newly created group, type name 24 Binding_Complex
. To save the file and exit, type q
.
Note that, however, we still need other special index groups defined in the index file. Specifically, in both Hamiltonian replica exchange and expanded ensemble, we will have to apply a harmonic distance restraint between the phosphorus atom of PTR and each of the 4 arginine residues that compose the binding pocket to restrain the peptide in the binding pocket to prevent the greasy core problem. Therefore, we need to define these groups using the command below:
Note that in the command above, the newly created index file we just made is passed by the -n
flag and a new index file with the same name will be created. (In this case, the old index file will be renamed as #complex.ndx.1#
.) Without using the flag -n
, the group Binding_Complex
we just defined in the old index file will not appear in the new index file. To define the groups for applying the distance restraints, we first have to find out atoms involved in the 4 arginine residues and the phosphorus atom in PTR. Typically, it is safer to identify these atoms from a .gro
file with the help of VMD visualization (as elaborated in the next section) to make sure that the right atoms are chosen. The following are the commands for choosing the atoms involved in each entity participating the distance restraints to be applied in advanced sampling simulations:
pep7
, not needed for the restraints we are going to use but recommended to be set up): a 1742-1878
, name 25 pep7
P_atom
): a 1817
, name 26 P_atom
Arg1
: ri 15
, name 27 Arg1
Arg2
: ri 34
, name 28 Arg2
Arg3
: ri 36
, name 29 Arg3
Arg4
: ri 56
, name 30 Arg4
After the execution of the command make_ndx
, the bottom part of the index file complex.ndx
should be as follows (the following is the output of the command tail complex.ndx -n 25
):
(Note: the figures and the data shown in Section 5-4-2. are actually from the simulation of the binding complex at pH6, but the visuzliation method and the things we should care about are exactly the same.) After the simulation is complete, it is crucial to see if the simulation is at least visually reasonable. Using Visual Molecular Dynamics (VMD), we are able to visualize the output structure from the short MD simulation. As a result, as you can see below, the box is rectangular rather than dodecahedral as we specified when using editconf
command, which is a result of the fact that GROMACS always keeps the particles in a brick-shaped volume for efficiency even when simulation using a triclinic box. (Note: rhombic dodecahedra are special cases of triclinic unit cells.) One can use trjconv
program to convert the trajectory to a different unit-cell representation. (See GROMACS documentation) for more information.)
From the figure, it can also be observed that the protein is out of the box and detached from the peptide ligand, this is not a result of the peptide unbound from the protein, but a result of the protein crossing the periodic boundaries of the box. (To confirm that, just observe which side of the protein the peptide is on and compare that to the initial PDB structure, say complex_GMX.gro
.) To recenter the protein and wrap the periodic boundaries, execute the command below in the Tk Console in VMD:
As a result shown below, a triclinic representation of the system is recovered, with protein centered.
In addition, we can use the command as follows in Tk Console to save the structure as a .gro
file if needed after the adjustment so that we don't have to adjust the visualization every time for this structure:
As in any simulation conducted with periodic boundary conditions, molecules may appear "broken" or may "jump" back and forth across the box. To recenter the protein and rewrap the molecules within the unit cell to recover the desired rhombic dodecahedral shape, invoke trjconv
in the following two-step procedures (Create the folder animated_traj
in short_MD
to start with.):
-pbc nojump
-pbc mol -ur compact
There are are two prompts for both commands. the first one is "Select group for centering" and the second one is "Select group for output". Enter 24
and 0
sequentially to select Binding_Complex
and System
, respectively.
To visualize the trajectory, execute the command:
As a result, after using trjconv
commands, the dodecahedral box is recovered and no molecules are broken. (Originally, both the protein and the ligand were broken in the trajectory traj_comp.xtc
. Check this by the command vmd complex_md.gro traj_comp.xtc
.)
However, as can be seen in the figure above, the binding complex is not at the center of the box. To fix this, execute the exactly the same command that we've just used in Tk Console in VMD:
Also, it is recommended to output a trajectory (complex_center_vmd.trr
) after the adjustment, so we don't have to recenter the complex every time we visualize it:
(Note that VMD is not about to write an .xtc
file.)
From the animate trajectory, we can see that the peptide is stably bound to the protein and the N-terminal of the peptide has a much larger fluctuation than the C-terminal, which just validated the reason that the first 10 residues were missing in the X-ray crystallography experiment. (Note that only the ASN residue was rebuilt though.)
One important thing to note here is that GROMACS didn't do anything wrong in our simulation. The "problems" we saw before the adjustment by VMD are all just a matter of VMD visualization. The correct physics is still preserved in complex_md.gro
and traj_comp.xtc
. However, having more straightforward visualization, complex_center_vmd.gro
and complex_center_vmd.trr
do not have the correct physics after the adjustment, so the files are only for visualization purposes. To validate this, we can examine the energy as a function of time of the trajectory before and after the adjustment. (Also note that the atom coordinates in complex_center_vmd.gro
are different from the ones in complex_md.gro
.) To get the energy data of complex_md.gro
, execute the following command and enter 13
to choose to output the data of Total-Energy
to the .xvg
file:
However, right now we don't have the energy data of the trajectory after the VMD adjustment (complex_center_vmd.trr
), so we have to execute the following command the generate an .edr
file for the adjusted trajectory:
However, the following error would occur:
Using gmx energy
to get the data of the Coulombic potential (enter 9 to choose Coulombic-(SR)
) of the original trajectory (traj_comp.xtc
), we can obtain that the average of the Coulombic potential is -371339. Therefore, the VMD adjustment didn't change the Coulombic potential too much and the high total energy almost came from the excessively high/erroneous Lennard-Jones potential. Therefore, it is apparent that the trajectory after VMD adjustment is not physically correct and can only serve as a trajectory for visualization purposes.
Similarly, we can also use -rerun
option to get .edr
files for complex_nojump.xtc
and comple_center.xtc
and use gmx energy
to get the energy data.
As shown above, although the energy of complex_center.xtc
and complex_nojump.xtc
seems reasonable, trjconv
command still changed the physics somehow. In addition, for both the "rerun" simulations, the following warning was shown:
Therefore, to extract a snapshot that is physically correct to proceed with further advanced sampling simulations, utilizing the original trajectory traj_comp.xtc
is the safest way.
As mentioned, the goal of this tutorial is to provide input files for Hamiltonian replica exchange and expanded ensemble simulations. As running these two types of advanced sampling simulations in an NPT ensemble would generate problematic results, here we must perform the simulations in an NVT ensemble. Since the vanilla simulation was performed in an NPT ensemble, we have to make sure that the free energy in the NPT vanilla simulation is equivalent to the free energy in the NVT Hamiltonian replica exchange/expanded ensemble simulations. To do this, consider:
Obviously, the Gibbs free energy in an NPT ensemble is equal to the Helmholtz free energy in an NVT ensemble given that the
fixed in the NVT ensemble corresponds to the pressure in the NPT ensemble. Therefore, to conduct an NVT Hamiltonian replica exchange/expanded ensemble simulation, we have to extract a snapshot from the trajectory of the NPT vanilla simulation and make sure that the volume of that snapshot is close to the average volume in the NPT simulation. Extracting the energy data using gmx energy
from md.edr
and using gmx_plot2d.py
to analyze the data, we can obtain that the average volume over 5 ns of MD simulation is 208.8357 and the snapshot that has a volume closet to this value is the snapshot at 3.478 ns, whose volume is 208.8350 . Therefore, we extract this snapshot using the following command (enter 0
to select the whole system in the prompt):
This is the structure file that can finally serve as the input structure of further advanced sampling simulations!
To see how it looks like, we similarly use trjconv
commands as below and use pbc wrap
command in the Tk Console of VMD:
After we finally obtain all the input files required for further advanced sampling simulations, it is recommended to collect all the final input files in one folder. To do this, start with creating the folder Final_inputs
and execute the following commands in the folder:
In this section, we provide the reader a quick look at all the commands used in the preparation process in Section 2 to Section 5 in case that the reader is already familiar with the process.
Modeling of the missing residues
python sequence.py
to get an alignment file.
A:+109
into A:+110
in the sequence file.)python build.py
to model the missing residues
Topology preparation
Topology
.cp ../Build/4k45_fill.B99990001.pdb ./PLCC.pdb
pep7.pdb
.PLCC.pdb
with H++ web server (pH = 7.4).PLCC.top
and PLCC.crd
. Convert the files by executing python GAFF/acpype.py -x PLCC.crd -p PLCC.top
.pep7.pdb
to ATOM records.pep7.top
and pep7.crd
. Convert the files by executing python GAFF/acpype.py -x pep7.crd -p pep7.top
.pep7_GMX.gro
and append to the bottom of complex_GMX.gro
. Change the numer of atoms from 1744 to 1878 and save the file.complex_GMX.top
and pep7_GMX.top
and compare the [ atomtypes ]
directives, append the atom types that only appear in pep7_GMX.top
to the [ atomtypes ]
directive of complex_GMX.top
. The appdended atom types are:
[ defaults ]
, [ atomtypes ]
, [ system ]
and [ molecules ]
in pep7_GMX.top
, including [ moleculetype ]
, [ atoms ]
, [ pairs ]
, [ bonds ]
, [ angles ]
and [ dihedrals ]
(including the propers and the impropers). Paste all these directives to complex_GMX.top
.[ system ]
from PLCC
to complex in water
and add pep7 1
to the bottom of the file. Save the file to complete the combination of the topology files.complex_GMX.top
[ system ]
directive, add:
[ atomtypes ]
directive:
Solvation and neutralization of the system
gmx editconf -f ../Topology/complex_GMX.gro -o complex_box.gro -bt dodecahedron -d 1.0 -c
gmx solvate -cp complex_box.gro -p ../Topology/complex_GMX.top -o complex_sol.gro -cs
[ molecules ]
directive in complex_GMX.top
is well-formatted.ions.mdp
and execute gmx grompp -f ions.mdp -c complex_sol.gro -p ../Topology/complex_GMX.top -o complex_ions.tpr -maxwarn 1
ions.mdp
:
gmx genion -s complex_ions.tpr -o complex_ions.gro -p ../Topology/complex_GMX.top -pname NA -nname CL -np 26 -nn 26
to add counterions. (Eneter 15
in the prompt to choose SOL
.)Energy minimization, equilibration and MD simulation
em.mdp
and execute gmx grompp -f em.mdp -c ../Solvation/complex_ions.gro -p ../Topology/complex_GMX.top -o complex_em.tpr
em.mdp
:
gmx mdrun -s complex_em.tpr -o complex_em.trr -c complex_em.gro -g em.log -e em.edr
.nvt_equil.mdp
and execute gmx grompp -f nvt_equil.mdp -c ../../EM/complex_em.gro -p ../../Topology/complex_GMX.top -o complex_equil.tpr
.nvt_equil.mdp
:
gmx mdrun -s complex_equil.tpr -o complex_equil.trr -c complex_equil.gro -g equil.log -e equil.edr
.npt_equil.mdp
and execute gmx grompp -f npt_equil.mdp -c ../NVT/complex_equil.gro -t ../NVT/state.cpt -p ../../Topology/complex_GMX.top -o complex_equil.tpr
gmx mdrun -s complex_equil.tpr -o complex_equil.trr -c complex_equil.gro -g equil.log -e equil.edr
md.mdp
and execute gmx grompp -f md.mdp -c ../Equil/NPT/complex_equil.gro -p ../Topology/complex_GMX.top -o complex_md.tpr -t ../Equil/NPT/state.cpt
md.mdp
:
gmx mdrun -s complex_md.tpr -o complex_md.trr -c complex_md.gro -g md.log -e md.edr
Postprocessing of the MD simulation
Preparation of the index file
Execute gmx make_ndx -f complex_md.gro -o complex.ndx
. In the prompt, specify the index groups as follows:
Extraction of a reasonable snapshot
gmx energy -f complex_md.gro -o volume.xvg
and select Volume
in the prompt.gmx_plot2d
to plot the volume as a function of time and determine the average volume and the sanpshop having a volume cloest to the average volume.gmx trjconv -s complex_md.tpr -f traj_comp.xtc -o complex_avg.gro -dump 3478
to extract the snapshot.Collection of all the final input files
Execute the following commands
This is the end of the first part of the tutorial! For detailed methods for preparing simulation inputs for advanced sampling methods and the application of the workflows introduced here to other similar systems, please refer to the second part of the tutorial.
If you found the article useful, you're very welcome to share the article or leave a comment below so that I will be motivated to write more articles! If you have any questions about the tutorial, please feel free to contact weitse.hsu@colorado.edu. Thank you for reading this far!