AptaLoop - HackMD

# Team DTU-Denmark 2023 Software Tool: AptaLoop ## Description This repository contains code for aptamer design _in silico_ developed by the **DTU Biobuilders 2023** iGEM team. The AptaLoop pipeline consists of 4 different modules: 1a. Secondary/tertiary structure prediction 1b. Making Aptamers Without Selex (MAWS) 2. Docking 3. Molecular Dynamics # AptaLoop ## DTU BioBuilders 2023 ### General Introduction Single-stranded DNA and RNA can be designed to bind with high affinity to target molecules (Ref). These types of nucleic acids are called **aptamers** and are generally found through a laborious and costly experimental procedure named **SELEX** (Systematic Evolution of Ligands by Exponential Enrichment). This experimental selection process is an iterative one where random sequence libraries are used to find and enrich aptamers that bind to the target molecule. Several established _in silico_ methods for aptamer-ligand interactions are currently used in the literature (Refs). These include folding predictions, docking simulations, and molecular dynamics. These methods can be used to further understand how the aptamer binds to the ligand and can aid in optimizing aptamer sequences. Some _in silico_ methods have also been developed by past iGEM teams. For example, the Heidelberg iGEM team in 2015 developed an _in silico_ alternative to SELEX called **MAWS** (Making Aptamers Without SELEX), where _de novo_ aptamer sequences could be generated through an entropy minimization algorithm. This software was then updated in 2017 to include further functionality and adaptability. However, in 2023 the code that this team created was non-functional and outdated. Our team mission was to take the main _in silico_ methods for aptamer-ligand dynamics (folding, docking, and MD) and integrate them into a user-friendly pipeline. Furthermore, we wanted to improve the MAWS software by updating the code, adding new functionality, making the software streamlined for future iGEM teams, and integrating it into our pipeline. Thus, our pipeline, named **AptaLoop**, takes an input molecule, generates a _de novo_ aptamer, and then runs this aptamer and the target molecule through a series of docking and MD simulations, providing researchers with a holistic view of aptamer-ligand dynamics. --- ### MAWS **MAWS** (Making Aptamers Without SELEX) functions by using an entropy minimization algorithm. This algorithm was based on a paper published in 2011 by Tseng _et al._ where entropic minimization was used as a metric for binding stability in aptamer-ligand binding dynamics. This approach has been named the **entropic fragment-based approach (EFBA)** and was adapted by the Heidelberg 2015 and 2017 iGEM teams to create the software known as MAWS. We can begin by taking the equation for relative entropy (Kullback-Leibler divergence), which is a measure of how a probability distribution $ P $ diverges from another distribution $ Q $: $$ S[P|Q] = -\int P \cdot \log\left(\frac{P}{Q}\right) dX $$ Distribution $ P $, in this case, would be the probability of an aptamer to bind to our target molecule, and $ Q $ would be a uniform distribution to compare our aptamer to. To adapt this concept to include relevant physical parameters, we can replace our theoretical relative entropy equation with a novel entropy-like distance metric based on thermodynamic parameters. $$ P(x) = \frac{e^{-\beta \cdot \Delta G}}{\mathcal{Z}} $$ Here we define the probability of observing conformation X of the aptamer-ligand complex and introduce the inverse temperature $\beta$ and Gibbs free energy. We also introduce $\mathcal{Z}$, which is defined as: $$ \mathcal{Z} = \int e^{-\beta \Delta G(x)} dx $$ $\mathcal{Z}$, or the partition function, is a key member of the canonical ensemble, which is a way to describe the thermodynamic properties of a system in thermal equilibrium. We can use the canonical ensemble to describe the probability distribution of aptamers in different conformations to assess their thermodynamic feasibility. Specifically, $\mathcal{Z}$ is the sum over all possible states _i_ that the system can be in, and acts as a normalization factor for the expression as it ensures that the probabilities P(x) sum to 1 across all states. This makes P(x) a valid probability distribution, allowing us to compare the likelihood of different states directly. Thus, given: $$ S[P|Q] = -\int P \cdot \log \left( \frac{P}{Q} \right) d\vec{\chi} $$ We can replace Q with our uniform distribution and use a natural logarithm to yield: $$ \overline{S}[P] := -\int -P \cdot \beta \Delta G(X) + \ln(QZ) $$ From these equations we can derive the following distance metric: $$ g[P,P'] = \left| \bar{S}[P'] - \bar{S}[P] \right| $$ This distance metric serves to give basis for aptamer selection, as the conformation P(x) with the higher probability is selected through its entropic favorability. In essence the selection through entropic criteria can be explained through the example of a rock and a magnet. If a rock is thrown against a magnet, the resultant collision will be highly disordered, which is to say the entropy of this system will be high. If another magnet is thrown instead of a rock, then the resultant system will be less disordered than the rock, as there is a degree of attraction between the two objects. Thus, the selection criteria for the aptamer sequence is the system calculated to have the minimal amount of entropy. What this means is that a lower degree of disorder is taken as a metric for binding stability, as lower entropy indicates more stability in aptamer ligand binding. In summary, lower entropy increases the probability of binding P(x), which through the distance metric, is compared to alternative conformations and the best overall conformation is selected for. # MAWS Algorithm The MAWS algorithm functions through iterative nucleotide sampling over a target molecule within a bounded box. The process begins with the addition of one of four nucleotides (A, T, C, G) and iterates each nucleotide over a series of binding positions around the target molecule. The energies calculated for each position is saved, and parameters Z, P, and S are all computed and the minimal entropy aptamer selection criteria is run to choose the aptamer (figure 1). ![](https://static.igem.org/mediawiki/2015/d/d3/Algorithm_flowchart.svg) Figure 1: Flowchart outlining MAWS algorithm (https://2015.igem.org/Team:Heidelberg/software/maws) This results in a nucleotide being chosen based on its overall thermodynamic favorability. This process is repeated by adding another nucleotide module to the previous nucleotide and repeating the process, such that the adjoined 2 nucleotide aptamer has its thermodynamic favorability assessed in each location for each base pair, with the best overall nucleotide being chosen to be the second nucleotide in the aptamer. It is important to note that this method of selection does not yield a binding location, as it selects nucleotides based on a generalized thermodynamic favorability. ## Our improvements to the code We first attempted to run the MAWS algorithm from the **Heidelberg 2017 iGEM team** ([github repo](https://github.com/igemsoftware2017/AiGEM_TeamHeidelberg2017/tree/master/sharksome-suite)) following the [guide](https://docs.google.com/document/d/1VpqD0gc2ZrxZVhDIr6PMhXtEJ7jFILcskNtPZiLjlmw/edit ) that the **iGEM NU Kazakhstan 2022 team** made last year. However, we could not manage to make it work as it was, so we had to do some modifications to make it functional again. Mainly, the problem was related to package version incompatibilities, which was a bit difficult to debug since the original software did not contain a requirements.txt file. Also, some minor changes were needed to some path variables and functions. When we had the first version running, we decided to first improve it by translating the code from python2 to **python3** and correct some indentation issues to make the software easier to use and maintain in the future. We then proceeded by implementing the possibility of generating **DNA aptamers**, so that the user was not limited to RNA. And we also added more types of ligand molecules, specifically **lipids** and **small molecules**, so that the user was not limited to choosing a protein for the aptamer target. In order to properly add all these new capabilities to the software, we also incorporated the **appropriate force fields** to use depending on each case. Other improvements include correcting the output naming, fixing compatibility issues within some functions and updating the argument parser to add more options (type of aptamer, type of ligand and conda environment name). Importantly, we generated an **environment.yml** file in order to make the environment reproducible and compatible in any other machine. Finally, we created a **Jupyter implementation** to make the software easier to use and to prove the reproducibility of our conda environment. # Docking Molecular docking is a computational technique used in structural biology and drug discovery to predict how two or more molecules interact with each other and form a stable complex. Docking algorithms aim to predict the most energetically favorable binding mode or conformation of the ligand within the protein's binding site. In most cases, docking is used to study the binding site and binding affinity of proteins and ligands, but other molecules can also be explored. In the case of AptaLoop, we performed docking between RNA molecules (our aptamers) and a specific ligand (PFOA). Several software tools have been developed to perform docking. For AptaLoop, we chose AutoDock Vina, one of the most widely used docking engines, because of its ease of implementation, known accuracy, and because it is open source, allowing for all the users of AptaLoop to perform docking without any restrictions. In order to test the accuracy of the results we obtained with AutoDock Vina, we also performed docking between our aptamers and PFOA using Haddock. We deemed it possible that Haddock was more reliable, since it allows for the tuning of plenty of parameters depending on the input, but it is more complex to configure and requires the users to ask for licenses. Thus, we decided not to include Haddock in AptaLoop, but to use it as a way to validate the AutoDock Vina results. ## AutoDock Vina As stated above, we chose AutoDock Vina for AptaLoop because we considered that it would be very easy for everybody to implement. It only requires structure files of the molecules being docked and the specification of the search space in which to look for the binding site. Besides, even if it is mostly used for protein-ligand docking, it was specifically designed for receptor-ligand docking, so it is also applicable to aptamer-ligand docking. First, it is important to ascertain that both the receptor and ligand files contain all the hydrogen atoms present in the molecule. While the positions of H in the output are arbitrary because AutoDock Vina uses a scoring function that only involves the heavy atoms (united-atom scoring function), the H in the input files are used to determine which atoms can be H bond donors or acceptors. AutoDock Vina requires PDBQT files for both the receptor and the ligand. These files can be obtained both from PDB (more suitable for the receptors) or SDF (more suitable for small ligands) files using different methods. For AptaLoop, we chose the Openbabel software because it was efficient and easy to implement in a pipeline (reference). AutoDock Vina searches for possible sites and positions of docking within a defined grid that the user has to provide as input. The more reduced the grid is, the less the computation effort and the better the results. However, as was the case with AptaLoop, if the user cannot determine the approximate area where the ligand will bind, it is necessary to define a grid that covers the whole receptor. A very large search space can lead to increased computational demands (and therefore slower docking), but, most importantly, it implies a risk of missing potential binding sites and reduced accuracy. Thus, if possible, it is always recommended to constrain the searching space as much as possible. A good approach for the large grid problem is to first run AutoDock Vina a few times searching for the binding site across the totality of the receptor, and then selecting the best binding site and run the software again, narrowing down the searching space to the area of the predicted binding site. How to define the grid? What values to use? AutoDock Vina asks for “X center”, “Y center”, and “Z center”, the coordinates of the center of the center of the grid in the three-dimensional space; and for “X size”, “Y Size”, and “Z size”, which determine the dimensions of the grid box along each axis. All these values can be determined using Chimera (link) (exaplain how with details?) AutoDock Vina calculates the energy associated with each ligand pose, considering factors like steric clashes, hydrogen bonds, and electrostatic interactions. The software then ranks these poses based on their calculated energy scores, providing insights into the most likely binding modes and estimating the binding affinity between the ligand and protein. To do so, it uses an empirical scoring function that aims to estimate the binding affinity of a receptor-ligand complex. It takes into account Van der Waals interactions (attractive and repulsive forces between atoms, helps to evaluate steric clashes), hydrogen bonding (hydrogen bond interactions), and electrostatic interactions. Interestingly, AutoDock Vina ignores the partial charges supplied by the users in the input files. Instead, it addresses electrostatic interactions through the hydrophobic and hydrogen bonding terms. See the documentation for details of the scoring function. The software first places the ligand in an initial position, and uses an optimization algorithm (BFGS) to refine its position and orientation within the binding site. For each ligand pose, AutoDock Vina calculates uses the scoring function to calculate the energy. The sites with highest binding affinity get the lower scores. ## Haddock Haddock is a computational software developed by researchers at the University of Amsterdam and the VU University Medical Center in the Netherlands that stands out for its ability to predict and model biomolecular interactions with exceptional accuracy. Moreover, Haddock allows customization of the docking run to a particular scenario, such as protein-ligand, protein-protein, protein-nucleic acid, or, as in our case, nucleic acid-ligand. Haddock follows 3 steps: it0 (in which the interacting partners are treated as rigid bodies, with all geometrical parameters frozen), it1 (in which flexibility is introduced to the interacting partners, and a third optional step that performs refinement in cartesian space with an explicit solvent (water). For evaluating the binding affinity between receptor and ligand, Haddock calculates a score using a linear combination of various energies and surface area. For more information, refer to: https://www.bonvinlab.org/software/haddock2.4/scoring/ . The best structure gets the lowest score. In the particular case of RNA-ligand docking, it is recommended to adjust the parameters the following way: | Description | Parameter | Protein/Small molecule | Protein/DNA | DNA/small molecule | |------------------------------------------------------|---------------------|-------------------------|-------------|--------------------| | Dielectric constant for it0 | dielec_0 | rdie | cdie | rdie | | Dielectric constant for it1 | dielec_1 | rdie | cdie | rdie | | MD steps for rigid body high temperature TAD | initiosteps | 500 | 0 | 500 | | MD steps during first rigid body cooling stage | cool1_steps | 500 | 0 | 500 | | Initial temperature for second TAD cooling step | tadinit2_t | 1000 | 500 | 1000 | | with flexible side-chain at the interface | | | | | | Initial temperature for third TAD cooling step | tadinit3_t | 1000 | 300 | 1000 | | with fully flexible interface | | | | | | Weight of the intermolecular van der Waals energy | w_vdw_0 | 0.01 | 1 | 0.01 | | for scoring at the rigid-body docking stage | | | | | | Weight of the intermolecular electrostatic energy | w_elec_2 | 0.2 | 0.1 | 0.2 | | for scoring at the final stage | | | | | | Dielectric constant for rigid-body docking | epsilon_0 | 10 | 10 | 78 | | Dielectric constant for the semi-flexible refinement | epsilon_1 | 10 | 10 | 78 | ## Results: FluoroLoop is based on the binding of an aptamer to PFOA, an event that triggers the conformational change of the TMS and allows for PFAS detection. Thus, it was very important to perform docking between PFOA and the aptamers we wanted to work with and evaluate if the experiments would be successful according to computational tools. Thus, we performed docking using AutoDock Vina for all ten aptamers described in (paper), selected after performing systematic evolution of ligands by exponential enrichment (SELEX)as well as for the aptamer that MAWS specifically designed for PFOA. To test the veracity and reproducibility of the results, we used the same molecules to perform docking using Haddock. The results are displayed in Table _ : | | Haddock | AutoDock Vina | |---------|-------------|---------| | Aptamer 1 | 6.9 ± 6.4 | -6.693 | | Aptamer 2 | -0.1 ± 5.6 | -6.547 | | Aptamer 3 | 3.3 ± 7.5 | -6.574 | | Aptamer 4 | 2.7 ± 3.7 | -6.626 | | Aptamer 5 | -4.1 ± 2.4 | -6.446 | | Aptamer 6 | 3.1 ± 5.5 | -6.271 | | Aptamer 7 | 8.0 ± 2.1 | -5.323 | | Aptamer 8 | -1.6 ± 1.0 | -7.439 | | Aptamer 9 | 21.1 ± 4.1 | -5.665 | | Aptamer 10 | -0.7 ±-1.9 | -6.657 | | Aptamer MAWS | 9.1 ± 6.2 | -6.082 | As can be observed in Table _, the aptamer that binds PFOA with the highest affinity according to AutoDock Vina is Aptamer 8, while Haddock points to Aptamer 5. Taking into account both docking methods, it is possible to conclude that aptamers 8, 5, 10, and 2 bind PFOA with the highest affinity, while aptamers 9 and 7 do a terrible job at it. These results differ to the conclusions found by (cite) et al, who stated that the aptamers with the highest affinity to PFOA were 2, 8, and 3, and the aptamers with the lowest binding affinity to said ligand were 1, 5, and 6. After the docking results, we decided to analyze the predicted binding sites for the best aptamers (i.e., aptamers 2, 5, 8, and 10), as well as for the aptamer designed by MAWS. For aptamers 5, 8, and 10, Haddock and AutoDock Vina predict very similar binding sites for PFOA, but they orient the ligand in opposite directions. In the case of aptamer 2, the binding sites for the best results are different, but the second best binding site for AutoDock Vina agrees with Haddock’s best prediction. The fact that the predictions from AutoDock Vina and Haddock are not far from each other allows us to rely on the AutoDock Vina results from AptaLoop. # Molecular Dynamics Molecular dynamics was performed to understand the behavior of our aptamer in water. It was performed on the aptamer generated by MAWS (and used by the korean paper?). Molecular dynamics (MD) had its roots in the late 1950s and early 1960s when pioneers like Alder, Wainwright, and Rahman developed it as one of the earliest simulation techniques. Over time, MD has evolved into a crucial tool in the fields of physics and chemistry, thanks to remarkable advancements in computer technology and algorithmic improvements. Since the 1970s, MD has been extensively utilized to probe the structure and behavior of large molecules, including proteins and nucleic acids. MD methods fall into two primary categories, each characterized by the model and mathematical framework used to represent physical systems. In the "classical" mechanics approach to MD simulations, molecules are treated as classical entities, resembling the familiar "ball and stick" model. In this representation, atoms are envisioned as soft balls, and the bonds between them are depicted as elastic sticks, with the system's dynamics governed by classical mechanics principles. Conversely, "quantum" or "first-principles" MD simulations, pioneered in the 1980s by Car and Parinello, explicitly account for the quantum nature of chemical bonds. These simulations involve the computation of the electron density function for valence electrons, which are pivotal in bonding, using quantum equations. Simultaneously, the dynamics of ions (nuclei and inner electrons) are tracked using classical mechanics. For conducting MD simulations, a variety of software programs are available, including both licensed options like MAPS and Discovery Studio, as well as free alternatives like AMBER, CHARMM, and GROMACS. We chose GROMACS for its accessibility, extensive documentation, and compatibility with other tools like PyMol. Moreover, evidence of its effectiveness in aptamer optimization (as seen in references [1] and [2]) further supported our choice. GROMACS finds application in various scenarios, including: - Investigating protein/aptamer-ligand interactions. - Stabilizing DNA aptamers through ligand binding. - Developing DNA aptamers for visualization. - Computational modeling of peptides. - Exploring peptide-aptamer binding. - In silico design processes. The MD simulation process begins with the creation of a 3D model for the PFAS molecule and the aptamer sequence. This involves generating crucial components such as the molecule's topology, a position restraint file, and a processed structure file. Subsequently, the simulation box dimensions are defined using the editconf module, and water is added to the box using the solvate module. To maintain charge neutrality, ions are introduced as needed. Prior to initiating the dynamic simulation, thorough checks are conducted to identify any clashes or geometry-related issues in the system. This validation process, termed energy minimization (EM), involves compiling the structure, topology, and simulation parameters into an input file. Once these preliminary steps are completed, temperature and pressure equilibration are performed to prevent potential system anomalies. Following this comprehensive preparation, the molecular dynamics simulation can be executed, and the resulting data can be analyzed. In the context of aptamers, the choice of force fields is crucial. AMBER force fields, tailored for RNA or DNA aptamers, are considered among the best options. Additionally, specific water models such as TIP3P, SPC/E, and TIP4P/Ewald are incorporated to enhance the accuracy of the simulations.