# Molecular Docking
*Molecular docking is used to solve binding energy calculations and visualize binding conformations between ligands and receptors*
---
Tools: `Pymol`, `p2rank`, `Foldseek`, `ESMFold`, `Vina`, `MGLtools`.
# Step 1. Domain structure prediction
Using the `ESMFold` to predict the 3D structure by **API** server:
`cat txt.list | parallel -j 1 'curl -X POST --data "@{}" https://api.esmatlas.com/foldSequence/v1/pdb/ > {.}.pdb'`
**txt.list** include all amino-acid file names and **"@{}"** means enter the file by its name.
plus, if there is an error with the **API** network, try adding the `-k` parameter.
> **API** server just allow 400aa maximum length, if you want to predict more than it, please using the **Colab** in **Google** or build your own `ESMFold` platform (**Requires CUDA**)
# Step 2. Domain structure network construction
Using the `Foldseek` to caculate the similarity score by structure **easy-search** algorithm:
`foldseek easy-search PDB_database PDB_database outfile tmp --exhaustive-search -e 1e-5 --format-output "query,target,fident,alnlen,evalue,bits,alntmscore,lddt"`
The example used **all to all** search method, you should create index by `Foldseek` **createdb/createindex** module. Finally, we got the **bits** and normalization by custome `Python` script:
```
import pandas as pd
import argparse
def process_csv(input_csv, output_csv):
# 读取CSV文件
df = pd.read_csv(input_csv)
# 对每个源进行分组,并按权重降序排序
df = df.sort_values(['source', 'weight'], ascending=[True, False])
# 计算每个源的最大权重,并将其缩减到300
max_weights = df.groupby('source')['weight'].transform(max)
reduction_factor = 300 / max_weights
# 应用缩减因子
df['weight'] = df['weight'] * reduction_factor
# 输出CSV文件
df.to_csv(output_csv, index=False)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Process a CSV file.')
parser.add_argument('--input', type=str, help='Input CSV file')
parser.add_argument('--output', type=str, help='Output CSV file')
args = parser.parse_args()
process_csv(args.input, args.output)
```
Why we do this? After getting the weight score, all the weights are not uniform among themselves, we need to normalization , i.e. set the standard with the maximum value (the highest score for yourself over yourself). This script makes the standard value of the maximum value **300** (cf. `vContact2` maximum value), which can also be customized.
# Step 3. Network visualization
Build **node.csv** & **edge.csv** with gephi visualization. You need konw **edge.csv** include **source**, **target** and **weight** column, they were shown in your `Foldseek` output table: **query**, **subject** and **bits**. When you import your **edge.csv** file, you can get the **Node.csv** file, and than, you will add extra informations in your **Node** file.
# Step 4. Molecular Docking
## (1) Predicted active pocket which is the predicted ligand location.
The `p2rank` software is used to predict the active pockets, which produces a **prediction.csv** file with the center coordinates of the active box, and the box side lengths are set uniformly according to the size of the small ligand.
`./prank predict ARGs.ds -m alphafold.model -o P2rank`
>ARGs.ds represents a list containing the names of all **pdb** files in the folder; `-m` specifies the model; `-o` specifies the output path.
## (2) Processing of ligands and receptor.
You should download your ligand through the Zinc website here (https://zinc.docking.org/).
Using **prepare_ligand4.py** & **prepare_receptor4.py** in `mgltools`, the steps are rotational bond customization for ligands, hydrogenation and dehydration and hydrogenation of receptors (proteins or domains).
`prepare_receptor4.py -l penicillin.pdb `
`-l` need to add the **pdb** files.
`prepare_ligand4.py -r IMGVR.pdb -A hydrogens -U waters`
`-A` represents the addition of hydrogen,`-U` represents the removal of water.
## (3) Molecular Docking ligands to receptors.
We using `Vina` to molecular docking. According the active pocket, we decide the active box parameters contain **X_size**, **Y_size** and **Z_size** (depend on your size of ligands). the most important parameters are **xyz coordinates** which showed in `p2rank` output files.
`vina --receptor 1bue_ENTEROBACTER_PF13354.pdbqt --ligand penicillin.pdbqt --center_x 18.5401 --center_y 18.1729 --center_z 6.8200 --size_x 20.0 --size_y 20.0 --size_z 20.0 --cpu 8 --exhaustiveness 32 --num_modes 9 --energy_range 4 --log 1bue_ENTEROBACTER.log --out 1bue_ENTEROBACTER_.out`
Here, my active box information included box size 20.0 and center coordinates. There are two output file: **log file** represented the **affinity** and **out file** represented the **new ligand pdb** file which is located in receptor.
We can write a `Python` script to run molecular docking in bulk:
```
import os
import subprocess
import pandas as pd
#pdbqt文件所在的目录
pdbqt_directory = '/mnt/d/Molecular_docking'
#csv文件所在的目录
csv_directory = '/mnt/d/Molecular_docking/P2rank'
commands = []
for filename in os.listdir(pdbqt_directory):
if filename.endswith(".pdbqt"):
receptor = filename
csv_file = os.path.join(csv_directory, filename.split('.')[0] + '.pdb_predictions.csv')
if not os.path.exists(csv_file):
print(f"CSV file {csv_file} does not exist.")
continue
try:
df = pd.read_csv(csv_file)
except Exception as e:
print(f"Failed to read CSV file {csv_file}: {e}")
continue
df.columns = df.columns.str.strip() # 去除列名中的空格
df['name'] = df['name'].str.strip() # 去除'name'列中值的空格
row = df[df['name'] == 'pocket1'].iloc[0]
center_x = row['center_x']
center_y = row['center_y']
center_z = row['center_z']
log_file = receptor.split('.')[0]
command = f"vina --receptor {os.path.join(pdbqt_directory, receptor)} --ligand P2rank/penicillin.pdbqt --center_x {center_x} --center_y {center_y} --center_z {center_z} --size_x 20.0 --size_y 20.0 --size_z 20.0 --cpu 8 --exhaustiveness 32 --num_modes 9 --energy_range 4 --log {log_file}.log --out {log_file}.out"
commands.append(command)
#使用parallel来并行运行vina命令,每次运行2个线程
for i in range(0, len(commands), 2): # 将commands列表分解成更小的列表
commands_str = '\n'.join(commands[i:i+2])
subprocess.run(f"echo \"{commands_str}\" | parallel -j 2", shell=True)
```
## (4) Visualization
Just using 3D structure visualization software as you like! I recommend `pymol`.
# Reference
Krivák, R., Hoksza, D., 2018. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. Journal of Cheminformatics 10, 39. https://doi.org/10.1186/s13321-018-0285-8
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., Rives, A., 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130. https://doi.org/10.1126/science.ade2574
Irwin, J.J., Tang, K.G., Young, J., Dandarchuluun, C., Wong, B.R., Khurelbaatar, M., Moroz, Y.S., Mayfield, J., Sayle, R.A., 2020. ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery. J. Chem. Inf. Model. 60, 6065–6073. https://doi.org/10.1021/acs.jcim.0c00675
van Kempen, M., Kim, S.S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C.L.M., Söding, J., Steinegger, M., 2023. Fast and accurate protein structure search with Foldseek. Nat Biotechnol 1–4. https://doi.org/10.1038/s41587-023-01773-0