Duc Nguyen - Colby College '24
Updated on 2024-01-08
# Tips for protein searching on MGnify database
#### **What is MGnify ?**
>The MGnify protein database comprises non-redundant sequences predicted from assemblies generated from publicly available metagenomic datasets. It is a good place to search for putative proteins from uncultured unknown organisms, mostly bacteria and archaea. The current version is 2023_02, containing nearly 3 billion sequences and over 700 million clusters.
The developer provided a user-friendly website [here](https://www.ebi.ac.uk/metagenomics/sequence-search/search/phmmer) that uses HMMER to search against the database ([details](https://emg-docs.readthedocs.io/en/latest/sequence-search.html)) However, only version 2018_06 is available. The MGnify browser also does not support ID-based protein searching.
I came up with a fairly robust pipeline below to search for homologs of proteins of interest and the contig/scaffold they came from. It works on any version of the MGnify database for now.
### Download database
- Download the MGnify version of your interest:
+ All databases are hosted on European Bioinformatics Institute's FTP site [here](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database).
+ In order to download the 2023_02 version, use `wget` or `aria2c` (faster) to get the `mgy_clusters.fa` file. This file contains all sequences with their unique IDs (MGYP).
```
wget -nc http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2023_02/mgy_clusters.fa.gz
gunzip mgy_clusters.fa.gz
```
+ Also download the `mgy_assemblies.tsv` file, which contains accessions of assemblies (ERZ) in which each sequence is found.
```
wget -nc http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2023_02/mgy_assemblies.tsv.gz
gunzip mgy_assemblies.tsv.gz
```
### Sequence search
- You can use any sequence-based search tool, eg. BLAST, HMMER, etc. Since building a BLAST database from this giant amount of sequences is not recommended, I use HMMER instead.
+ Install HMMER locally follwing [this guide](http://hmmer.org/documentation.html).
+ Search with pHMMER, using parameters similar to the MGnify website. Record IDs of the best hits.
```
phmmer -E 1 --domE 1 --incE 0.01 --incdomE 0.03 --mx BLOSUM62 \
--pextend 0.4 --popen 0.02 --cpu 8 \
--tblout <output file in tbl format> \
<input sequence in FASTA format> mgy_clusters.fa
```
+ If you want to search using multiple proteins from the same family, it's better to make a multiple sequence alignment with MUSCLE/MAFFT/Clustal Omega, build a HMM-profile with `hmmbuild`, and search against the database using `hmmsearch` with this protein profile as input.
### Full protein sequence
- The output of HMMER does not contain the full amino acid sequences of the hits, as its search algorithm is domain-based.
+ You can use `samtools` or `seqkit grep` to obtain the full sequence given the ID *from a 120 GB file* :grimacing:.
+ A much faster way is to obtain the full sequence via ESM Metagenomic Atlas API (interactively or via `curl`). It was synchronized with the MGnify 2023_02 release.
`curl https://api.esmatlas.com/fetchSequence/<MGYP ID>`
### ORF and Scaffold
- Search for the respective assembly ID (ERZ) from `mgy_assemblies.tsv`. Simple `grep` should work.
`grep -m 1 <MGYP ID> mgy_assemblies.tsv`
- Next, search for the assembly files from the European Nucleotide Archive (ENA) browser using ERZ ID. Access via `https://www.ebi.ac.uk/ena/browser/view/<ERZ ID>`. Download all the files and unzip them (again, use `aria2c` or `wget`).
If you want more detailed information about the assembly, go to `https://www.ebi.ac.uk/metagenomics/assemblies/<ERZ ID>`
- Unzip the assembly (concatenate into a single FASTA file if necessary), and make a BLAST database from it.
```
module load blast+
makeblastdb -in <assembly FASTA file> -dbtype nucl -out <database name>
```
- Search for the scaffold where the protein sequence resides with `tblastn`.
```
tblastn -query <MGYP sequence in FASTA file> \
-db <database name> -outfmt 6 -out <result file>
```
The scaffold containing the putative protein should be the top hit with E-value = 0 and identity score = 100.
- Obtain the scaffold sequence from the BLAST database with `samtools`, `seqkit`, or `blastdbcmd`. For example,
```
seqkit grep -r -p <scaffold ID> <assembly FASTA file> > <output FASTA file>
```
:::info
I tried to make the steps fairly easy to be combined into a *bash* script or so. Additional codes are needed between some steps, e.g. extract the protein sequence and assembly ID automatically. Have fun :smiley:.
:::