Tips for protein searching on MGnify

Duc Nguyen - Colby College '24 Updated on 2024-01-08 # Tips for protein searching on MGnify database #### **What is MGnify ?** >The MGnify protein database comprises non-redundant sequences predicted from assemblies generated from publicly available metagenomic datasets. It is a good place to search for putative proteins from uncultured unknown organisms, mostly bacteria and archaea. The current version is 2023_02, containing nearly 3 billion sequences and over 700 million clusters. The developer provided a user-friendly website [here](https://www.ebi.ac.uk/metagenomics/sequence-search/search/phmmer) that uses HMMER to search against the database ([details](https://emg-docs.readthedocs.io/en/latest/sequence-search.html)) However, only version 2018_06 is available. The MGnify browser also does not support ID-based protein searching. I came up with a fairly robust pipeline below to search for homologs of proteins of interest and the contig/scaffold they came from. It works on any version of the MGnify database for now. ### Download database - Download the MGnify version of your interest: + All databases are hosted on European Bioinformatics Institute's FTP site [here](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database). + In order to download the 2023_02 version, use `wget` or `aria2c` (faster) to get the `mgy_clusters.fa` file. This file contains all sequences with their unique IDs (MGYP). ``` wget -nc http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2023_02/mgy_clusters.fa.gz gunzip mgy_clusters.fa.gz ``` + Also download the `mgy_assemblies.tsv` file, which contains accessions of assemblies (ERZ) in which each sequence is found. ``` wget -nc http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2023_02/mgy_assemblies.tsv.gz gunzip mgy_assemblies.tsv.gz ``` ### Sequence search - You can use any sequence-based search tool, eg. BLAST, HMMER, etc. Since building a BLAST database from this giant amount of sequences is not recommended, I use HMMER instead. + Install HMMER locally follwing [this guide](http://hmmer.org/documentation.html). + Search with pHMMER, using parameters similar to the MGnify website. Record IDs of the best hits. ``` phmmer -E 1 --domE 1 --incE 0.01 --incdomE 0.03 --mx BLOSUM62 \ --pextend 0.4 --popen 0.02 --cpu 8 \ --tblout <output file in tbl format> \ <input sequence in FASTA format> mgy_clusters.fa ``` + If you want to search using multiple proteins from the same family, it's better to make a multiple sequence alignment with MUSCLE/MAFFT/Clustal Omega, build a HMM-profile with `hmmbuild`, and search against the database using `hmmsearch` with this protein profile as input. ### Full protein sequence - The output of HMMER does not contain the full amino acid sequences of the hits, as its search algorithm is domain-based. + You can use `samtools` or `seqkit grep` to obtain the full sequence given the ID *from a 120 GB file* :grimacing:. + A much faster way is to obtain the full sequence via ESM Metagenomic Atlas API (interactively or via `curl`). It was synchronized with the MGnify 2023_02 release. `curl https://api.esmatlas.com/fetchSequence/<MGYP ID>` ### ORF and Scaffold - Search for the respective assembly ID (ERZ) from `mgy_assemblies.tsv`. Simple `grep` should work. `grep -m 1 <MGYP ID> mgy_assemblies.tsv` - Next, search for the assembly files from the European Nucleotide Archive (ENA) browser using ERZ ID. Access via `https://www.ebi.ac.uk/ena/browser/view/<ERZ ID>`. Download all the files and unzip them (again, use `aria2c` or `wget`). If you want more detailed information about the assembly, go to `https://www.ebi.ac.uk/metagenomics/assemblies/<ERZ ID>` - Unzip the assembly (concatenate into a single FASTA file if necessary), and make a BLAST database from it. ``` module load blast+ makeblastdb -in <assembly FASTA file> -dbtype nucl -out <database name> ``` - Search for the scaffold where the protein sequence resides with `tblastn`. ``` tblastn -query <MGYP sequence in FASTA file> \ -db <database name> -outfmt 6 -out <result file> ``` The scaffold containing the putative protein should be the top hit with E-value = 0 and identity score = 100. - Obtain the scaffold sequence from the BLAST database with `samtools`, `seqkit`, or `blastdbcmd`. For example, ``` seqkit grep -r -p <scaffold ID> <assembly FASTA file> > <output FASTA file> ``` :::info I tried to make the steps fairly easy to be combined into a *bash* script or so. Additional codes are needed between some steps, e.g. extract the protein sequence and assembly ID automatically. Have fun :smiley:. :::