owned this note
owned this note
Published
Linked with GitHub
# Map/align metagenome short reads to a reference database
BBTools is a powerful collection of commands to analyze metagenomic sequences. Download the software [here](https://sourceforge.net/projects/bbmap/) and find a user guide [here](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/).
We will use the tool BBMap to align short reads to a reference database. The idea is to retrieve reads that belong to certain microbial clade thereby removing reads of organisms that are not of interest.
In this example we are interested in sulfate reducers. We have already created a [reference database](https://hackmd.io/4V9elLP3Rr2gUAHLf2DPuw) with genomes belonging to known sulfate-reducing bacteria.
## Map reads to reference database
To map metagenomic reads to a reference database we first build an index. You created your 'reads.fna' in the chapter [Create genome reference database](https://hackmd.io/9JxrczlwRtCm_hFPtyh1hw?view).
```
module load bbtools
module load samtools
bbmap.sh ref=your-database-file.fna
```
Note:`ref=your-database-file.fna` is named differently and might have the extension 'gz', e.g. `ref=desulfo_genomes_DB.fna.gz`.
The output looks like this:
```
No output file.
Writing reference.
Executing dna.FastaToChromArrays2 [desulfo_genomes_DB.fna, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]
Set genScaffoldInfo=true
Writing chunk 1
Writing chunk 2
Set genome to 1
Loaded Reference: 0.154 seconds.
Loading index for chunk 1-2, build 1
No index available; generating from reference genome: /users/eruff/Spart2020_Student/genome_databases/desulfo/ref/index/1/chr1-2_index_k13_c2_b1.block
Indexing threads started for block 0-2
Indexing threads finished for block 0-2
Generated Index: 177.753 seconds.
Finished Writing: 30.043 seconds.
No reads to process; quitting.
Total time: 224.825 seconds.
```
The command produces a directory called `ref` with several sub-directories.
Now we can map the metagenomic reads from metagenomes to the index using the following command. This will create two `bam` files (short for Binary Alignment Map), one containing mapped and one containing unmapped reads. Note: The mapping of 1GB metagenomic reads takes around 10 minutes on an average cluster using 12 threads.
```
bbmap.sh in=/groups/spartina2020/genome_databases/MG.fastq outu=MG_unmapped.bam outm=MG_mapped.bam
```
In case the metagenome consists of paired reads it is recommended to process them as such. Most bbtools can handle paired reads, check [here](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/usage-guide/). Paired reads can be mapped and turned into a single interleaved file using this command:
```
bbmap.sh in1=/groups/spartina2020/genome_databases/MG_R1.fastq in2=/groups/spartina2020/genome_databases/MG_R2.fastq outu=MG_unmapped.bam outm=MG_mapped.bam
```
The console output looks somewhat like this:
```
------------------ Results ------------------
Genome: 1
Key Length: 13
Max Indel: 16000
Minimum Score Ratio: 0.56
Mapping Mode: normal
Reads Used: 7261066 (1063955727 bases)
Mapping: 602.129 seconds.
Reads/sec: 12058.99
kBases/sec: 1766.99
...
Total time: 639.210 seconds.
```
The bam file of the mapped reads is converted back to fastq files using samtools
```
samtools fastq MG_mapped.bam > MG_mapped.fastq
```
## Process multiple sequences at once
Because it is inefficient to process every sample on its own, let's tell bbmap to map/align multiple metagenomes on one index file. For this we will create a `for loop`. Note: Even if multiple input files are processed the index file needs to be created only once. Find more info [here](https://github.com/BioInfoTools/BBMap/blob/master/sh/bbmap.sh).
### Create a list with all metagenomes that need to be processed
This can be done using a number of commands. In ths case we just need a list of the names of all `fastq.gz` files. This will do:
`ls /path/to/dir/*.fastq.gz > /path/to/dir/file_list.txt`
If you need a list with files and their path names use, e.g this:
```
find /path/to/dir/ -type f -name '*.fastq.gz' -exec cat {} + > /path/to/dir/file_list.txt
```
### Now let's write a loop
The following `for loop` will take each metagenome in the above list, and maps/aligns it to our index. (Remeber, we do not have to specify the location of the index as long as we are in the same directory as the index). The loop creates separate bam files for the `mapped` and `unmapped` reads. The bam files will be named after the metagenome they are coming from so that we don't loose track. Find more info on a very similar loop [here](https://www.biostars.org/p/381398/).
```
for file in $(cat IGERT_mgm_list.txt); do bbmap.sh in=/groups/spart2020/maptest/IGERT_mgm/${file} outm=./bam/${file}_mapped.bam outu=./bam/${file}_unmapped.bam; done
```