---
title: Removal of duplicated regions
tags: OMM
description: Personal hackmd notes
image: https://partechshaker.com/wp-content/uploads/2018/10/logo_square.png
robots: noindex, nofollow
GA: UA-165598729-1
---
---
[TOC]
---
Removal of duplicated regions
===
evidence for ambigous mapping regions in *IGV viewer*
---
On *bwa mem* mapped files, white marked reads are reada that would map to more than one position.

Get duplicated stats
---
:::warning
**Note:** The duplicationess here refers to PCR duplicates and not ambigous aligned reads? That whould explain why removing duplicates would not change the alignment at all
:::
> Picard MarkDuplicates marks duplicates, rather than removing them. Instead, when it finds a duplicate read it sets the duplicate flag to true and then outputs it. To remove the duplicates you need to add `REMOVE_DUPLICATES=true`
But for now just the statistics:
```bash!
cd /net/sgi/oligomm_ab/OligoMM-report
java -jar -Xmx30g /net/sgi/oligomm_ab/tools/picard.jar MarkDuplicates INPUT=/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted.bam OUTPUT=/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted_picard.bam METRICS_FILE=/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/output.dup_metrics CREATE_INDEX=TRUE VALIDATION_STRINGENCY=SILENT
```
|alignment|UNPAIRED_READS_EXAMINED | READ_PAIRS_EXAMINED | SECONDARY_OR_SUPPLEMENTARY_RDS |UNMAPPED_READS | UNPAIRED_READ_DUPLICATES | READ_PAIR_DUPLICATES| READ_PAIR_OPTICAL_DUPLICATES | PERCENT_DUPLICATION |ESTIMATED_LIBRARY_SIZE|
| --------| -------- | -------- | -------- |-------- |-------- |-------- |-------- |-------- |-------- |
|bowtie2|230684 | 13956964 | 0 | 258552 |146668 | 903320 | 69776 | 0.069403 | 111006234|
|bwa mem| 3912 | 14193668 | 295389 |11916 |2556 | 917967 |70885 | 0.064756 | 112973229|
Seems there are $6.9\%$ duplicated reads, and no big difference between *bowtie2* and *bwa mem*
|BIN | CoverageMult | all_sets | optical_sets | non_optical_sets|
| -------- | -------- | -------- |-------- |-------- |
|1.0 | 1.004715 | 12209717 | 0 | 12268158|
|2.0 | 1.890725 | 788501 | 67794 | 740148|
|3.0 | 2.672055 | 51725 | 957 | 42765|
|4.0 | 3.361071 | 3450 | 20 | 2430|
|5.0 | 3.968682 | 238 | 2 | 140|
|6.0 |4.504504 | 11 | 0 | 2|
|7.0 | 4.97702 |2 | 0 | 1|
|8.0 | 5.393709 | 0 | 0 | 0|
|9.0 | 5.761167 | 0 | 0 | 0|
|10.0 | 6.08521| 0 | 0 | 0|
|11.0 | 6.370968 | 0 | 0 | 0|
> Q: What is the meaning of the histogram produced by MarkDuplicates?
A: MarkDuplicates estimates the return on investment for sequencing a library to a higher coverage than the observed coverage. The first column is the coverage multiple, and the second column is the multiple of additional actual coverage for the given coverage multiple. The first row (1x, i.e. the actual amount of sequencing done) should have ROI of approximately 1. The next row estimates the ROI for twice as much sequencing of the library. As one increases the amount of sequencing for a library, the ROI for additional sequencing diminishes because more and more of the reads are duplicates.
same for bwa
```bash!
cd /net/sgi/oligomm_ab/OligoMM-report
java -jar -Xmx30g /net/sgi/oligomm_ab/tools/picard.jar MarkDuplicates INPUT=/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted.bam OUTPUT=/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_picard.bam METRICS_FILE=/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/output_bwa.dup_metrics CREATE_INDEX=TRUE VALIDATION_STRINGENCY=SILENT
```
and create bam file with removed duplicates
```bash!
cd /net/sgi/oligomm_ab/OligoMM-report
java -jar -Xmx30g /net/sgi/oligomm_ab/tools/picard.jar MarkDuplicates INPUT=/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted.bam OUTPUT=/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_picard_removed.bam METRICS_FILE=/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/output_bwa.dup_metrics CREATE_INDEX=TRUE VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true
```
But when loading it to IGV viewer, no changes can be seen between `/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_picard_removed.bam` and `/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted.bam`
Mark duplicated reads using samtools
---
[Samtools `markdup`](http://www.htslib.org/doc/samtools-markdup.html) marks duplicate alignments in a coordinate sorted file.
```bash!
samtools markdup positionsort.bam markdup.bam
```
However, input files need to be in the right format as described [here](http://www.htslib.org/doc/samtools-markdup.html).
For *bowtie2* mapping
```bash!
# This first collate command can be omitted if the file is already name ordered or collated
samtools collate -o /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted_namecollate.bam /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted.bam
# Add ms and MC tags for markdup to use later
samtools fixmate -m /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted_namecollate.bam /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted_fixmate.bam
# Markdup needs position order
samtools sort -o /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted_positionsort.bam /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted_fixmate.bam
# Finally mark duplicates
samtools markdup /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted_positionsort.bam /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted_markdup.bam
# and use -r parameter to create a bam file with removed duplicates
samtools markdup -r /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted_positionsort.bam /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted_markdup_removed.bam
# index for IGV viewer
samtools index /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bowtie2_mapped_omm_sorted_markdup_removed.bam
```
For *bwa mem* mapping
```bash!
# This first collate command can be omitted if the file is already name ordered or collated
samtools collate -o /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_namecollate.bam /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted.bam
# Add ms and MC tags for markdup to use later
samtools fixmate -m /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_namecollate.bam /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_fixmate.bam
# Markdup needs position order
samtools sort -o /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_positionsort.bam /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_fixmate.bam
# Finally mark duplicates
samtools markdup /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_positionsort.bam /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_markdup.bam
# and use -r parameter to create a bam file with removed duplicates
samtools markdup -r /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_positionsort.bam /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_markdup_removed.bam
# index for IGV viewer
samtools index /net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted_markdup_removed.bam
```
Other way would be to use `samtools view` with filter `-F 1024` or `-F 0x0400`
Visualization of `*_mardup_removed.bam` filen in *IGV viewer*
See [IGV Viewer](https://hackmd.io/YMMKtwsaQ7OLaxYitbUjVQ).

Flanking white regions are both *IS21 Transposase*
> Autonomous mobile genetic elements such as transposons or insertion sequences (IS) encode an enzyme, called transposase, required for excising and inserting the mobile element. The IS21 is among the largest of bacterial IS elements and encodes a transposase designated as istA (ORF1) as well as a second protein istB (ORF2) and represents the IS21/IS408/IS1162 family
Alignment tacks:
1. bowtie2
2. bowtie2 with removed duplicated reads via `samtools markdup -r`
3. bwa mem
4. bwa mem with removed duplicated reads via `samtools markdup -r`
## Removal of duplicated regions with _Mauve_
```bash!
cd /home/aime/projects/oligomm-claudia/reference_comparison
/home/aime/Downloads/mauve_snapshot_2015-02-13/linux-x64/progressiveMauve --output=joined_reference_curated.xmfa /home/aime/projects/oligomm-claudia/databases/omm_new/joined_reference_curated.fasta
```