Monday Update - Shao Group
===
###### `Metagenome assembly` `Benchmarking` `Documentation`
:::info
- **Date:** Jan 17, 2022 2:30 PM (EST)
- **Agenda**
1. Last week's work
2. Plan for this week
3. Help
4. Misc
5. References
- **Reference:** - [Last Monday's update](/s/template-meeting-note)
:::
## Last week's work
1) Updated thesis docmentation
i) Methods
ii) Results
2) Hifiasm and Raven does not output enough information to assess their resulting assembly. So ran QUAST to obtain N50, N75, L50, L75 and contig informations.
3) **BUSCO and QUAST output comparison** for all assemblers(metaFlye, HiCanu, Raven and Hifiasm-meta) on 5 benchmark datasets(E.coli, ATCC, Zymo, Sheep, Human and Chicken):
(QUAST: assembly -> metaFlye; X.contigs ->HiCanu; X_hifi.p_ctg -> Hifiasm-meta; X_hifi_assembly -> Raven)
### On E.coli dataset
- BUSCO

• Raven produces assembly with **contigs length < 5000 bp** for E.coli. Hence, it isint being read by BUSCO for quality analysis.
• HiCanu is able to identify **13 complete single copy genes**, which is better than metaFlye and Hifiasm-meta.
- QUAST

• Hifiasm-meta produces contiguous assembly with a total of **129 contigs** compared to others.
### On ATCC dataset
- BUSCO

• There are exactly **20 strain staggered metagenomes** in ATCC dataset.
• HiCanu and Hifiasm-meta are closest to the correct count. But both fail to identify the count of single-copies, rather show higher duplicate genes. In that sense, metaFlye does better identifying **12 single-copy genes** and lesser duplicates. Raven over shoots the complete count.
• Recovery of many duplicates generally indicate erroneous assembly of haplotypes.
- QUAST

• Hifiasm-meta produces contiguous assembly with a total of **72765 contigs** compared to others.
### On Zymo dataset
- BUSCO

• There are exactly **21 strains** in Zymo dataset.
• HiCanu, metaFlye and Hifiasm-meta produces results from both lineage datasets groups eukaryota_odb10 and saccharomycetes_odb10.
• Not sure of the total complete count being high.
- QUAST

### On Sheep dataset
- BUSCO

•
- QUAST

### On Human dataset
- BUSCO

•
- QUAST

### On Chicken dataset
- BUSCO

•
- QUAST

## Plan for this week
1) Thesis work:
- Calculate F1 score, precision and recall for binning output.
- Understand alternate to N50 metrics for assessing genome assembly. Generally, higher N50 values are better, but N50 alone cannot be considered for contiguity.
- Finalize on results(mostly benchmarking table and conclusion) for thesis documentation .
- Update references
- Request Dr. Koslicki and Dr. Medvedev for thesis panel.
- Start preparing presentation slides.
2) Journal Club work:
- Identify paper for journal club.
## Help
- Possibility of research assistanstship for Spring 2022?
- Comparing CheckM bin output is difficult because plenty of bins are produced, each having its own % completedness, % contamination etc. How do I address this? Here's a photo for clarification - A total of 509 bins are produced.


## Misc
--x--
## References
1. Felipe A. Simão, Robert M. Waterhouse, Panagiotis Ioannidis, Evgenia V. Kriventseva, Evgeny M. Zdobnov, BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs, Bioinformatics, Volume 31, Issue 19, 1 October 2015, Pages 3210–3212, https://doi.org/10.1093/bioinformatics/btv351
2. Yue, Y., Huang, H., Qi, Z. et al. Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets. BMC Bioinformatics 21, 334 (2020). https://doi.org/10.1186/s12859-020-03667-3
3. Meyer, F., Lesker, TR., Koslicki, D. et al. Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit. Nat Protoc 16, 1785–1801 (2021). https://doi.org/10.1038/s41596-020-00480-3
4. https://busco.ezlab.org/busco_userguide.html#interpreting-the-results
5. http://quast.sourceforge.net/docs/manual.html