---
title: Notes on Papers
tags: Knowledgement
description: Personal hackmd notes
image: https://partechshaker.com/wp-content/uploads/2018/10/logo_square.png
robots: noindex, nofollow
GA: UA-165598729-1
---
Paper Notes
===
:::info
Notes on papers I have read
:::
[TOC]
:page_facing_up: Evolution of microbiota strains and substrains in experimental mice
---
### Methods variant calling
1. take OMM ref genomes[^ref]
2. deduplicate reads[^duplication]
3. align reads with BWA to ref
4. use _Pilon_ to refine reference genomes [^pilon]
5. They to 2 rounds of _LoFreq_ and use first round for _base quality score recalibartion_.[^recalibration]
6. After variant calling they remove positions with more than one alternative allele [^alternative]
7. Statistical test for AF changes between years with multiple-testing correctoin
### General comments
- Not sure which test they performed to test AF variant changes between year, they just report its a "two-stage" setup and that its corrected for multiple testing. Maybe they should add the name of the test used.
- Not sure how big is the change due to _Pilon_ refinement. I will check it on our data using the new resequencing and also a whole metagenome alignment (I guess this is what they have done).
#### Should they have the same issues as we have?
- In the "Estimate relative abundance of species" part they write they remove "none-uniquely mapped reads". This is exactly what one should to to prevent our problem. Since they just mention it in this RA chapter and not on the variant calling chapter above, I _guess_ they run into the same problems as we do.
- However, I do not understand whit what input alignment they use for _Pilon_. The whole metagenome alignment to one reference? The resequenced NGS data they generated later? At least this should be specified in the methods. _Maybe_ this will sort out these problems but I dont think so.
[^ref]: So no assembly at all.
[^duplication]: They _do_ filter out "duplicate marked" with picard-tools. But as far as I know these duplications are technical duplicates of the same read due to PCR and not "duplications" in terms of their mappings to different locations. I tested the same (?) picard filtering on our samples [see my report](https://hackmd.io/@pmuench/HJHJw0Ut8) and it not solved our problem since this refers to PCR-"duplications" (same read multiple times).
[^pilon]: They run _Pilon_ to detect problematic structural variants. _Pilon requires as input a FASTA file of the genome along with one or more BAM files of reads aligned to the input FASTA file. Pilon uses read alignment analysis to identify inconsistencies between the input genome and the evidence in the reads. It then attempts to make improvements to the input genome, including:- (1) Single base differences (2) Small indels (3) Larger indel or block substitution events (3) Gap filling (4) Identification of local misassemblies, including optional opening of new gaps_
[^recalibration]: Was not aware this is a thing. But [It seems they are right](https://csb5.github.io/lofreq/faq/) and we should do this too. Citation from LoFreq: "We recommend GATK base-recalibration even for non-human data (or targeted sequencing), even though GATK requires the input of known variant sites (a circular problem actually), which are not known for many organisms (use dbsnp for human data). One option is to run LoFreq first and use its predictions as “known” variant input to GATK and then run LoFreq again. The other alternative is to simply use an empty vcf file."
[^alternative]: Not sure about that yet.