GGG 201b - lab 7 - Variant calling and assembly workflows discussion

--- tags: ggg, ggg2020, ggg201b --- # GGG 201b - lab 7 - Variant calling and assembly workflows discussion *Feb 21, 2020* [Variant calling workflow](https://github.com/ngs-docs/2020-ggg-201b-variant-calling/blob/master/Snakefile) [Assembly workflow](https://github.com/ngs-docs/2020-ggg-201b-assembly/blob/master/Snakefile) what does each workflow do? how do we know they worked? - computer saying "no errors happened!" is usually necessary but not sufficient - need to examine outputs, information loss, and have evaluation metrics - controls on whole process - how? variant calling: * look at mapped reads & some regions * find low coverage regions * count and/or look at unmapped reads * (technique for looking at medium data: take 10 randomly selected results, e.g. unmapped reads, and ask what they are!) assembly: * how does megahit work? what does it do? * (what if megahit just output all the reads again?) * try mapping reads * look for known genes * (black box view) * how many k-mers don't show up in final? how abundant are they? * assemblathon result; usually a tradeoff between various parameters big data is kind of like wet bench experiments: you often can't look at it directly, so you need to develop indirect ways of doing quality checks. - summary statistics for various stages - information loss estimates - visualization - pos/neg controls - can we do hold-out / train+test+validate? not so easily in some cases.