**Part I. Marine Turtle Rapture Data Analysis notes and workflow**

# **Part I. Marine Turtle Rapture Data Analysis notes and workflow** - *New DATA, received July 2017* - *N.B. Insert metadata on library prep/capture and sequencing (UC Davis HiSeq, 150bp PE reads)* - *IMPT: See also D.coriacea README file, which has notes and code from what Mike and I actually did in Feb for designing probes, etc. Using a lot of the same code, etc. but some impt steps added that need to clarify* ### **Intro & Initial QC** - Got 3 files from Mike of the entire lane/compressed data from the Genome center: - No. of Lines: 1583820688 SOMM190_S1_L001_R1_001.fastq *Forward reads* 1583820688 SOMM190_S1_L001_R2_001.fastq *Plate Barcodes (NEBNext adaptor sequences)* 1583820688 SOMM190_S1_L001_R3_001.fastq *Reverse reads* - So in total, getting 395,955,172 reads (pretty close to 400 million as expected for one HiSeq lane) #### I. Demultiplex full lane into plates, initial QC checks/library ``` gunzip *fastq.gz nohup ./run_split.sh SOMM190_S1_L001_R1_001.fastq SOMM190_S1_L001_R2_001.fastq SOMM190_S1_L001_R3_001.fastq SOMM190 & ``` - Bash script calls perl script BarcodeSplitList3Files.pl; *see the one with my initials for one with annotation* - Note that this script wont work with gzipped files, so need to unzip them first above - Looks for the barcode sequences in the R2 files, and then assigns the corresponding sequence in R1 and R3 per plate - These scripts will look for all the ~118 plate barcodes Miller lab uses. Just keep the files from the barcodes you used, and check to make sure no barcodes show in significant numbers that aren't supposed to be there **Libraries:** G01 AGCTTA G02 GAATTA G03 TGTGTA G04 CCCGTA G05 ATACTA G06 GTGATA G07 CTTTGA G08 GCAGGA G09 AATCGA G10 TGGCGA G11 GTCCGA G12 CCGAGA H01 AGAAGA H02 GGTTCA H03 TTGTCA - *confirmed with Sean that barcode for added library was TGACCA (NEBNext index4)* ##### 1. Check file sizes, lines (should be same between R1,R2,R3), headers, other QC checks ###### Number of lines: *N.B. If you do a line count for the compressed .gz files, can be different between F and R due to how compressing works.* - Ran linechecks on target files as well as ones assigned to non-existent plates... ``` wc -l *fastq >By_plate_linecheck wc -l *fastq >Extra_Seq_linecheck ``` - For non-target files, see *Extra_Seq_linecheck.xlsx* file; in total 605,692 sequences were assigned to these ghost barcodes (this is ~0.15% reads lost) - For target plate files, all the file sizes and line #s match up - see *By_target_plate_linecheck.xlsx* file; calcs for library prop etc more in depth looked at with QC steps below-this is just first quick check to ID any red flags ###### Check headers: - Look for any obvious file corruption, issues, etc. ``` head -n 8 *.fastq >plate_head_n8.txt less plate_head_n8.txt ``` *All looks normal; Note- can also look at head of gzip files without decompressing them all* ``` gzip -cd SOMM190_R1_TGTGTA.fastq.gz | head ``` ###### Look at files sizes to assess evenness of library sequencing: - *Again this is just quick and dirty check-look into further with FASTQC read counts and other QC graphics below* ``` du -h *fastq | sort -h ``` *(removed the R2s from the list)* 3.4G SOMM190_R1_AGAAGA.fastq 3.4G SOMM190_R3_AGAAGA.fastq 5.0G SOMM190_R1_TTGTCA.fastq 5.0G SOMM190_R3_TTGTCA.fastq 5.2G SOMM190_R1_TGACCA.fastq 5.2G SOMM190_R3_TGACCA.fastq 5.5G SOMM190_R1_GGTTCA.fastq 5.5G SOMM190_R3_GGTTCA.fastq 7.3G SOMM190_R1_CCCGTA.fastq 7.3G SOMM190_R3_CCCGTA.fastq 7.4G SOMM190_R1_TGGCGA.fastq 7.4G SOMM190_R3_TGGCGA.fastq 8.1G SOMM190_R1_AGCTTA.fastq 8.1G SOMM190_R1_GAATTA.fastq 8.1G SOMM190_R3_AGCTTA.fastq 8.1G SOMM190_R3_GAATTA.fastq 8.2G SOMM190_R1_GTGATA.fastq 8.2G SOMM190_R1_TGTGTA.fastq 8.2G SOMM190_R3_GTGATA.fastq 8.2G SOMM190_R3_TGTGTA.fastq 8.6G SOMM190_R1_ATACTA.fastq 8.6G SOMM190_R3_ATACTA.fastq 8.8G SOMM190_R1_CCGAGA.fastq 8.8G SOMM190_R1_GCAGGA.fastq 8.8G SOMM190_R3_CCGAGA.fastq 8.8G SOMM190_R3_GCAGGA.fastq 8.9G SOMM190_R1_GTCCGA.fastq 8.9G SOMM190_R3_GTCCGA.fastq 9.8G SOMM190_R1_CTTTGA.fastq 9.8G SOMM190_R3_CTTTGA.fastq 10G SOMM190_R1_AATCGA.fastq 10G SOMM190_R3_AATCGA.fastq - *F/R per plate match up; substantial variation in file sizes, look at further with FASTQC and after demuliplexing to see if some samples failed, etc.* ##### 2. Running FASTQC (by plate/library): ``` nohup fastqc -f fastq -t 5 *fastq > fastqc.stdout 2> fastqc.stderr & ``` ###### Running MULTIQC (by plate/library) to aggregate summary stats: [MultiQC Documentation](http://multiqc.info/docs/#introduction) ``` nohup /usr/local/python/bin/multiqc . & ``` - You can specify file type etc, the default "." just looks for any analysis file in the current directory used MultiQC v1.2, updated on ahi Aug 25, 2017 - Default report is just called 'multiqc_report.html'; I renamed to be 'Rapture_newdata_byplate_multiqc_report.html' - Look at documentation for options of naming, batching etc. Really thorough and helpful documentation. - Toolbox options are helpful for interactive-turning samples on/off etc to see where issues are, etc. *note that MultiQC actually can aggregate results from a large # of HTS programs at the raw seq, filtered/trimmed, pre/post alignment processing (see full list in documentation, link above), so this may be something want to come back to later to look at additional files* ###### A few summary notes on report *(see file for more details)* - Good quality scores; note dip in beginning-these are just the R2s (barcodes) bc included everything in folder; can hide these in t with the toolbox functions. Not surprisingly, R3s overall have lower quality than R1s, but still Good - GC content around 45% and length 150bp for all - % duplicates/duplication #s are very high, but I think this may be due to how it's calculating-e.g., we are always starting with RAD cut site and are looking at the same sites across samples in a plate; not worry about for now but do calculations when rmdups with samtools. If needed, come back and figure out how FASTQC is calculating - The 'sequence duplication levels'graph below I think is just graphically showing what was in the summary table above for the duplicates-all fail bc duplicating sequences >10-500 (but that would be the capture of the same targets across samples I think) - Similar comment for 'per base sequence content' -showing that they all 'fail' the module, but that looks like could be due to differences in the beginning cut site/well barcode, then seems to even out - Interestingly despite high duplication, seems to be even-not any over-represented sequences/loci coming out - Note differences of total sequences across plates (exported table and saved with plate IDs in file: *Dcor_newdata_readcountsFASTQC_byplate.xlsx*); H01-H04 plates are lower total sequences. Some differences likely due to input sample #s (bc we normalized input according to sample # bc some weren't full plates), but others likely bc quality. Plate 9 has highest, double or triple others. Will look at further with QC code below --- #### II. Demultiplex plates into wells (in loop), initial QC checks/library ``` nohup ../Bysample.demultiplex.wbashloop.sh & ``` ##### 1. General background on process - This bash script is just the *run_BestRadSplit.sh* with a loop added to repeat for each plate(library) and label as such - It calls *BarcodeSplitListBestRadPairedEnd.pl* to demultiplex each plate by well *(see the one saved with my initials for annotation; Of course need to make sure relative paths are all correct for sample files and scripts)* - Outputs are demultiplexed .fastq files for each sample (F & R) - If running for just one plate, use *run_BestRadSplit.sh* and set arguments in command line for input F and R and the prefix you want it to add to all the files *(see script for details)* - This does the flip trim, as well as demultiplexing for each sample *(see script for details)* - As per discussion with Mike- don't need to do adaptor and quality trimming here bc adaptor contamination isn't as much of an issue for RAD data, and ANGSD takes into account the quality scoring - However, we can also test for adaptor contamination/barcode hopping rates below ##### 2. Results/details relating to reads w double barcodes - Looked at *nohup.out* file: have 897,263 lines of "Double Barcode!" warnings - I annotated the Perl script to understand what it's doing (roughly), and this happens when barcode(s) is found the F and R sequences, which shouldn't happen. If that happens, then those sequences are omitted from being written to any files. This translates to ~0.23% of the sequences from the lane (~395 million total, see above). Seems reasonable, but can follow up with Mike to ask if this is typical % they expect to see. - To make sure it wasn't my loop that was causing an issue, I ran just for one plate and saw same result. - Printed the plate IDs and their line numbers: ``` grep -vn "Double" nohup.out >test2 ``` - Then just calculated in excel # of lines assigned to each plate: | Line No. | Plate | No. Seq. "Double Barcode" error | | -------- | -------- | -------- | | 1 | AATCGA | 26766 | |26768|AGAAGA |31968| |58737 |AGCTTA |53149| |111887 |ATACTA |85635| |197523 |CCCGTA |24556| |222080 |CCGAGA |76933| |299014 |CTTTGA |114422| |413437 |GAATTA |58672| |472110 |GCAGGA |74227| |546338 |GGTTCA |15182| |561521 |GTCCGA |109201| |670723 |GTGATA |69517| |740241 |TGACCA |12528| |752770 |TGGCGA |80383| |833154 |TGTGTA |57795| |890950 |TTGTCA |6329| |*(897279 total lines)*| - So some plates have more or less, looks like may be partially related to # of samples and possibly sample quality - Also, I went back to check this was roughly the same with the old data-reran demultiplexing script on SOMM117_R1_AATCGA.fastq and SOMM117_R3_AATCGA.fastq; resulting nohup.out has 9651 lines, so yeah, same as before, probably totally normal. ##### 3. Rename files with species and sample info - At this point, rename to be able to parse out by species for downstream analyses - Amended rename bash script and metadata files to rename the demuliplexed .fastq files - See *Rapturerenamefastqarchive09-2017data.sh *script - Created new combined metadata key file with new library added: *BestRADBarcodesCorrected_RaptureNEWSept2017_combinedmetadata.txt* - Note that I added the new library added May 2017 and called it "H04"; and some of the metadata is manually manipulated so see the sample original input spreadsheet if any questions/clarification need *New_RAD_library_sampleinfoLK_SB_June2017.xlsx* -e.g., the 'turtleIDs' are from the spreadsheet to ID mother/offspring vs. the database, and most didn't have D_IDs listed at the time so added "NYA" for not yet assigned-can go back and determine from database/notes from Shreya's extractions etc as needed - Labeling with species codes, D_IDs, plate and well #s - N.B. put all the possible metadata I'd want into these, so later can easily make bamlists to parse out by location, etc. (downside is the names are long, so can amend this script as desired in future uses with shorter names) *Sept 13 2017: ran above amended script and appeared to work (all the files were renamed & moved), but Rich got crash error emails from Ahi...trying to figure out what was/is the issue so dont screw up cluster in future* - Can gzip files above or here (or whenever) to save space-especially for archiving but in general - Moved files back into combined folder of RAs and RBs for ease of downstream analyses - If need to batch gzip/unzip these, use the recursive options: ``` nohup gzip -r Renamed_fastq & #filename is the directory/folder for all the files to compress ``` **TO DO:** - [ ] Send Phil info to run their script to generate z# and put into Genetics-NGS-archive folder for NOAA archiving ##### 4. Running FASTQC (by sample) & aggregate with MultiQC: ``` cd Renamed_fastq nohup fastqc -f fastq -t 10 *fastq > fastqc.stdout 2> fastqc.stderr & nohup /usr/local/python/bin/multiqc . & ``` - Overall looking at MultiQC, there are clear issues of violations of expectations of a 'random library', however, ours is not a random library bc we are targeting...e.g., looking at sequence duplication rates-majority of samples 'fail', but actually (if I am thinking about this correctly), that is a good thing-that is telling us about our X coverage. However, it's not clear to me what some of the other flags are being caused by-e.g., the wacky GC issues. Need to think about these a bit more, in concert with other QC below. - Repeating MultiQC by species/location so smaller batches=enable interactive plots and be able to see patterns better, zoom/hide/highlight particular samples, etc. ``` nohup /usr/local/python/bin/multiqc ./LV* & mv multiqc_report.html ../LV_multiqc_report.html mv multiqc_data ../LV_multiqc_data #and so on for each species ``` - Renamed output with sp code for each - I further split up DC into regions using the file list option bc still too many files for interactive plots ``` nohup /usr/local/python/bin/multiqc --file-list EPAC_fastqclist.txt & nohup /usr/local/python/bin/multiqc --file-list WPAC_fastqclist.txt & nohup /usr/local/python/bin/multiqc --file-list ATL_fastqclist.txt & ``` - Note: these sample lists have to be for the fastqc.zip files (not the html) produced by FASTQC **TO DO:** - [ ] Go through MultiQC results, add relevant notes here** ##### 5. Generate some additional quick summary stats ``` nohup du -h DC*fastq | sort -h >filesize_bysample.txt & nohup wc -l *fastq >Linecount_bysample_renamed.txt & ``` - File sizes and line counts (skimming) all look fine-match between RA and RB; substantial variation across samples as expected. - Download output, renamed as *filesize_bysampleNEWData_LK.txt* and *Linecount_bysample_renamedNEWData_LK.txt*; tweak heading/format to import into R below - Initial QC checks: - See **Section I** of R script: *Rapture_basicstats_NEWdataSept2017.R* - Summary stats raw read #s of 'empty wells' confirms that they they were actually empty: | Mean | Median | Min/Max | | -------- | -------- | -------- | | 244.85 | 211.5 | 27/818 | - One hawksbill sample somehow was way oversequenced (just like in original data): *EI, D_ID 15048, pG09, wA08* - Read distributions vary by plate but not that any one library/plate obviously failed in particular - See *Turtle Rapture-Read distribution by plate renamed samples NEWDATA.pdf* and *Turtle Rapture-No. of Reads renamed samples by plateNEWDATA.pdf* graphs - Same finding when parsed by species-not any particular species that failed - See *Turtle Rapture-Read distribution by species renamed samplesNEWDATA.pdf* and *Turtle Rapture-No. of Reads renamed samples by speciesNEWDATA.pdf* - When we bin samples together based on an estimated pass/fail threshold (10K), we see similarly a bimodal distribution for most-i.e., many samples in each library 'worked', and then some didn't - See *Turtle Rapture-No. of Reads Binned renamed samples by plateNEWDATA.pdf* and *"Turtle Rapture-No. of Reads Binned LOW renamed samples by plateNEWDATA.pdf* - We do see a few libraries with higher proportion of failures: G09, H02 - 10K seems to be a reasonable pass/fail threshold, as most times samples are clearly above or below that (w exception of G09 plate) - Wrote failed samples to file to look for patterns in 'failed' samples and combined with metadata - See *No.seq_QC_NEWDATA.csv* & *Failed_Rapturesamples_10Kthreshold_NEWDATA.csv* - Also in later analysis after mapping, created secondary filter (see below) and wrote to file, see - *Failed_borderline_samples_NEWDATA.csv* (samples that either had <10K raw reads or <5K filtered alignments) - *QC_passed_samples_NEWDATA.csv* (samples that had >10K raw reads and >5K filtered alignments) - Summary: ``` > table(failed$species) Ccar CM DC Ei EI hybrid LK LV 1 3 203 0 7 0 0 1 ``` - **See R script and notes; summary points:** 1. Many samples are missing year info so undet., but samples known to be collected prior to 1994 didn't work-however these are just 5 samples-several from Peru and one from Australia museum 2. No clear failure by country/location except one sample from Cyprus; other countries where more of the bycatch samples coming from also lower medians (American Samoa, South Africa, Peru, etc) 3. The Australia museum sample didn't work (not surprisingly); oddly strandings overall came out better than hatchlings-however likely there is something going on there with low input and/or if samples labeled hatchling were old (e.g., Brian Bowen's)...would have to go back and look at which samples are labeled in each of the categories-look at more if planning to do more hatchling sampling in future 4. If want to look at further, may need to do some addition metadata sleuthing and QC bc so much is missing from straight database compilation --- ### Mapping to Reference Genome #### I. Download/Index Green Turtle genome - Already done for past dataset (and put indexing directly into bwa script below), but if need to repeat in future command is ``` nohup bwa index Cmyd.v1.1.fa & ``` - *Program, reference files and versions used:* Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.7.5a-r405 Contact: Heng Li <lh3@sanger.ac.uk> Reference Genome used: Genomic data of the green sea turtle (Chelonia mydas). Huang, Z; Li, C; Li, W; Niimura, Y; Pascual-Anaya, J; Wang, Z; White, S; Zadissa, A; Fang, D; Xiong, Z; Wang, B; Ming, Y; Chen, Y; Zheng, Y; Kuraku, S; Herrero, J; Pignatelli, M; Beal, K; Nozawa, M; Li, Q; Wang, J; Zhang, H; Yu, L; Shigenobu, S; Wang, J; Liu, J; Flicek, P; Searle, S; Wang, J; Kuratani, S; Yin, Y; Aken, B; Zhang, G; Irie, N (2014): Genomic data of the green sea turtle (Chelonia mydas). GigaScience Database. http://dx.doi.org/10.5524/100085 Related manuscripts: doi:10.1038/ng.2615 (PubMed: 23624526) Accessions (data included in GigaDB): SRA: SRP011574 BioProject: PRJNA104937 #### II. Align PE reads to reference Genome using BWA ##### 1. Align and output to samfiles ``` ln -s ln -s /home/Dcor_Rapture_NEWDATA/ RapNEW_SC nohup ../RapS_SC/Rapture_bwa_NEWdata.sh & ``` - Just created relative link *(RapS_SC)* so can use/amend scripts in Rapture folder for old data easily - Suggest testing on a few samples (copy into test folder) before running on large datasets (e.g., here we have ~1500 samples) - Check stderr files (capturing what was on screen for each sample and commands) to look for errors/issues, then can delete if all looks good ##### 2. Convert sam files to bam file format and sort - *N.B. Can combine mapping, create/sort bam files all in one step/script to streamline once check that is working properly (separated here just to check in/output at each step to allow for easy checking/troubleshooting of errors)* ``` nohup ../RapS_SC/Rapture_StB_sort.sh & ``` - Delete .sam files files when done - *N.B., can extract unmapped reads for de novo assembly/SNP discovery etc. if/when of interest later on-skipping for now. But see samtools documentation.* ``` samtools view -f4 -h input_sorted.bam > output.unmapped.bam samtools bam2fq output.unmapped.bam > reads1.fq # For paired-end reads: cat reads.fq| grep '^@.*/1$' -A 3 --no-group-separator > reads1.fastq cat reads.fq | grep '^@.*/2$' -A 3 --no-group-separator > reads2.fastq ``` ### *TO DO: add step to filter for proper pairs only before remove PCR duplicates* code to add: (included the commands you would run before and after as reference-see *D.coriacea_readme_file* also ``` #samtools view -bS ${c1}.sam | samtools sort - ${c1}_sorted samtools view -b -f 0x2 ${c1}_sorted.bam > ${c1}_sorted_proper.bam #this line is what needs to be added to script #samtools rmdup ${c1}_sorted_proper.bam ${c1}_sorted_flt.bam ``` #### III. Run summary stats, remove duplicates & index ##### 1. Print bam file alignment statistics for each sorted, unfiltered .bam file ``` nohup ../RapS_SC/Rapture_bam_stats.sh & ``` ##### 2. Concatenate them all together ``` nohup ../RapS_SC/Rapture_combine_sortedflagstat_files.sh & ``` *- Concatenate, but make them tabular so that can more easily manipulate with awk and in R for summary mapping stats etc:* ``` nohup ../RapS_SC/Rapture_combine_sortedflagstat_files_tabbed.sh & ``` ##### 3. Generate file with just the desired columns - First look how many columns/should be all the same across samples: ``` awk '{print NF}' All_Rapture_sort_combined.flagstat_tabbed.txt ``` - Parses into 88 columns; parses every space as a tab, pick columns that match what I want: ``` awk '{print $88 "\t" $1 "\t" $23 "\t" $27 "\t"$30 "\t"$44 "\t" $49 "\t" $52 "\t"$60 "\t" $64 "\t" $67 "\t" $77 }' All_Rapture_sort_combined.flagstat_tabbed.txt >All_Rapture_flagstat_reformat.txt ``` - Get rid of the opening parentheses in the % columns: ``` sed -i 's/(//g' All_Rapture_flagstat_reformat.txt #-i changes it in the original file; could test sending to another outfile first sed -i 's/%//g' All_Rapture_flagstat_reformat.txt ``` - Extract headers into tab delimited text file and concatenate: ``` cat short_flagstat_headers All_Rapture_flagstat_reformat.txt >All_head_flagstat_reformat.txt ``` ##### 4. Create new bam file with clonal reads (aka PCR duplicates) removed; index - *Indexing is just for if want to look at in IGV, pull out specific regions, etc* ``` nohup ../RapS_SC/Rapture_sortedbam_rmdup.sh & nohup ../RapS_SC/Rapture_bam_index.sh & ``` - Index both filtered and unfiltered above if desired to look at specific/compare regions in IGV etc - If not, delete not filtered bam files to save space ##### 5. Repeat flagstat/awk steps from above to compare unfiltered/filtered ``` nohup ../RapS_SC/Rapture_bam_stats.sh & nohup ../RapS_SC/Rapture_combine_sortedflagstat_files_tabbed.sh & awk '{print $88 "\t" $1 "\t" $23 "\t" $27 "\t"$30 "\t"$44 "\t" $49 "\t" $52 "\t"$60 "\t" $64 "\t" $67 "\t" $77 }' All_Rapture_sortfltr_combined.flagstat_tabbed.txt >All_Rapture_sortfltr_flagstat_reformat.txt sed -i 's/(//g' All_Rapture_sortfltr_flagstat_reformat.txt sed -i 's/%//g' All_Rapture_sortfltr_flagstat_reformat.txt cat short_flagstat_headers All_Rapture_sortfltr_flagstat_reformat.txt >All_head_sortfltr_flagstat_reformat.txt ``` - *Can delete individual flagstat files once combined for organization* - Downloaded and merged filtered/unfiltered files in R *(see below)* with key denoting failed, good, high thresholds to determine proportions of raw reads that are generating usable data --- ### EDA & Generate Basic Summary Stats & Graphics #### I. Basic Mapping & Filtering Statistics - See **Section II** of R script: - *Rapture_basicstats_NEWdataSept2017.R* - *N.B. In this section compared to above, here I have mostly companionate notes to R script rather than actual code and steps of analysis* - Evaluating proportion of total reads mapped after filtering - median/mean 22-27%, i.e., we lose a lot to duplicates - Parsed by species and plate; removed empty wells and one hawksbill outlier on G09 - Summary tables and graphics-see output into PDFs, script to recreate summary tables - As expected distributions, see: - *Turtle Rapture-Mapped distributions by species NEW DATA.pdf* - *Turtle Rapture-Mapped reads by species NEW DATA.pdf* - By species: No obvious patterns of mapping success across species, except more outliers with Dcor's but this could simply be due to the fact that we tried all our samples vs. picking smaller # for other species (analogous to raw reads) - High percentages of reads mapping; No obvious correlation with lower reads counts and # mapped, similar with filtered/unfiltered - see *Turtle Rapture-No. Reads & Percent Mapped Correlation NEW DATA.pdf* - By plate: see summary tables in script and *Turtle Rapture-filtered Mapped reads by plate NEW DATA.pdf* - Same pattern as raw reads-40ng/ul (G10-H01) are the most consistent with read counts, mapping counts and very high % mapping; but,lower concentrations, while higher failure/variability between samples, also worked frequently. - G09 lower counts- could be related also to the many reads sucked up by the high Ei outlier (not shown in graph-subsetted out), perhaps due to DNA normalization error upstream? - H02 more lower outlier percentages for mapping-could be related to lower concentrations and more room for contamination, etc.? - Removed failed samples (<10K raw reads, see below) to look at patterns of prop. filtered mapped/raw reads - As you get more reads, you are sequencing more duplicates (as expected). - See *Prop Filtered Alignments of Raw Reads by PlateNEWDATA.csv* and *Turtle Rapture-prop raw reads f-MappedNEW DATA.pdf*; but not necessarily strong correlation/clear threshold for targeting in future; clustering around 0.2-0.3 prop for majority of samples, not different from before we removed failed samples - BUT there is a relationship with original DNA input: if you start with higher concentration of DNA, get less clones (this seems obvious, but since adding in same total ng to reaction, suprised relationship was this strong). - See boxplot: *Turtle Rapture-prop filtered mapped-seq frag by DNA input boxplotNEW DATA.pdf* - Under/over representation of certain libraries, and/or variability across samples within libraries? - Are certain libraries under/over-represented: prop.sample and prop.seq should be approx in same ballpark; mostly are except -H02 low, H03 and H04 high - Under/over-represented samples within library: see that of course, failed samples are under represented, but interestingly some good and high ones are also; doesn't seem to be issue by species; see more variability in plates with lower concentrations #### II. Explore subset of samples visually/semi-quantitatively in IGV - Downloaded a handful of filtered DC bam files of medium size - Looked across a smattering of loci, seem to be consistent coverage; some may have SNPs in cut site that makes it so it doesn't cleave. However, still was able to capture and get sequence so avoids null allele issues (at least in the small # of samples/loci I looked at) - Some SNPs show up as expected, others not but looking at just a couple samples may not be heterozygotes/alternates - As expected, best coverage out to + or + or - ~130bases #### III. Determine coverage over intervals, and ID shitty loci, etc. - Using Bedtools - References: - [Bedtools Documentation](http://bedtools.readthedocs.io/en/latest/) - [Getting Genetics Done Blog](http://www.gettinggeneticsdone.com/2014/03/visualize-coverage-exome-targeted-ngs-bedtools.html) - [Github Bedtools Protocols](https://github.com/arq5x/bedtools-protocols/blob/master/bedtools.md#ap2b-assessing-coverage-in-exome-capture-experiments) - *N.B., if want to just run a subset (by species, region, etc.), can easily move all the files from a list into a new folder, e.g.:* ``` xargs -a DC_STX_bamlist mv -t ./bedtools_test/ ``` - N.B. Below is the streamlined version decided on with Mike to just focus on one representative position, but can see May2017.md file for extended documentation of file interpretation and exploration and code of target regions, graphics etc. and scripts: - Rapturebedtoolscoverage.sh - Rapturebedtoolscoveragebytarget.sh - Exome-coverage-multi.R (makes plots for targeted regions across samples, see [Getting Genetics Done blogpost](http://www.gettinggeneticsdone.com/2014/03/visualize-coverage-exome-targeted-ngs-bedtools.html) on it) - Bedtools_coverageplots.R (makes plots for individual files, see [bedtools protocols tutorials](https://github.com/arq5x/bedtools-protocols/blob/master/bedtools.md#ap2b-assessing-coverage-in-exome-capture-experiments-) and the [Bedtools ReadtheDocs](http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html)) **Look at coverage at representative base within target regions: picked relative position 20 to create matrix of coverage by loci and samples** ##### 1. Generate .txt files listing coverage for each locus per sample: ``` nohup ../RapS_SC/Rapture_bedtools_coverage_pos20.sh & ``` ##### 2. Once all the file generated, pull out informative columns and add sample name as new column, save as .bed file: ``` cd ../bedtools_pos20_cov nohup ../RapS_SC/SNPcov.parse.sh & ``` ##### 3. Concatenate all together to generate long dataframe: ``` cat *.bed >All.combined.pos20.bed grep -v 'all' All.combined.pos20.bed >All.combined.pos20.2.bed #get rid of the 'all' summary lines wc -l All.combined.pos20.2.bed ``` ###### check: 3082752 lines=2007 loci *1536 individuals* ##### 4. Open in R to manipulate into wide format, generate summary results & graphs: - See **Section III.Bedtools coverage summaries and graphics** in: - *Rapture_basicstats_NEWdataSept2017.R* script - Combined with flagstat failed/passed files generated above to look at coverage patterns for all vs. 'QC ok only' samples - Created matrix of coverage sample x locus ##### a. Calculating coverage per locus at different thresholds - The main goal of this was to examine general patterns of what coverage we are getting across samples & loci - 4x,8x,16x,24x - Wrote to file: *Pos20coverage_bylocusGENREF_all_NEWDATA.csv* - Calculated proportions and graphed for all samples, see: - *Coverage by locus GENREF hist.allsamp.NEWDATA.pdf* - With all samples we see that majority of loci are covered at 8x between 50-70% of the samples - With the 16x/24x thresholds this drops to 25-50% samples - Calculated proportions and graphed for QC passed samples, see: - *Coverage by locus GENREF hist.QCok.NEWDATA.pdf* - With just the QC passed samples, we see the majority of loci are covered at 8x ~65-80% - With the 16x/24x thresholds this drops to 50-75% samples - But this is still several hundred loci-which may be totally fine for analyses (depending on goals) - **Take home:** seems decent, would have liked to see higher #s here but some loci may just not be consistent and/or we could sequence more to increase DoC if desired ##### b. Calculating coverage per sample at different thresholds - - 4x,8x,16x,24x - Wrote to file: *Relpos20_GENREF_bysample.allsamp.NEWDATA.csv* and *Pos20coverage_bylocus_GENREF_QCok_NEWDATA.csv* - Calculated proportions and graphed for all loci, see: - *Coverage by sample GENREF hist.allsamp_NEWDATA.pdf* - With all samples we see U shape distribution of loci are covered at 8x = *some failed, some low #s intermediate coverage throughout, and then several hundred samples have >75% loci covered* - With the 16x/24x thresholds this drops so fewer samples in upper tail, and reaching 300+ samples with 0 loci covered at this level :( - Calculated proportions and graphed for all loci, see: - *Coverage by locus GENREF hist.QCok.NEWDATA.pdf* - With just the QC passed samples, we see the bottom of the U goes away (not suprisingly-we removed the samples that didn't sequence well and thus didn't have coverage at many loci), and rest of the results/distributions are the same - **Take Home:** again seems decent with several hundred samples being covered at or >50% loci, but would like have it been higher; again could do more sequencing if desired but depends on what we want data for **Think about/maybe talk to Mike-any reason I can't concatenate the old & new data to increase coverage at least for some loci and to not throw away those data?** ##### c. Comparing coverage at thresholds collapsed by # raw reads - Combining thresholds by sample, see output graphics: - *Turtle Rapture-Coverage pos20 Plot 2NEW DATA.pdf* - *Turtle Rapture-Coverage pos20 Plot 2NEW DATA QCokonly.pdf* - Reach asymptote pretty quickly of raw reads/higher coverage but variable by sample, see further exploration of 10K threshold ##### d. Comparing coverage (pos 20) by mapped alignment #s - Graphing the # of Mapped Alignments vs. No. of Loci that were covered at >=4X - The main goals of this were to: 1. See if there's a clear threshold for No. mapped filtered alignments you need to get good enough coverage across targets - See output files: - *Coverage pos 20-take2NEWDATA.pdf* - *Coverage pos 20-take2_OTRM_NEWDATA.pdf* (outlier removed to visualize better) - *Coverage pos 20-good threshold explorationNEWDATA.pdf* - Get about half of the loci covered at 4X with ~25K filtered alignments - Get close to all loci covered at 4X with ~50-75K filtered alignments - see file: *Coverage pos 20-allsamplesforMS_OTRM_NEWDATA.pdf* - But not clear what is going on with the samples that had >10K reads, but a quarter or less of the loci covered at 4X? - Samples marked as failed by the 10K threshold do generally have less than 200 loci covered. However, a substantial number of samples above that threshold also have low target coverage. This is something we may want to consider later when determining which samples to omit from analyses (see more below) - Compared to old dataset, patterns are similar but new data clearly has better on target recovery and consistency 2. Is our 10K read threshold generally discerning of what we should call a 'failed' sample? *(i.e., are samples with <10K reads the 'failures?' or are there some with higher raw reads, but poor coverage to target loci, and vice versa?)* - See output files: - *Coverage pos 20-take2_OTRM_NEWDATA.pdf* - *Coverage pos 20.samp<500loci4x_NEWDATA.pdf* - *Coverage pos 20.samp<500loci4x_thresholdlines_NEWDATA.pdf* - Looking at the data for samples with relationship between these suggests that our filter of 'failed' at 10K is a good one because although there are some samples with higher #s of reads that aren't getting good recovery, there are some that have only ~12K reads that have 400+ loci covered. - One option is to add a secondary filter-i.e., 'failed' = <10K reads OR <5K filtered alignments. 3. Why are some samples maxing out at ~1750 despite higher coverage? - Looked at summary stats, nothing by plate or well - Species summary: ``` Ccar CM DC Ei EI hybrid LK LV 18 0 15 5 15 2 3 2 ``` - - Notice that there's no green turtles and proportionally lower DC-likely polymorphisms/indel in RAD cut sites/bait sequences in other species relative to GT reference that causes it to drop in certain loci (?) - See output graph: *Coverage pos 20.byspecies_NEWDATA.pdf* - shows nicely the fallout of different species at the lower loci no.; note the green turtles up high and two hybrids - would be interesting to go back and see if those hybrids are green mixes (probably). 4. Are the same loci dropping out from the samples maxing out at ~1750? - Generated summary tables & histograms *(see R script)* - **Take Home:** There doesn't seem to be one specific set of loci that failed for all of these samples (would expect to be sharp frequency at 0 or close to it), but there is a set of 151 loci that failed for 1/2 or more individuals in this group *(i.e., failed for 30+/60 individuals-see histograms/summary tables generated).* - Could look in the complete sample set to see if these are just consistently crappy loci across most samples, or if something special going on with these ones - Additionally, there are loci that worked for 2/3 or less samples (315 loci; 50/60 individuals), and these may be species specific or have other causes. Can look in further later by species etc as desired. --- - Also created graphics of loci covered >=4x coverage by # filtered alignments, vice versa by sample - - Used these to generate list of 'failed' samples based on certain read and coverage thresholds - *insert results here* - Used these to generate list of 'failed' loci coverage thresholds - *insert results here* #### IV. Examine on/off target coverage & GC content issues - *Align RA reads (only) to bait reference to check on/off target coverage across loci and GC content issues (see notes on previous data for details)* - *Also looked at coverage by locus and sample as above, but keep in mind that since RA only alignments, duplicates have not been removed and have inflated coverage counts* #### A. Evaluating % on/off targeted alignments ##### 1. Aligned Rapture data using 120 Rapture baits as reference.fasta - same code/steps as above for mapping with BWA, just different reference ##### 2. Pulling in (previously) aligned RAD data & old Rapture data using 120 Rapture baits as reference.fasta - see *RAD_bwa_baitref.sh* ##### 3. Sam to bam/sort for steps (1) and (2) *NB Same as above, except skipping rmdup step for all of these since only RA reads* ##### 4. Calculate summary stats for each with *flagstat* ##### 5. Process flagstat to get collated summary tables following steps as above: ``` nohup ../RapS_SC/Rapture_bwa_BS_baitref.sh & nohup ../RapS_SC/Rapture_StB_sort_baitref.sh & nohup ../RapS_SC/Rapture_bam_stats.sh & nohup ../RapS_SC/Rapture_combine_sortedflagstat_files_tabbed.sh & awk '{print $88 "\t" $1 "\t" $23 "\t" $27 "\t"$30 "\t"$44 "\t" $49 "\t" $52 "\t"$60 "\t" $64 "\t" $67 "\t" $77 }' All_Rapture_sort_baitref_combined.flagstat_tabbed.txt >All_Rapture_sort_baitref_combined.flagstat_reformat.txt sed -i 's/(//g' All_Rapture_sort_baitref_combined.flagstat_reformat.txt sed -i 's/%//g' All_Rapture_sort_baitref_combined.flagstat_reformat.txt cat short_flagstat_headers All_Rapture_sort_baitref_combined.flagstat_reformat.txt >All_Rapture_head_sort_baitref_combined.flagstat_reformat.txt ``` ##### 6. Combine RAD and Rapture flagstat to compare across datasets - See **Section IV.** of R script: - *Rapture_basicstats_NEWdataSept2017.R* - Create graph in R, with/without filtering out 'failed' samples - **See output graphs in "On-off Target Capture.pdf"** - Similar patterns as before, except better-new data has higher proportions of mapping to targets #### B. Look at coverage at representative base within target regions - Following steps analagous to samples mapped to ref genome above with bedtools - In brief, picked relative position 20 to create matrix of coverage by loci and samples & generate .txt files listing coverage for each locus per sample: ``` nohup ../RapS_SC/Rapture_bedtools_coverage_pos20baitref.sh & ``` - Once all the file generated, pull out informative columns and add sample name as new column, save as .bed file: ``` cd baitref_relpos20cov nohup ../../RapS_SC/SNPcovbaitref.parse.sh & ``` - Concatenate all together to generate long dataframe: ``` cat *.bed >All.combined.pos20.baitref.bed grep -v 'all' All.combined.pos20.baitref.bed >All.combined.pos20.baitref.2.bed #get rid of the 'all' summary lines wc -l All.combined.pos20.baitref.2.bed ``` ###### check: 2693394 lines=2007 loci *1342 individuals* - Move into R to manipulate into wide format, generate summary results & graphs same as ref genome mapping above, but see companionate script: - See *RaptureSept2017_baitref_coverageQC.R* script - **N.B. I did this to do RELATIVE comparisons with previous dataset GC content issue, as well as with new duplicates removed dataset mapped to the genome, but keeping in mind that the absolute coverage #s here are inflated** - Combined with QC failed/passed files generated above to look at coverage patterns for all vs. 'QC ok only' samples - Comparing between the two, removing the <10K read samples removed ones that had % or very low coverage of target loci, see: - Coverage by locus bait reference hist.QCok.NEWDATA.pdf - Coverage by locus bait reference hist.allsamp.NEWDATA.pdf - Coverage by locus bait reference histograms.pdf (OLD DATA-compare distributions, new data better!) - However, still have some samples retained that have <25% of loci covered even at 4 or 8X, so may need additional filtering/ID these samples post SNP calling bc will like present issues of high proportions of missing data - Can look in summary matrices (saved) to further ID which samples they are (e.g., should we exclude samples that are missing decent coverage data from >X threshold targets?) - Similar findings with loci-removing failed samples shifts distribution to the right; makes sense because even the really great loci are unlikely to work consistently with bad quality samples. - Coverage by sample bait reference hist.QCokNEWDATA.pdf - Coverage by sample bait reference hist.allsamp_NEWDATA.pdf - Majority of loci now seem to have coverage at ~>75% of the samples (compare to figures from previous Rapture data) - Wrote to file so have coverage matrices and summary by locus and samples for future use - Pos20coveragebylocusbaitrefQCokNEWDATA.csv - Pos20coveragebylocusbaitrefallNEWDATA.csv - Pos20coveragebaitrefbysampleQCokNEWDATA.csv - Pos20coveragebylocusbaitrefQCokNEWDATA.csv - Can use these to generate list of 'failed' samples and/or loci based on certain read and coverage thresholds #### C. Bait AT/GC content issue: - Pulled in GC content info to match with coverage, see: - *RaptureSept2017_baitref_coverageQC.R* script - And output: - *GC content by coverage_newdata.pdf; GC_Rap_oldvnew.png* - New baits and/or polymerase fixed most of problem, though still higher failure with increased GC content, so good to know for future bait design, compare graphs --- ### Rapture Raw & Mapped Data QC Take Home Summary #### Used coverage/quality metrics/graphs/spreadsheets generated throughout QC data analysis above to determine: 1. Raw files look normal, no obvious signs of corruption/sequencing problems 2. Consistency across libraries/samples, barcode issues, etc. - Low percentages sequences assigned to empty wells, ghost barcodes and double barcodes - Generate summaries of reads per plate/# samples = libraries under/over-represented? - see file: *Library under-over-representation summary NEWDATA* - prop.sample and prop.seq should be approx in same ballpark; mostly are except -H02 low, H03 and H04 high - Under/over-represented samples within library: see that of course, failed samples are under represented, but interestingly some good and high ones are also; doesn’t seem to be issue by species; see more variability in plates with lower concentrations - Reads for each sample normalized to total reads per library = under/over-represented samples within library? - *see file: Flagstat_combined_filt_unfilt_NEWDATA_LK_plus_under_over_repsamples.xlsx* - failed samples are below not suprisingly, but in general some under and some over, one really high-likely normalization error - would like to get to be more consistent for future but fine for these data 3. Establish 10K reads as threshold for designating 'failed' samples; could also add secondary filter-i.e., ‘failed’ = <10K reads OR <5K filtered alignments. 4. Generate lists of passed, failed and borderline (potential redo) samples - failed samples below thresholds, at low end - borderline samples =pass base threshold, but lower read #s and coverage - Wrote to file: - *Failed_Rapturesamples_10Kthreshold_NEWDATA.csv* (samples that had <10K raw reads) - *Failed_borderline_samples_NEWDATA.csv* (samples that either had <10K raw reads or <5K filtered alignments) - *QC_passed_samples_NEWDATA.csv* (samples that had >10K raw reads and >5K filtered alignments) - no clear pattern of failure by plate, well, location, species; mostly older samples, and lower concentration (but not always) - higher DNA input plates (40ng/ul) have higher consistency in success 5. Generate summaries of raw reads to filtered mapped to understand proportion of sequences generating data - See file *Flagstat_combined_filt_unfilt_NEWDATA.csv* and summary graphics (boxplots, histograms) and tables by plate (not printed to file-in R script if need #s) - After mapping & filtering: have ~25% of sequences left (though there may be something with the calculations here for raw reads vs mapped alignments where they are double counted) - Similar patterns to raw reads, no clear issues of mapping across species, plate, etc - Very high percentages of mapping 6. Estimate coverage across samples and loci - Overall, coverage seems decent with several hundred samples being covered at or >50% loci - Get close to 4x coverage for all loci with ~50-75K filtered alignments - ~200 hundred loci drop out for other species (not CM or DC) 7. Evaluate on/off target % and GC content issues - High on target capture - Failure of loci with high GC has gone away --- ### *See Part II. notes for genotyping asssessments & continued analyses...*