Quality Control (FastQC and pycoQC)

# Quality Control (FastQC and pycoQC) ###### tags: `code` ### 1. Short Reads (<300 bp) - NovaSeq, HiSeq, NextSeq, MiSeq -> Illumina - Ion Proton/S5 - Life Technologies - MGISEQ/BGISEQ-500 - BGI ### 2. Long Reads (>350 bp - kilo/mega bases) ## Let's work with short read Illumina a. Raw data folders (1GB to 1.2 TB) b. 100,000 to 700,000 files per data folder c. Still in a compressed form, so we'll need to decompress ### FASTQ: This the format provided by sequencing facilitay - "finished" product format, basically a text file, largest ~650 GB each compressed A FASTQ record consists of 4 lines: 1. Always starts with @ sign then a long string including serial number of sequencer 2. run number: serial number of flow cell: lane number:tile coordinates then Read number:?:?: sequence data 3. + 4. ASCII code: Q-score + 33 = number that corresponds to ASCII code letter (NOTE: "I" is good;)). Every basepair has a corresponding character that inputs it's quality score (Q-score is an integer mappin gof p = estimated probability that the corresponding base call is incorrect) Q40 is very good #### More on Q-Score Scales: Corbin talks a bit about things that can effect quality scores and how flow cells work... ## Data Analysis workflow ## READS > quality control > trimming? > mapping or assembly > variant calling/counting > visualization Note on trimming READS: R1_Read1 R1_Read2 R1_Read3 R2_Read1 R2_Read2 R2_Read3 If you are trimming you have to remove BOTH R1 and R2 otherwise one will shift and the other won't ## 1. Import your reads ## 2. QC your data using FASTQC (This is for Illumina data) *Note: you can use **multiqc** to compile a single report that combines LOTS of fastqc report together (it also understands more files than just fastqc, also BAM etc) Activate your correct Conda environment if using: ``` conda activate bioinfo ``` to find exactly where your program is (in this case fastqc): ``` which fastqc ``` We need to specify sequence files and an output directory (-o) and number of threads to use ``` mkdir ~/QC fastqc data/seqnameR1.fastq.gz data/seqnameR2.fastq.gz -o ~/QC -t 14 ``` Now use multiqc ``` multiqc . ``` Execute commands in a ``` source ~/class_project/project_env/bin/activate jupyter notebook ``` NOW that we've opened the jupyter notebook in the first terminal open a new terminal (Shell > New Tab), and login again using this specific command: ``` ssh -L 8888:localhost:8888 labuser@34.148.164.14 -i my.key ``` Now you can copy and paste the URL from your first terminal tab into your local google chrome/internet explorer/etc What does the FastQC report mean? a. Basic Statistics - %GC content - we need to compare this to expectations to make sure we're working with right data - sequence lenght: can tell you if it's been trimmed already or not (if already trimmed you will see a range of lengths) b. Per Base Sequence Quality - typically quality scores will drop at the ends of each read (why? as sequencing happens, the clusters get 'fat' over time adn teh software may start having trouble distinguishing the clusters) c. Per sequence quality scores d. Per base sequence content: first few cycles of reads (position in bp) tend to be wonky for A/T and C/G bias content - this is normal e. Per sequence GC content: sometimes looks askew from expectations becuase of rRNA content or if you've captured another organisms genome (like wolbachia in flies) f. Sequence duplication levels: If you were to remove sequencing duplicates it will tell you how much will be remaining and unique. You will need to keep what TYPE of dataset you are using when looking at this. g. Overrepresented sequences - if you have any they would tell you the sequence data in a table h. Adapter content - as you get towards the end of the sequence, there is some adapter sequence that we'll want to remove. ## pycoQC (this is for Nanopore data) - is a QC program for nanopore data