# BI345 Lab 2 Goal: quality check, process, and align illumina sequencing reads to a reference genome - This is imporant because it is the foundational step of many different kinds of genomic data analyses - it will affect later conclusions based on data We will use BWA for alignment but general steps will be similar regardless of what aligner you choose in the future After alignment --> we process alignment and call variants NOTE: - for lab exercises: working directory should be ``` /courses/bi345/sdivit25 ``` - you can either work directly in that level or w/in "Private" folder - Remember that everyone can see into each other's directories at the top level ==> we will use this system in the future for shared projects This week, we are working with subset of data from Conwill et al paper - Data is in: ``` /courses/bi345/Course_Materials/wk_03 ``` - in this folder, you'll also find metadata that is associated with these data in file: conwill_subset.txt - NOTE: Understand what this data is - NOTE: for the SRR# that contained data for more than a million reads, Prof Noh took only 1 million read subset (just for purpose of this lab) - she already ran through the entire pipeline for 7 out of 8 samples in subset of data - so today: I am processing the last sample's data ## Exercise 1 - Quality Controll of Ilumina/NGS Reads Gen Notes: - use fastp to do quality control of our raw illumina reads - raw sequencing reads are typically not used as is - instead, we want to... - remove sequencing adapters and barcodes (not part of organism) - remove any reads/parts of reads that don't pass a quality filter Website with examples of "good" and "bad" illumina data: https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/ 1.1 Find out basic usage of fastp. I need at least these options (which are input and output files): ``` --in1 --out1 --in2 --out2 ``` ![](https://i.imgur.com/9JuSRep.png) Essentially fastp is a fastq processer --> quality profiling, adapter trimming, read filtering and base correction 1.2 Execute quality control for the input SRR16361397*.fastq files in this week's folder. NOTE: you should be able to do this without moving the original files, but you need to redirect the output to your working directory. ``` fastp -i /courses/bi345/Course_Materials/SRR16361397_1.fastq -o out.SRR16361397_1.fastq -I /courses/bi345/Course_Materials/SRR16361397_2.fastq -O out.SRR16361397_2.fastq ``` 1.3 Look at the html report that's in your working directory now. The data looks fine to Pof Noh but compare what you see with examples in website (given above) and make own judgement. - Looks good to me Q) What is the base quality scores threshold used by fastp? What do these scores mean? See here (https://gatk.broadinstitute.org/hc/en-us/articles/360035531872-Phred-scaled-quality-scores) for some discussion about where you might establish a Phred score threshold. A: The quality score is an integer typically in the range between 2-40. A high-quality score seems to have a value higher than 30. A score of 20 or above is acceptable, which means that whatever it qualifies is 99% accurate. In general, the higher score indicates a higher probabiity that a particular decision is correct. Thus, a lower score indicates a higher probability that the decision is incorrect. ## Exercise 2 - Get and index your reference genome Gen notes: - now we can align our clean QC'd read to our ref genome for Cutibacterium acnes C1 - Prof Noh downloaded the fasta file for this genome from NCBI based on the info provided in Conwill et all - Note: we are using BWA for our aligner - general instructions for using BWA: https://github.com/lh3/bwa#getting-started - we are using the newest version (this link tells you about what is new in this version: https://github.com/bwa-mem2/bwa-mem2) - as always, type in bwa in terminal to see more instructions on what commmands are possible/see the built-in manual 2.1) Normally in this step, you would index the ref genome, BUT Prof Noh has already this so we can share the reference (it will generate a bunch of associated files that include the BWT genome & suffix array): ``` #bwa-mem2 index Cutibacterium_acnes_C1.norm.fasta ``` note for future labs: # means that we don't have to run it NOTE: if later in alignment process, there's an error saying something about a "diff line length" regarding fasta file, you need to run a command like this and redo everything from this point on: ``` picard NormalizeFasta I=inputFile O=outputFile ``` ## Exercise 3 - Align read to your reference genome Gen Notes: - here, we will simultaneously align clean (QC'd) paired end FASTQ files to the reference genome - if you had more than one sample (ie up to two paired end files at a time), you'd this one sample at a time - here Prof Noh used default alignment parameters BUT you can changes these if desired (be sure to reference manual) - NOTE: if this was a single end alignment, you'd list just the one fastq file after the reference genome 3.1) Before running the alignment, look at the following command to see how Prof Noh is adding READ GROUP info to this alignment using the -R option. Based n metadata, all of these SRR# belong to the same subject but are different samples. The simplest way to differentiate the samples is to note which pore (p) and which colony (c) each sample came from. This is why the filename and readgroup ID, SM, and LB are all denoted with p05c01 for the specific SRR# used below. You need to change these accordingly for the SRR# you align. 3.2) Align SRR# files you just cleaned. ``` bwa-mem2 mem -M -R '@RG\tID:p11c01\tSM:p11c01\tLB:p11c01' /courses/bi345/Course_Materials/new_03/Cutibacterium_acnes_C1.norm.fasta out.SRR16361397_1.fastq out.SRR16361397_2.fastq > p11c01.bwa.sam ``` 3.3)Take a look at the first few lines of your alignment file. It is in a format called SAM. The binary version of an alignment file is called BAM file. Prof Noh example output (scroll to the side in the lab manual): ![](https://i.imgur.com/KBYWL6z.png) My output: ![](https://i.imgur.com/31zeBOx.png) Note: header contains info regarding what is in file & how it was created - line by line data that follow are the alignment info for each read 3.4) Use table 1 to decode what one line of data from your SAM file means. Notice that your aligned raeds are paired and have bitwise flags that indicate whether the paired reads were properly aligned or not. Definitely follow the link (https://broadinstitute.github.io/picard/explain-flags.html) and enter your flag (the number) to see what it means. Q2) Decode line of SAM file. ![](https://i.imgur.com/bhA0MC3.png) The sample with a query name of SRR16361397.2 has a flag number of 83 (which means that there is a read paired (0x1), read mapped in proper pair (0x2), read reverse strand (0x10), and first in pair (0x40)).The reference sequence name is NC_018707.1 and the 1-based leftmost position/coordinate of the clipped seqeuence is 1932506. The Phred quality score is 60 (indicating a very high quality sample). The extended CIGAR string is 101M and the marte reference sequence name is the same as the reference sequence name. The 1-based Mate position is 1932315. The inferred insert size is 292. The sequence on the same strand as the reference is GATCTCCTGGATGTTCGAGACCACCTTGTCATGAAGCTGATCTCGCCCCTTCGGTGGCACAACGACGAAGACCGGGGTGCCCTCCTCGATGACGGCTATGN. The query quality is AS:i:100. ## Exercise 4 - Refine your alignment Gen notes: - you wil keep only the properly aligned reads from your alignment - we are using "Picard" to do this - to read more about Picard: https://broadinstitute.github.io/picard/index.html - then we will use "Samtools" to examine your alignment - to read more about Samtools: http://www.htslib.org/doc/samtools.html 4.1) First you need to sort the reads into a compressed alignment format (BAM) using a modified version of this command. Currently, the alignment are in the same order as the reads from your fastq file. ``` picard SortSam INPUT=p11c01.bwa.sam OUTPUT=p11c01.bwa.sorted.bam SORT_ORDER=coordinate ``` 4.2) Collect summary stats about how well the alignment went for your file: ``` picard CollectAlignmentSummaryMetrics R=/courses/bi345/Course_Materials/new_03/Cutibacterium_acnes_C1.norm.fasta I=p11c01.bwa.sorted.bam O=p11c01.bwa.stats.txt ``` 4.3) Take a look at the resulting text file with cat and with a modified version of this command to look specifically at the last couple colums: ``` cut -f 17,18 p11c01.bwa.stats.txt ``` Output: ![](https://i.imgur.com/2F4TFPZ.png) Ideally, most of your paired reads are now aligned in proper pairs. This is a reality check regarding the quality of your sequencing library! - basically you want to make sure that the DNA/RNA mostly came from your organism & the paired end reads are not chimeric (2 pieces of DNA from diff parts of a genome that somehow got pasted together) 4.4) Index your mapped file for the next step (it won't work unless we do this step!) using this command: ``` samtools index p11c01.bwa.sorted.bam ``` 4.5) Now look at the graphical representation of the initial alignment using a modified version of this command: ``` samtools tview p11c01.bwa.sorted.bam ``` ^you can use arrow keys or any other option (ie ? to see what they are) to just explore Prof Noh prefers to just quickly check that alignment worked & that all reads are piled up on top of each other in an alignment 4.6) Before moving on to call variants (SNPs or indels), we need to mark duplicates in our alignment Duplicates occur in NGS for 2 broad reasons: 1) via PCR steps that are common in NGS library construction protocols 2) via "optical" duplication - where sequencer reads a single spot of bridge-amplified reads as two separate spots NOTE: for whole-genome resequencing (DNA-seq), these duplicates are mre of a nuisance basically than anything terribly harmful for the following steps leading up to variant calling **Most researchers agree that it's best to mark any reads that are potential duplicates - b/c at the least an excess of only some kinds of duplicates can indicate that there was something wrong with the sequencing run REMEMBER: sequencing depth is a way of estimating confidence in a variant call (on top of BaseQ, MapQ) so it's best to be aware of duplicates during variant calling when gauging your confidence in a variant 4.7) Mark duplicates using a modified versino of the following command: ``` picard MarkDuplicates I=p11c01.bwa.sorted.bam O=p11c01.bwa.marked.bam METRICS_FILE= p11c01.bwa.duplicate.metrics ASSUME_SORT_ORDER=coordinate TAGGING_POLICY=All VALIDATION_STRINGENCY=LENIENT ``` Lab manual about histogram in metric file: FAQ link ---> https://broadinstitute.github.io/picard/faq.html ![](https://i.imgur.com/f3wtbPE.png) ## Exercise 5 - Call variants Gen notes: - to call variants, we will use "bcftools" - to read more specifically abt variant calling: https://samtools.github.io/bcftools/howtos/variant-calling.html 5.1) We can call variants for our alignments BUT we want to do this for all our alignments at once. This is considered to be better than calling one file at a time because the variant caller can consider info from all alignment files to determine which variants are real vs artifacts. Prof Noh has done this for the 7 other samples and their alignments and placed the vcf file in this week's folder. ``` #bcftools mpileup -Ou -EI -C 50 -Q 30 -f ../wk_03/Cutibacterium_acnes_C1.norm.fasta ``` Because bacteria are haploid, Prof Noh indicated this in the second command: ``` --ploidy 1 ``` Note: make sure you understand what the other options are as well 5.2) Call variants for the 8th alignment, the on you've been working on, by modifying the command above ``` bcftools mpileup -Ou -EI -C 50 -Q 30 -f /courses/bi345/Course_Materials/new_03/Cutibacterium_acnes_C1.norm.fasta p11c01.bwa.marked.bam | bcftools call -vc -O z --ploidy 1 -o p11c01.bwa.mpileup.vcf.gz ``` 5.3) Index this last vcf file you've just made using the command "bcftools index" ``` bcftools index p11c01.bwa.mpileup.vcf.gz ``` 5.4) Now merge the variant files using the command "bcftools merge" ``` bcftools merge p11c01.bwa.mpileup.vcf.gz 7samples.bwa.mpileup.vcf.gz ``` 5.5) Remove some variants that are most likely to be of too low quality to use: ``` bcftools filter -O v -o all.bwa.mpileup.filtered.vcf -s LOWQUAL -i'%QUAL>10' 8samples.bwa.mpileup.vcf.gz ``` 5.6) The final output file (or any other vcf file) can be viewed using a command like this: ``` bcftools view all.bwa.mpileup.filtered.vcf ``` NOTE: we'll do more with vcf file next week ## Exercise 6 - File Compression Formats in Genomic Data Notes: - lot of genomic data files will end with unfamiliar file formats that use diff file compression methods (sometimes in combination): - tar - Tape ARchive - bz2 - BZIP2 - gz - GNU zip - example of how to uncompresss a file: ``` tar xvzf some_reference_genome_Current_Release.tgz ``` - combination of tar & x (eXtract) and z (GNU zip) are able to extract this file that we just downloaded (.tgz is a shortened version of .tar.gz) - if this had been a .tar.bz2 file, you would substitute j for z NOTE: for today's lab, no need to extract any of the compressed files today (more alignment software are being written to deal with compressed data and decompress on the fly) - for future reference: check your software manual before decompressing bc the purpose of compression is to save you some storage space