Bi345 Lab #2 (wk-03)

# Bi345 Lab #2 (wk-03) https://www.ncbi.nlm.nih.gov/assembly/GCF_000005845.2 ### Quality control of Illumina NGS Reads In this lab, we will be using fastp in order to QC our raw illumina reads in order to **remove sequencing adapters and barcodes**, and **remove any reads or parts of reads that don't pass a quality filter.** ``` mkdir /courses/bi345/kyamad23/wk_03 cd /courses/bi345/kyamad23/wk_03 cp /courses/bi345/Course_Materials/wk_03 ./ fastp --in1 SRR16361397_1.fastq --in2 SRR16361397_2.fastq --out1 SRR16361397_1.fastp.fastq.gz --out2 SRR16361397_2.fastp.fastq.gz ``` The basic fastp command takes two inputs and two output files for paired reads. **Q1:** The default value for the quality score threshold used by fastp is 15. This is on the Phred scale which is used to represent how confident we are in the assignment of each base call by the sequencer. Phred is related to -10 * log of the error. A Phred Quality Score of 15 indicates a 3% chance of error (i.e. 1/31.62) ### Indexing reference genome In order to align the cleaned and quality controlled reads to the reference genome, we can use the BWA package as an aligner. While the reference genome has been downloaded in this instance, the process of indexing the reference genome has been outlined below ``` bwa-mem2 index Cutibacterium_acnes_C1.norm.fasta ``` If an error for "different line lengths" arises, we can normalize the fasta file by running the following command ``` picard NormalizeFasta I=Cutibacterium_acnes_C1.norm.fasta O=Cutibacterium_acnes_C1.norm2.fasta ``` ### Aligning reads to reference genome The output of the fastp command were cleaned, paired end fastp files that we can now align to the reference genome. ``` bwa-mem2 mem -M -R '@RG\tID:p05c01\tSM:p05c01\tLB:p05c01' ./Cutibacterium_acnes_C1.norm.fasta ./SRR16361397_1.fastp.fastq.gz ./SRR16361397_2.fastp.fastq.gz > p05c01.bwa.sam ``` The output of this file is in a format called SAM. Below are the first few lines of the alignment file: ![](https://i.imgur.com/x2uhLQi.png) Here, we decode the first line of the SAM file: ``` SRR16361397.2 83 NC_018707.1 1932506 60 101M = 1932315 -292 GATCTCCTGGATGTTCGAGACCACCTTGTCATGAAGCTGATCTCGCCCCTTCGGTGGCACAACGACGAAGACCGGGGTGCCCTCCTCGATGACGGCTATGN >DBBBCDDDC@>89DDDDDDBC@B?@CCCCECCCCC@ACBBDDDDBBB@>CDFEEFHGFHJIJJGJJJJJJJJJJIJIGJIJJJJJJJHHHHHFDDFD=4# NM:i:1 MD:Z:100G0 AS:i:100 XS:i:0 RG:Z:p05c01 ``` QNAME: SRR16361397.2 This is the name of the query (pair) name FLAG: 83 This is the propreitary bitwise FLAG that can be used to get information on the type of aligned read was performed. The following is an output from the bitwise flag dictionary [webtool](https://broadinstitute.github.io/picard/explain-flags.html): ![](https://i.imgur.com/PNuVMyi.png) RNAME: NC_018707.1 This is the NCBI ID of the reference genome that was used. In this case, it is of Cutibacterium acnes C1 POS: 1932506 This is the 1-based leftmost coordinate of the clipped sequence. This should always be between 1 and the value of the LN field in the header. MAPQ: 60 This is the Phred-scaled mapping quality of the sequence. A value of 60 corresponds to a chance of error of 0.0001%. CIGAR: 101M The extended CIGAR string provides information on the position of insertions, deletions, and matches in the alignment. The string above indicates that there was an insertion followed by an alignment, finalized by another insertion. ![](https://i.imgur.com/6wsUyKw.png) MRNM: = This means that the name of mate pair is equivalent in this paired-end read. MPOS: 1932315 This is the 1-based position of the Mate sequence ISIZE:-292 The inferred size of the insert. This gives the size of the template. ### Refining Alignment We will only keep the properly aligned reads from the alignment using Picard. The command below will sort the reads into a compressed alignment format. We will be using coordinate in order to determine the sort order. ``` picard SortSam INPUT=p05c01.bwa.sam OUTPUT=p05c01.bwa.sorted.bam SORT_ORDER=coordinate ``` In order to collect some summary statistics about the sorted alignment read, we can use the following command: ``` picard CollectAlignmentSummaryMetrics R=./Cutibacterium_acnes_C1.norm.fasta I=p05c01.bwa.sorted.bam O=p05c01.bwa.stats.txt ``` By looking at the output file, we can see how well our paired reads aligned into proper pairs: ``` cut -f 17,18 p05c01.bwa.stats.txt ``` ![](https://i.imgur.com/B4yLqq0.png) From the results, we see that 99.9% of all reads were aligned into proper pairs. We can now index the sorted aligned read by using the samtools index command: ``` samtools index p05c01.bwa.sorted.bam ``` This opens the graphical representation of the initial alignment. ![](https://i.imgur.com/HgO03pc.png) This step helps us ensure that the reads in our alignment are piled on top of each other. In order to mark duplicates in our alignment, we can mark them using picard MarkDuplicates. Duplicates occur in the PCR steps that are common in next-gen sequencing library construction. ``` picard MarkDuplicates I=p05c01.bwa.sorted.bam O=p05c01.bwa.marked.bam METRICS_FILE=p05c01.bwa.duplicate.metrics ASSUME_SORT_ORDER=coordinate TAGGING_POLICY=All VALIDATION_STRINGENCY=LENIENT ``` ### Call Variants In order to call variants, we will use the bcftools package. First create a new text file with the name of the alignment file(s) we would like to call variants for. ``` vi bam.list #enter file names in this file ``` The mpileup command will consider information from all alignment files specified in the bam.list file.: ``` bcftools mpileup -Ou -EI -C 50 -Q 30 -f ./Cutibacterium_acnes_C1.norm.fasta -b bam.list | bcftools call -vc -O z --ploidy 1 -o 8samples.bwa.mpileup.vcf.gz ``` This will call the variants of the 8th alignment we have conducted in the previous steps. We then have to index and merge the variant files provided in the Course Materials directory. ``` bcftools merge 7samples.bwa.mpileup.vcf.gz 8samples.bwa.mpileup.vcf.gz > all.bwa.mpileup.vcf.gz ``` We can now remove low-quality variants (QC score less than 10): ``` bcftools filter -O v -o all.bwa.mpileup.filtered.vcf -s LOWQUAL -i'%QUAL>10' all.bwa.mpileup.vcf.gz ``` In order to view the final output file, we can use the view command from the bcftools package. ``` bcftools view all.bwa.mpileup.filtered.vcf ``` ### File Compression This section covers different filetypes that genomic data can be compressed to. The three most commonly used file-formats are tar, bz2, and gz. In order to uncompress a tar file, use a command such as the one below. Many alignment software are written to handle compressed data and decompress without using up disk space. ``` tar xvzf some_reference_genome_Current_release.tgz ```