BI345 Lab 3 - HackMD

# BI345 Lab 3 Goal: - align illumina RNA-sequencing reads to a reference genome and to process these alignments for differential expression analysis - go back to last week’s VCF file and work with it to understand differences among those samples Gen Notes: - This week’s data are Prof Noh's unpublished data from an experiment she did with a colleague - They infected three strains of D. discoideum amoebas with 4 strains of Paraburkholderia facultative symbionts - The negative controls were uninfected Questions that came up during the data: - What are the common infection responses by all hosts (qs18, qs864, and qs9)? - What are differences in response to each burk? - b70 and b159 are P. agricolaris - b859 is P. bonniea - b11 is P. hayleyella - P. bonniea and P. hayleyella are sister species to each other Lab Specific Notes: - sample file name has structure: - dicty_burk - controls are noted as a1-c2 instead of burk strain id ![](https://i.imgur.com/HzTIO53.png) NOTE: - You will be using a subset of these data (your choice) for your first guided project. - You will need to select the appropriate samples for comparison to answer a specific research question. - Goal for this project is to learn how to write a paper based on a functional genomics data analysis. ## Exercise 1 - QC 1) This was already done, but Prof Noh wanted to give you an alternative way of processing your data. This method uses an older software called FASTX-Toolkit. ``` # this works with an uncompressed file, otherwise have to uncompress then recompress after #cat INFILE | fastx_clipper -Q 33 -v -l 20 | fastq_quality_filter -Q 33 -v -q 30 -p 33 -o clipped/SAMPLE.clip.fastq ``` ^The first part removes reads that are less than 20 bp long. The second part requires a minimum base quality score of 30, then at least 1/3 of the read must have this minimum base quality score. ## Exercise 2 - Index your Reference Genome 2) This was also already done, but Prof Noh wanted to show you how this is done for STAR so you have an RNA-seq aligner as well. She prefers to use splice-aware aligners like STAR for RNA-seq data, despite bwa apparently working fine for this purpose too. ``` #STAR --runMode genomeGenerate --genomeDir /courses/bi345/Course_Materials/wk_04/STAR_index --genomeFastaFiles /courses/bi345/Course_Materials/wk_04/dicty_chromosomal_num_nodup.fa --sjdbGTFfile /courses/bi345/Course_Materials/wk_04/dicty_chromosomal_num_fixed.gtf ``` Notes: - To do this for another ref genome, you need to have 2 files: 1) FASTA file for genome sequence & 2) GTF file for gene predictions (=annotations) - GTF file --> you've seen this when you ran Prokka in BI278 - To use most splice-ware RNA-seq aligners, you will need to have this type of file so the aligner can tell where the splice junctions are supposed to be - STAR = unique b/c it will do a good job at detecting splice junctions on its own - read more about this in paper/manual BUT for "on-the-fly" splice junction prediction, recommended to run alignment a second time using the newly predicted splice junctions as an additional input file - 2nd purpose of GTF file = count # of reads that appear to belong to specific genes - these counts & comparing them = basis of differential gene expression analysis - KEY to examining these files: chromosome (contig) names are the same between the two files - ex: they will not work together if chromosomes 1-6 are called “1 - 6” in one file but “chr1 - chr6” in another ## Exercise 3 - Align your RNA-seq Reads 3) Find 1 FASTQ file in this week's folder. These data are single-end data. If you had paired-end data, you would need to provide both read FASTQ files as input files, separated by a space. ``` STAR --genomeDir /courses/bi345/Course_Materials/wk_04/STAR.index --readFilesIn /courses/bi345/Course_Materials/wk_04/qs9_b859.clip.fastq --outFileNamePrefix qs9_b859. --outSAMtype BAM Unsorted --outSAMae ributes All ``` ## Exercise 4 - Sort and Fix Read Groups *These samples were sequenced on 6 lanes (Table 1). We need to add this information to our alignment as a part of their read group. 4.1) First, we want to exclude unmapped reads & also sort the reads by coordinate. ``` samtools view -F 0x4 -h qs9_b859.Aligned.out.bam | samtools sort -o qs9_b859.sorted.bam ``` ^First part of this (before "|") will exclude unmapped reads You then "|" (pipe) the output of this command directly to samtools sort NOTE: Make sure to send sort output to bam file (rather than sam) to save space 4.2) We want to add read group info since we didn't do this during alignment. Use table above to choose appropriate lane (ID) to modify command below. ``` picard AddOrReplaceReadGroups I=qs9_b859.sorted.bam O=qs9_b859.readgroupFiles.bam RGPU=1952_7 RGSM=qs9_b11 RGLB=qs9_b11 RGPL=illumina ``` 4.3) Now, we want to take primary alignment only ``` samtools view -F 0x100 -h qs9_b859.readgroupFiles.bam -o qs9_b859.primaryAlignmentOnly.bam ``` Note: - Here, assuming your sorted primary alignment is not an empty file, go ahead and delete all of your intermediate files. They are no longer necessary. - If you want to first compare your files to Prof Noh's intermediate files, you can find these in the subfolder intermediate_files. ## Exercise 5 - Count Reads Notes: - To count the number of reads that aligned to each predicted gene, we will use htseq-count. - This is obviously a crucial part of an RNA-seq pipeline. - You need to make sure you are counting in a way that makes sense for how your libraries were prepared. 5.1) Assuming you are starting with a bam file, a modified version of the following command will properly count these data. Here the bam file is sent to standard out (this is normally your screen), but then immediately piped to the htseq-count command. So you will only see the output from that. ``` samtools view qs9_b859.primaryAlignmentOnly.bam | htseq-count -s reverse -a 20 - /courses/bi345/Course_Materials/wk_04/dicty_chromosomal_num_fixed.gtf > qs9_b859.countReads.txt ``` NOTE: - This is probably the most time-consuming step of a RNA-seq pipeline so be patient while your reads get counted. - Prof Noh recommend reading through the manual above, as well as this paper on why for this dataset we used the option -s reverse. - alternative is to use "-s yes" - indicates you have a stranded library that was created in the other direction - when Prof Noh tried this, the tail of her counts file looked like: ![](https://i.imgur.com/K579Pry.png) - this is what the correct counts file looked like: ![](https://i.imgur.com/bVnj1ZL.png) NOTE: This is a more streamlined series of steps compared to what we were doing last week. You can insert steps in this pipeline similar to last week’s, to ensure that everything looks ok at each step. - only thing we didn't do that's imp for genomic data analysis is MarkDuplicates" - The one thing we did NOT do that is important for many genomic data analysis is MarkDuplicates. There are interesting discussions whether or not this should be done, should not be done, or how you might decide what to do. In practice, I don’t remove duplicates for RNA-seq data but would for DNA-seq data when the goal is variant discovery. Inform yourself on which seems better and why. 5.2) Once your counts are done, you can compare them to mine. All counts for the files in the table above are available in the subfolder final_counts. We will come back to these counts next week for differential expression analysis. ## Exercise 6 - Back to Last Week's Lab NOTE: This part of lab is in Rstudio - go to bi345.colby.edu - either in the command prompt (>) or in a script, execute the following steps - We’ll use our vcf file from last week to understand how the samples we aligned appear to be related to each other. - we will use R package called SNPRelate - read its vignette here: http://bioconductor.org/packages/release/bioc/vignettes/SNPRelate/inst/doc/SNPRelate.html 6.1) First load the necessary packages/ libraries. ``` library(gdsfmt) library(SNPRelate) ``` 6.2) Read in vcf file ``` setwd("/courses/bi345/sdivit25/Lab_3") snpgdsVCF2GDS("all.bwa.mpileup.filtered.vcf", "conwill.gds", method="biallelic.only") ``` 6.3) Get a summary of this new file format and save it as a variable ``` snpgdsSummary("conwill.gds") genofile <- snpgdsOpen("conwill.gds") ``` 6.4) Create a new dataframe with metadata. Make sure this matches the order of samples in your vcf file. ``` metadata <- data.frame(id = c("p11c01","p05c01","p04c01","p03c01","p28c01","p26c01","p19c02","p17c01"), region = as.factor(c("right-cheek","forehead","forehead","forehead","near-mouth","near-mouth","chin","near-mouth"))) ``` 6.5) Create a matrix that uses the SNPs to figure out identity-by-state. Two samples will be IBS for a given region if they have the same SNP within it. We are using this to understand which samples appear to be related to each other. You can read more about this here: https://en.wikipedia.org/wiki/Identity_by_descent ``` ibs <- snpgdsIBS(genofile, autosome.only = FALSE, remove.monosnp = TRUE) ``` 6.6) Next, create a dendrogram and plot it. We are taking into account the facial region from which each sample was collected. ``` ibs.hc <- snpgdsHCluster(ibs) rv <- snpgdsCutTree(ibs.hc, samp.group=as.factor(metadata$region)) plot(rv$dendrogram) legend("topright", legend=levels(metadata$region), col=1:nlevels(metadata$region), pch=19, ncol=2) ``` 6.7) Let’s also try a PCA visualization. First, run a PCA on these data. Save the results to a table. ``` conwill.pca <- snpgdsPCA(genofile, autosome.only = FALSE, remove.monosnp=TRUE, maf=NaN, missing.rate=NaN) tab <- data.frame(sample.id = conwill.pca$sample.id, region = as.factor(metadata$region), EV1 = conwill.pca$eigenvect[,1], EV2 = conwill.pca$eigenvect[,2]) ``` 6.8) Save the PC loads to another variable for plotting. ``` pc.percent <- conwill.pca$varprop*100 lbls <- paste("PC", 1:4, " (", format(pc.percent[1:4], digits=2), "%)", sep="") ``` 6.9) Plot these data again. ``` plot(tab$EV1, tab$EV2, col=as.integer(tab$region), xlab=lbls[1], ylab=lbls[2]) legend("bottomright", legend=levels(tab$region), pch="o", col=1:nlevels(tab$region)) ``` Q1. Are these results consistent with what was discussed in the Conwill et al paper? Explain why or why not. From what I understood, the paper was talking about how the C.acnes strain types were not specific to sample type or to skin regions. Our dendrogram shows that pretty much all of the samples are related. There are no clear distinctions – all samples are intertwined with each other in terms of relationship. In the PCA, we can also see that there is no clear distinction or clusters. Thus, our results are consistent with what was discussed in the Conwill et al paper. Dendogram (w/out legend & w/ legend): ![](https://i.imgur.com/8yam7ax.png) ![](https://i.imgur.com/7ysz1Zh.png) PCA plot(w/out legend & w/ legend): ![](https://i.imgur.com/QYVNTZC.png) ![](https://i.imgur.com/HwIyrSb.png)