# BI345 Lab 03 (week 04)
### QC
This is similar to the quality control steps taken in lab 02 but done using the older toolkit, FASTX instead.
This step has already been completed but it inputs a fasta file and outputs to the directory specified. The middle part of the command removes reads that are less than 20 bp long. The last part filters the reads to a minimum qualtiy score of 30 with at least 1/3 of the read having this base quality score or higher.
```
mkdir clipped
cat INPUT | fastx_clipper -Q 33 -v -l 20 | fastq_quality_filter -Q 33 -v -q 30 -p 33 -o ./clipped/SAMPLE.clip.fastq
```
### Indexing your reference genome
This was also pre-completed for us but the STAR aligner will input a fasta file for the genome sequence and a gtf file for the gene annotations. STAR is an example of a splice-aware aligner compared to the alternative, bwa.
```
#STAR --runMode genomeGenerate --genomeDir /courses/bi345/Course_Materials/wk_04/STAR_index --genomeFastaFiles /courses/bi345/Course_Materials/wk_04/dicty_chromosomal_num_nodup.fa --sjdbGTFfile /courses/bi345/Course_Materials/wk_04/dicty_chromosomal_num_fixed.gtf
```
### Aligning RNA-seq reads
We will be aligining the single-end RNA-seq data provided as a clipped FASTQ file. We will also need to specify the genome directory that the command should point to.
```
STAR --genomeDir /courses/bi345/Course_Materials/wk_04/STAR.index --readFilesIn ./qs9_b859.clip.fastq --outFileNamePrefix qs9_b859. --outSAMtype BAM Unsorted --outSAMattributes All
```
### Sort and fix read groups
The samples were sequenced on 6 lanes and we need to add some information to the alignment.
We output the sorted file to a bam in order to save space. We also need to use the -h flag in the samtools view command in order to include the header in the output.
```
samtools view -h -F 0x4 qs9_b859.Aligned.out.sam | samtools sort > qs9_b859.sorted.bam
```
To add read group information, we use the table above to choose the appropriate lane ID in order to add it to the output
```
picard AddOrReplaceReadGroups I=qs9_b859.sorted.bam O=qs9_b859.readFiles.bam RGPU=1952_8 RGSM=qs9_b859 RGLB=qs9_b859 RGPL=illumina
```
Now, we will take only the primary alignment.
```
samtools index qs9_b859.readFiles.bam
samtools view -h -F 0x100 qs9_b859.readFiles.bam > output.bam
```
### Count reads
In order to count the number of reads that align to each gene, we can use the htseq-count package.
This is the most time consuming step in an RNA-seq pipeline. We use the option ```-s reverse``` to signify how the stranded library was created.
```
samtools view output.bam | htseq-count -s reverse -a 20 - /courses/bi345/Course_Materials/wk_04/dicty_chromosomal_num_fixed.gtf > counted.bam
```
One important step that is missing from this streamlined pipeline is Mark Duplicates. In this case, since we are looking for variants, we do not remove duplicates.
### Back to last week's lab
The following sections will be in Rstudio. First, connect to the bi345 R-studio web client at the address bi345.colby.edu
```
library(gdsfmt)
library(SNPRelate)
setwd("/courses/bi345/kyamad23/bi345_lab3")
snpgdsVCF2GDS("all.bwa.mpileup.filtered.vcf", "conwill.gds", method="biallelic.only")
```
To get a summary of the gds fileformat, we can use the following command:
```
snpgdsSummary("conwill.gds")
genofile <- snpgdsOpen("conwill.gds")
```

We can create a new dataframe of meta data that matches the ordre of samples in the vcf file.
```
metadata <- data.frame(id = c("p11c01","p05c01","p04c01","p03c01","p28c01","p26c01","p19c02","p17c01"),
region = as.factor(c("right-cheek","forehead","forehead","forehead","near-mouth","near-mouth","chin","near-mouth")))
```
In order to create a matrix that uses the single nucleotide polymorphisms to figure out the identity by state, we can specify two samples as IBS for a given region when they have the same SNP within it.
```
ibs <- snpgdsIBS(genofile, autosome.only=FALSE, remove.monosnp=TRUE)
```

We can create a dendrogram of this data and plot it with the facial region each sample was collected from. This figure includes a legend that displays more information of the plot.
```
ibs.hc <- snpgdsHCluster(ibs)
rv <- snpgdsCutTree(ibs.hc, samp.group=as.factor(metadata$region))
par(mfrow = c(1,1))
plot(rv$dendrogram)
legend("topright", legend=levels(metadata$region), col=1:nlevels(metadata$region), pch=19, ncol=2)
```

To create a principal component analysis visualization, we can first run PCA on the conwill data.
```
conwill.pca <- snpgdsPCA(genofile, autosome.only = FALSE, remove.monosnp=TRUE, maf=NaN, missing.rate=NaN)
tab <- data.frame(sample.id = conwill.pca$sample.id,
region = as.factor(metadata$region),
EV1 = conwill.pca$eigenvect[,1],
EV2 = conwill.pca$eigenvect[,2])
```

We can save the principal component loads to another variable in order to plot it.
```
pc.percent <- pca$varprop*100
lbls <- paste("PC", 1:4, " (", format(pc.percent[1:4], digits=2), "%)", sep="")
```
Plot:
```
plot(tab$EV1, tab$EV2, col=as.integer(tab$region), xlab=lbls[1], ylab=lbls[2])
legend("bottomright", legend=levels(tab$region), pch="o", col=1:nlevels(tab$region))
```

**Question 1:** These results are consistent with that in the Conwill et al paper because there is clear segregation among the facial regions in the principal component plot. If the bacterial populations in pores were not segregated by pore (i.e. opposite of the Conwill conclusion), we would see more clustering of the principal components.