# Project 1 Application
Now that you understand how sequence alignment algorithms work under the hood, we can start exploring how they are used in computational biology research! In this section, we will use a tool called bowtie2 to align bulk RNA-seq reads to the *D. melanogaster* genome and then explore how we can use the results to gain insight regarding the function of a proto-oncogene.
# Setup
You should have already pulled the Project 1 stencil but if you have not, then run `git pull` in your `cs181-projects25` repo to get the scripts and support files. To enter the cs181 docker container, run `./run-docker` in the `cs181-projects25` folder.
# RNA-Seq Analysis
## Background
In homework 1, you learned how Next-Generation Sequencing (NGS) is used to rapidly sequence genomes. NGS can also be used to sequence transcriptomes, even though most sequencing platforms are designed to sequence DNA. Messenger RNAs (mRNAs) derived from tissue samples can be reverse transcribed into complimentary DNA (cDNA), creating sequencing reads which can be fed into a sequencing machine. If you are interested in learning more about RNA-sequencing protocols, check out [this paper](https://www.nature.com/articles/s41576-019-0150-2)!
:::spoiler Why RNA-seq??
Generally, RNA-seq experiments are performed to determine how gene expression changes in different experimental conditions. For instance, a researcher may choose to create a mutant strain of a model organism in which a transcription factor of interest has been knocked out. Comparing transcription levels in the mutant strain vs. a control strain may reveal genes that the transcription factor regulates. Researchers may also choose to explore how expression varies at different stages in development, in response to stress, or in disease vs. healthy states.
:::
## Biological background
*D. melanogaster* neural stem cells, called neuroblasts (NBs), have high proliferative potential. Under normal conditions, these cells will differentiate into specific neuronal and glial subtypes, which divide very rarely. However, if mutations are present in genes that are critical for driving neuronal differentiation, these NBs can revert to an abberant cell type known as dedifferentiated NBs (dNBs), which have highly maligant properties.
The transcripition factor Chinmo is expressed in NBs in the early larval stages, where it promotes the stem-cell qualities of NBs (eg. high proliferative potential and lack of cell-type specificity). Chinmo expression is turned off in late-stage larvae, allowing for neuronal differentiation to procede. However, if differentiation is halted and differentiating NBs revert to dNBs, rapid cell division will occur in cells that continue to express Chinmo, leading to the formation of neural tumors. Chinmo is therefore a proto-oncogene. While it is necessary for NB function in early developmental stages, abberant overexpression of Chinmo in late-stage larvae and early adult-stage flies can result in cancer.
## From FASTQ files to counts table
The output of a run on a sequencing machine is a FASTQ file (extension .fastq or. fq). FASTQ files contain the sequence of each read, as well as information about the sequencing run, cluster, and quality of each read. The sequencer also assigns a quality score to each base it calls that reflects how confident it is that it is a A, T, G, or C. In a typical sequence alignment workflow, a quality control step is performed in which low-quality portions of reads are removed.
Here, we will use RNA-seq data from [this paper](https://elifesciences.org/articles/13463) to examine how gene expression shifts in *D. melanogaster* larvae after knocking down Chinmo expression using [RNA interference](https://www.ncbi.nlm.nih.gov/probe/docs/techrnai/) (RNAi). We have provided you with 6 FASTQ files (3 experimental replicates, 3 control replicates), each containing 17000 high-quality reads. FASTQ files can contain hundreds of millions of reads, however aliging FASTQs this large with bowtie2 is a task best left for a supercomputer. We've created abridged FASTQs for this task that are laptop-friendly :)
**Note:** bowtie2 uses global end-to-end global alignment, meaning that the entirety of read is aligned globally to the reference genome (each read is about 50bp long in this case).
:::success
**Task:** Fill in a short bash script to create a RNA-seq counts table from FASTQ files.
Open the `align_reads.sh` script located in ```/home/cs181-user/project1/scripts/```. In this script, you will write 3 lines of code, each described below:
- Step 1 - Use `bowtie2` to align each FASTQ file to the *D. melanogaster* genome. Run ```bowtie2 -h``` in your terminal to view the function's documentation. Only focus on the first section (just the 3 main arguments). We will not use any additional options. **Hints**:
- The *D. melanogaster* genome is split into 6 bowtie2 genome index files in order to increase alignment efficiency (sort of like splitting a book into chapters to make it easier to find information). Use `../data/genomes/dmel_all_chromosome` for the **index filename prefix**.
- We are using FASTQ files with **unpaired reads**. Use `../data/fastqs/${i}` to identify a single FASTQ file.
- `bowtie2` will output a SAM file. A SAM file contains alignment information for each read (ie. the genomic coordinates that the read aligned to). Use `../outputs/${i}.sam` for the path to the **file for SAM output**. Note that ```${i}``` references the name of an original FASTQ file (eg. ```control_2.fq``` or ```RNAi_3.fq```), so using it in our filename allows us to keep track of which sample the alignments came from.
- Step 2 - Use `samtools sort` to convert the SAM file outputted by `bowtie2` into a sorted BAM file (a BAM file is a compressed SAM). By sorted, we mean that the alignments are arranged in order of their genomic coordinates. Run `samtools sort -h` in your terminal to view the function's documentation. **Hints:**
- use the `-o` flag to specify the output file as `../outputs/${i}.sorted.bam`. There's no need to specify any other options.
- Note that the documentation represents the input file as `[in.bam]`, however the function will also take a .sam file as input
- Step 3 - Use `featureCounts` to construct an RNA-seq counts table from the sorted BAM files. run `featureCounts -h` to view the function's documentation. **Hints:**
- Before specifying the main arguments, specify the following optional argument: `-t gene` . This indicates that we want our output table to contain counts for each gene.
- Use `../data/genomes/dmel-all-r6.42.gtf` for the **annotation file**.
- The path to the output file should be `../outputs/abridged_counts.txt`
- You can get a full list of the sorted BAM files generated previously using `../outputs/*.sorted.bam`
Once you have completed each step, run the script with the following line of code:
```shell=
$ sh align_reads.sh
```
As your script runs, it will output information about the fraction of reads that aligned to a unique genomic position for each FASTQ file. After the sript is finished, open your `abridged_counts.txt` file. You will notice that the counts for each gene will seem quite sparse (most genes will not have any counts in all 6 replicates). This is because the FASTQs we provided are a very small subset of the original FASTQs. For the remainder of this application, you will use a counts table generated using the full-length FASTQ files (see `full_counts.csv` in the `data` folder).
::::
## Analyzing RNA-seq counts data
Once alignment is complete and a counts table has been compiled, the next step in an RNA-seq analysis workflow is to run a differential expression analysis. One of the most popular tools for this purpose the R package DESeq2, which we will use here.
:::warning
**Note:** This section is meant to illustrate how an important application of sequence alignment that can lead to exciting biological insights, but it is not meant to be a comprehensive review of RNA-seq analysis. If you are interested in learning more, check out [this article](https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#ma-plot) from the creators of DESeq2!
:::
### Differential Gene Expression (DGE) analysis
Performing a DGE analysis is one of the most common and important reasons that researchers collect RNA-seq data. The aim is to identify genes who's level of expression is significantly different in different conditions. In our case, we will be identifying genes that are upregulated and downregulated in response to decreasing Chinmo expression with RNAi. These genes are likely to be directly or indirectly regulated by Chinmo.
For this portion of the assignment, you will be running an R script to perform differential expression anaysis with DESeq2, and answering questions about the plots and tables that are produced. Please write your answers in a file named **application.pdf**.
:::success
**Task:** In ```/home/cs181-user/project1/scripts```, run the following line of code in you terminal:
```shell=
Rscript rnaseq.R
```
The script will take about 20 seconds to run. While we are not asking you to write any R code, we encourage you to open the script and examine the code. All files generated by the script will be in the `outputs` folder.
:::
### Principal Component Analysis (PCA)
:::spoiler Learn About PCA!
In order to have confidence in the results of our DGE analysis, it is important to verify that our data is of high quality. One of the ways this is done is by examining the output of a principle component analysis (PCA).
PCA is a dimensionality-reduction technique that allows us to observe relative variation between samples. Each of our 6 samples is ultimately vector of length 17874 (corresponding to the 17874 annotated features of the *D. melanogaster* genome that reads were mapped to). It is impossible to visualize a where our samples lie in a 17874-dimensional space. PCA identifies the axes (principal components) along which the samples vary the most. By plotting our samples along the first 2 principal components, we can get a general sense of how similar they are to each other.
If RNA-seq data is of high quality, samples in the experimental group should have similar levels of gene-expression, as should samples in the control group. Therefore, the experimental group and the control group should form separate clusters on a PCA plot.
:::
<br>
:::success
**Task:** Open the `chinmo_RNAi_pca.png` image and comment on the quality of our RNA-seq data based on the PCA plot.
:::
### Volcano Plot
:::spoiler Learn about Volcano Plots!
There are many ways to visualize the results of a DGE analysis, and one of the most common ones is with a volcano plot. A volcano plot is a scatterplot of genes with fold change on the horizontal axis and statistical significance on the vertical axis. Fold-change is simply expression in experimental condition divided by expression in control condition. Fold-change values are log-transformed (with base 2) so a fold change of 0 indicates no expression change, a positive fold change indicates upregulation in the experimental condition, and a negative fold change indicates downregulation in the experimental condition. Statistical significance is measured in terms of p-values that are log-transformed (with base 10) and negated. This means that genes that are differentially expressed with high statistical significance will appear higher on the vertical axis, and genes that are differentially expressed with low significance will appear lower.
You will see that the volcano plot has dotted lines defining thresholds for differential expression and significance. Genes that pass both thresholds (have a large enough fold change and a small enough p-value) appear red. Genes with a small yet statistically significant expression change appear blue, and genes with a larger but insignificant expression change appear green. Genes that don't pass either of the thresholds are colored gray.
:::
<br>
:::success
**Task:** Open the `chinmo_RNAi_volcano_plot.png` image and answer the following questions:
1. Do you think it is more likely that Chinmo is primarily an upregulator or downregulator of its target genes in the late larval stage? Why?
2. Name 3 genes that might be interesting to investigate further. Why did you choose these genes based on the volcano plot? (A separate explanation for each gene is not necessary)
:::
### Functional Enrichment Analysis
After running a DGE analysis, we get a list of significantly upregulated and downregulated genes. In our case, these genes are likely regulated directly or indirectly by Chinmo. In order to gain insight into the function of Chinmo, it would be wonderful if we could get a general sense of the biological processes that Chinmo's targets are involved in. This is where functional enrichment analysis is a big help!
:::spoiler Learn About Functional Enrichment!
To perform functional enrichment analysis, we will use the classifications provided by the [Gene Ontology Consortium](https://geneontology.org/). The consortium maintains a set of Gene Ontology (GO) terms that describe genes in terms of 3 separate categories: molecular function (eg. "transport," "DNA-binding"), biological process (eg. "signal transduction," "DNA-repair"), and cellular component (eg. "nucleus," "mitochondrion"). Based on research from hundreds of thousands of published papers, genes from several common model organisms are associated with GO terms. GO terms follow a hierarchy of specificity. Some terms are quite general, such as "signal transduction," while others are very specific, like "pyrimidine nucleobase biosynthetic process." Genes that have not been very heavily researched are associated with very broad GO terms, while genes about which much is known are likely to be associated with several very specific terms.
In a functional enrichment analysis, a set of genes is used as a query, and a statistical enrichment analysis is performed to find GO terms that are "over-represented" by the genes in the query.
:::
<br>
We will use a very popular tool called g:Profiler to perform functional enrichment analysis. We will use g:Profiler's web-based interface, but there is also an [R package](https://cran.r-project.org/web/packages/gprofiler2/index.html) available which allows for easy integration of functional enrichment analysis into RNA-seq analysis pipelines.
:::success
**Task:** Perform functional enrichment analysis using g:Profiler's g:GOSt tool.
Steps:
1. Go to [https://biit.cs.ut.ee/gprofiler/gost](https://biit.cs.ut.ee/gprofiler/gost). Make sure you have the g:GOST ("Functional profiling") tool selected. Change the organism to *Drosophila melanogaster*.
2. Open the `upregulated_genes.csv` file, and copy and paste the list of genes in the "gene_symbol" column into the g:GOSt query-box.
3. Click the "Run query" button. It may take several seconds for the analysis to run.
4. When results are ready, scroll down and select the "Detailed Results" tab. You will see 3 lists of GO terms, one for each category described above.
5. **Open a new tab on your browser** and repeat steps 1-4, but this time use the `downregulated_genes.csv` file.
When you've completed both functional enrichment analyses, scroll down to the list of GO terms under the heading "GO:BP" (for Biological Process). Re-read the "Biological Background" section from the beginning of this handout, and compare the list of BP GO terms for upregulated and downregulated genes. How can you relate the results of the functional enrichment analyses to your background knowledge of Chinmo? Reference your results for both upregulated and downregulated genes, and use specific GO terms in your answer.
:::
# ChIP-Seq Analysis
## Background
Another application of NGS is Chromatin Immunoprecipitation followed by Sequencing, or ChIP-seq for short. ChIP-seq is a powerful molecular biology technique used to study the interactions between proteins and DNA within a cell. It allows scientists to pinpoint where specific proteins bind to the genome, uncovering critical information about gene regulation, epigenetics, and more. In this section of the application, we'll delve into the analysis of ChIP-seq data to help you understand its significance and how it works. To learn more about how this data is acquired, read [this manual](https://www.illumina.com/techniques/sequencing/dna-sequencing/chip-seq.html) from Illumina!
Generally, ChIP-seq experiments are performed to determine where a transcription factor is binding in the genome. Analyzing this data can help researchers understand patterns in a TF's behavior: localizing to a specific chromosome, binding at transcription start/end sites, and more. Similar to the RNA-seq experiments, analysis of ChIP-seq manner can be done in a comparitive manner. In this section, we will compare the ChIP-seq profiles of three different transcription factors!
### The Data
If you read through the Illumina methods page on ChIP-seq you might have learned that the output of this experiment is raw sequence files. This is because the DNA fragments bound to the corresponding protein are subjected to high-throughput DNA sequencing. This step generates millions of short DNA sequences, providing a snapshot of the genomic regions associated with the protein of interest.
When these short DNA sequences are aligned to the genome, peaks are generated! The more reads aligned to a specific region, the larger the peak. Because the alignment step was performed, we can create a file that tells us the position of a peak in the genome and statistics/measurements specific to that peak. These steps have been performed for you and are presented in Browser Extensible Data (BED) files in the `project1/data/` directory.
If you take a quick peak at this data, you will see that each row contains multiple columns of information with the first three being the chromosome, start index, and end index. For the sake of this assignment we will only use those first three columns but other analytics regarding TF peak height and statistical significance may use this other information. There are three BED files provided - each corresponding to a different transcription factor!
### The Tools
In this portion, you will write some simple R code utilizing the `ChIPseeker` package to plot this binding data. In particular you will use three plotting functions: `covplot`, `plotPeakProf2` and `plotHeatmap`. Here is some brief information on each function:
* `covplot`: Calculates the coverage of peak regions over chromosomes and generates a figure to visualize the binding of a protein across the genome.
* `plotPeakProf2`: Supports the plotting of profiles of peaks binding to different regions. For our purpose, we are interested in protein binding to genes.
* `peakHeatmap`: Visualizes where a protein might be binding with respect to a specific genomic feature. That will be the transcription start site, or TSS, in our application.
For the purposes of this assignment, you will notice that using R is not unlike using Python (though this is not the case in general). The only change you might notice is that instead of using `=` for variable assignment, `<-` is seen in R.
## Analyzing ChIP-seq Data
You have been tasked with identifying the names of three transcription factors given the BED Files from ChIP-seq experiments! You know the following facts about these proteins and will have to use them to figure out the proteins' identities. Transciption Start Site Specificity indicates how far from the TSS the binding site can typically found.
| Name | Localization | TSS Specificity (bp) |
| -------- | -------------- |:--------------------:|
| VanPra-13 | Whole Genome | $\pm$ 0 |
| SORIN-2 | One Chromosome | $\pm$ 0 |
| JClaw-25 | Whole Genome | $\pm$ 500 |
### Plotting Binding Data
:::success
**Task:** Fill in `chipseq.R` based on the comments provided in the file. Refer to the [ChIPseeker tutorial](https://bioconductor.org/packages/release/bioc/vignettes/ChIPseeker/inst/doc/ChIPseeker.html#functional-enrichment-analysis) to help complete the TODO items indicated.
:::
:::danger
When running some of the ChIPSeeker code, you may run into an out of memory issue (the command line might say `Killed.`). To combat this, either reduce the number of bins or the visualization space (i.e. upstream/downstream).
:::
Now that your code is complete, you are ready to plot your data!
:::success
**Task:** In your terminal run the following:
```console=
$ Rscript student_chipseq.R
```
:::
### Making Conclusions!
Now, we are in the final steps of the application project! Now that you have written your code, it is important that you are able to look at your plots and make conclusions from them. Without this analysis, these plots do not carry enough meaning!
::: success
**Task:** In the same **application.pdf** document, answer the following questions:
1. Assign TFs *A*, *B*, and *C* to their corresponding name in the table above.
2. Defend your decisions using the plots that you generated. Make sure to include any referenced plots in your discussion within the PDF.
3. If you had to omit one type of plot (covplot, heatmap, or peak profile) from your analysis, which would you choose? Explain your choice. (Hint: Which plot could you lose and still clearly articulate your argument to part 2.)
**Note**: Make sure at least one example of each plot is included in your discussion.
:::
:::warning
The plots that you generate will not include a title. Make sure it is clear for us what plot we are looking at when you include them in your discussion! You can do this by editing the images to contain a title or just including a brief footnote naming the image.
:::
## Wrapping Up
You're done! We hope you learned something about how alignment plays a crucial role in bioinformatic analysis pathways.
:::warning
**Optional Task:** In your **application.pdf** document, leave any recommendations you have to make this application section better!
:::
Developed by jcurrie2 and pmahable, 2023 and 2024