# A Hands-On Introduction to Running RNAseq in the Cloud The goal of today's lab meeting (March 15, 2021) is to enable you to: - **create** virtual cancer cohorts using the NIH Common Fund-supported Gabriella Miller **Kids First Data Portal (KF Portal)** - **analyze** the differential gene expression (DGE) on **Cavatica**, an integrated cloud based platform This hands-on will follow the [CFDE training lesson on running RNAseq on Cavatica](https://training.nih-cfde.org/en/latest/Bioinformatics-Skills/RNAseq-on-Cavatica/rna_seq_1/). A typical RNAseq workflow is highlighted in the schematic diagram below. The orange boxes highlight the steps you will do in this workshop! ![](https://i.imgur.com/V0Zmr3e.png) ## Set Up To get started you need: - [ ] Kids First Data Portal account - Go to https://portal.kidsfirstdrc.org/ - Sign into existing account - Click **Join now** to register for a new account - Step by step lesson: [Click here](https://training.nih-cfde.org/en/latest/Bioinformatics-Skills/Kids-First/Portal-Setup-And-Permissions/KF_3_KF_Registration/) :::info You can use gmail or any other login credential to set up accounts. You can also use two separate login credentials for the KF portal and Cavatica. ::: :::success Put up :raised_hand: on zoom when you successfully login to an account on KF Portal. ::: - [ ] Cavatica account - Go to https://cavatica.sbgenomics.com/ - Sign into existing account - Click **Create an account** to register a new account - Step by step lesson: [Click here](https://training.nih-cfde.org/en/latest/Bioinformatics-Skills/Kids-First/Portal-Setup-And-Permissions/KF_4_Cavatica_Registration/) :::success Put up :raised_hand: on zoom when you successfully login to an account on Cavatica. ::: - [ ] Integrate KF Portal and Cavatica accounts - Check the **Settings** option under your name (top right corner) in the KF portal to ensure Cavatica is intergrated - Log into Cavatica - Click on the **Developer** tab - Select **Authentication token** from the drop down menu - Click the "copy to clipboard icon" next to the authoentication token - In your KF portal account, click on your name (top right corner) and select **Settings** - Under Application Integration, click on **Connect** for Cavatica - Paste the authentication token in the popup box under **Cavatica Authentication Token** and click **Connect** - Step by step lesson: [Click here](https://training.nih-cfde.org/en/latest/Bioinformatics-Skills/Kids-First/Portal-Setup-And-Permissions/KF_5_ConnectingAccounts/) :::success Put up :raised_hand: on zoom when you successfully connected your Cavatica account to your KF Portal account. ::: ## Selecting cancer cohort : KF Portal For today's lesson we will compare between two pediatric cancers: **Medulloblastoma** & **Ependymoma**. Medulloblastoma - is a common malignant childhood brain tumor - typically occurs in the 4th ventricle region of the brain - has five different histological types - subtypes impact the prognosis and response to therapy Ependymoma - is a broad group of tumors - often arises from lining of the ventricles in the brain - can also occur in the central canal in the spinal cord - anatomical distribution impacts prognosis :::info Detailed lesson on filtering cohort with screenshots & video walkthrough: [Click Here](https://training.nih-cfde.org/en/latest/Bioinformatics-Skills/RNAseq-on-Cavatica/rna_seq_3/) ::: ### Option 1: Run through all filters 1. Select **File Repository** tab 2. Select **Browse All** option next to Filter 3. Select the **Access** filter listed under **FILE** field 4. Select **Open** value 5. Click View Results to update selection. 6. Apply **File Filters** - Experimental Strategy --> RNA-Seq - Data Type --> Gene expression - File Format --> tsv 7. Switch to **Clinical Filters** - Diagnosis (Source Text) --> Medulloblastoma and Ependymoma. - Gender --> Male - Race --> White 8. Click **ANALYZE IN CAVATICA** 9. Click **CREATE A PROJECT** and enter a name of project 10. Click **SAVE** 11. Click **COPY AUTHORIZED** ### Option 2: Use the query short URL to skip the filter steps 1. Log into your KF account 2. [Click on the query link](https://p.kfdrc.org/s/6ic) 3. Click **ANALYZE IN CAVATICA** 4. Click **CREATE A PROJECT** and enter a name of project 5. Click **SAVE** 6. Click **COPY AUTHORIZED** :::success Put up :raised_hand: on zoom when you can see a "Success" pop up box summarizing the copy details along with a link to view the project on Cavatica. ::: ## View, Filter, Tag and Download files : Cavatica :::info Detailed lesson on filtering, tagging and downloading files from Cavatica: [Click Here](https://training.nih-cfde.org/en/latest/Bioinformatics-Skills/RNAseq-on-Cavatica/rna_seq_4/) ::: You can use the **Files** tab within the project folder to view the copied files. Let's further subset the cohort to remove possible sources of variation. You can update the columns visible in the table from default to any columns available from the metadata column list. Let's select the following columns: - Age at diagnosis - Vital status - tumor_location - histology - histology_type :::success Put up :raised_hand: on zoom if you have 99 tsv.gz files with the above columns headers. ::: Here are the following filters we will apply: - Vital status --> Alive - histology --> Medulloblastoma & Ependymoma - histology_type --> Initial CNS Tumor - tumor_location --> all values **without** `;` , `Not Reported`, `Other locations NOS` :::success Put up :raised_hand: on zoom if you have 50 tsv.gz files after applying all the filters. ::: 1. Select all filtered files and click **Tags** tab 2. Type the name of the tag and click **Add new tag**. 3. Click **Apply** 4. Click on the ... button on the right corner. 5. Select **Export metadata manifest from filtered files**. :::success Put up :raised_hand: on zoom if you have the metadata file for the filtered data on your local machine. ::: ## Setup DESeq2 app : Cavatica :::info Detailed lesson on setting up the DESeq2 public app on Cavatica: [Click Here](https://training.nih-cfde.org/en/latest/Bioinformatics-Skills/RNAseq-on-Cavatica/rna_seq_5/) ::: DESeq2 app is offered as a stand alone app on Cavatica. We will know go through the steps to obtain a copy of this app in our project folder. 1. Click the **Apps** tab which is currently empty and click **Add App** button which opens the list of Public Apps. 2. Type "DESEQ" in the search bar. 3. In the DESeq2 app box select the **Other versions** drop down box and click on the version 1.18.1. 4. This opens the app in a new tab where you can click the `...` on the right hand corner and click **Copy**. 5. Select the project folder that contains the tsv.gz files and click **Copy**. 6. Navigate to your project Dashboard using Projects drop down menu and view the app under the Apps tab. You can also click the project link in the popup box that appears on top of the page. :::success Put up :raised_hand: on zoom if you successfully copied the DESeq2 app in the project folder with the tsv.gz files. ::: We also need the reference gene annotation file in GTF format as one of the inputs for DESeq2 app. First, we will check the reference genome information associated with the tsv.gz files and then follow steps to obtain a copy of the GTF file in the project folder. 1. We can update the table in the **Files** tab to add the Reference genome column. All files in this dataset used the GRCh38 (hg38) homo sapiens genome assembly released by Genome Reference Consortium. 2. Click **Add Files**, which automatically selects the Public Files tab. 3. Select **Category:All** filter and check box next to **reference**. 4. Select **Type:All** filter and check the **GTF** box. 5. From the list select **Homo_sapiens.GRCh38.84.gtf** and click **Copy to Project**. 6. In the popup box, click **Copy**. :::success Put up :raised_hand: on zoom if you successfully copied the reference GTF file in the project folder with the tsv.gz files. ::: ## Modify phenotype file : Local Machine :::info Detailed lesson on modifying the phenotype file and uploading it to Cavatica: [Click Here](https://training.nih-cfde.org/en/latest/Bioinformatics-Skills/RNAseq-on-Cavatica/rna_seq_6/) ::: We previously [downloaded the metadata file for the filtered cohort](https://hackmd.io/Ytzuu3ahTAqzI2ZU2gh3sA#View-Filter-Tag-and-Download-files--Cavatica). The DESeq2 app requires the phenotype file in CSV format with the Sample ID in the first column. The default column order of the downloaded metadata manifest had id in the first column. 1. Rearrange the order by using cut/insert to move the sample_id to the first column. 2. Create a new column age_at_diagnosis_yrs and enter the formula `=ROUND(N2/365,3)` to convert age in days to years. 3. Sort age_at_diagnosis_yrs from largest to smallest value. 4. Delete the row corresponding to BS_BA6AZWB3 (sample collected from patient at 36.5 yrs of age). 5. Create a new column diagnosis_age_range and enter the formula `=LOOKUP(Z2,{0,5,10,15},{"0-5","5-10","10-15","15-20"})` to create five year bins of diagnosis age. 6. Select all rows and columns with values, select File --> Save As :::success Put up :raised_hand: on zoom if you successfully modified and saved the metadata manifest as a CSV file. ::: ## Upload phenotype file : Cavatica 1. Access the **Files** tab in your project folder on Cavatica and click the **Add files**. 2. Select **Your Computer** as source to add files. 3. Click on **Start upload** to add the files to Cavatica. 4. Click on the `x` on the top right hand corner after successful upload. 5. From the **Files** tab select **Type:CSV** to bring the file. You can click on it to preview the contents. :::success Put up :raised_hand: on zoom if you successfully uploaded the phenotype file in CSV format to Cavatica. ::: ## Analysis with DESeq2 app :::info Detailed lesson on setting up the DESeq2 app, exploring and downloading the output files from Cavatica: [Click Here](https://training.nih-cfde.org/en/latest/Bioinformatics-Skills/RNAseq-on-Cavatica/rna_seq_7/) ::: 1. Access the DESeq2 app under **Apps**. 2. Click **Run** to open the app task page. 3. Under Inputs, click the **Select files** icon next to each of data type. - For Expression data, use Type option to choose TSV.GZ files and subset using Tags to select the tags you gave to the filtered data. Select all filtered files and click **Save selection**. - For Gene annotation, the files list is updated to show the GTF file. Choose the file and click **Save selection**. - For Phenotype data, the file list is updated to show the CSV file. Choose the file and click **Save selection**. 4. Update the app settings - Provide Analysis title - Control variables: tumor_location & diagnosis_age_range - Covariate of interest: histology - FDR cutoff: 0.05 - Factor level - reference: Ependymoma - Factor level - test: Medulloblastoma - Quantification toll: kallisto - IgnoreTxVersion: True - log2 fold change shrinkage: True 5. Click **Run** on the right hand corner to initiate the analysis. :::success Put up :raised_hand: on zoom if the status on the DESeq2 app shows RUNNING. ::: The app takes ~36 minutes for these set of files. The cost for the analysis is around $0.15. An email will be sent at the address you used to register for Cavatica when the app run is completed. ## Analysis with Data Cruncher (Optional) An alternative method is to use interactive analysis on an instance running the RStudio computational environment. The DGE workflow is run using an analysis script to generate reports and outputs. :::info Detailed lesson on setting up and using the data crunher on Cavatica: [Click Here](https://training.nih-cfde.org/en/latest/Bioinformatics-Skills/RNAseq-on-Cavatica/rna_seq_8/) :::