owned this note
owned this note
Published
Linked with GitHub
20200811 - LSCI RNAseq training - Course Outline
==
###### tags: `LSCI6101`
---
- Notes from last meeting: https://hackmd.io/BjhBpwJgS3KUY4HHNFfNNA
## Galaxy login
- Access from Lab WiFi
- Without TFC VPN : https://192.168.52.116:4443/
- With TFC VPN : https://192.168.52.116:4443/
- Access from Lab Desktop
- https://192.168.52.116:4443/
- Access from CUHK WiFi hotspots
- CUHK1x : https://137.189.51.116:2220/
- eduroam: follow "Access from outside internet" below
- Access from CUHK Intranet (e.g. Dr Chan's office)
- https://137.189.51.116:2220/
- Access from outside internet:
- With CUHK VPN : https://137.189.51.116:2220/ (Mainly for students)
- With TFC VPN : https://192.168.52.116:4443/
- Without VPN(!) : (Contigency, Not enabled for now)
- Password for students
| Email Address | Password | Student Name | Galaxy Public Name |
|-----------------------------|----------|----------------------|----------------------|
| 1155136097@link.cuhk.edu.hk | n8epn4up | Gongli CAI | gongli_cai |
| 1009044520@link.cuhk.edu.hk | rm2s44r8 | Bimal GURUNG | bimal_gurung |
| 1155093103@link.cuhk.edu.hk | zqdnr36r | Man Ip HO | man_ip_ho |
| 1155093165@link.cuhk.edu.hk | k5ztfkbs | Nelson KEI | nelson_kei |
| 1155094962@link.cuhk.edu.hk | haxw9cv8 | Ching Ying KWOK | ching_ying_kwok |
| 1155094057@link.cuhk.edu.hk | k5d4enzu | Oi Yan LAU | oi_yan_lau |
| 1155135691@link.cuhk.edu.hk | phu6wgux | Jiaming Li | jiaming_li |
| 1155151916@link.cuhk.edu.hk | nakyg95e | Qiaoxia LIANG | qiaoxia_liang |
| 1155151912@link.cuhk.edu.hk | 5kjxns66 | Yongchao NIU | yongchao_niu |
| 1155136084@link.cuhk.edu.hk | 65azux5e | Kaike REN | kaike_ren |
| 1155151920@link.cuhk.edu.hk | kzcmu649 | Ka Li Jacquelyne SUN | ka_li_jacquelyne_sun |
| 1155091877@link.cuhk.edu.hk | qz68sdyf | Yunjia ZHANG | yunjia_zhang |
---
## Introducing Galaxy platform - Eugene
- Data management (Upload/Download/Organize)
- The workflow interface
- Viewing first 1MB of large datafiles
## RNA-seq data pre-processing (QC, Alignment) - Eugene
- Overview (Lecture)
- Why alignment is required to utilize information in RNA-seq data
- RNA-seq data can have QC issues that affect downstream analysis
- Downloading from SRA (Demonstration)
- Provide students with a Drosophila RNA-seq datdaset, subsampled for workshop
- (NCBI might block us for 12 simultaneous downloads during workshop)
- Structure of a sequencing library molecule (Lecture)
- Relationship between adapters, Insert and sequencing read
- FastQ format (Lecture)
- QC and trimming (Hands-on)
- Adapter/Quality trimming recommended settings
- Reference alignment (Hands-on)
- (Demonstration) Reference indexing, provide a ready-made drosophila reference
- Short-read Alignment (Target:<20 min)
- Genomic Coordinate System (Lecture, during alignment wait time)
- Highlight its difference/significance from a mRNA-centric or protein-sequence centric coordinate system (commonly used by biologists)
- 0-base/1-base systems
- Projection of transcriptome elements (gene, transcript, exon) onto genomic coordinate system
- Sorting and indexing SAM/BAM files (Hands-on)
- Brief overview of each information column
- Mapping quality and multi-mapping phenomenon
- Extract uniquely-aligned reads in specific region
- SAM/BAM QC (Hands-on)
- Generate QC reports
- error rate, 3’/5’ bias, exonic reads proportion
- (Optional: visualize BAM file in genome browser, if not covered by other parts)
- Download BAM files/References and Using IGV
- Proposed Assessment
- 1. Complete the entire RNA-seq pre-processing workflow on their own
- Pair-End yeast RNA-seq data (from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4878611/, 48 replicates available)
- SRA data download --> extract uniquely-aligned reads in sub-region, generate QC report
- 2. Case-study - ["Effect of RNA integrity on uniquely mapped reads in RNA-Seq"](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4213542/)
- Low RNA integrity samples + a polyA RNA-seq workflow is known to introduce 3' bias in reads
- Provide students with reference material (e.g. paper)
- Require them to explain in paragraphs
- How low integrity input and polyA selection lead to 3' bias
- Why low integrity input does not introduce 3' bias when using a rRNA-depletion protocol
## Quantification and differential expression analysis - Jizhou
- Quantification
- Basics in RNA-seq quantification
- Absolute quantification and relative quantification
- Gene and transcript level quantification
- Common units for RNA-seq quantification: read counts / FPKM / TPM
- Tools introduction: alignment-based and alignment-free quantification
- Demonstration: using HTseq-count to quantify the gene expression
- Modes in HTseq-count
- Result interpretation
- Post-QC the replicates using expression data to p(PCA)
- DE analysis
- Biological significance of DE analysis
- Data normaliztion
- Tools introduction
- Demonstration: using DESeq2
- DE genes filtering using P-value and FDR
- Result interpretation and visualization (volcano plot, venn diagram …)
- Functional pathway and GO term annotation using selected DE genes
## Isoform / splicing analysis - Alan
1. Genome annotation format
Recap the concept of genome and transcript in a bioinformatic sense, which is mainly about various genome annotation formats.
Introduce why isoform / splicing analysis is useful, even the quantification can be done as mentioned in the previous session.
2. Transcriptome assembly, reference-based and de novo
Use StringTie2 as an example for the reference-based approach.
Use Trinity as an example for the de novo approach.
Introduce the concepts in layman terms and make comparisons between the two approaches.
3. Isoform classification (reference-based)
Use the reference-based approach as an example.
Talk about isoforms that are missed by the genome annotation and why they can be important.
4. Splicing analysis (refernece-based)
Use the reference-based approach as an example.
Discuss that why splicing analysis is more appropriate when using NGS data.
5. Practice: transcriptome assembly and isoform / splicing analysis
Use the read alignment results (bam files) from the previous session.
Use StringTie2 and gffcompare for the analysis.
6. A very brief introduction on long-read RNA-seq
Introduce a bit about long-read RNA-seq by discussing the main drawback of the NGS approach in isoform analysis.
Make a transition to the following session.
## Latest technologies: Long reads (PacBio IsoSeq + Nanopore) - Andy
- Background
- What's the difference between long reads and short reads
- The benefits to use long reads and the problems with it
- How to analyse PacBio IsoSeq/Nanopore data
- The files required
- The software that can be used to proceed the analysis
- PacBio
- [SQANTI3](https://github.com/ConesaLab/SQANTI3)
- [TAMA](https://github.com/GenomeRIK/tama)
- [LoReAn](https://github.com/lfaino/LoReAn)
- [IsoSeq3](https://github.com/PacificBiosciences/IsoSeq)
- Nanopore
- [Medaka](https://github.com/nanoporetech/medaka)
- [pipeline-transcriptome-de](https://github.com/nanoporetech/pipeline-transcriptome-de)
- [pipeline-nanopore-ref-isoforms](https://github.com/nanoporetech/pipeline-nanopore-ref-isoforms)
- The workflow in the analysis. A nice tutorial for PacBio IsoSeq is [here](https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md)
- Data proceesing and analysis (similar to short read approach)
- <i>de novo</i> based
- Step 1: Circular Consensus Sequence calling
- Step 2: Primer removal and demultiplexing
- Step 3: Refine
- Step 4: Clustering
- Alignment based
- Map long reads to a reference
- Minimap2
- deSALT
- GMAP
- STAR
- BLAT
- uLTRA
## Data visualization - Claire
- Data visualization principles
- Why necessary
- EDA vs publication
- Common data and corresponding plot types
- Discrete groups vs Time series
- Comparison of discrete groups
- Bar chart (skip through, example use: pathway / GO analysis, ie infer functions, note for error bars)
- Box plot / Swarm plot
- Indicating statistical significance
- Basic graphs for RNAseq
- Volcano plot, MA plot (just mention, echoing Jizhou's part)
- Heatmap
- Venn Diagram (again just recap)
- Unsupervised association with massive data
- Clustering, dimension reduction: e.g. PCA, tSNE, UMAP
- Network (introduce Cytoscape)
- Phylogenetic trees (skip)
- Genome Browser (mention, maybe included )
- Plotting tools options with Pros & Cons
- Excel (Easy for everyone; Good for simple EDA, but have many limitations)
- Prism, Galaxy (Good for generating publication grade figures, but not easy to tweak)
- Python, R (Most flexible, streamlined data feeding and figure tweaking, but need basic knowledge in the programming language)
- Figure legend writing
- Making scientifically sound & attractive figures
- Plots: The good, the bad, and the ugly
- Perception: visualization is to aid comprehension
- Be intuitive: Aid rather than distract
- Post-editing
- When & Why necessary
- What is good and acceptable and what is not
- **Hands-on**: Heatmap on DEG on Galaxy
- **Assessment**:
- Plot analysis (Q&A)
- Creating Boxplot? (different genes from same dataset)
---
## Suggestions for Dr Chan's lecture topics
- (Please list the slide numbers you think Dr Chan should keep)
- (Please suggest any background topics Dr Chan should cover in additional to existing contents)
- Slide numbers
- Genomics(4-19,51),Transcriptomics(3-28,66-72)
- Additional suggested background topics:
- Considerations when planning a RNA-seq experiment (#replicates, conditions, sequencing depth)
- Limitations of RNA-seq on detecting low-abundance transcripts (e.g. rare lncRNAs)
- Fractionation-then-sequencing technology
- Public RNA-seq resources (e.g. GTEx, ENCODE) for human/non-human species
- Skip:(Mention of presence of) Handy skills to ease bioinformatic analysis (e.g. basic Linux BASH, Python/R)
---
## Dataset preparation
- Common for pre-processing, alignment & isoform analysis
- Drosophila
- Prep: time alignment before
- Hands-on with a smaller file, proceed to next step with complete file
## AOB
- 116 reserved for course: from Aug 21 (temporary set)
- Fix Galaxy login not via lab desktops
- Finalise this note, ping on Whatsapp & send
---
## Notes from 20200817 meeting (UPDATED)
- Prepare slides of our sessions
- Eugene will send Dr Chan slides of preparation by students (Done)
- Leave off-lesson work for students
- Dr Chan will open BlackBoard and add all as instructors (Done)
- Upload PPT to Blackboard
- Best stick to one or two environment
- Long-read: no need to be comprehensive
- Jizhou & Alan switch order
- Claire to simplify visualization part
- [New] Can you guys provide some suggested readings (books)?
### Schedule
- Dr Chan (3h)
- Eugene (3h)
- Alan (1.5h) + Jizhou (1.5h)
- Andy (1.5h) + Claire (1.5h)