RNA-Seq stands for RNA sequencing. It is a technology for studying RNA species. In a cell, the majority of RNA sepcies is ribosomal RNA (rRNA) which accounts for 90% of total RNA. Conversly, messenger RNA (mRNA) only aocunts for 1-2% of total RNA. The central dogma of biology tells us that DNA transcribes into mRNA and then mRNA translates into protein to execute biological functions. mRNA is the only RNA species we know encodes protein seuqences. Studying the level of mRNAs enables us to understand the gene expression level and infer the protein expression level. Therefore, most of time we only interest in the level of mRNA, and wants to deplet rRNA from the total RNA.
Preprocessing RNA samples before sequencing includes total RNA extraction, mRNA enrichment, fragmentation, first-strand cDNA synthesis, RNA strand digestion, second-strand cDNA synthesis, 3'end repair, adenylation and adatper ligation. In total RNA extraction step, extracted RNA should not have signs of degradation. In mRNA enrichment, poly-T beads or poly-T columns are used to separate mRNA from other RNA sepcies, since only mRNA contains poly-A tails. mRNAs are fragmented by chemicals or ultrasonic into certain size range that fits for the sequencing capacity of sequencer. mRNA are primed with random hexamer and reverse-transcribed into first-strand cDNA, followed by RNA strand digestion and second-strand cDNA synthesis. For making strand-specific RNA-seq, TTPs are replaced with dUTPs, which serves as marks for the second-strand, in the second-strand cDNA synthesis step. Since the nature of polymerase synthesis, the double strand cDNA product lose several bases at the 3' ends on each strand. The 3'ends are repaired by dna end repair enzyme and then added a adenine. Y-shape adapters with 3'-T overhang are ligated to the 3'-A overhang of the double strand cDNA. To reduce the complexity in short read assembly, dUTP marked cDNA stands are digested using uracil-N-glycolas (UNG). The single strand cDNAs are then subjected to sequencing.
In sequencing step, primer targeting adapter sequence aneal to one end of cDNA. Fluoresence-labelled ATPs, CTPs, GTPs or TTPs is added one at a time and photographed. Layers of photographes are analyzied by computer to interpret which nucleotide is added to the growing strand in each reaction. For paired end sequencing, the other primer targeting adatper on the other side is added and squenced again. The resultant are 2 separate files, one from forward strand, the other from reverse strand.
Raw reads obtained from sequencer should undergo quality control before downtream analysis. Raw reads data are normally reported in fastq format. Fastq format stores 4 lines of information for each read. The first line is read name, the second line is sequence, the thrid line is a "+" sign, and the forth line is ASCII characters encypting qualtiy scores. For illumina sequencer, the quality socre is reported in Phred33 scheme. To obtain numberical quality score, ASCII characters need to be converted to numbers they represented in computer and minus 33. Quality score is used to repersent how confidence we can say this base is called correctly. A 40 quality score means 1 in 10000 chances this base is called incorrectly. Thus, higher quality score indicating higher chance this base is called correctely. Usually, we would want to trim off bases with quality score lower than 20. Low quality score base calling usually happens at the 3' end of reads since the polymerase is getting weak attaching to the template and prone to add wrong nucleic acids.
FastQC is a commonly used tool to give an overview of read quality. It reports per base quality, per sequence quality, per base nucleotide content, per sequence GC content, number of duplicated sequences, overrepresented sequences and adapter content metrices. Per base quality metric shows the distribution of quality score in box plot for each base location. The upper boundary of the yellow box is the third quantile of qualtiy scores, and the lower boundary of the box is the first quantile. The red line at the middle of box is the median value of quality scores. Per sequence quality metric averages base quality for each read and shows the distribution of per read quality. For good quality reads, the distribution of read quality should skews toward high quality side. Per base nucleotide content metric shows the frequencies of nucleic acid appear at each base location. In principle, RNA has been randomly fragmented, so the frequency of each nucleic acid should distribute evenly across all read length. However, RNA-Seq data is an exception, a research has found that priming with random hexamer in cDNA synthesis step seems to have certain selection for fragmented RNA sequences, which then results in not that random nucleic acid content at the 5'end of reads. But these bases are still real bases in mRNAs, not artifacts, it's ok to put them into short read assembly. Per sequence GC content metric cacluate GC content for each read and report the distribution of per read GC content. GC content is like a fingerprint of a species, it has a constant value for a given species. This also applys to GC content of reads. The distribution of per read GC content should peaks at the value equal to that species' GC content. If peak shifts away from the expected GC content, it can stem from foreign species contaminants in read dataset. Number of duplicated sequences metric shows how many reads are exact maches to each other. For RNA-seq data set, highly expressed transcripts usually repeatly sequenced resulting in many duplicated reads. Overrepresented sequences metric reports reads that appear repeatively and account for 0.1% of total reads. Adapters are the most likely sequences captured by this metric. Adapter content metric align reads with public adapters and report alignment hits.