## Steps 1-2: Mapping Reference Genome
We downloaded the correct files and used them for mapping reference genome. It took some effort to figure which FASTA and GTF files to use:
* Homo_sapiens.GRCh38.dna.toplevel.fa
* Homo_sapiens.GRCh38.110.gtf
The command that we ran for mapping the reference genome (utilizing 14 cores on our 16-core laptop):
```bash
STAR --runThreadN 14 --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles Homo_sapiens.GRCh38.dna.toplevel.fa --sjdbGTFfile Homo_sapiens.GRCh38.110.gtf --sjdbOverhang 49
```
## Steps 3-6: Mapping the FastQ files
### Steps 3-5: Preparing the Perl Scripts and .txt files
We downloaded the github repo: https://github.com/MolBioBioinformatics/RNA_seq_analysis
And then modified the `sample_list.txt` and `configure.txt` files as per our configuration.
### Step 6: Running the `RNAseq_align.pl` script
```bash
./RNAseq_align.pl sample_list.txt configure.txt
```
This code generated humongous `.sam` files for each sample separately.
## Step 7: Quality Check
```bash
./RNAseq_qc.pl sample_list.txt confirgure.txt
```
This converted each SAM file to BAM file. And then did some quality checking. And then generated a single EXCEL file.
All this was done using Picard tools (utilizing Java). It wasn't multi-core.
- [x] Explore if this step can be sped up using multiple cores
- Yes, this can be(and has been) parallelized. Since we are processing 16 files independently. Just run the script in parallel using GNU `parallel`. I wrote a shell script for this. However, the bottleneck remains reading/writing such humongous SAM files from the hard disk and then converting them to BAM files. Damn! Nevertheless, we achieved some speed-ups.
# References:
* Primary Reference:
* Ji, Fei, and Ruslan I. Sadreyev. “RNA-Seq: Basic Bioinformatics Analysis.” Current Protocols in Molecular Biology 124, no. 1 (October 2018): e68. https://doi.org/10.1002/cpmb.68.
* https://www.youtube.com/playlist?list=PLi1VnGoeDGjvHvl83QySD2oAQYFHPRYso : a good RNA-seq tutorial by Sanbomics
* https://useast.ensembl.org/info/data/ftp/index.html : from where we downloaded the FASTA and GTF Files
* https://github.com/MolBioBioinformatics/RNA_seq_analysis
* https://github.com/broadinstitute/picard/releases/tag/1.120 : use exactly this release of Picard
* Download the refFlat.txt from here: https://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/
* https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc
* https://github.com/danielecook/Awesome-Bioinformatics
* http://www.bioinformatics-brazil.org/r-peridot/
# Learning Resources and Courses:
* https://rnabio.org/course/ : They also organize workshops every year. Watch out for them :eyes: