Siddu Paaji's Prelim Mission

## Steps 1-2: Mapping Reference Genome We downloaded the correct files and used them for mapping reference genome. It took some effort to figure which FASTA and GTF files to use: * Homo_sapiens.GRCh38.dna.toplevel.fa * Homo_sapiens.GRCh38.110.gtf The command that we ran for mapping the reference genome (utilizing 14 cores on our 16-core laptop): ```bash STAR --runThreadN 14 --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles Homo_sapiens.GRCh38.dna.toplevel.fa --sjdbGTFfile Homo_sapiens.GRCh38.110.gtf --sjdbOverhang 49 ``` ## Steps 3-6: Mapping the FastQ files ### Steps 3-5: Preparing the Perl Scripts and .txt files We downloaded the github repo: https://github.com/MolBioBioinformatics/RNA_seq_analysis And then modified the `sample_list.txt` and `configure.txt` files as per our configuration. ### Step 6: Running the `RNAseq_align.pl` script ```bash ./RNAseq_align.pl sample_list.txt configure.txt ``` This code generated humongous `.sam` files for each sample separately. ## Step 7: Quality Check ```bash ./RNAseq_qc.pl sample_list.txt confirgure.txt ``` This converted each SAM file to BAM file. And then did some quality checking. And then generated a single EXCEL file. All this was done using Picard tools (utilizing Java). It wasn't multi-core. - [x] Explore if this step can be sped up using multiple cores - Yes, this can be(and has been) parallelized. Since we are processing 16 files independently. Just run the script in parallel using GNU `parallel`. I wrote a shell script for this. However, the bottleneck remains reading/writing such humongous SAM files from the hard disk and then converting them to BAM files. Damn! Nevertheless, we achieved some speed-ups. # References: * Primary Reference: * Ji, Fei, and Ruslan I. Sadreyev. “RNA-Seq: Basic Bioinformatics Analysis.” Current Protocols in Molecular Biology 124, no. 1 (October 2018): e68. https://doi.org/10.1002/cpmb.68. * https://www.youtube.com/playlist?list=PLi1VnGoeDGjvHvl83QySD2oAQYFHPRYso : a good RNA-seq tutorial by Sanbomics * https://useast.ensembl.org/info/data/ftp/index.html : from where we downloaded the FASTA and GTF Files * https://github.com/MolBioBioinformatics/RNA_seq_analysis * https://github.com/broadinstitute/picard/releases/tag/1.120 : use exactly this release of Picard * Download the refFlat.txt from here: https://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/ * https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc * https://github.com/danielecook/Awesome-Bioinformatics * http://www.bioinformatics-brazil.org/r-peridot/ # Learning Resources and Courses: * https://rnabio.org/course/ : They also organize workshops every year. Watch out for them :eyes: