--- robots: noindex, nofollow lang: pt-br breaks: true --- {%hackmd theme-dark %} # From SRA to read counts [TOC] ## Get the tools required ``` mkdir tools cd tools ``` ### The SRA Toolkit ``` wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.6/sratoolkit.3.0.6-ubuntu64.tar.gz tar -xzf sratoolkit.3.0.6-ubuntu64.tar.gz ``` ### FastQC ``` wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip unzip fastqc_v0.12.1.zip ``` ### Trimmomatic ``` wget https://github.com/usadellab/Trimmomatic/archive/refs/tags/v0.39.tar.gz tar -xzf v0.39.tar.gz ``` ### STAR ``` wget https://github.com/alexdobin/STAR/archive/2.7.11a.tar.gz tar -xzf 2.7.11a.tar.gz ``` ### Picard tools ``` wget https://github.com/broadinstitute/picard/releases/download/3.1.0/picard.jar ``` ## Download data ### Using a kart file from SRA selector ``` mkdir data_folder tools/sratoolkit.3.0.6-ubuntu64/bin/prefetch -p -X 999G --outdir data_folder --cart cart_DARXXXXXX_XXXXXXXXXXXX.krt > status.txt # if a key is needed tools/sratoolkit.3.0.6-ubuntu64/bin/prefetch -p --ngc /path/to/key/prj_XXXX.ngc -X 999G --outdir data_folder --cart cart_DARXXXXXX_XXXXXXXXXXXX.krt > status.txt ``` ### From SRA IDs ``` mkdir data_folder tools/sratoolkit.3.0.6-ubuntu64/bin/prefetch -p -X 999G --outdir data_folder SRRXXXXXXXX SRRXXXXXXXX SRRXXXXXXXX SRRXXXXXXXX SRRXXXXXXXX # if a key is needed tools/sratoolkit.3.0.6-ubuntu64/bin/prefetch -p --ngc /path/to/key/prj_XXXX.ngc -X 999G --outdir data_folder SRRXXXXXXXX SRRXXXXXXXX SRRXXXXXXXX SRRXXXXXXXX SRRXXXXXXXX ``` ## Convert SRA to fastq ``` cd data_folder mkdir ../fastq_folder readonly NPROC=8 tools/sratoolkit.3.0.0-ubuntu64/bin/fasterq-dump -x -e $NPROC -f -p --ngc /path/to/key/prj_XXXX.ngc --outdir ../fastq_folder SRRXXXXXXXX SRRXXXXXXXX SRRXXXXXXXX SRRXXXXXXXX SRRXXXXXXXX ``` ## Check read quality using FastQC ``` readonly DATADIR=~/fastq_folder readonly RUNDIRpre=$DATADIR/pre readonly NPROC=8 # create the running Directories if [ ! -d "$RUNDIRpre" ]; then echo "Creating RUNDIRpre" mkdir -p $RUNDIRpre fi tools/FastQC/fastqc -o $RUNDIRpre -t $NPROC $DATADIR/*.fastq ``` ## Use trimmomatic to remove adapters and filter reads ``` readonly DATADIR=~/fastq_folder readonly RUNDIR=$DATADIR/trimmomatic readonly NPROC=8 if [ ! -d '$RUNDIR' ]; then mkdir $RUNDIR fi for base in $(ls -d SRR*); do echo echo "Start of $base" echo java -jar tools/Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 $DATADIR/${base}_1.fastq \ $DATADIR/${base}_2.fastq \ $RUNDIR/${base}_1_trim_pair.fastq \ $RUNDIR/${base}_1_trim_unpair.fastq \ $RUNDIR/${base}_2_trim_pair.fastq \ $RUNDIR/${base}_2_trim_unpair.fastq \ ILLUMINACLIP:tools/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 \ SLIDINGWINDOW:4:15 MINLEN:30 -threads $NPROC done ``` ## Check read quality after filtering ``` readonly DATADIR=~/fastq_folder readonly DATADIRpos=$DATADIR/trimmomatic readonly RUNDIRpos=$DATADIR/pos readonly NPROC=8 # create the running Directories if [ ! -d "$RUNDIRpos" ]; then echo "Creating RUNDIRpos" mkdir -p $RUNDIRpos fi tools/FastQC/fastqc -o $RUNDIRpos -t $NPROC $DATADIRpos/*_trim_pair.fastq ``` ## Align reads and calculate read counts using STAR Ref files in the [GDC Reference Files](https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files) page: * [GRCh38.d1.vd1.fa](https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834); * [GDC.h38.d1.vd1 STAR2 Index Files (v36)](https://api.gdc.cancer.gov/data/c0008693-0583-4eac-bd5c-583070763893); ``` readonly dir_fq=~/fastq_folder readonly index=~/ref_files/star-2.7.5c_GRCh38.d1.vd1_gencode.v36 readonly ref_file=~/ref_files/GRCh38.d1.vd1.fa readonly NPROC=8 # STAR Alignment for base in SRRXXXXXXXX SRRXXXXXXXX SRRXXXXXXXX SRRXXXXXXXX SRRXXXXXXXX; do echo $base # define R1 fastq filename fq1=$dir_fq/${base}_1_trim_pair.fastq # define R2 fastq filename fq2=$dir_fq/${base}_2_trim_pair.fastq tools/STAR-2.7.5c/bin/Linux_x86_64/STAR --readFilesCommand cat --runThreadN $NPROC --outFileNamePrefix $base"_" --twopassMode Basic --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFilterType BySJout --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --limitSjdbInsertNsj 1200000 --outSAMstrandField intronMotif --outFilterIntronMotifs None --alignSoftClipAtReferenceEnds Yes --quantMode GeneCounts TranscriptomeSAM --outSAMtype BAM Unsorted --outSAMunmapped Within --genomeLoad NoSharedMemory --chimSegmentMin 15 --chimJunctionOverhangMin 15 --chimOutType Junctions SeparateSAMold WithinBAM SoftClip --chimOutJunctionFormat 1 --chimMainSegmentMultNmax 1 --outSAMattributes NH HI AS nM NM ch --readFilesIn $fq1 $fq2 --genomeDir $index # Remove fastq to save space (OPTIONAL) rm $fq1 $fq2 done ``` ### Using Picard to evaluate the aligment ``` java -jar tools/picard.jar CollectAlignmentSummaryMetrics -R $ref_file -I *.bam -O ${base}_picard.txt ``` ### Using FastQC to evaluate the aligment ``` readonly DATADIR=~/fastq_folder readonly RUNDIR=$DATADIR/bam_fastqc readonly NPROC=8 # create the running Directories if [ ! -d "$RUNDIR" ]; then mkdir -p $RUNDIR fi tools/FastQC/fastqc -o $RUNDIR -t $NPROC *.bam ``` ## Tips! ### Mounting a remote folder using sshfs ``` sshfs user@remove_address:remote_folder local_folder ``` ### Using nohup to detach from the terminal ``` # Ex. nohup wget https://github.com/broadinstitute/picard/releases/download/3.1.0/picard.jar & ``` ### Using MultiQC to visualize FastQC results in a single file ``` # Assuming a conda env is already setup mamba install -c conda-forge "pygments<3.0.0,>=2.14.0" mamba install -c conda-forge lzstring mamba install -c bioconda multiqc multiqc ~/fastq_folder/pre ```
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up