CRC Script for genome assembly

# CRC Script for genome assembly Notes for class on 8/18/20 and 8/20/20 [Assembly workflow](https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/) ## Assembly script ``` #!/bin/bash #$ -M <netid>@nd.edu # Email address for job notification #$ -m abe # Send mail when job begins, ends and aborts #$ -pe smp 8 # Specify parallel environment and legal core size #$ -q long # Specify queue #$ -N simple_assem_script # Specify job name - no spaces, put netid for easy id module load bio/2.0 fastqc -t 8 *.gz #look at quality of files - open in browser through fastx #if needed, trim trimmomatic PE -threads 8 SRR2584863_1.fastq.gz SRR2584863_2.fastq.gz \ SRR2584863_1.trim.fastq SRR2584863_2.trim.fastq \ SRR2584863_1.untrim.fastq SRR2584863_2.untrim.fastq \ SLIDINGWINDOW :4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 #check options for parameters with velveth --help velveth Assem 31 -shortPaired -fastq -separate \ SRR2584863_1.trim.fastq SRR2584863_2.trim.fastq #check options for parameters with velvetg --help velvetg ``` --- **Trimmomatic Note:** Trimmomatic adaptor file (NexteraPE-PE.fa in this command). This file won't be in your directory by default, you need to either 1) move it there, or 2) include the path to the file in your script. On the CRC, you can find this file at the path ``` /afs/crc.nd.edu/x86_64_linux/b/bio/install/share/trimmomatic-0.39-1/adapters ``` and copy whichever file you need to the directory where you need it. **FastQC Note:** To look at the fastqc files when they are completed successfully, you should see an html and a zip file for each sequence file (i.e., each file that ends in fastq.gz) - to look at these, open a browser on your local machine and enter crcfe01.crc.nd.edu then log in with your netid/pw select Gnome You should see a virtual desktop, and select 'files', then navigate to where your html files are. You can double click on these files just like on your local machine What are you looking for? General quality of the sequences, length, adapter content. This will vary some with datasets, but you'll always want to trim poor quality bases off. --- ## Glimmer ``` # line to run in command line to see how it works, what results to expect trainGlimmerHMM Asperg.fasta Asperg.cds -d train_aspeg # -d directory to run ``` Get the 'Asperg' files by logging into crcfe02 (<netid>@crcfe02.crc.nd.edu) and copying files to your working directory ``` cp /tmp/ND_ICG_AUG_20/* . ``` ### Glimmer Script ``` #!/bin/bash #$ -M <netid>@nd.edu # Email address for job notification #$ -m abe # Send mail when job begins, ends and aborts #$ -pe smp 3 # Specify parallel environment and legal core size #$ -q long # Specify queue #$ -N PA42_glimmer module load bio/2.0 trainGlimmerHMM PA42.fasta PA42.cds -d train_PA42 glimmerhmm PA42.fasta train_PA42 ``` #### Finding orthologs eggnog5.embl.de/#/app/seqscan Using example file protein.fasta (copied in files above) Copy and paste sequence into eggnog, search Alternative, use [Blast](blast.ncbi.n.m.nih.gov/Blast.cgi) - Nucleotide to nucleotide - Paste contents of protein.fasta into text box - Select Daphnia (taxid:6668) under Organism - Somewhat dissimilar sequences - Click 'Blast' ## Running Scripts on the CRC Submit script by using the command ```qsub``` followed by your script ```qsub assem.job``` ### Checking job status ```qstat``` ```qstat -u <netid>``` ### After your job is finished When you get an email that your job has ended, you can look quickly to see if there are errors by looking at the exit status. Exit status = 0 means your job ran without errors. If you get an exit status = 1, or if the output is not what you expected, look for the output file from the job. It will look something like this: jobname.o850233 and should be in the same directory you were in whenever you submitted the script. ### Other things you can do with header variables in job script This will let you submit array jobs ``` #$ -t 1:100 #$ -tc ```