BI278 Lab 1 - - Unix, genome files, and shell scripts

# BI278 Lab 1 - - Unix, genome files, and shell scripts 1. Course Unix Environment 1.1 Connecting to the BI278 Unix Environment ssh sdivit25@bi278 After this, you wil be asked for a passsword (email passsword). ssh refers o secure shell, which basically allows you to connect to remote computing environments through any computer. You know you're in if it displays: [sdivit25@vcacbi278 ~]$ This is how you leave bi278. exit 1.2 Navigate and find my Colbyhome. '~' #'home directory' pwd #prints current working directory ls variations ls #lists everything in your current working directory ls ~ ls /home/sdivit25 ls . #'.' is the 'current directory' #^these 3 lines are the same ls -lh #asks for more detailed list of directory contents (includes file permissions, file size, & date created) ls colbyhome ls /personal/sdivit25 #^these two lines are the same #they are a link to my personal directory on Colby's fileserver (Filer) ls /courses/bi278 #how to access Bi278 course in filesaver cd is used to change directory to cd /personal/sdivit25 #example 1.3 Organizing my files *Important point to remember: Don't use spaces in directory or file names. Use underscore or period, etc to make it readable/easier to separate between words. Useful comands for organization: mkdir "name" #makes a directory named whatever you list after this command rmdir "name" #removes a specific directory, only works if it's empty if directory != empty: rm -r cp "a" "b" #copies file a to b #can be used to change location & filename at same time mv "a" "b" #move a file from a to b #can also be used to change filename rm "filename" #removes whatever you list after it **CANNOT BE REVERSED** cat "filename" #concatenate/print to screen the entire contents of a file as long as it is a text file less "filename" #displays contents of a file one screen length at a time; use arrows to scroll up/down one line at a time #exit by typing 'q' head "filename" #print to screen the top 10 lines of a file tail "filename" #print to screen the bottom 10 lines of a file man "command" #basically a user manual for most Unix commands Useful shortcuts for moving around/typing . #current location .. #one directory above ~ #home, this will take you to wherever folder you started when you connected to bi278 * #a wildcard that will match any string of characters (A-z; 0-9; etc) #you can also use the wildcard to search for stuff: # "*eps" will give you things that end in eps # "fig*eps" will give you things that starts with fig and ends with eps Example: Make 2 new directories and copy over the entire contents of this first lab folder (=directory) using a command that says to copy all the files (*) in this week's lab folder to one of your new folders. mkdir lab_01a lab_01b cp /courses/bi278/Course_Materials/lab_01a/* ./lab_01a #basically you copies contents from course material file lab_01a into your personal lab_01a file. lab_01b is empty. 2. Collect basic genome statistics for multiple genomes lab_01b directory in course materials in bi278 has some genomes in text files: /courses/bi278/Course_Materials/lab_01b 2.1 Get basic info about genomes Useful Unix commands: grep "pattern" "filename" #find a specific pattern w/in a file #if the pattern is complicated, use quotes (',') to contain the pattern #check out manual for more options: esp option -v wc "filename" #word count of a file essentially #can be used to count lines (-1) in a file #can be used to count characters (-c) in a file tc #translate or delete sets of characters #play around w/ this one & check what it's actually doing when it "translates" WARNING: In Unix, > sends the results of the command that preceded it to a file that you specify after it so it will write over that file if it already exists; so use quotes ">" if you want to use > as a pattern "some command" > "filename" USEFUL: | is used to send (or "pipe") the output of one command directly to another command without saving the intermediary file. You'll see this in use in some of the Unix commands below. Genome files = usually in FASTA format Individual chromosomes/contigs or genes are labeled in a FASTA file by this symbol: > 2.1.1 This command for a given genome will show what kinds of sequences are included in each file: grep ">" /courses/bi278/Course_Materials/lab_01b/"filename" 2.1.2 First two columns of table on Lab 1 pdf | Organism | Strain | Contig Count | Genome size (bp) | GC % | | -------- | ------ | ------------ | ---------------- | ---- | | P. agricolaris|baqs159| 2 | 8721420 | 5375334 | |P. bonniea|bbqs859| 2 | 4098182 | 2406657 | |P. bonniea|bbqs395| 2 | 9058983 | 5593928 | |P. bonniea|bbqs433| 2 | 7829542 | 4948909 | |P. fungorum|ATCC BAA-463 | 4|9058983|5593928| |P. hayleyella|bhqs11| 2 |4125700|2444079 | |P. hayleyella|bhqs171| 35 | 4088457 | 2421259 | |P. hayleyella|bhqs21| 35 | 4088512 | 2421281 | |P. hayleyella|bhqs22| 45 |4084312 | 2418627 | |P. hayleyella|bhqs23| 36 | 4090401 | 2422298 | |P. hayleyella|bhqs530| 2 | 4118722 | 2439957 | |P. hayleyella|bhqs69| 2 | 4125852 | 2444184 | |P. hayleyella|bhqs155|2 | 10062489 | 6230535 | |P. sprentiae|WSM5005| 5 | 7829542 | 4948909 | |P. terrae|DSM 17804| 4 |10062489 | 6230535 | |P. xenovorans|LB400| 3 | 9702951 | 6077288 | ** KNOW WHICH FILE BELONGS TO WHICH ORGANISM ** GCF_009455635.1_ASM945563v1_genomic.fna = P. agricolaris GCF_009455625.1_ASM945562v1_genomic.fna = P. bonniea bbqs859 GCF_000961515.1_ASM96151v1_genomic.fna = P. bonniea bbqs395 GCF_001865575.1_ASM186557v1_genomic.fna = P. bonniea bbqs433 GCF_000961515.1_ASM96151v1_genomic.fna = P. fungorum GCF_009455685.1_ASM945568v1_genomic.fna = P. hayleyella bhqs11 P.hayleyella_bhqs171.nanopore.fasta = P. hayleyella bhqs171 P.hayleyella_bhqs21.nanopore.fasta = P. hayleyella bhqs21 P.hayleyella_bhqs22.nanopore.fasta = P. hayleyella bhqs22 P.hayleyella_bhqs23.nanopore.fasta = P. hayleyella bhqs23 P.hayleyella_bhqs530.nanopore.fasta = P. hayleyella bhqs530 P.hayleyella_bhqs69.pacbio.fasta = P. hayleyella bhqs69 GCF_002902925.1_ASM290292v1_genomic.fna = P. hayleyella bhqs155 GCF_001865575.1_ASM186557v1_genomic.fna = P. sprentiae GCF_002902925.1_ASM290292v1_genomic.fna = P. terrae GCF_000756045.1_ASM75604v1_genomic.fna = P. xenovorans test.fa Partial commands: grep ">" /courses/bi278/Course_Materials/lab_01b/test.fa #test grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa #returns first 5 lines of sequence #genetic code grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -d -c GCgc #returns only Gs and Cs #translating As and Ts to Gs and Cs maybe grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -d -c GCgc | wc -c #returns a number (# of characters) awk 'BEGIN {print (253/400)}' #returns GC% # basically #GC/#ATGC using awk Summary: GC% grep -v ">" PATH/test.fa | tr -d -c GCgc | wc -c awk 'BEGIN {print (253/400)}' Genome size grep -v ">" PATH/test.fa | tr -d -c ATGCatgc | wc -c 2.2 Write and run unix script to automate collection of genome statistics To open up editor nano To close/exit editor crtl + X Start w/ following line on top to tell Unix that this is a bash shell script: #!/bin/bash Type in 3 commands: grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -d -c GCgc | wc -c grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -d -c ATGCatgc | wc -c #awk 'BEGIN {print (253/400)}' Exit to type in filename & use .sh as file ending File should be in working directory To execute the file, you type in: sh GC%_GenomeSize.sh #only if you're in the directory To make script more flexible, change filename into variable that can be designated at execution: nano GC%_GenomeSize.sh #changed test.fa to $1 To run this script, using the new variable: sh GC%_GenomeSize.sh "fasta_filename"