BI278 lab 1 – Unix, genome files, and shell scripts

# BI278 lab 1 – Unix, genome files, and shell scripts 1.1 Connect to bi Unix environment ssh cgsnow25@bi278 exit 1.2 Navigate and find colby home when you log in, letters should say [cgsnow25@vcacbi278 ~]$ '~' changes as you move around directory structure pwd Print working directory, pwd brings to /home/cgsnow25 - so cd ~, pwd brings you home ls ls ~ ls /home/students/d/cgsnow25 ls . This command asks for a list of all of the files and directories (folders) ls -lh This command asks for more detailed list of directory contents ls colbyhome ls /personal/cgsnow25 This is a link to your personal irectory on Colby's fileserver Filer You can mount a folder to your desktop by going to Mac desktop and typing [command K] and enter smb://filer.colby.edu and use your Mac's finger to navigate the folder ls /courses/bi278 Allows you to access the Courses filesaver cd 'Change directory to' cd colbyhome cd /personal/cgsnow25 Will take you to colbyhome, you can type in the "ls" command to find files Once you're in colbyhome, it won't be a visible destination from your current location, but you can navigate the directory structure using cd cd .. pwd Will allow you to move up a directory or go to a parent directory. '.' is the current directory, so cd . pwd will put you in the same place 1.3 Organize your files *Don't use spaces in directory or file names. Example: My Documents would be typed in Unix like ls My\ Documents You should instead use an underscore or period to type my_documents my.documents myDocuments Example: make two directories adn copy over entire contents of the lab folder/directory mkdir lab_01a lab_01b cp /courses/bi278/Course_Materials/lab_01a/* ./lab_01a Go to the directory cd lab_01a ls Organize directory mkdir name #Make a directory named whatever you list after this command rmdir name #To remove specific directory rm -r #if directory is not empty cp a b #Copy file a to b, changes location and file name mv a b #move a file from a to b, both must be specified rm filename #rm (remove) whatever you list after it, irreversable cat filename #concatenate, or print to screen the entire contents of a file less filename #display contents of a file one screenlength at a time head filename #print to screen the top 10 lines of a file tail filename #print to screen the ottom 10 lines of a file man comand #manual for most Unix commands, shows how the command works command --help #access usage information Useful shortcuts: . (current location) . . (one directory above) ~ (home) *(a wildcard that will match any string of characters) You can find most recent commands by using up arrow key, you can see recent commands by typing in 'history' Autocomplete saves time, if you start typing in a command or filename and click tab, the Unix shell will complete the name for you. Example: moving all eps files into a folder ls *eps #finds all eps files mkdir eps_files #creates new directory mv *eps eps_files #moves all files containing eps into the new directory 2. Collect basic genome statistics for multiple genomes Don't copy over files from folder, refer to them by specifying where the file is (its PATH) grep pattern filename #find a specific pattern within a file, use ('') to contain the pattern wc filename #count the words in a file; can be used to count lines (-1) or characters (-c) tr #translate or delete sets of characters In Unix, '>' sends the results of the command that preceded it into a file that you specify after it so use qua=otes if you want to use '>' as a pattern grep ">" /courses/bi278/Course_Materials/lab_01b/filename Run command for a given geome to see what kinds of sequences are in each file (example: to see all GCF files, you would type in GCF* for filename) GCF file will tell you how many chromosomes, preceded by the header (the line that starts with >), will tell you which organism and the strain Other files have the organism and strain identity in file name but not the header | Organism | Strain | Contig count | Genome size (bp) | GC % | | -------------- | ------- | ------------ | ---------------- | ---- | | P. agricolaris | baqs159 | 2 | 8721420 | 5375334 | | P. bonniea | bbqs859 | 2 | 4098182 | 2406657 | | P. bonniea | bbqs395 | 2 | 9058983 | 5593928 | | P. bonniea | bbqs433 | 2 | 7829542 | 4948909 | | P. fungorum | ATCC BAA-463 | 4 | 9058983 | 5593928 | | P. hayleyella | bbqs155 | 2 | 10062489 | 6230535 | | P. hayleyella | bhqs171 | 35 | 4088457 | 2421259 | | P. hayleyella | bhqs21 | 35 | 4088512 | 2421281 | | P. hayleyella | bhqs22 | 45 | 4084312 | 2418627 | | P. hayleyella | bhqs23 | 36 | 4090401 | 2422298 | | P. hayleyella | bhqs530 | 2 | 4118722 | 2439957 | | P. hayleyella | bhqs69 | 2 | 4125852 | 2444184 | | P. Terrae | DSM17804 | 4 | 10062489 | 6230535 | | P. xenovorans | LB400 | 3 | 9702951 | 6077288 | | P. sprentiae | WSM505 | 5 | 7829542 | 4948909 | | P. hayleyella | bhqs11 | 2 | 4125700 | 2444079 | What is the size of your genome (how many total bases?) grep -v ">" PATH/test.fa | tr -d -d ATGCatgc | wc -c What is your GC%? awk 'BEGIN {print (253/400)}' Commands that help: grep "v" PATH/test.fa #test grep -v ">" PATH/test.fa #genetic code grep -v ">" PATH/test.fa | tr -d -c GCgc #only Gs and Cs grep -v ">" PATH/test.fa | tr -d -c GCgc | wc -c #G and C count": 258 awk 'BEGIN {print (253/400)}' #GC%, 0.6325 Write and run a unix script to automate your collection of genome statistics To bundle commands together To open up editor within unix shell type nano You can use control X to exit back to the normal terminator #!/bin/bash Then type commands grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa |tr -d -c ATGCatgc | wc -c grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa |tr -d -c GCgc | wc -c #awk 'BEGIN {print (253/400)}' Type in filename (GC%_genomesize.sh) To save, you want to save it to folder lab_01b ~/lab_01b/ To excecute: sh GC%_genomesize.sh #only if you're in the directory with ~/lab_01b Change filename into a variable that can be designated at excecution nano GC%_genomesize.sh #to go back and edit Change filename in each line (test.fa) to $1 and exit while saving grep -v ">" /courses/bi278/Course_Materials/lab_01b/$1 |tr -d -c ATGCatgc | wc -c grep -v ">" /courses/bi278/Course_Materials/lab_01b/$1 |tr -d -c GCgc | wc -c #awk 'BEGIN {print (253/400)}' To excecute and get the numbers to calculate %GC, type as prompt: sh GC%_genomesize.sh fasta_filename When done working in bi278 type exit