# BI278 Lab 1 - - Unix, genome files, and shell scripts
1. Course Unix Environment
1.1 Connecting to the BI278 Unix Environment
ssh sdivit25@bi278
After this, you wil be asked for a passsword (email passsword). ssh refers o secure shell, which basically allows you to connect to remote computing environments through any computer.
You know you're in if it displays:
[sdivit25@vcacbi278 ~]$
This is how you leave bi278.
exit
1.2 Navigate and find my Colbyhome.
'~' #'home directory'
pwd #prints current working directory
ls variations
ls #lists everything in your current working directory
ls ~
ls /home/sdivit25
ls . #'.' is the 'current directory'
#^these 3 lines are the same
ls -lh #asks for more detailed list of directory contents (includes file permissions, file size, & date created)
ls colbyhome
ls /personal/sdivit25
#^these two lines are the same
#they are a link to my personal directory on Colby's fileserver (Filer)
ls /courses/bi278 #how to access Bi278 course in filesaver
cd is used to change directory to
cd /personal/sdivit25 #example
1.3 Organizing my files
*Important point to remember: Don't use spaces in directory or file names. Use underscore or period, etc to make it readable/easier to separate between words.
Useful comands for organization:
mkdir "name" #makes a directory named whatever you list after this command
rmdir "name" #removes a specific directory, only works if it's empty
if directory != empty:
rm -r
cp "a" "b" #copies file a to b
#can be used to change location & filename at same time
mv "a" "b" #move a file from a to b
#can also be used to change filename
rm "filename" #removes whatever you list after it **CANNOT BE REVERSED**
cat "filename" #concatenate/print to screen the entire contents of a file as long as it is a text file
less "filename" #displays contents of a file one screen length at a time; use arrows to scroll up/down one line at a time
#exit by typing 'q'
head "filename" #print to screen the top 10 lines of a file
tail "filename" #print to screen the bottom 10 lines of a file
man "command" #basically a user manual for most Unix commands
Useful shortcuts for moving around/typing
. #current location
.. #one directory above
~ #home, this will take you to wherever folder you started when you connected to bi278
* #a wildcard that will match any string of characters (A-z; 0-9; etc)
#you can also use the wildcard to search for stuff:
# "*eps" will give you things that end in eps
# "fig*eps" will give you things that starts with fig and ends with eps
Example:
Make 2 new directories and copy over the entire contents of this first lab folder (=directory) using a command that says to copy all the files (*) in this week's lab folder to one of your new folders.
mkdir lab_01a lab_01b
cp /courses/bi278/Course_Materials/lab_01a/* ./lab_01a
#basically you copies contents from course material file lab_01a into your personal lab_01a file. lab_01b is empty.
2. Collect basic genome statistics for multiple genomes
lab_01b directory in course materials in bi278 has some genomes in text files:
/courses/bi278/Course_Materials/lab_01b
2.1 Get basic info about genomes
Useful Unix commands:
grep "pattern" "filename" #find a specific pattern w/in a file
#if the pattern is complicated, use quotes (',') to contain the pattern
#check out manual for more options: esp option -v
wc "filename" #word count of a file essentially
#can be used to count lines (-1) in a file
#can be used to count characters (-c) in a file
tc #translate or delete sets of characters
#play around w/ this one & check what it's actually doing when it "translates"
WARNING:
In Unix, > sends the results of the command that preceded it to a file that you specify after it so it will write over that file if it already exists; so use quotes ">" if you want to use > as a pattern
"some command" > "filename"
USEFUL:
| is used to send (or "pipe") the output of one command directly to another command without saving the intermediary file. You'll see this in use in some of the Unix commands below.
Genome files = usually in FASTA format
Individual chromosomes/contigs or genes are labeled in a FASTA file by this symbol: >
2.1.1 This command for a given genome will show what kinds of sequences are included in each file:
grep ">" /courses/bi278/Course_Materials/lab_01b/"filename"
2.1.2 First two columns of table on Lab 1 pdf
| Organism | Strain | Contig Count | Genome size (bp) | GC % |
| -------- | ------ | ------------ | ---------------- | ---- |
| P. agricolaris|baqs159| 2 | 8721420 | 5375334 |
|P. bonniea|bbqs859| 2 | 4098182 | 2406657 |
|P. bonniea|bbqs395| 2 | 9058983 | 5593928 |
|P. bonniea|bbqs433| 2 | 7829542 | 4948909 |
|P. fungorum|ATCC BAA-463 | 4|9058983|5593928|
|P. hayleyella|bhqs11| 2 |4125700|2444079 |
|P. hayleyella|bhqs171| 35 | 4088457 | 2421259 |
|P. hayleyella|bhqs21| 35 | 4088512 | 2421281 |
|P. hayleyella|bhqs22| 45 |4084312 | 2418627 |
|P. hayleyella|bhqs23| 36 | 4090401 | 2422298 |
|P. hayleyella|bhqs530| 2 | 4118722 | 2439957 |
|P. hayleyella|bhqs69| 2 | 4125852 | 2444184 |
|P. hayleyella|bhqs155|2 | 10062489 | 6230535 |
|P. sprentiae|WSM5005| 5 | 7829542 | 4948909 |
|P. terrae|DSM 17804| 4 |10062489 | 6230535 |
|P. xenovorans|LB400| 3 | 9702951 | 6077288 |
** KNOW WHICH FILE BELONGS TO WHICH ORGANISM **
GCF_009455635.1_ASM945563v1_genomic.fna = P. agricolaris
GCF_009455625.1_ASM945562v1_genomic.fna = P. bonniea bbqs859
GCF_000961515.1_ASM96151v1_genomic.fna = P. bonniea bbqs395
GCF_001865575.1_ASM186557v1_genomic.fna = P. bonniea bbqs433
GCF_000961515.1_ASM96151v1_genomic.fna = P. fungorum
GCF_009455685.1_ASM945568v1_genomic.fna = P. hayleyella bhqs11
P.hayleyella_bhqs171.nanopore.fasta = P. hayleyella bhqs171
P.hayleyella_bhqs21.nanopore.fasta = P. hayleyella bhqs21
P.hayleyella_bhqs22.nanopore.fasta = P. hayleyella bhqs22
P.hayleyella_bhqs23.nanopore.fasta = P. hayleyella bhqs23
P.hayleyella_bhqs530.nanopore.fasta = P. hayleyella bhqs530
P.hayleyella_bhqs69.pacbio.fasta = P. hayleyella bhqs69
GCF_002902925.1_ASM290292v1_genomic.fna = P. hayleyella bhqs155
GCF_001865575.1_ASM186557v1_genomic.fna = P. sprentiae
GCF_002902925.1_ASM290292v1_genomic.fna = P. terrae
GCF_000756045.1_ASM75604v1_genomic.fna = P. xenovorans
test.fa
Partial commands:
grep ">" /courses/bi278/Course_Materials/lab_01b/test.fa
#test
grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa
#returns first 5 lines of sequence
#genetic code
grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -d -c GCgc
#returns only Gs and Cs
#translating As and Ts to Gs and Cs maybe
grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -d -c GCgc | wc -c
#returns a number (# of characters)
awk 'BEGIN {print (253/400)}'
#returns GC%
# basically #GC/#ATGC using awk
Summary:
GC%
grep -v ">" PATH/test.fa | tr -d -c GCgc | wc -c
awk 'BEGIN {print (253/400)}'
Genome size
grep -v ">" PATH/test.fa | tr -d -c ATGCatgc | wc -c
2.2 Write and run unix script to automate collection of genome statistics
To open up editor
nano
To close/exit editor
crtl + X
Start w/ following line on top to tell Unix that this is a bash shell script:
#!/bin/bash
Type in 3 commands:
grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -d -c GCgc | wc -c
grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -d -c ATGCatgc | wc -c
#awk 'BEGIN {print (253/400)}'
Exit to type in filename & use .sh as file ending
File should be in working directory
To execute the file, you type in:
sh GC%_GenomeSize.sh
#only if you're in the directory
To make script more flexible, change filename into variable that can be designated at execution:
nano GC%_GenomeSize.sh
#changed test.fa to $1
To run this script, using the new variable:
sh GC%_GenomeSize.sh "fasta_filename"