Bi278 Lab Num 1 v2

## Bi278 Lab Num 1 (v2) ### By Lee Ferenc 9/12/2023 I just want to preface this with I have taken half of this course before and still have my notes so I'll use them if that's okay. ## Excerise 1. Organization ### Connecting to the Unix Enviroment Work is done on the terminal (MacOS) Enter the remote computing environment through the secure shell (ssh) and that is ssh userid@bi278 > `ssh enfere24@bi278` Secure Shell > `exit` Exits This is your working directory. #### Basic commands >`pwd` print working directory >`cd` change directory >`ls` list all files in a directory (add -lh for a more detailed description) >`mkdir (name)` make directory >`rmdir (name)` #emove directory >`cp a b` copy a to b >`mv a b` move a to b >`cat filename` concate or print entire contaents of a file (.txt file) >`less filename` display file one screen length at a time, arrows to scroll, q to exit >`head filename` print first 10 lines >`tail filename` print last 10 lines >`man command` (manual for most Unix commands) #### Shortcuts >`.` Current location >`..` One directiory above >`~` home, takes you to the first folder you connected to when connecting to bi278 >`*` wildcare that will match any string of of chars (Unix is nesting. Note that colbyhome stands for /personal/colbyid or /personal/enfere24) #### Making folder example and moving lab_01a to it >`mkdir lab_01a lab_01b` (makes two files named lab_01a and lab_01b under /personal/enfere24 because that's my home directory (`cd ~`)) >`cp /courses/bi278/Course_Materials/lab_01a/* ./lab_01a` (copies all the files from the original class file to my new "cooler" file) ## Exercise 2. Working with Genome Files Most genome files are in FASTA format. Individual chromosomes or contigs are labeled by `>` #### Lab Unix Commands >`grep pattern filename` (finds a pattern within a file, if the patern is complex use quotes to contain the pattern. Grep manual: https://www.gnu.org/software/grep/manual/grep.html) >`wc filename` Count the words in a file, cam count lines with `-l` or characters with `c` >`tr` translate or delete sets of characters (Note that `>` sends the results of a command to whatever file after ">" so be careful) ### 1. An Example ##### Getting basic information about a genome >grep ">" /courses/bi278/Course_Materials/lab_01b/"filename" #### Question 1 ###### Example 1 >`grep ">" /courses/bi278/Course_Materials/lab_01b/GCF_009455625.1_ASM945562v1_genomic.fna ` Prints out: ``` >NZ_CP008760.1 Paraburkholderia xenovorans LB400 chromosome 1, complete sequence >NZ_CP008762.1 Paraburkholderia xenovorans LB400 chromosome 2, complete sequence >NZ_CP008761.1 Paraburkholderia xenovorans LB400 chromosome 3, complete sequence ``` So for each > that means one chromosome or contig. ###### Example 2 >`grep ">" /courses/bi278/Course_Materials/lab_01b/P.bonniea_bbqs395.nanopore.fasta ` Prints out: ``` >1 length=3124304 depth=1.00x >2 length=884981 depth=0.94x circular=true ``` (I somehow chose the same as the lab manual for both) #### Question 2 | Organsism | Strain | Contig Count | Genome Size (bp) | GC % | | -| -| - | - | - | | B. agricolaris | BaQS159 | 2 | 8721420 | 61.2% | | P. bonniea | Bbqs859 | 2 | 4098182 | 58.7% | | P. bonniea | bbqs395 | 2 | 4009285 | 58.8% | | P. bonniea | bbqs433 | 2 | 4013203 | 58.8% | | P. Fungorium | ATCC BAA-463| 4 | 9058983 | 61% | | P. Hawleyella | BhQS11 | 2 | 4125700 | 59.2% | >`grep “>” /courses/bi278/Course_Materials/lab_01b/*` (I used the * wildcard so I didn't have to do it for every single file) This gets Name, strain, and contig count >`wc -c /courses/bi278/Course_Materials/lab_01b/*` Is pretty servicable for the genome size but if looking at the file you should notice a header that isn't ATCG and line ends are counted. >`grep -v ">" PATH/lab_01b/filenamex | tr -d "\n" | wc -c ` I go into further detail below for the GC% but `tr -d "\n"` deletes the line breaks. Not needed for below cause the -c is the inverse translation >`grep -v ">" PATH/lab_01b/filenamex | tr -d -c GCgc | wc -c` is much better and gets the GC count to be divided by the genome size for the GC%. `grep -v ">"` is excluding the the header line. `tr -d -c GCgc` gets rid of every character but GCg and c., the `-d` deletes the pesky line break and `-c` is an inverse translation. Basically instead of turning all of x into y it turns everything but x into y. So it turns everything but GCgc into nothing. Then `wc -c` counts the bytes (characters because we aren't using glyphs or any funny ASCII stuff). (TLDR: Gets rid of first line, removes everything but G, C, g, and c, and then counts how many character) ###### Lab 2 will include the automation of above (Hint: Used nano and "wc -m $1" and "grep -v ">" $1 | tr -d -c GCgc | wc -c" but I had difficulty automating the % so I had to do in the terminal or on a calculator) #### Question 3 ##### 3a The size of the genome is 400 >`grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -d "\n" | wc -c` ##### 3b 0.6325 or 63% >`grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -c -d GCgc | wc -c` Get 253. >`awk 'BEGIN {print (253/400)}'` 253/400 is 0.6325 or 63% (I didn't use 4 digits in the table oops)