## Bi278 Lab Num 1 (v2)
### By Lee Ferenc 9/12/2023
I just want to preface this with I have taken half of this course before and still have my notes so I'll use them if that's okay.
## Excerise 1. Organization
### Connecting to the Unix Enviroment
Work is done on the terminal (MacOS) Enter the remote computing environment through the secure shell (ssh) and that is ssh userid@bi278
> `ssh enfere24@bi278` Secure Shell
> `exit` Exits
This is your working directory.
#### Basic commands
>`pwd` print working directory
>`cd` change directory
>`ls` list all files in a directory (add -lh for a more detailed description)
>`mkdir (name)` make directory
>`rmdir (name)` #emove directory
>`cp a b` copy a to b
>`mv a b` move a to b
>`cat filename` concate or print entire contaents of a file (.txt file)
>`less filename` display file one screen length at a time, arrows to scroll, q to exit
>`head filename` print first 10 lines
>`tail filename` print last 10 lines
>`man command` (manual for most Unix commands)
#### Shortcuts
>`.` Current location
>`..` One directiory above
>`~` home, takes you to the first folder you connected to when connecting to bi278
>`*` wildcare that will match any string of of chars
(Unix is nesting. Note that colbyhome stands for /personal/colbyid or /personal/enfere24)
#### Making folder example and moving lab_01a to it
>`mkdir lab_01a lab_01b` (makes two files named lab_01a and lab_01b under /personal/enfere24 because that's my home directory (`cd ~`))
>`cp /courses/bi278/Course_Materials/lab_01a/* ./lab_01a` (copies all the files from the original class file to my new "cooler" file)
## Exercise 2. Working with Genome Files
Most genome files are in FASTA format. Individual chromosomes or contigs are labeled by `>`
#### Lab Unix Commands
>`grep pattern filename` (finds a pattern within a file, if the patern is complex use quotes to contain the pattern. Grep manual: https://www.gnu.org/software/grep/manual/grep.html)
>`wc filename` Count the words in a file, cam count lines with `-l` or characters with `c`
>`tr` translate or delete sets of characters
(Note that `>` sends the results of a command to whatever file after ">" so be careful)
### 1. An Example
##### Getting basic information about a genome
>grep ">" /courses/bi278/Course_Materials/lab_01b/"filename"
#### Question 1
###### Example 1
>`grep ">" /courses/bi278/Course_Materials/lab_01b/GCF_009455625.1_ASM945562v1_genomic.fna
`
Prints out:
```
>NZ_CP008760.1 Paraburkholderia xenovorans LB400 chromosome 1, complete sequence
>NZ_CP008762.1 Paraburkholderia xenovorans LB400 chromosome 2, complete sequence
>NZ_CP008761.1 Paraburkholderia xenovorans LB400 chromosome 3, complete sequence
```
So for each > that means one chromosome or contig.
###### Example 2
>`grep ">" /courses/bi278/Course_Materials/lab_01b/P.bonniea_bbqs395.nanopore.fasta
`
Prints out:
```
>1 length=3124304 depth=1.00x
>2 length=884981 depth=0.94x circular=true
```
(I somehow chose the same as the lab manual for both)
#### Question 2
| Organsism | Strain | Contig Count | Genome Size (bp) | GC % |
| -| -| - | - | - |
| B. agricolaris | BaQS159 | 2 | 8721420 | 61.2% |
| P. bonniea | Bbqs859 | 2 | 4098182 | 58.7% |
| P. bonniea | bbqs395 | 2 | 4009285 | 58.8% |
| P. bonniea | bbqs433 | 2 | 4013203 | 58.8% |
| P. Fungorium | ATCC BAA-463| 4 | 9058983 | 61% |
| P. Hawleyella | BhQS11 | 2 | 4125700 | 59.2% |
>`grep “>” /courses/bi278/Course_Materials/lab_01b/*` (I used the * wildcard so I didn't have to do it for every single file) This gets Name, strain, and contig count
>`wc -c /courses/bi278/Course_Materials/lab_01b/*` Is pretty servicable for the genome size but if looking at the file you should notice a header that isn't ATCG and line ends are counted.
>`grep -v ">" PATH/lab_01b/filenamex | tr -d "\n" | wc -c ` I go into further detail below for the GC% but `tr -d "\n"` deletes the line breaks. Not needed for below cause the -c is the inverse translation
>`grep -v ">" PATH/lab_01b/filenamex | tr -d -c GCgc | wc -c` is much better and gets the GC count to be divided by the genome size for the GC%. `grep -v ">"` is excluding the the header line. `tr -d -c GCgc` gets rid of every character but GCg and c., the `-d` deletes the pesky line break and `-c` is an inverse translation. Basically instead of turning all of x into y it turns everything but x into y. So it turns everything but GCgc into nothing. Then `wc -c` counts the bytes (characters because we aren't using glyphs or any funny ASCII stuff). (TLDR: Gets rid of first line, removes everything but G, C, g, and c, and then counts how many character)
###### Lab 2 will include the automation of above (Hint: Used nano and "wc -m $1" and "grep -v ">" $1 | tr -d -c GCgc | wc -c" but I had difficulty automating the % so I had to do in the terminal or on a calculator)
#### Question 3
##### 3a
The size of the genome is 400
>`grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -d "\n" | wc -c`
##### 3b
0.6325 or 63%
>`grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -c -d GCgc | wc -c` Get 253.
>`awk 'BEGIN {print (253/400)}'` 253/400 is 0.6325 or 63%
(I didn't use 4 digits in the table oops)