# Lab 01
## Unix, Genome Commands, and Shell Scripts
## Exercise 1: Our Course Unix Environment
### 1.1 Connecting onto the Unix Environment
```
ssh kyamad23@bi278
exit
```
The first command above will prompt a login window to access the bi278 server space. By providing credentials, we can now access the files on the server directly from the terminal.
The exit command will log the terminal session out of the bi278 server.
### 1.2 Navigate and find colbyhome
```
# returns no such directory found
ls colbyhome
ls /home/students/k/kyamad23
```
The link, colbyhome is currently unavailable but the location "/home/students/k/kyamad23" is where the link should direct the files to. As long as I am logged onto the bi278 server, I am able to access "/home/students/k/kyamad23" which is where the rest of the lab will be conducted.
### 1.3 Organizing Your Files
```
pwd
# returns '/home/students/k/kyamad23'
mkdir lab_01a lab01_b
# created two directories named "lab_01a" and "lab_01b"
```
The mkdir command in Unix will create new folders, or "directories", in the current working directory which can be accessed by the command "pwd"
```
cp /courses/bi278/Course_Materials/lab_01a/* ./lab_01a
cd ./lab_01a/
pwd
```
The cp command takes the location of the file as the argument and the destination as the second argument. The "*" symbol can be used as a "wildcard" and will point to all files in a directory (i.e. all files that appear when using the command "ls")
The cd command changes the current working directory to the newly created "/home/students/k/kyamad23/lab_01a" and the pwd command helps us verify where our current working directory is.
The files that were copied over from "/courses/bi278/Course_Materials/lab_01a/" contains a mix of figures, raw data files, and R scripts. We will create three new directories to have an easier workflow to follow.
```
mkdir figure data scripts
```
The mkdir command creates new directories in the pwd with the names provided in the arguments. The code above will create the new directories "./figure", "./data", and "./scripts" in the folder, "/home/students/k/kyamad23/lab_01a".
```
head *.R
tail *.R
less *.R
cat *.R
```
The commands head, less, and cat all display the contents of a specified file in the terminal. "head" will print the first ten lines, "tail" will print the last ten lines, "less" will print what can be displayed in the terminal one screen at a time with the ability to "turn pages" by using the arrow keys, and "cat" will dump all contents of the file onto the terminal at once.
The wildcard symbol "*", combined with ".filetype" will select all files in the pwd with that filetype. In this case "*.R" will point to any R scripts in the current working directory.
```
cp *.R ./scripts
rm *.R
ls
```
All files with the ".R" filetype were copied into the ./scripts directory. Now that we have copied the file into the scripts directory, we can remove the original file from the pwd. We did not need to specify the exact location of the R script files in the remove command because we never changed our current working directory.
```
mv *.eps ./figure
mv ./figure ./figures
man mv
```
An alternative to the "copy-and-remove" method is to use the "mv" command which can move all specified files in the first argument to the new location specified in the second argument.
The second line of code shows how the mv command can also be used to rename a file or directory.
The man command can be used to get more information about the command specified in the argument. This includes things like different flags and arguments that can be provided for the command.
```
cd ./data
mkdir achybrid_signal_and_preference raw_preference
ls
```
I changed the current working directory to data and created the new directories "achybrid_signal_and_preference" and "raw_preference" to house the pertinent data and the README files.
```
mv ../achybrid_* ./achybrid_signal_and_preference/
mv ../raw_preference_* ./raw_preference/
mv ../*.txt ./
```
Since we have changed our working directory, we use the symbol ".." to point the move command towards the parent directory, "/home/students/k/kyamad23/lab_01a". We select all files that contain "achybrid" in the name by using the wildcard symbol and move it to our newly created "achybrid_signal_and_preference" directory in the data folder. We do the same for any file that begins with "raw_preference".
The last command will move any remaining files still in the lab directory to the data directory but not into a subdirectory.
## Exercise 2: Collect Basic Genome Statistics and Multiple Genomes
```
cd /home/students/k/kyamad23/lab01_b
pwd
ls
```
We change our current working directory to the lab01_b section. The pwd command helps us ensure that we have successfully changed our working directory and the ls command lists all files. This folder is currently empty.
```
grep ">" /courses/bi278/Course_Materials/lab_01b/GCF_000756045.1_ASM75604v1_genomic.fna
```
The grep command will look for the pattern provided in the argument at the file specified by the second argument (similarly to how Cmd+F works in a browser). In this case, we look for instances of the symbol ">" appearing and get the following result:
```
>NZ_CP008760.1 Paraburkholderia xenovorans LB400 chromosome 1, complete sequence
>NZ_CP008762.1 Paraburkholderia xenovorans LB400 chromosome 2, complete sequence
>NZ_CP008761.1 Paraburkholderia xenovorans LB400 chromosome 3, complete sequence
```
This shows us that there are three chromosomes in the GCF_000756045.1_ASM75604v1_genomic.fna file and the header after the ">" symbol tells us what organism and strain the sequences are from as well as how to find them in the NCBI database. This particular .fna file corresponds to the organism, P. xenovorans and the strain LB400 with contig count, 3 since there are three headers.
```
grep ">" /courses/bi278/Course_Materials/lab_01b/P.bonniea_bbqs395.nanopore.fasta
```
We look for the same pattern in the file, "P.bonniea_bbqs395.nanopore.fasta" and get the following result:
```
>1 length=3124304 depth=1.00x
>2 length=884981 depth=0.94x circular=true
```
This file corresponds to the organism, P. bonniea, the strain bbqs395, and has contig coun 2 since it has two headers. We do this process for all files in the lab_01b folder in order to fill out the following table:
|Organism | Strain | Contig count |
| ------------- | -------- | ------------- |
|P. agricolaris | baqs159 | 2 |
|P. bonniea | bbqs859 | 2 |
|P. bonniea | bbqs433 | 2 |
|P. bonniea | bbqs395 | 2 |
|P. fungorum | baa463 | 4 |
|P. hayleyella | bhqs11 | 2 |
|P. hayleyella | bhqs155 | 2 |
|P. hayleyella | bhqs22 | 45 |
|P. hayleyella | bhqs23 | 36 |
|P. hayleyella | bhqs171 | 35 |
|P. hayleyella | bhqs530 | 2 |
|P. hayleyella | bhqs21 | 35 |
|P. hayleyella | bhqs69 | 2 |
|P. sprentiae | WSM5005 | 5 |
|P. terrae | DSM 17804 | 4 |
|P. xenovorans | LB400 | 2 |
```
# returns ">test"
grep ">" /courses/bi278/Course_Materials/lab_01b/test.fa
# returns everything aside from ">test"
grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa
# only returns A's, G's, C's, and T's.
# in other words, removes line end characters
grep -v ">" /courses/bi278/Course_Materials/lab_01b/test.fa | tr -d -c GCgcATat
# counts up only the characters and does not count the line end characters
grep -v ">" PATH/test.fa | tr -d -c GCgcATat | wc -c
# counts up only the G's and C's
grep -v ">" PATH/test.fa | tr -d -c GCgc | wc -c
awk 'BEGIN {print (253/400)}'
```
The first command will return the header for the file "test.fa". The "-v" flag on the second command will return everything aside from the header in the file "test.fa". The third command removes the invisible line end characters. The fourth command counts up the character total of the file after removing the header and the line end characters. This sum of characters is also the total genome size contained in the file, test.fa. The fifth command only counts up the G's and C's with the tr command in the pipe removing all other characters.
The final command will print the result of the product, "253/400". This is product will be the percentage of CG % in the genome file.
```
nano
```
The nano command will open up the script editor in the unix shell. Enter the commands from the previous code block and save the file with the ".sh" extension.
Once we are back out to the main terminal window run the following command to execute the script.
```
sh script_name.sh
```
The results of the script will print to the terminal window.
By replacing the hard-coded filename in the script with "$1", the script will now take any filename as an argument in the main terminal window. We will now use this script in order to fill out the rest of the table.
|Organism | Strain | Contig count | Genome Size (bp) | GC %|
| ------------- | -------- | ------------- | ---------------- | --- |
|P. agricolaris | baqs159 | 2 | 8721420 | 61.6|
|P. bonniea | bbqs859 | 2 | 4098182 | 58.7|
|P. bonniea | bbqs433 | 2 | 4013203 | 58.7|
|P. bonniea | bbqs395 | 2 | 4009285 | 58.8|
|P. fungorum | baa463 | 4 | 9058983 | 61.8|
|P. hayleyella | bhqs11 | 2 | 4125700 | 59.2|
|P. hayleyella | bhqs155 | 2 | 4118676 | 59.2|
|P. hayleyella | bhqs22 | 45 | 4084312 | 59.2|
|P. hayleyella | bhqs23 | 36 | 4090401 | 59.2|
|P. hayleyella | bhqs171 | 35 | 4088457 | 59.2|
|P. hayleyella | bhqs530 | 2 | 4118722 | 59.2|
|P. hayleyella | bhqs21 | 35 | 4088512 | 59.2|
|P. hayleyella | bhqs69 | 2 | 4125852 | 59.2|
|P. sprentiae | WSM5005 | 5 | 7829542 | 63.2|
|P. terrae | DSM 17804 | 4 | 10062489 | 61.2|
|P. xenovorans | LB400 | 2 | 9702951 | 62.6|