# bi278 fall2023 lab practical
First, make a copy of the markdown version of this page. Then fill in the commands to execute each step within a:
```
#code block
```
Fill all empty code blocks and tables, and answer all questions.
> Reminder that `ls` and autocomplete are super useful and should be tools you use all the time, to make sure you are where you think you are, that files are where you think they are, that commands executed successfully, etc.
0. SSH onto `ssh tsbroa25@bi278`.
1. Make a new directory in your home (`~`) called: practical
```
mkdir practical
ls
```
2. Now go into this directory.
```
cd practical
```
3. Copy only the `*.count` files from `/courses/bi278/Course_Materials/practical` to your current location.
```
cp /courses/bi278/Course_Materials/practical/*.count practical
```
Prove that you've done the above by including a screenshot of the path and contents of your working directory.

4. Among the fasta files (`*.fasta` and `*.fna`) in the courses' practical directory above, find out which organisms the two genome files belong to.
```
grep ">" /courses/bi278/Course_Materials/practical/*.fna
```
Found Burkholderia multivorans (SRR8885150) and Burkholderia cepacia(SRR2558789) using this command
5. These two organisms are close relatives, often found in the lungs of cystic fibrosis patients. Given this fact and based on what is contained in these genome files, what would you determine is the status of each genome? Choose between the options: draft or finished. Explain why you think so.
Finished. I believe they are most likely finsihed genomes as they are involved in cystic fibrosis which gives reasearchers a medical motivation to study the organisms indepth. Aditonally, when using the command `vdb-dump --info` followed by the srr number, the reads look fairly complete based on the info provided by the command and considering they are single cellular organisms.
6. Find the genome size and GC% for the genome files.
to get the genome sizes I used the Command `grep -v ">" /courses/bi278/Course_Materials/practical/GCF_020419785.1_ASM2041978v1_genomic.fna | tr -d -c ATGC| wc -c` for B. cepacia and `grep -v ">" /courses/bi278/Course_Materials/practical/GCF_003019965.1_ASM301996v1_genomic.fna | tr -d -c ATGC| wc -c` for B. multivorans. I found B. cepacia has a genome size of 8,195,038 and B. multivorans has a genome size of 6,322,859.
Next I used `grep -v ">" /courses/bi278/Course_Materials/practical/GCF_020419785.1_ASM2041978v1_genomic.fna | tr -d -c CG| wc -c` to get the number of GCs in B. cepacia which then using `awk 'BEGIN {print (CG#/Total#)}'` I found has a GC content of 66.965% for B. cepacia. I then repeated this for B. Multivoran and found a GC% of 67.242%
```
[tsbroa25@vcacbi278 ~]$ grep -v ">" /courses/bi278/Course_Materials/practical/GCF_020419785.1_ASM2041978v1_genomic.fna | tr -d -c ATGC| wc -c
8195038
[tsbroa25@vcacbi278 ~]$ grep -v ">" /courses/bi278/Course_Materials/practical/GCF_020419785.1_ASM2041978v1_genomic.fna | tr -d -c CG| wc -c
5487852
[tsbroa25@vcacbi278 ~]$ awk 'BEGIN {print (5487852/8195038)}'
0.669655
[tsbroa25@vcacbi278 ~]$ grep -v ">" /courses/bi278/Course_Materials/practical/GCF_003019965.1_ASM301996v1_genomic.fna | tr -d -c ATGC| wc -c
6322859
[tsbroa25@vcacbi278 ~]$ grep -v ">" /courses/bi278/Course_Materials/practical/GCF_003019965.1_ASM301996v1_genomic.fna | tr -d -c CG| wc -c
4251619
[tsbroa25@vcacbi278 ~]$ awk 'BEGIN {print (4251619/6322859)}'
0.67242
```
7. What is the appropriate command to download the raw sequencing reads from this sample? **(but don't run it)** https://www.ncbi.nlm.nih.gov/sra/SRX1304848[accn]
```
fasterq-dump -p --fasta --outdir ~/practical SRR2558789
```
8. SRA reads have already been downloaded for you in the courses' practical directory as `*SRR*.fasta` files. How many reads are included in each file?
For Burkholderia multivorans (SRR8885150) I am getting 1,636,959 reads and for Burkholderia cepacia(SRR2558789) I get 2,608,479 reads.
```
vdb-dump --info SRR8885150
vdb-dump --info SRR2558789
```
9. `jellyfish count` has already been run for you on both SRA files and left in the courses' practical directory. Recreate the commands used to do this task **(but don't run it)**. Make sure the input and output file names correspond to the files in the remote directory.
```
jellyfish count -t 2 -C -s 1G -m 29 -o bceno.m19.count /courses/bi278/Course_Materials/practical/bceno_SRR2558789.fasta
jellyfish count -t 2 -C -s 1G -m 29 -o bmulti.m19.count /courses/bi278/Course_Materials/practical/ bmulti_SRR8885150.fasta
```
10. Run `jellyfish histo` on both of the `*.count` files in your current directory.
```
cd ~/practical
jellyfish histo -o bceno.m29.histo bceno.m29.count
jellyfish histo -o bmulti.m29.histo bmulti.m29.count
ls
```
11. Import the resulting `*.histo` files into `R` and estimate each genome size based on their kmer curves. No need to report `R` code back but fill in the table.
| Organism | Genome size (basepair count from step 6) | Genome size (kmer estimate) |
| -------- | -------- | -------- |
| Burkholderia cepacia | 8,195,038 | 9,662,230 |
| Burkholderia multivorans | 6,322,859 | 6,357,976 |
12. Exit out of your SSH connection.
`Exit`