# BI278 Lab 3/Practical
# bi278 fall2022 lab practical
First, make a copy of the markdown version of this page. Then fill in the commands to execute each step with a:
```
code block
```
Answer any questions and fill any tables.
*Reminder that `ls` and autocomplete will be your best friends.*
0. SSH onto `bi278`.
1. Make a new directory called 'practical'.
```
mkdir practical
```
2. Now go into this directory.
```
cd practical
```
3. Copy only the fasta files (`*.fasta` and `*.fna`) from `/courses/bi278/Course_Materials/practical` to your current location.
```
cp /courses/bi278/Course_Materials/new_practical/*.fasta ./practical
```
4. Find out which organisms the two genome files belong to.
```
grep ">" /courses/bi278/Course_Materials/new_practical/GCF_003019965.1_ASM301996v1_genomic.fna
grep ">" /courses/bi278/Course_Materials/new_practical/GCF_020419785.1_ASM2041978v1_genomic.fna
```
The "bceno_SRR2558789.fasta" belongs to Burkholderia cepacia.
Link: https://www.ncbi.nlm.nih.gov/sra/SRR2558789/
The "bmulti_SRR8885150.fasta" belongs to Burkholderia multivorans.
Link: https://www.ncbi.nlm.nih.gov/sra/?term=SRR8885150
5. These two organisms are close relatives, often found in the lungs of cystic fibrosis patients. Given this fact and based on what is contained in these genome files, what would you determine is the status of each genome? Choose between the options: draft or finished. Explain why.
I looked at the contents of these genome files using the .fna files. It looks like the Burkholderia cepacia genome is a draft genome while the Burholdria multivorans genome is a finished genome. When you look at the contents of the genome file, you can see that the Burholderia multivorans genome has 2 completed chromosomes and one completed plasmid. On the other hand, Burkholeria cepacia genome is a compilation of many whole genome shotgun sequences of varying lengths. This indicates that the Burkholeria cepacia genome may be a draft genome compared to a finished Burholdria multivorans genome. This would also make sense in the context of the fact that these two organisms are close relatives and found in the lungs of cystic fibrosis patients. Since they are closely related, it is understandable why one genome may be finished while the other remains a draft. This could be due to time/cost saving efforts.
6. Find the genome size and GC% for the genome files.
The GC% for "bceno_SRR2558789.fasta" is approximately 65% ((328757998/506435753)*100). The GC content is 328757998 bp. The genome size is 506435753 bp.
The GC% for "bmulti_SRR8885150.fasta" is approximately 68% ((161999073/239967045)*100). The GC content is 161999073 bp. The genome size is 239967045 bp.
```
cp ./lab_02/GC%_GenomeSize.sh ./practical
nano GC%_GenomeSize.sh
#change path from "home2/sdivit25/lab_02" to "home2/sdivit25/practical"
sh GC%_GenomeSize.sh bceno_SRR2558789.fasta
sh GC%_GenomeSize.sh bmulti_SRR8885150.fasta
```
7. What is the appropriate command to download the raw sequencing reads from this sample? **(but don't run it)** https://www.ncbi.nlm.nih.gov/sra/SRX1304848[accn]
```
fastq-dump --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v --fasta default --outdir ~/practical SRR2558789
fastq-dump --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v --fasta default --outdir ~/practical SRR8885150
```
8. SRA reads have already been downloaded for you. How many reads are included in each `*SRR*.fasta` file?
The "bceno_SRR2558789.fasta" file includes 2608479 reads.
The "bmulti_SRR8885150.fasta" file includes 1636959 reads.
```
tail -20 bceno_SRR2558789.fasta
tail -20 bmulti_SRR8885150.fasta
```
9. `jellyfish count` has already been run for you on both SRA files and left in the remote "practical" directory above. Recreate at least one of the commands that was used to do this task **(but don't run it)**. Make sure the input and output file names correspond to the files in the remote directory.
```
jellyfish count -t 2 -C -s 1G -m 29 -o bceno.m29.count bceno_SRR2558789.fasta
jellyfish count -t 2 -C -s 1G -m 29 -o bmulti.m29.count bmulti_SRR8885150.fasta
```
10. Run `jellyfish histo` on both of the `*.count` files still in the remote directory, without copying them to your current directory.
```
jellyfish histo -o bceno.m29.histo /courses/bi278/Course_Materials/new_practical/bceno.m29.count
jellyfish histo -o bmulti.m29.histo /courses/bi278/Course_Materials/new_practical/bmulti.m29.count
```
11. Import the resulting `*.histo` files into R and estimate each genome size based on their kmer curves. No need to report R code back but fill in the table.
| Organism | Genome size (basepair count from step 6) | Genome size (kmer estimate) |
| -------- | -------- | -------- |
|Burkholderia cepacia|506435753|9662230|
|Burkholderia multivorans|239967045|6357976|
12. Exit out of your SSH connection.
```
exit
```