# Practical #1
0. SSH onto `bi278`.
1. Make a new directory called 'practical'.
```
mkdir practical
```
2. Now go into this directory.
```
cd practical
```
3. Copy only the fasta files (`*.fasta` and `*.fna`) from `/courses/bi278/Course_Materials/practical` to your current location.
```
cd /courses/bi278/Course_Materials/new_practical
cp bceno_SRR2558789.fasta ~/practical
cp bmulti_SRR8885150.fasta ~/practical
cp GCF_003019965.1_ASM301996v1_genomic.fna ~/practical
cp GCF_020419785.1_ASM2041978v1_genomic.fna ~/practical
to make sure they were all copied
cd ~
cd practical
ls
```
4. Find out which organisms the two genome files belong to.
```
cat GCF_020419785.1_ASM2041978v1_genomic.fna
organism Burkholderia cepacia
cat GCF_003019965.1_ASM301996v1_genomic.fna
organism Burkholderia multivorans
```
5. These two organisms are close relatives, often found in the lungs of cystic fibrosis patients. Given this fact and based on what is contained in these genome files, what would you determine is the status of each genome? Choose between the options: draft or finished. Explain why.
The status of this genome is finished because the whoe genome sequence is present in both of these files accoding to the cat command (no gaps). Because they are bacteria, which are more easily sequenced and it is unlikely that there are gaps in the genome sequence and is most likely a finished genome.
6. Find the genome size and GC% for the genome files.
```
Burholderia cepacia
grep -v ">" ./GCF_020419785.1_ASM2041978v1_genomic.fna | tr -d -c GCgc | wc -c
grep -v ">" ./GCF_020419785.1_ASM2041978v1_genomic.fna | tr -d -c GCATgcat | wc -c
awk 'BEGIN {print (5487852/8195038)}'
genome size 8195038
GC% 66.9655
Burkholderia multivorans
grep -v ">" ./GCF_003019965.1_ASM301996v1_genomic.fna | tr -d -c GCgc | wc -c
grep -v ">" ./GCF_003019965.1_ASM301996v1_genomic.fna | tr -d -c GCATgcat | wc -c
awk 'BEGIN {print (4251619/6322859)}'
genome size 6322859
GC% 67.242
```
7. What is the appropriate command to download the raw sequencing reads from this sample? **(but don't run it)** https://www.ncbi.nlm.nih.gov/sra/SRX1304848[accn]
```
to download the raw sequecning reads without the quality scores
fastq-dump -–split-3 --skip-technical --readids --read-filter pass -- dumpbase --clip -v --fasta default --outdir ~/FOLDERNAME SRR2558789
folder name is folder you want odwnload to enterm in this case would probably be practical if the download was being run but for example lab week for me it would have been labwk2
```
8. SRA reads have already been downloaded for you. How many reads are included in each `*SRR*.fasta` file?
```
bceno_SRR2558789.fasta
2608479 reads
bmulti_SRR8885150.fasta
1636959 reads
```
9. `jellyfish count` has already been run for you on both SRA files and left in the remote "practical" directory above. Recreate at least one of the commands that was used to do this task **(but don't run it)**. Make sure the input and output file names correspond to the files in the remote directory.
```
jellyfish count -t 2 -C -s 1G -m 29 -o bceno.29.count bceno_SRR2558789.fasta
or
jellyfish count -t 2 -C -s 1G -m 29 -o bmulti.29.count bmulti_SRR8885150.fasta
```
10. Run `jellyfish histo` on both of the `*.count` files still in the remote directory, without copying them to your current directory.
```
jellyfish histo -o bceno.m29.histo /courses/bi278/Course_Materials/new_practical/bceno.m29.count
jellyfish histo -o bmulti.m29.histo /courses/bi278/Course_Materials/new_practical/bmulti.m29.count
```
11. Import the resulting `*.histo` files into R and estimate each genome size based on their kmer curves. No need to report R code back but fill in the table.
| Organism | Genome size (basepair count from step 6) | Genome size (kmer estimate) |
| -------- | -------- | -------- |
| Burkholderia cepacia | 8195038 | 8655747 |
| Burkhlderia multivorans | 6322859 | 6357976 |
12. Exit out of your SSH connection.
```
exit
```