# bi278 fall2022 lab practical
First, make a copy of the markdown version of this page. Then fill in the commands to execute each step with a:
```
code block
```
Answer any questions and fill any tables.
*Reminder that `ls` and autocomplete will be your best friends.*
0. SSH onto `bi278`.
```
ssh klpast23@bi278
```
1. Make a new directory called 'practical'.
```
mkdir practical
```
2. Now go into this directory.
```
cd practical/
```
3. Copy only the fasta files (`*.fasta` and `*.fna`) from `/courses/bi278/Course_Materials/practical` to your current location.
```
cp /courses/bi278/Course_Materials/new_practical/*.fasta ~/practical/
cp /courses/bi278/Course_Materials/new_practical/*/fna ~/practical/
```
4. Find out which organisms the two genome files belong to.
```
#in the folder ~/practical
grep ">" GCF_020419785.1_ASM2041978v1_genomic.fna
#Burkholderia cepacia strain AU1368
grep ">" GCF_003019965.1_ASM301996v1_genomic.fna
#Burkholderia multivorans strain FDAARGOS_246
```
5. These two organisms are close relatives, often found in the lungs of cystic fibrosis patients. Given this fact and based on what is contained in these genome files, what would you determine is the status of each genome? Choose between the options: draft or finished. Explain why.
```
grep ">" -c GCF_020419785.1_ASM2041978v1_genomic.fna
#returns 35
grep ">" -c GCF_003019965.1_ASM301996v1_genomic.fna
#returns 3
```
The B.cepacia genomic file contains 35 contigs, while the B.multivorans genomic file contains 2 chromosomes and an unnamed plasmid. Since these organisms are close relatives, we would expect a similar chromosome count, indicating the the B.cepacia file is a draft genome. Using the command grep ">" GCF_003019965.1_ASM3011996v1_genomic.fna, the three returned lines indicate that it is a complete sequence. The B.cepacia file does not indicate that it is a complete sequence, and instead says, "whole genome shotgun sequence."
6. Find the genome size and GC% for the genome files.
```
# Copied over nano script from lab_01b
cp ~/lab01b/BPCounter.sh ~/practical/
```
shell script:
```
#returns the contigs/chromosomes at top (indicates organism)
grep ">" $1
#finds the total BP Count
grep -v ">" $1 | tr -d -c GCAT | wc -c
#finds the total GCCount
grep -v ">" $1 | tr -d -c GCgc | wc -c
```
running shell script to find genome size and GC%
```
#B.cepacia
sh BPCounter.sh GCF_020419785.1_ASM2041978v1_genomic.fna
#total BP: 8195038
#total GC: 5487852
#GC% = 5487852/8195038 = 66.9%
#B.multivorans
sh BPCounter.sh GCF_003019965.1_ASM301996v1_genomic.fna
#total BP: 6322859
#total GC: 4251619
#GC% = 4251619/6322859 = 67.2%
```
7. What is the appropriate command to download the raw sequencing reads from this sample? **(but don't run it)** https://www.ncbi.nlm.nih.gov/sra/SRX1304848[accn]
```
fastq-dump --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v -- fasta default --outdir ~/practical/ SRR2558789
```
8. SRA reads have already been downloaded for you. How many reads are included in each `*SRR*.fasta` file?
```
#copy the *SRR*.fasta files to ~/practical
cp /courses/bi278/Course_Materials/new_practical/*SRR*.fasta ~/practical/
#count the number of read (in ~/practical/)
grep -c ">" bmulti_SRR8885150.fasta
#returns 1636959 reads
grep -c ">" bceno_SRR2558789.fasta
#returns 2608479 reads
```
9. `jellyfish count` has already been run for you on both SRA files and left in the remote "practical" directory above. Recreate at least one of the commands that was used to do this task **(but don't run it)**. Make sure the input and output file names correspond to the files in the remote directory.
```
jellyfish count -t 2 -C -s 1G -m 29 -o bceno.m29.count bceno_SRR2558789.fasta
jellyfish count -t 2 -C -s 1G -m 29 -o bmulti.m29.count bmulti_SRR8885150.fasta
```
10. Run `jellyfish histo` on both of the `*.count` files still in the remote directory, without copying them to your current directory.
```
jellyfish histo -o ~/practical/bceno.m29.histo /courses/bi278/Course_Materials/new_practical/bceno.m29.count
jellyfish histo -o ~/practical/bmulti.m29.histo /courses/bi278/Course_Materials/new_practical/bmulti.m29.count
```
11. Import the resulting `*.histo` files into R and estimate each genome size based on their kmer curves. No need to report R code back but fill in the table.
#bceno max around 43
#bmulti max around 29
| Organism | Genome size (basepair count from step 6) | Genome size (kmer estimate) |
| -------- | -------- | -------- |
| B.cepacia | 8,195,038 | 9,662,230 |
| B.multivorans| 6,322,859 | 6,357,976 |
12. Exit out of your SSH connection.
```
exit
```