bi278 fall2022 lab practical

# bi278 fall2022 lab practical First, make a copy of the markdown version of this page. Then fill in the commands to execute each step with a: ``` code block ``` Answer any questions and fill any tables. *Reminder that `ls` and autocomplete will be your best friends.* 0. SSH onto `bi278`. 1. Make a new directory called 'practical'. ``` mkdir practical ``` 2. Now go into this directory. ``` cd practical ``` 3. Copy only the fasta files (`*.fasta` and `*.fna`) from `/courses/bi278/Course_Materials/practical` to your current location. ``` #copy all the .fasta files from the bi278 folder cp /courses/bi278/Course_Materials/new_practical/*.fasta practical #copy all the .fna files from the bi278 folder cp /courses/bi278/Course_Materials/new_practical/*.fna practical ``` 4. Find out which organisms the two genome files belong to. ``` #To find out the first one, I use the following command on this .fna file grep ">" GCF_003019965.1_ASM301996v1_genomic.fna ``` And I get >NZ_CP020397.1 Burkholderia multivorans strain FDAARGOS_246 chromosome 1, complete sequence >NZ_CP020398.1 Burkholderia multivorans strain FDAARGOS_246 chromosome 2, complete sequence >NZ_CP020399.1 Burkholderia multivorans strain FDAARGOS_246 plasmid unnamed, complete sequence > So this organism is **Burkholderia multivorans**. ``` #To find out the second one, I use the following command on this .fna file grep ">" GCF_020419785.1_ASM2041978v1_genomic.fna ``` And I get >NZ_JAIZPY010000010.1 Burkholderia cepacia strain AU41368 NODE_10_length_352747_cov_44.178166, whole genome shotgun sequence >NZ_JAIZPY010000011.1 Burkholderia cepacia strain AU41368 NODE_11_length_310558_cov_41.633896, whole genome shotgun sequence >NZ_JAIZPY010000012.1 Burkholderia cepacia strain AU41368 NODE_12_length_236094_cov_42.966580, whole genome shotgun sequence >NZ_JAIZPY010000013.1 Burkholderia cepacia strain AU41368 NODE_13_length_209724_cov_39.668087, whole genome shotgun sequence >.... So this organism is **Burkholderia cepacia**. 5. These two organisms are close relatives, often found in the lungs of cystic fibrosis patients. Given this fact and based on what is contained in these genome files, what would you determine is the status of each genome? Choose between the options: draft or finished. Explain why. I think they are both finished genome. From the files, I noticed that for Burkholderia multivorans, it says complete genome at the end of each line. For Burkholderia cepacia, I notice it says whole genome shotgun sequence. I think that suggests the gene have been finalized. 6. Find the genome size and GC% for the genome files. For Burkholderia cepacia: 1. ``` #find the GC number grep -v ">" bceno_SRR2558789.fasta | tr -d -c GCgc | wc -c #I get 328757998 #find the genome size grep -v ">" bceno_SRR2558789.fasta | tr -d -c ATatGCgc | wc -c #I get 506435753 #So the GC% is 0.649 ``` 2. ``` #find the GC number grep -v ">" GCF_020419785.1_ASM2041978v1_genomic.fna | tr -d -c GCgc | wc -c #I get 5487852 #find the genome size grep -v ">" GCF_020419785.1_ASM2041978v1_genomic.fna | tr -d -c ATatGCgc | wc -c #I get 8195038 #So the GC% is 0.670 ``` For Burkholderia multivorans: 1. ``` #find the GC number grep -v ">" bmulti_SRR8885150.fasta| tr -d -c GCgc | wc -c #I get 161999073 #find the genome size grep -v ">" bceno_SRR2558789.fasta | tr -d -c ATatGCgc | wc -c #I get 239967045 #So the GC% is 0.675 ``` 2. ``` #find the GC number grep -v ">" GCF_003019965.1_ASM301996v1_genomic.fna | tr -d -c GCgc | wc -c #I get 4251619 #find the genome size grep -v ">" GCF_003019965.1_ASM301996v1_genomic.fna | tr -d -c ATatGCgc | wc -c #I get 6322859 #So the GC% is 0.672 ``` 7. What is the appropriate command to download the raw sequencing reads from this sample? **(but don't run it)** https://www.ncbi.nlm.nih.gov/sra/SRX1304848[accn] ``` vdb-config --interactive fastq-dump fastq-dump -–split-3 --skip-technical --readids --read-filter pass -- dumpbase --clip -v --fasta default --outdir ~/practical SRRSRR2558789 ``` 8. SRA reads have already been downloaded for you. How many reads are included in each `*SRR*.fasta` file? When we run this command for **Burkholderia cepacia** ``` grep ">" bceno_SRR2558789.fasta ``` We can see from the result that there are **2608479** lines. meaning that there are **2608479** reads. Similarly, when we run this command for **Burkholderia multivorans**: ``` grep ">" bmulti_SRR8885150.fasta ``` We can see from the result that there are **1636959** lines. meaning that there are **1636959** reads. 9. `jellyfish count` has already been run for you on both SRA files and left in the remote "practical" directory above. Recreate at least one of the commands that was used to do this task **(but don't run it)**. Make sure the input and output file names correspond to the files in the remote directory. ``` jellyfish count -t 2 -C -s 1G -m 29 -o bceno.m29.count GCF_020419785.1_ASM2041978v1_genomic.fna ``` 10. Run `jellyfish histo` on both of the `*.count` files still in the remote directory, without copying them to your current directory. For **Burkholderia cepacia** ``` jellyfish histo -o bceno.m29.histo /courses/bi278/Course_Materials/new_practical/bceno.m29.count ``` For **Burkholderia multivorans** ``` jellyfish histo -o bmulti.m29.histo /courses/bi278/Course_Materials/new_practical/bmulti.m29.count ``` 11. Import the resulting `*.histo` files into R and estimate each genome size based on their kmer curves. No need to report R code back but fill in the table. | Organism | Genome size (basepair count from step 6) | Genome size (kmer estimate) | | -------- | -------- | -------- | | Burkholderia cepacia | 8195038 | 9662230 | | Burkholderia multivorans | 6322859 | 6357976 | 12. Exit out of your SSH connection. ``` exit ```