bi278 fall2022 lab practical

# bi278 fall2022 lab practical First, make a copy of the markdown version of this page. Then fill in the commands to execute each step with a: ``` code block ``` Answer any questions and fill any tables. *Reminder that `ls` and autocomplete will be your best friends.* 0. SSH onto `bi278`. ``` ssh xmu23@bi278 ``` 1. Make a new directory called 'practical'. ``` mkdir practical ``` 2. Now go into this directory. ``` cd practical ``` 3. Copy only the fasta files (`*.fasta` and `*.fna`) from `/courses/bi278/Course_Materials/practical` to your current location. ``` cp /../courses/bi278/Course_Materials/new_practical/{*.fasta,*.fna} . ``` 4. Find out which organisms the two genome files belong to. ``` grep ">" GCF_020419785.1_ASM2041978v1_genomic.fna grep ">" GCF_003019965.1_ASM301996v1_genomic.fna ``` One is Burkholderia cepacia. The other is Burkholderia multivorans. It can also be told from the name of the two fasta files. 5. These two organisms are close relatives, often found in the lungs of cystic fibrosis patients. Given this fact and based on what is contained in these genome files, what would you determine is the status of each genome? Choose between the options: draft or finished. Explain why. I think they are all finished. According to the information in it, the Burkholderia capacia genome undergoes whole genome shotgun sequencing which check the whole genome, and the Burkholderia multivorans have a complete sequence. 6. Find the genome size and GC% for the genome files. bceno_SRR2558789.fasta: ``` grep -v ">" bceno_SRR2558789.fasta | tr -d -c ATGCatgc | wc -c ``` Genome size is 506435753. ``` grep -v ">" bceno_SRR2558789.fasta | tr -d -c GCgc | wc -c ``` GC count is 328757998 ``` awk 'BEGIN {print (328757998/506435753)}' ``` GC% is 0.64916% bmulti_SRR8885150.fasta: ``` grep -v ">" bmulti_SRR8885150.fasta | tr -d -c ATGCatgc | wc -c ``` Genome size: 239967045 ``` grep -v ">" bmulti_SRR8885150.fasta | tr -d -c GCgc | wc -c ``` GC count is 161999073 ``` awk 'BEGIN {print (161999073/239967045)}' ``` GC%: 0.675089 GCF_003019965.1_ASM301996v1_genomic.fna: ``` grep -v ">" GCF_003019965.1_ASM301996v1_genomic.fna | tr -d -c ATGCatgc | wc -c ``` Genome size: 6322859 ``` grep -v ">" GCF_003019965.1_ASM301996v1_genomic.fna | tr -d -c GCgc | wc -c ``` GC count is 4251619 ``` awk 'BEGIN {print (4251619/6322859)}' ``` GC%: 0.67242 GCF_020419785.1_ASM2041978v1_genomic.fna: ``` grep -v ">" GCF_020419785.1_ASM2041978v1_genomic.fna | tr -d -c ATGCatgc | wc -c ``` Genome size: 8195038 ``` grep -v ">" GCF_003019965.1_ASM301996v1_genomic.fna | tr -d -c GCgc | wc -c ``` GC count is 5487852 ``` awk 'BEGIN {print (5487852/8195038)}' ``` GC%: 0.669655 7. What is the appropriate command to download the raw sequencing reads from this sample? **(but don't run it)** https://www.ncbi.nlm.nih.gov/sra/SRX1304848[accn] ``` fastq-dump -–split-3 --skip-technical --readids --read-filter pass -- dumpbase --clip -v --fasta default --outdir . SRR2558789 ``` 8. SRA reads have already been downloaded for you. How many reads are included in each `*SRR*.fasta` file? ``` grep ">" bceno_SRR2558789.fasta ``` It has 2608479 reads. ``` grep ">" bmulti_SRR8885150.fasta ``` It has 1636959 reads. 9. `jellyfish count` has already been run for you on both SRA files and left in the remote "practical" directory above. Recreate at least one of the commands that was used to do this task **(but don't run it)**. Make sure the input and output file names correspond to the files in the remote directory. ``` jellyfish count -t 2 -C -s 1G -m 29 -o bmulti.m29.count GCF_003019965.1_ASM301996v1_genomic.fna ``` 10. Run `jellyfish histo` on both of the `*.count` files still in the remote directory, without copying them to your current directory. ``` jellyfish histo -o bmulti.m29.histo /../courses/bi278/Course_Materials/new_practical/bmulti.m29.count ``` ``` jellyfish histo -o bceno.m29.histo /../courses/bi278/Course_Materials/new_practical/bceno.m29.count ``` 11. Import the resulting `*.histo` files into R and estimate each genome size based on their kmer curves. No need to report R code back but fill in the table. | Organism | Genome size (basepair count from step 6) | Genome size (kmer estimate) | |:------------------------ | ---------------------------------------- | --------------------------- | | Burkholderia cepacia | 8195038 | 9662230 | | Burkholderia multivorans | 6322859 | 6357976 | 12. Exit out of your SSH connection. ``` exit ```