bi278 fall2022 lab practical

# bi278 fall2022 lab practical ### Completed by Lee Ferenc First, make a copy of the markdown version of this page. Then fill in the commands to execute each step with a: ``` code block ``` Answer any questions and fill any tables. *Reminder that `ls` and autocomplete will be your best friends.* 0. SSH onto `bi278`. 1. Make a new directory called 'practical'. ``` cd colbyhome/Genomics mkdir practical #or you can do mkdir colbyhome/Genomics/practical ``` 2. Now go into this directory. ``` cd practical ``` 3. Copy only the fasta files (`*.fasta` and `*.fna`) from `/courses/bi278/Course_Materials/practical` to your current location. ``` cp ////courses/bi278/Course_Materials/new_practical/*.fasta ////home2/enfere24/colbyhome/Genomics/practical #to check: ls ////home2/enfere24/colbyhome/Genomics/practical ``` 4. Find out which organisms the two genome files belong to. ``` head ////courses/bi278/Course_Materials/new_practical/*.fna ``` #### Burkholderia multivorans (strain FDAARGOS_246) and Burkholderia cepacia (strain AU41368) (are files: GCF_003019965.1_ASM301996v1_genomic.fna and GCF_020419785.1_ASM2041978v1_genomic.fna, respectvely) 5. These two organisms are close relatives, often found in the lungs of cystic fibrosis patients. Given this fact and based on what is contained in these genome files, what would you determine is the status of each genome? Choose between the options: draft or finished. Explain why. When conducting: ``` grep ">" *.fasta ``` There were a large amount of lines being produced for both to the point I had to control + C. Each read was around 200 to 250 bp in length. Therefore it seems that both genomes are drafts. 6. Find the genome size and GC% for the genome files. ``` #instead of using another nano script or creating one, I did: grep -v ">" GCF_003019965.1_ASM301996v1_genomic.fna | tr -d -c GCgc | wc -c wc -m GCF_003019965.1_ASM301996v1_genomic.fna awk 'BEGIN{print(4251619/6402175)}' ``` #### The size of B. mulitvorans is 4251619 bp with a GC% of 0.66409. ``` grep -v ">" GCF_020419785.1_ASM2041978v1_genomic.fna | tr -d -c GCgc | wc -c wc -m GCF_020419785.1_ASM2041978v1_genomic.fna awk 'BEGIN{print(5487852/8301827)}' ``` #### The size of B. cepacia is 8301827 bp with a GC% of 0.661041. 7. What is the appropriate command to download the raw sequencing reads from this sample? **(but don't run it)** https://www.ncbi.nlm.nih.gov/sra/SRX1304848[accn] ``` fastq-dump -X 3 -Z SRR2558789 ``` 8. SRA reads have already been downloaded for you. How many reads are included in each `*SRR*.fasta` file? ``` #I couldn't figure out the fastq-dump to get the read number so I got creative tail -5 bceno_SRR2558789.fasta ``` > SRR2558789.2608479.1 2608479 length=251 ... (genetic code) ... ``` #something went wrong and I noticed they had the exact same output (SRR and all) #it seems the file got over-writen #but I'm still documenting as a lesson to read your data rm bmulti_SRR8885150.fasta cp ////courses/bi278/Course_Materials/new_practical/bmulti_SRR8885150.fasta home2/enfere24/colbyhome/Genomics/practical tail -5 bmulti_SRR8885150.fasta ``` > SRR8885150.1636959.1 1636959 length=150 ... (genetic code) ... #### Before length it gives the read number B. cepacia has 2608479 reads and B. multivorans has 1636959 reads 9. `jellyfish count` has already been run for you on both SRA files and left in the remote "practical" directory above. Recreate at least one of the commands that was used to do this task **(but don't run it)**. Make sure the input and output file names correspond to the files in the remote directory. ``` jellyfish count -t 2 -C -s 1G -m 22 -o ////courses/bi278/Course_Materials/new_practical/bceno.m29.count SRR2558789.fasta ``` 10. Run `jellyfish histo` on both of the `*.count` files still in the remote directory, without copying them to your current directory. ``` jellyfish histo -o bceno.m29.histo ////courses/bi278/Course_Materials/new_practical/bceno.m29.count jellyfish histo -o bmulti.m29.histo ////courses/bi278/Course_Materials/new_practical/bmulti.m29.count ``` 11. Import the resulting `*.histo` files into R and estimate each genome size based on their kmer curves. No need to report R code back but fill in the table. | Organism | Genome size (basepair count from step 6) | Genome size (kmer estimate) | | -------- | -------- | -------- | | B. cepacia | 8301827 | 2905426 | | B. multivorans | 4251619 | 6357976 | 12. Exit out of your SSH connection. ``` exit ```