bi278 fall2022 lab practical

# bi278 fall2022 lab practical First, make a copy of the markdown version of this page. Then fill in the commands to execute each step with a: ``` code block ``` Answer any questions and fill any tables. *Reminder that `ls` and autocomplete will be your best friends.* 0. SSH onto `bi278`. 1. Make a new directory called 'practical'. ``` ssh kyamad23@bi278 mkdir practical ``` 2. Now go into this directory. ``` cd ./practical ``` 3. Copy only the fasta files (`*.fasta` and `*.fna`) from `/courses/bi278/Course_Materials/practical` to your current location. ``` cp /courses/bi278/Course_Materials/new_practical/*.fasta ./ cp /courses/bi278/Course_Materials/new_practical/*.fna ./ ``` 4. Find out which organisms the two genome files belong to. ``` grep ">" GCF_003019965.1_ASM301996v1_genomic.fna grep ">" GCF_020419785.1_ASM2041978v1_genomic.fna ``` The first genome file belongs to the organism, Burkholderia multivorans of the strain FDAARGOS_246 and the second genome file belongs to the organism, Burkholderia cepacia of the strain AU41368. 5. These two organisms are close relatives, often found in the lungs of cystic fibrosis patients. Given this fact and based on what is contained in these genome files, what would you determine is the status of each genome? Choose between the options: draft or finished. Explain why. By reading the first few lines of the genome file for B. multivorans, we see that the first chromosome is a complete sequence. This would point towards the fact that this is a finished genome. By reading the first few lines of the genome file for B. cepacia, it is still labeled "whole genome sequence" rather than "complete". The number of contigs that are found when running the grep command points towards the fact that this is a draft genome. 6. Find the genome size and GC% for the genome files. bishbashbosh.sh is a shell script made in the first lab that counts the GC frequency and length of the genome in bp by using the grep command. ``` sh bishbashbosh.sh GCF_003019965.1_ASM301996v1_genomic.fna sh bishbashbosh.sh GCF_020419785.1_ASM2041978v1_genomic.fna ``` The size of the B. multivorans genome is 6,322,859 bp and the size of the B. cepacia genome is 8,195,038 bp. The GC% are 67.2% and 67.0% respectively. 7. What is the appropriate command to download the raw sequencing reads from this sample? **(but don't run it)** https://www.ncbi.nlm.nih.gov/sra/SRX1304848[accn] ``` fastq-dump -X 3 -Z SRR2558789 ``` 8. SRA reads have already been downloaded for you. How many reads are included in each `*SRR*.fasta` file? ``` grep -c "^>" bceno_SRR2558789.fasta grep -c "^>" bmulti_SRR8885150.fasta ``` First file: 2608479 Second File: 1636959 9. `jellyfish count` has already been run for you on both SRA files and left in the remote "practical" directory above. Recreate at least one of the commands that was used to do this task **(but don't run it)**. Make sure the input and output file names correspond to the files in the remote directory. ``` jellyfish count -t 2 -C -s 1G -m 29 -o /courses/bi278/Course_Materials/new_practical/bceno.m29.count /courses/bi278/Course_Materials/new_practical/GCF_003019965.1_ASM301996v1_genomic.fna jellyfish histo -o bceno.m29.histo bceno.m29.count rm bceno.m29.count ``` 10. Run `jellyfish histo` on both of the `*.count` files still in the remote directory, without copying them to your current directory. ``` jellyfish histo -o bceno.m29.histo /courses/bi278/Course_Materials/new_practical/bceno.m29.count jellyfish histo -o bmulti.m29.histo /courses/bi278/Course_Materials/new_practical/bmulti.m29.count ``` 11. Import the resulting `*.histo` files into R and estimate each genome size based on their kmer curves. No need to report R code back but fill in the table. | Organism | Genome size (basepair count from step 6) | Genome size (kmer estimate) | | -------- | -------- | -------- | | B. cepacia | 8,195,038 bp | 3077599 | | B. multivorans | 6,322,859 bp | 1,523,813 | 12. Exit out of your SSH connection. ``` ```