# Lab 02 # log in using `ssh slpuro26@colby.edu` **Exerise 1** moved into ncbi and browsed bioproject list searched Burkholderia pseudomallei and then sroted so i could sort by genome sequencing I then chose one and downloaded the to my desktop by clikcing the *Genome* and then *genome* links respectively mounted filer using smb://filter.colby.edu opened new terminal from desktop `cp Downloads/GCF_000756125.1_ASM75612v1_genomic.fna` to copy the download `ssh slpuro26@bi278.colby.edu` `mkdir labwk2` to make a new folder for this week `mv /personal/slpuro26/GCF_000756125.1_ASM75612v1_genomic.fna ~/labwk2` to move the downloaded file into the new folder I just made I beleive the script I made last week was made incorrectly so use the commands to get the genome stats `grep -v ">" ./GCF_000756125.1ASM75612v1_genomic.fna | tr -d -c GCgc | wc -c` to get the GC count which was 4836611 and then `grep -v ">" ./GCF_000756125.1ASM75612v1_genomic.fna | tr -d -c GCATgcat | wc -c` to and then `awk 'BEGIN {print (94836611/7085397)}'` to get the GC % for this set of data which ended up as 13.3848 I then opened up the BioProjects from NYU Langone and searched using the Accession ID **PRJNA650245** so I could use the same set as the exmaple so i woud be easier to follow I clicked on SRA Experiments to open up the experimental datasets and then clicked on one to download the run ID which in my case was **SRR12380026** used command `fastq-dump -X 3 -Z SRR12380026` to preview the downloaded data. The output was: Read 3 spots for SRR12380026 Written 3 spots for SRR12380026 @SRR12380026.1 A00427:328:HKM35DRXX:1:2101:1054:18004 length=202 CTACAGCATTCTGTGAATTATAAGGTGAAATAAAGACAGCTTTTAATTCACAGAATGCTGTAGCCTCAAAGATTTTGGGACTACCAACTCAAACTGTTGATGCTCTGGTAATAGCAACATTAAATCTGTTTACATTACAAGAGTGAGCTGTTTCAGTGGTTTGAGTGAATATGACATAGTCATATTCTGAGCCCTGTGATGA +SRR12380026.1 A00427:328:HKM35DRXX:1:2101:1054:18004 length=202 FFFF:FFFFFFFFFF:FFFFFFFFFFF::FF:F,FFFFFFFFFFFFFF:FFFFFFFF:FFFFFF:F:FF,FFFFFFFFF:FFF:FFF:F:FF,FFFFFFFFFFFFFFFFFFF,FFF:FFFFF,:F:FFFFFF,FFF:FF:F:FFF,F,FFFFFFFFFFFFFFFFFFF::F,FFF:FF:FFFFFFFFFFF,FF:FFFFFFFF, @SRR12380026.2 A00427:328:HKM35DRXX:1:2101:1072:3004 length=202 TTACATCGTTGATAGTGTTACAGTGAAGAATGGTTCCACCCATCTTTACTTTGATAAAGCTGGTCAAAAGACTTATGAAAGACATTCTCTCTCTCATTTTGCTACCATCAAAAACTATAACATTAATAGGCAATGAACCTTTAGTGTTATTAGCTCTCAGGTTGTCTAAGTTAACAAAATGAGAGAGAGAATGTCTTTCATA +SRR12380026.2 A00427:328:HKM35DRXX:1:2101:1072:3004 length=202 FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:,:F:FFFFFFFFFF,FFFFFFFFFFFFF:FFFF,FFFFFFFF:::F,FFFF:FFFFFF:F,FFFFFFF:FFFFFFFFFFFFFFF,FFFFFF,FFFFFF:F:FFFFF:FFF:FFF,F:F:FFF:FFFFFFFFFFFF @SRR12380026.3 A00427:328:HKM35DRXX:1:2101:1127:17378 length=202 GGTTTCCAACCCACTAATGGTGTTGGTTACCAACCATACAGAGTAGTAGTACTTTCTTTTGAACTTCTACATGCACCAGCATCTGTTTGTGGACCTAAAATGTCTCTGCCAAATTGTTGGAAAGGCAGAAACTTTTTGTTAGACTCAGTAAGAACACCTGTGCCTGTTAATCCATTGAAGTTGAAATTGACACATTTGTTTT +SRR12380026.3 A00427:328:HKM35DRXX:1:2101:1127:17378 length=202 FFFFFFFFFF:F::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFF,,FF:FFFF:,F:,FFFFFFFF:,FF:FF,:,,FFFFFFFFFF,:FF,F:F:F,FFF::F:FF::FFFFF,,F:FFFFF:,FFF::FFFF,FF:FFFFFFFFF I then used the command `fastq-dump -X 3 -–split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -Z SRR12380026` to download the data with the base pairs seperated to get this output: Read 3 spots for SRR12380026 Written 3 spots for SRR12380026 @SRR12380026.1.1 A00427:328:HKM35DRXX:1:2101:1054:18004 length=101 CTACAGCATTCTGTGAATTATAAGGTGAAATAAAGACAGCTTTTAATTCACAGAATGCTGTAGCCTCAAAGATTTTGGGACTACCAACTCAAACTGTTGAT +SRR12380026.1.1 A00427:328:HKM35DRXX:1:2101:1054:18004 length=101 FFFF:FFFFFFFFFF:FFFFFFFFFFF::FF:F,FFFFFFFFFFFFFF:FFFFFFFF:FFFFFF:F:FF,FFFFFFFFF:FFF:FFF:F:FF,FFFFFFFF @SRR12380026.1.2 A00427:328:HKM35DRXX:1:2101:1054:18004 length=101 GCTCTGGTAATAGCAACATTAAATCTGTTTACATTACAAGAGTGAGCTGTTTCAGTGGTTTGAGTGAATATGACATAGTCATATTCTGAGCCCTGTGATGA +SRR12380026.1.2 A00427:328:HKM35DRXX:1:2101:1054:18004 length=101 FFFFFFFFFFF,FFF:FFFFF,:F:FFFFFF,FFF:FF:F:FFF,F,FFFFFFFFFFFFFFFFFFF::F,FFF:FF:FFFFFFFFFFF,FF:FFFFFFFF, @SRR12380026.2.1 A00427:328:HKM35DRXX:1:2101:1072:3004 length=101 TTACATCGTTGATAGTGTTACAGTGAAGAATGGTTCCACCCATCTTTACTTTGATAAAGCTGGTCAAAAGACTTATGAAAGACATTCTCTCTCTCATTTTG +SRR12380026.2.1 A00427:328:HKM35DRXX:1:2101:1072:3004 length=101 FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:,:F:FFFFFFFFFF,FFFFFFFFFFFFF:FFFF, @SRR12380026.2.2 A00427:328:HKM35DRXX:1:2101:1072:3004 length=101 CTACCATCAAAAACTATAACATTAATAGGCAATGAACCTTTAGTGTTATTAGCTCTCAGGTTGTCTAAGTTAACAAAATGAGAGAGAGAATGTCTTTCATA +SRR12380026.2.2 A00427:328:HKM35DRXX:1:2101:1072:3004 length=101 FFFFFFFF:::F,FFFF:FFFFFF:F,FFFFFFF:FFFFFFFFFFFFFFF,FFFFFF,FFFFFF:F:FFFFF:FFF:FFF,F:F:FFF:FFFFFFFFFFFF @SRR12380026.3.1 A00427:328:HKM35DRXX:1:2101:1127:17378 length=101 GGTTTCCAACCCACTAATGGTGTTGGTTACCAACCATACAGAGTAGTAGTACTTTCTTTTGAACTTCTACATGCACCAGCATCTGTTTGTGGACCTAAAAT +SRR12380026.3.1 A00427:328:HKM35DRXX:1:2101:1127:17378 length=101 FFFFFFFFFF:F::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFF,, @SRR12380026.3.2 A00427:328:HKM35DRXX:1:2101:1127:17378 length=101 GTCTCTGCCAAATTGTTGGAAAGGCAGAAACTTTTTGTTAGACTCAGTAAGAACACCTGTGCCTGTTAATCCATTGAAGTTGAAATTGACACATTTGTTTT +SRR12380026.3.2 A00427:328:HKM35DRXX:1:2101:1127:17378 length=101 FF:FFFF:,F:,FFFFFFFF:,FF:FF,:,,FFFFFFFFFF,:FF,F:F:F,FFF::F:FF::FFFFF,,F:FFFFF:,FFF::FFFF,FF:FFFFFFFFF then i proceeded to download the raw sequencing reads without the quality scores using the command `fastq-dump -–split-3 --skip-technical --readids --read-filter pass -- dumpbase --clip -v --fasta default --outdir ~/labwk2 SRR12380026` used command `head -20 ./*.fasta` to look at the first 20 lines then to download another set, this time for *P.sprentiae* using command set `fastq-dump -X 3 -Z SRR3927171` output: Read 3 spots for SRR3927471 Written 3 spots for SRR3927471 @SRR3927471.1 1 length=152 NCGGCGAACAGCCCGGTGTCGGGATCGAGCGCTTCTGGCGCGCAGTCCGCTTCGTTGACGCATCAGCAGGCGCTTGCGAGTTCAAGNNNNNNNNNNGCATCGGAAAAAGCNNNNNNGCACTCATTTTTTTCCATCNNNNAGGCTGTCGCTAT +SRR3927471.1 1 length=152 ############################################################################GE4GECCBBB##########0141>:B>BB:49:######/45+.=AEBBGGDA9<=68####3366012?AA??A @SRR3927471.2 2 length=152 NCCGCTTGCCGATGCCAGTCCCGAGACATTCGTCGCCGCGCTGAACGAAGCACTCACGACCGCAACAGAACAGGACTCGATCAGCAGTACGCGCTCCTTGCTTCGGGGCTCGAGTAGCGCGACGGGGCAACAGGGTTGCGGGGCACGTATTG +SRR3927471.2 2 length=152 ############################################################################4?64?B@:5?AB?;;<>4=BED=:FGDEE:4BD:DA?==4BCB@4;,=;+BB5>B;@################### @SRR3927471.3 3 length=152 NCAGACCTGCGCGCACGCGGTGAAGCCTAAGGAGTAAATCATGGCGAAAGGCGCACGCGACAAGATCAAGCTGGAACTTGGTTTCCTTGTAGTCAACGTGCTTGCGGATGACAGGATCAAATTTTTTGATCAGCATCTTTTCCGGCATGTTG +SRR3927471.3 3 length=152 #---+12222@@5@@::::::::::<<<:<777775775558858::<<<8@@7@@@@@@@@@@@@@@?@37757@GGGGGBGGGGIIGHHIIIHIHIIIIIIBIIIEHI?IIGIHGHGIDHIEIIIGIIIBIEIIIIIHIHIII@HGIHII command `fastq-dump -X 3 -–split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -Z SRR3927471` output:Read 3 spots for SRR3927471 Written 3 spots for SRR3927471 @SRR3927471.1.1 1 length=76 NCGGCGAACAGCCCGGTGTCGGGATCGAGCGCTTCTGGCGCGCAGTCCGCTTCGTTGACGCATCAGCAGGCGCTTG +SRR3927471.1.1 1 length=76 ############################################################################ @SRR3927471.1.2 1 length=76 CGAGTTCAAGNNNNNNNNNNGCATCGGAAAAAGCNNNNNNGCACTCATTTTTTTCCATCNNNNAGGCTGTCGCTAT +SRR3927471.1.2 1 length=76 GE4GECCBBB##########0141>:B>BB:49:######/45+.=AEBBGGDA9<=68####3366012?AA??A @SRR3927471.2.1 2 length=76 NCCGCTTGCCGATGCCAGTCCCGAGACATTCGTCGCCGCGCTGAACGAAGCACTCACGACCGCAACAGAACAGGAC +SRR3927471.2.1 2 length=76 ############################################################################ @SRR3927471.2.2 2 length=76 TCGATCAGCAGTACGCGCTCCTTGCTTCGGGGCTCGAGTAGCGCGACGGGGCAACAGGGTTGCGGGGCACGTATTG +SRR3927471.2.2 2 length=76 4?64?B@:5?AB?;;<>4=BED=:FGDEE:4BD:DA?==4BCB@4;,=;+BB5>B;@################### @SRR3927471.3.1 3 length=76 NCAGACCTGCGCGCACGCGGTGAAGCCTAAGGAGTAAATCATGGCGAAAGGCGCACGCGACAAGATCAAGCTGGAA +SRR3927471.3.1 3 length=76 #---+12222@@5@@::::::::::<<<:<777775775558858::<<<8@@7@@@@@@@@@@@@@@?@37757@ @SRR3927471.3.2 3 length=76 CTTGGTTTCCTTGTAGTCAACGTGCTTGCGGATGACAGGATCAAATTTTTTGATCAGCATCTTTTCCGGCATGTTG +SRR3927471.3.2 3 length=76 GGGGGBGGGGIIGHHIIIHIHIIIIIIBIIIEHI?IIGIHGHGIDHIEIIIGIIIBIEIIIIIHIHIII@HGIHII then downloaded this second file using `fastq-dump -–split-3 --skip-technical --readids --read-filter pass -- dumpbase --clip -v --fasta default --outdir ~/labwk2 SRR3927471` I then removed the extrafiles using `rm` which was SRR3927471_pass_2.fasta and SRR12380026_pass_2.fasta **Exercise 2** attempted to run `jellyfish count -t 2 -C -s 1G -m 23 -o SRR3927471_pass.fasta.23.count SRR3927471_pass.fasta` but there was not enough memory storage skipped ahead to the histo file Burkholderia_pseudomallei_K96243_123.gbk pfung.m29.histo went back to jellyfish count `jellyfish count -t 2 -C -s 1G -m 23 -o SRR3927471_pass.fasta.23.count SRR3927471_pass.fasta` then I ran the jellyfish histo `jellyfish histo -o SRR3927471_pass.fasta.m23.histo SRR3927471_pass.fasta.m23.count` command to get the frequency distrubution and `rm SRR3927471_pass.fasta.m23.count` to remove the previous step because it takes up a lot of room used `less` so i didnt have to look at the whole thing at once stop before exercuse three