# Lab 02
#
log in using `ssh slpuro26@colby.edu`
**Exerise 1**
moved into ncbi and browsed bioproject list
searched Burkholderia pseudomallei and then sroted so i could sort by genome sequencing
I then chose one and downloaded the to my desktop by clikcing the *Genome* and then *genome* links respectively
mounted filer using smb://filter.colby.edu
opened new terminal from desktop
`cp Downloads/GCF_000756125.1_ASM75612v1_genomic.fna` to copy the download
`ssh slpuro26@bi278.colby.edu`
`mkdir labwk2` to make a new folder for this week
`mv /personal/slpuro26/GCF_000756125.1_ASM75612v1_genomic.fna ~/labwk2` to move the downloaded file into the new folder I just made
I beleive the script I made last week was made incorrectly so use the commands to get the genome stats
`grep -v ">" ./GCF_000756125.1ASM75612v1_genomic.fna | tr -d -c GCgc | wc -c` to get the GC count which was 4836611
and then `grep -v ">" ./GCF_000756125.1ASM75612v1_genomic.fna | tr -d -c GCATgcat | wc -c` to
and then `awk 'BEGIN {print (94836611/7085397)}'` to get the GC % for this set of data which ended up as 13.3848
I then opened up the BioProjects from NYU Langone and searched using the Accession ID **PRJNA650245** so I could use the same set as the exmaple so i woud be easier to follow
I clicked on SRA Experiments to open up the experimental datasets and then clicked on one to download the run ID which in my case was **SRR12380026**
used command `fastq-dump -X 3 -Z SRR12380026` to preview the downloaded data. The output was: Read 3 spots for SRR12380026
Written 3 spots for SRR12380026
@SRR12380026.1 A00427:328:HKM35DRXX:1:2101:1054:18004 length=202
CTACAGCATTCTGTGAATTATAAGGTGAAATAAAGACAGCTTTTAATTCACAGAATGCTGTAGCCTCAAAGATTTTGGGACTACCAACTCAAACTGTTGATGCTCTGGTAATAGCAACATTAAATCTGTTTACATTACAAGAGTGAGCTGTTTCAGTGGTTTGAGTGAATATGACATAGTCATATTCTGAGCCCTGTGATGA
+SRR12380026.1 A00427:328:HKM35DRXX:1:2101:1054:18004 length=202
FFFF:FFFFFFFFFF:FFFFFFFFFFF::FF:F,FFFFFFFFFFFFFF:FFFFFFFF:FFFFFF:F:FF,FFFFFFFFF:FFF:FFF:F:FF,FFFFFFFFFFFFFFFFFFF,FFF:FFFFF,:F:FFFFFF,FFF:FF:F:FFF,F,FFFFFFFFFFFFFFFFFFF::F,FFF:FF:FFFFFFFFFFF,FF:FFFFFFFF,
@SRR12380026.2 A00427:328:HKM35DRXX:1:2101:1072:3004 length=202
TTACATCGTTGATAGTGTTACAGTGAAGAATGGTTCCACCCATCTTTACTTTGATAAAGCTGGTCAAAAGACTTATGAAAGACATTCTCTCTCTCATTTTGCTACCATCAAAAACTATAACATTAATAGGCAATGAACCTTTAGTGTTATTAGCTCTCAGGTTGTCTAAGTTAACAAAATGAGAGAGAGAATGTCTTTCATA
+SRR12380026.2 A00427:328:HKM35DRXX:1:2101:1072:3004 length=202
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:,:F:FFFFFFFFFF,FFFFFFFFFFFFF:FFFF,FFFFFFFF:::F,FFFF:FFFFFF:F,FFFFFFF:FFFFFFFFFFFFFFF,FFFFFF,FFFFFF:F:FFFFF:FFF:FFF,F:F:FFF:FFFFFFFFFFFF
@SRR12380026.3 A00427:328:HKM35DRXX:1:2101:1127:17378 length=202
GGTTTCCAACCCACTAATGGTGTTGGTTACCAACCATACAGAGTAGTAGTACTTTCTTTTGAACTTCTACATGCACCAGCATCTGTTTGTGGACCTAAAATGTCTCTGCCAAATTGTTGGAAAGGCAGAAACTTTTTGTTAGACTCAGTAAGAACACCTGTGCCTGTTAATCCATTGAAGTTGAAATTGACACATTTGTTTT
+SRR12380026.3 A00427:328:HKM35DRXX:1:2101:1127:17378 length=202
FFFFFFFFFF:F::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFF,,FF:FFFF:,F:,FFFFFFFF:,FF:FF,:,,FFFFFFFFFF,:FF,F:F:F,FFF::F:FF::FFFFF,,F:FFFFF:,FFF::FFFF,FF:FFFFFFFFF
I then used the command `fastq-dump -X 3 -–split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -Z SRR12380026` to download the data with the base pairs seperated to get this output:
Read 3 spots for SRR12380026
Written 3 spots for SRR12380026
@SRR12380026.1.1 A00427:328:HKM35DRXX:1:2101:1054:18004 length=101
CTACAGCATTCTGTGAATTATAAGGTGAAATAAAGACAGCTTTTAATTCACAGAATGCTGTAGCCTCAAAGATTTTGGGACTACCAACTCAAACTGTTGAT
+SRR12380026.1.1 A00427:328:HKM35DRXX:1:2101:1054:18004 length=101
FFFF:FFFFFFFFFF:FFFFFFFFFFF::FF:F,FFFFFFFFFFFFFF:FFFFFFFF:FFFFFF:F:FF,FFFFFFFFF:FFF:FFF:F:FF,FFFFFFFF
@SRR12380026.1.2 A00427:328:HKM35DRXX:1:2101:1054:18004 length=101
GCTCTGGTAATAGCAACATTAAATCTGTTTACATTACAAGAGTGAGCTGTTTCAGTGGTTTGAGTGAATATGACATAGTCATATTCTGAGCCCTGTGATGA
+SRR12380026.1.2 A00427:328:HKM35DRXX:1:2101:1054:18004 length=101
FFFFFFFFFFF,FFF:FFFFF,:F:FFFFFF,FFF:FF:F:FFF,F,FFFFFFFFFFFFFFFFFFF::F,FFF:FF:FFFFFFFFFFF,FF:FFFFFFFF,
@SRR12380026.2.1 A00427:328:HKM35DRXX:1:2101:1072:3004 length=101
TTACATCGTTGATAGTGTTACAGTGAAGAATGGTTCCACCCATCTTTACTTTGATAAAGCTGGTCAAAAGACTTATGAAAGACATTCTCTCTCTCATTTTG
+SRR12380026.2.1 A00427:328:HKM35DRXX:1:2101:1072:3004 length=101
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:,:F:FFFFFFFFFF,FFFFFFFFFFFFF:FFFF,
@SRR12380026.2.2 A00427:328:HKM35DRXX:1:2101:1072:3004 length=101
CTACCATCAAAAACTATAACATTAATAGGCAATGAACCTTTAGTGTTATTAGCTCTCAGGTTGTCTAAGTTAACAAAATGAGAGAGAGAATGTCTTTCATA
+SRR12380026.2.2 A00427:328:HKM35DRXX:1:2101:1072:3004 length=101
FFFFFFFF:::F,FFFF:FFFFFF:F,FFFFFFF:FFFFFFFFFFFFFFF,FFFFFF,FFFFFF:F:FFFFF:FFF:FFF,F:F:FFF:FFFFFFFFFFFF
@SRR12380026.3.1 A00427:328:HKM35DRXX:1:2101:1127:17378 length=101
GGTTTCCAACCCACTAATGGTGTTGGTTACCAACCATACAGAGTAGTAGTACTTTCTTTTGAACTTCTACATGCACCAGCATCTGTTTGTGGACCTAAAAT
+SRR12380026.3.1 A00427:328:HKM35DRXX:1:2101:1127:17378 length=101
FFFFFFFFFF:F::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFF,,
@SRR12380026.3.2 A00427:328:HKM35DRXX:1:2101:1127:17378 length=101
GTCTCTGCCAAATTGTTGGAAAGGCAGAAACTTTTTGTTAGACTCAGTAAGAACACCTGTGCCTGTTAATCCATTGAAGTTGAAATTGACACATTTGTTTT
+SRR12380026.3.2 A00427:328:HKM35DRXX:1:2101:1127:17378 length=101
FF:FFFF:,F:,FFFFFFFF:,FF:FF,:,,FFFFFFFFFF,:FF,F:F:F,FFF::F:FF::FFFFF,,F:FFFFF:,FFF::FFFF,FF:FFFFFFFFF
then i proceeded to download the raw sequencing reads without the quality scores using the command `fastq-dump -–split-3 --skip-technical --readids --read-filter pass -- dumpbase --clip -v --fasta default --outdir ~/labwk2 SRR12380026`
used command `head -20 ./*.fasta` to look at the first 20 lines
then to download another set, this time for *P.sprentiae* using command set
`fastq-dump -X 3 -Z SRR3927171`
output:
Read 3 spots for SRR3927471
Written 3 spots for SRR3927471
@SRR3927471.1 1 length=152
NCGGCGAACAGCCCGGTGTCGGGATCGAGCGCTTCTGGCGCGCAGTCCGCTTCGTTGACGCATCAGCAGGCGCTTGCGAGTTCAAGNNNNNNNNNNGCATCGGAAAAAGCNNNNNNGCACTCATTTTTTTCCATCNNNNAGGCTGTCGCTAT
+SRR3927471.1 1 length=152
############################################################################GE4GECCBBB##########0141>:B>BB:49:######/45+.=AEBBGGDA9<=68####3366012?AA??A
@SRR3927471.2 2 length=152
NCCGCTTGCCGATGCCAGTCCCGAGACATTCGTCGCCGCGCTGAACGAAGCACTCACGACCGCAACAGAACAGGACTCGATCAGCAGTACGCGCTCCTTGCTTCGGGGCTCGAGTAGCGCGACGGGGCAACAGGGTTGCGGGGCACGTATTG
+SRR3927471.2 2 length=152
############################################################################4?64?B@:5?AB?;;<>4=BED=:FGDEE:4BD:DA?==4BCB@4;,=;+BB5>B;@###################
@SRR3927471.3 3 length=152
NCAGACCTGCGCGCACGCGGTGAAGCCTAAGGAGTAAATCATGGCGAAAGGCGCACGCGACAAGATCAAGCTGGAACTTGGTTTCCTTGTAGTCAACGTGCTTGCGGATGACAGGATCAAATTTTTTGATCAGCATCTTTTCCGGCATGTTG
+SRR3927471.3 3 length=152
#---+12222@@5@@::::::::::<<<:<777775775558858::<<<8@@7@@@@@@@@@@@@@@?@37757@GGGGGBGGGGIIGHHIIIHIHIIIIIIBIIIEHI?IIGIHGHGIDHIEIIIGIIIBIEIIIIIHIHIII@HGIHII
command `fastq-dump -X 3 -–split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -Z SRR3927471`
output:Read 3 spots for SRR3927471
Written 3 spots for SRR3927471
@SRR3927471.1.1 1 length=76
NCGGCGAACAGCCCGGTGTCGGGATCGAGCGCTTCTGGCGCGCAGTCCGCTTCGTTGACGCATCAGCAGGCGCTTG
+SRR3927471.1.1 1 length=76
############################################################################
@SRR3927471.1.2 1 length=76
CGAGTTCAAGNNNNNNNNNNGCATCGGAAAAAGCNNNNNNGCACTCATTTTTTTCCATCNNNNAGGCTGTCGCTAT
+SRR3927471.1.2 1 length=76
GE4GECCBBB##########0141>:B>BB:49:######/45+.=AEBBGGDA9<=68####3366012?AA??A
@SRR3927471.2.1 2 length=76
NCCGCTTGCCGATGCCAGTCCCGAGACATTCGTCGCCGCGCTGAACGAAGCACTCACGACCGCAACAGAACAGGAC
+SRR3927471.2.1 2 length=76
############################################################################
@SRR3927471.2.2 2 length=76
TCGATCAGCAGTACGCGCTCCTTGCTTCGGGGCTCGAGTAGCGCGACGGGGCAACAGGGTTGCGGGGCACGTATTG
+SRR3927471.2.2 2 length=76
4?64?B@:5?AB?;;<>4=BED=:FGDEE:4BD:DA?==4BCB@4;,=;+BB5>B;@###################
@SRR3927471.3.1 3 length=76
NCAGACCTGCGCGCACGCGGTGAAGCCTAAGGAGTAAATCATGGCGAAAGGCGCACGCGACAAGATCAAGCTGGAA
+SRR3927471.3.1 3 length=76
#---+12222@@5@@::::::::::<<<:<777775775558858::<<<8@@7@@@@@@@@@@@@@@?@37757@
@SRR3927471.3.2 3 length=76
CTTGGTTTCCTTGTAGTCAACGTGCTTGCGGATGACAGGATCAAATTTTTTGATCAGCATCTTTTCCGGCATGTTG
+SRR3927471.3.2 3 length=76
GGGGGBGGGGIIGHHIIIHIHIIIIIIBIIIEHI?IIGIHGHGIDHIEIIIGIIIBIEIIIIIHIHIII@HGIHII
then downloaded this second file using `fastq-dump -–split-3 --skip-technical --readids --read-filter pass -- dumpbase --clip -v --fasta default
--outdir ~/labwk2 SRR3927471`
I then removed the extrafiles using `rm` which was SRR3927471_pass_2.fasta and SRR12380026_pass_2.fasta
**Exercise 2**
attempted to run `jellyfish count -t 2 -C -s 1G -m 23 -o SRR3927471_pass.fasta.23.count SRR3927471_pass.fasta` but there was not enough memory storage
skipped ahead to the histo file Burkholderia_pseudomallei_K96243_123.gbk pfung.m29.histo
went back to jellyfish count `jellyfish count -t 2 -C -s 1G -m 23 -o SRR3927471_pass.fasta.23.count SRR3927471_pass.fasta`
then I ran the jellyfish histo `jellyfish histo -o SRR3927471_pass.fasta.m23.histo SRR3927471_pass.fasta.m23.count` command to get the frequency distrubution
and `rm SRR3927471_pass.fasta.m23.count` to remove the previous step because it takes up a lot of room
used `less` so i didnt have to look at the whole thing at once
stop before exercuse three