# BI345 Lab 1 Goal: 1) refresh Unix skils 2) relearn how to download raw sequencing reads from NCBI SRA ### Exercise #1 BI345 Unix Environment 1.1 Use terminal app to connect to bi345 Unix environment ``` ssh sdivit25@bi345 ``` ^then enter your password 1.2 Find out where you are/working directory in this file system ``` #find your current working directory pwd ``` 1.3 Look at what's in your current working directory ``` #list everything in your curent working directory ls ``` Variations on ls: ``` ls . #a single dot means here ls ../ #two dots mean one level above ls ../../ #this means two levels above ls -lh #get ls to show more information in table-like format ls /courses/bi345 #where all the course files will be ``` 1.4 Use cd to move around the file system ``` cd . pwd cd ../ pwd cd ../../ pwd cd ~ pwd ``` 1.5 Go back to home directory, make a directory/folder & mve into it ``` #go to your home directory mkdir bi345_wk02 cd bi345_wk02 ``` 1.6 Next = download some files from NCBI SRA (sequence read archive) ### Exercise #2 SRA-tools General Notes: - data for projects/articles usually available through specific BioProject number - https://www.ncbi.nlm.nih.gov/bioproject/PRJNA771717 ==> for our discussion article this week, there are 1159 data samples available - clicking on the number (https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=771717) gives you the individual data samples - In this example, prof Noh wanted "paired end data" so she clicked on paired below Library Layout on the left hand column - then she clicked on one of the runs (https://www.ncbi.nlm.nih.gov/sra/SRX12639428%5Baccn%5D) to find SRR number - NOTE: most raw data files you can download from NCBI wil have a number that starts with SRR - if you click on SRR number (https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR16360998&display=metadata), you'll be taken to a colorful page that gives you more info (metadata) about the particular data file - here, the metadata told prof Noh that this sample was coded as "subj-3pore-119col-2" & is illumina whole genome sequencing (WGS) data - now, we wll download this file 2.1 Two commands we can use: fastq-dump & fasterq-dump. First, examine the file using fastq-dump with a few options: 1) -X, to get small sample of data file and 2) -Z to print onto the screen (stdout) this sample Ex: Check 3 spots ``` fastq-dump -X 3 -Z SRR16360998 ``` ^from metadata, we expected read to be longer ==> paired end read w/ 75bp each (150bp total) so each line of DNA seq should be abt 150bp long which tells us that something is not quite right 2.2 To figure out the progblem, we use third command vdb-dump. This command will let us look at the first read to see how it is coded. ``` vdb-dump SRR16360998 -R 1 -C READ_TYPE ``` Result: ![](https://i.imgur.com/71ctzbp.png) ^for some reason, this data file is labeled incorrently in the system. It should be 2 biological reads, not one technical & one biological. To get complete data, we need to work around the mis-labeling. 2.3 Basic command to download these data is fasterq-dump. Use --help option to see more info. ``` fasterq-dump --help ``` 2.4 Syntax for fasterq-dump is simple but there are many additional useful options. First, b/c of paired end data, we need R1 & R2 files to be separated from each other. Otherwise your aligner won't be able to use it. ``` fasterq-dump --split-files SSR######## ``` **MINE DID NOT WORK. Prof Noh said we'l fix it next week. We tried these commands: ``` fastq-dump --split-3 SRR16360998 fastq-dump --split-spot SRR16360998 ``` ^both did not result in two reads we only got one output 2.5 Now, we know data is mislabeled so we need to get it to output the (incorrectly labeled) 'technical' reads as well: ``` fasterq-dump --split-files --include-technical SRR######## ``` ^ALSO DID NOT WORK b/c do not have proper course access -> hopefully getting fixed next week NOTE: Prof Noh said it's okay to stop here and fix next week **Rest of the lab is concerned with Unix commands andtricks including autocomplete, wildcards, leaving Unix, etc. - familiar w/ from last semester