BI278 Lab 1 - Unix, NCBI, and genome files

# BI278 Lab 1 - Unix, NCBI, and genome files 8/31 ## Unix Environment ### Connecting to the BI278 Unix Environment To connect to course Unix from Terminal use``` ssh cmnguy22@bi278 ``` Let's break this command down. ```ssh``` stands for secure shell. It allows you to connect to a remote computing environment from any computer. To exit the ```bi278```course environment simply use the command ```exit``` ### Navigating and Finding ColbyHome (aka Filer) ```colbyhome``` stands for ```/home/colbyid``` Unix file system environments are made of nested directories. When first logging into ```bi278```you will be in your _home_ directory, denoted by `~`. This is what the home directory looks like on the terminal. ```[cmnguy22@vcacbi278~]``` **This is the working directory.** ```pwd``` print working directory ```ls``` list all files and folderts in current working directory ```lh``` detailed list of directory contents ```ls /courses/bi278/``` BI278 course material ```cd``` change directory ```cd ..```move up to parent directory ```.``` current directory ### Organizing Files Specify spaces in terminal using ```\```. ```ls My\ Documents``` ```cp /courses/bi278/Course_Materials/lab_01a/* .``` copy all files ```*``` to current directory ```.```. ```mkdir``` make a directory ```rmdir``` remove directory ```mv``` move file; also used for renaming files ```cp``` copy file; can also rename file ```rm``` remove **(Cannot be reversed)** ```cat```concatnate/print to screen entire contents of file ```less```display contents of file one screen length at a time; exit by ```q``` ```head```print top 10 lines ```tail```print last 10 lines ```man``` manual for command ```--help```works like manual #### Workflow 1. Make directory within ColbyHome for BI278 ```mkdir BI278```. Within ```BI278``` make directories for finished figures, raw files, processed files, and raw preference files.```mkdir finished_figs```. 2. ```cd ..``` Move back to colbyhome where your copied files are stored. 3. Move all figure files (.eps) into correct directory within BI278. ```mv *.eps BI278/finished_figs/``` 4. Move all raw preference files into correct directory within BI278. ```mv raw_preference_* BI278/raw_preference_files/``` 5. Move raw and processed files into respective folders using ```mv``` and calling them by name. ```mv all_songs_prefs_merged.txt playback_signals_measured.txt preference_predictions.txt BI278/raw_files``` ```mv achybrid_signal_and_preference.txt achybrid_signal_and_preference_README.txt BI278/processed_files/``` 6. Move remaining files into ```BI278``` directory. ## Downloading public genomic data from NCBI For this lab I will be using this example of DNA-Seq of SARS-CoV-2 taken on a Illumina MiSeq in the Philippines. The data for this lab was taken from NCBI's Sequence Read Archive (SRA) [BioProject list](https://www.ncbi.nlm.nih.gov/bioproject/browse). Clicking on a project will bring a table of Project Data that shows me how many SRA experiments are included. Clicking on the number in the sequence data cell, shows all of the experimental datasets. Clicking on the experimental datasets, will show details about this dataset, including type of sequencing, data source, and sequencing read layout You need to download this data using its run ID Take a look at data before downloading (first 3 spots of run) ```fastq-dump -X 3 -Z SRR12532546``` Download data to working directory ```fastq-dump -X 100 --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v --outdir ~/ SRR12532546``` Based on the trace, the sequences vary widely in base-pair length per "spot". ## Collect basic genome statistics for multiple genomes ### Get basic information ```grep pattern filename``` finds specific pattern within file; use quotes for complicated patterns ```wc``` count words in file; can also count lines or characters ```tr``` translate/delete sets of characters **Caution: if you need to use < as a pattern use quotes "<".** \ **| is used to send/pipe output of one command to another command without saving middle file.** a. ```grep ">" test.fa | wc -l``` printed out the number of contigs, the sets of overlapping DNA segments that together make up a consensus region of DNA. **1 contig** b. ```grep -v ">" test.fa | tr -cd [:upper:] | wc -m``` first gets any matches not associated with > symbol then gets any matches associated with uppercase letters (bases) and counts them. **400 bases** c. ```grep -v ">" test.fa | tr -d -c GCgc | wc -c``` Delete the complements of GC since we only want to count these bases. Now count the number of bases. **253** ```awk 'BEGIN {print (253/400)}'``` Signal to Unix that we want to divide the number of GC by the total number of bases. Print the output. **0.6325** ### Write and run basic Unix script to automate your collection of genome statistics To write a script use ```nano```. To indicate to Unix that this is a bash shell script, type ```#!/bin/bash```. Change filenames to ```$1``` in scripts to make them flexible to handle any filename. To run script: ```sh script_name.sh fasta_file```. **Note:** You write scripts in your directory but direct them to another data directory (safeguard data). ```sh ~/count_gc.sh``` runs script that is located from your directory. ```csvpreview {header="true"} Genome, Contig count, Genome size, GC% P.agricolaris, 2, 8721420, 61.6% P. bonniea, 2, 4098182, 58.7% P. fungorum, 4, 9058983, 61.8% P. megapolitana, 32, 7576279, 62.3% P. phenoliruptrix, 3, 7651131, 63.1% P. phymatum, 4, 8676562, 62.3% P. terrae, 247, 9838322, 62.5% P. xenovorans, 3, 9731138, 62.6% ```