Lab 1 notebook (8/31)

###### tags: `BI278` # Lab 1 notebook (8/31) ## section 1 ### 1.1 Connect to the Bi278 Unix Environment This unix environment is separate from Colby filer and can be accessed with ``` ssh lmdrep23@bi278``` It will always bring you to your home directory when you enter this. Whenever you come onto this, make sure to enter the following command when done: ```exit``` ### 1.2 Navigating and finding colbyhome Terminal recognizes this directory as /home/lmdrep23 or ~. This you can find out by entering ```pwd```, print working directory. This is not to be confused with colbyhome-- colbyhome can be accessed through this directory without connecting to filer (this is a link to the Personal folder in filer) where it is denoted as /home/lmdrep/colbyhome. You can access the Personal folder itself (having the exact same set of files/ folders) with ```/personal/lmdrep23 ``` ### 1.3 Organizing Files A useful command is knowing how to copy over the materials from the course folder to the current folder (my home directory in bi278 unix environment) ```cp /courses/bi278/Course_Materials/lab_01a/* .``` Note that * means everything and . means current directory. *note: . means current directory and .. means previous (parent), so if you accidentally go into one of the folders within a directory you can go back to the parent with cd ..* For organizing files within the directory I'm located in, I first enter the command below to see all files in this directory. ```ls ``` I then saw that one of these files ended in "R", so I used the following command to see the first 10 lines to confirm that it is in R ```head submitted_version_achybrid_revised_2015.R``` After confirming, I viewed the whole file one screen length at a time using the command below. (Note: you scroll to go through each screen, press q to exit) ```less submitted_version_achybrid_revised_2015.R``` the R file revealed that all_songs_prefs_merged.txt, playback_signals_measured.txt, preference_predictions.txt, are raw files, so I made a directory titled raw_files and moved these files to that directory with the following commands ``` mkdir raw_files mv all_songs_prefs_merged.txt raw_files mv playback_signals_measured.txt raw_files mv preference_predictions.txt raw_files ``` note that I did this while in the directory that these files are currently in (being moved out of) and I listed ```mv filetitle newdestination``` I then confirmed that these files succesfully moved into this directory with ``` ls raw_files```. Also works to do ```ls ~``` to make sure they're no longer loose files in the home directory. I then repeated this process to put the file "achybrid_signal_and_preference.txt" in a directory called processed_files Next, I moved all of the files that included raw processing data (as well as the file with an overview of the data in each file) into a directory called raw_preference_data with a following commands. Since all of these file names began with "raw_preference", using the asterisk moved all of these files with the second command ``` mkdir raw_preference_data mv raw_preference* raw_preference_data ``` I repeated the same steps for the files containing figures, which all began with "fig" Typing the following command showed the file listed in terminal, revealing that it contains instructions on how to navigate achybrid_signal_and_preference.txt. I therefore moved this file into the same directory as that file with the following command ```mv achybrid_signal_and_preference_README.txt processed_files``` I also used the less command on achybrid_reconciled_playback_predicted_final_2015.txt to see data that would be good to have in the same place, so I moved it to that folder now. The result of all of this work is that /home/lmdrep23 is organized by folders instead of having all loose files. ## section 2 ### 2.2 Downloading datasets found in 2.2 (outside of terminal) with SRA toolkit I found a data set using the ncbi site of interest and noted its run number. The following command downloaded the first 10 spots of data. ```fastq-dump -X 3 SRR12546753``` The following command was then used to separate the paired data. ```fastq-dump -X 100 -–split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v --outdir ~/SRR12546753``` Dissecting this command: 1. It is from the SRA toolkit, and it converts SRA data (what is in the NCBI database) to the fastq/fasta format 2. Note the Run code specifying the file and the ~ specifying the location. 3. The 100 in this command should have separated the reads into 100 base-pair length reads, (but it did not seem to somehow.) I then used the following command to remove this file from my directory. ```rm SRR12546753.fastq``` ## section 3 ### 3.1: Getting basic info about genome I moved to the directory with the genomic files with the command ``` cd /courses/bi278/Course_Materials/lab_01b/``` The following commands allowed me to see the files in this directory, and then open one of these files ``` ls /courses/bi278/Course_Materials/lab_01b/ less test.fa ``` To find the contigs in the test file (which begin with >), I used the following command. The output was just <test, revealing that there is only one contig and it is called test. ``` grep ">" test.fa ``` The following command revealed a word count of 411 (which includes the line <test and test.fa, so there are actually 400 bases) for test.fa: ``` wc test.fa``` side note: the ">" was necessary become ```some_command > filename``` puts the output of a command into a file! To find out the percentage of Gs and Cs in the read, the following command gives the total number of Gs and Cs in the .fa file ```grep -v ">" test.fa | tr -d -c GCgc | wc -c``` Dissecting this command 1. -v searches for all lines NOT containing >. you don't want to include the line with the contig name in your gc count 2. the tr -d -c GCgc just filters so you only get gs and cs 3. Note that removing the |wc -c would just list all of the Gs and Cs in the file, so adding the word count command counts these. This command has a count of 253, which is a 63.25% GC since there was a total of 400 bases in this file. Note that you can calculate percentages from a fraction using the following command (gives it as a decimal). This is useful in calculating GC% ```awk 'BEGIN {print (253/400)}'``` ### 3.2 Writing and running a Unix script to automate collection of genome statistics To write a script, you enter nano with the simple command ```nano``` . You hit control-X whenever you are done with nano (it will ask if you want to save if you typed anything in). For the last section of lab, I entered the following command into nano, exchanging test.fa with the filename of each data set. The first line of code finds the contigs, the second line finding the total number of bases, and the last finding the total number of Gs and Cs in the file. Note that the tr command allows just capital letters. ``` #!/bin/bash grep -c ">" test.fa grep -v ">" test.fa | tr -d -c GCgcATat| wc -c grep -v ">" test.fa | tr -d -c GCgc | wc -c ``` To use my script on the files, I used the following command, changing out the last section for the unique file names. I was able to use these files with the command because I accessed my script for editing again with ```nano par_genome.sh``` and replaced ```test.fa``` with ```$1``` so that file name is a variable that needs to be designated when you run the script. Generally, running a script requires``` sh scriptname.sh``` ``` sh par_genome.sh /courses/bi278/Course_Materials/lab_01b/P.bonniea_QS859_assembly_2_0.fna ``` Running my script on each of the fasta files gave the following data. ```csvpreview {header="true"} Genome, Contig count, Genome size (bp), GC % P.agricolaris, 2 , 8721420 , .6163 P. bonniea , 2 , 4098182, .5872 P. fungorum, 4 ,9058983 , .6175 P. megapolitana,32, 7607319, .6207 P. phenoliruptrix, 3 , 7651131 , .6315 P. phymatum , 4 , 8676562 , .6229 P. terrae ,247 , 9925782 , .6196 P. xenovorans , 3 , 9731138 , .6263 ```