BEARS day 1 & 2 notes

## ANGUS: Analyzing High Throughput Sequencing Data July 2 - July 14, 2018 UC Davis Workshop website: --- # BEARS day 1 & 2 notes --- Welcome to the mighty Bears! We will use this HackMD to share links and snippets of code, take notes, ask and answer questions, and whatever else comes to mind. ### Modes The page displays a screen with three major parts: Edit: See only the editor. View: See only the result. Both: See both in split view. ## Announcements Bathrooms: gender neutral bathrooms and lactation rooms upstairs on 2nd floor Code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html Google Poll: https://goo.gl/forms/F6BnIj558PaeTHcE2 ### Instructors: * Tristan De Buysscher * [Lisa Johnson](https://twitter.com/monsterbashseq) ### TA's: * Kat Fowler * Sateesh Peri * Asher Hudson * Elias * Raquel ### Introductions: Name, Institution, Email (optional), Twitter (optional). What is the last thing that you made? * Elias Oziolor, UCD, emoziolor@ucdavis.edu, twitter: [eliasoziolor](https://twitter.com/eliasoziolor), I made soyrizo scrambled eggs this morning. * Kat Fowler, NIH, kat.ffowler@gmail.com, @firstaidkat (I only lurk I never post), the last thing I made was probably a fool of myself * Jessica Zehnpfennig, Central Michigan University, Zehnp1jr@cmich.edu, Twitter: [@jZehnpfennig](https://twitter.com/JZehnpfennig) I made myself workout this morning. * Pearl Lam, University of Arizona (Tucson, AZ). wplam@email.arizona.edu - made a collage on illustrator * Sateesh Peri, University of Nevada, Reno (The Biggest Little city in the world), perisateesh@nevada.unr.edu, I made a painting :) * Maggi Brisbin, Okinawa Institute of Science and Technology, margaret.marsbrisbin@oist.jp, Twitter = [@MargaretBrisbin](https://twitter.com/MargaretBrisbin) , Last thing I made: [Plankton>Plastic](https://www.instagram.com/p/Bity5rCFLJB/?taken-by=mars_adventure) * Tristan De Buysscher, University of North Carolina at Chapel Hill, tristand@email.unc.edu, [@Joiry](https://twitter.com/Joiry) . 3d printed scenery at UNC's maker space * Mike Maniscalco, UC Santa Barbara, mmaniscalco@ucsb.edu, the last thing i made was some chili-mac and cheese in yosemite before heading up here * Jen Reeve, University of Colorado Boulder, jennifer.reeve@colorado.edu, [@seejenscience](https://twitter.com/seejenscience), garden beds for my nice little (late start) garden * Lisa Ma, University of California Davis, llama@ucdavis.edu, I don't get Twitter. I made zucchini bread for the amazing staff running my sample through the FACS sorter. * Anna James, University of California Santa Barbara, anna.james.k@gmail.com, a puffy-paint goblet * Joshua Kling, University of Southern California, Joshuakl@usc.edu, [@KlingJoshua](https://twitter.com/KlingJoshua), watermelon martinis * Rogan Grant Morimoto Lab, Northwestern University; rogangrant2022@u.northwestern.edu; I have a twitter, but I honestly don't like the platform enough to share; last thing I made was a gigantic mess demolishing my floors * Nick Jensen, UC Davis Microbiology Graduate Group/Dept. of Food Science & Technology, yenjensen@ucdavis.edu , I just made a Twitter ([@FeedTheMicrobes](https://twitter.com/FeedTheMicrobes)) today so it's uninformative, chicken bone broth and pickled cucumbers * Richard Harris, Univeristy of Southampton - UK, rjh1n18@soton.ac.uk, @rich_j_h, I made some raised beds for the garden. * Katarina Stuart, University of New South Wales, katarina.stuart@student.unsw.edu.au, Twitter: [@Katarina_Stuart](https://twitter.com/Katarina_Stuart), I made the decision not to have 5 doughnuts for breakfast * Suren Ambegaokar, Ohio Wesleyan University, ssambega@owu.edu. I made a sandwich for lunch (turkey & cheese for those who are curious). * Lisa Johnson, UC Davis, ljcohen@ucdavis.edu, [@monsterbashseq](https://twitter.com/monsterbashseq). I made coffee this morning! * Amy Young, UC Davis, ayoung@ucdavis.edu, [@ayoung1411](https://twitter.com/ayoung1411), blueberry muffins * Michael Mann, University of New Mexico, mimann@unm.edu [@dse_fungi](https://twitter.com/dse_fungi). Not much! * Emma Wear, University of Montana; emma.wear@flbs.umt.edu; @emmakwear; sadly, lab gear * Asher Hudson, UC Davis, aihudson@ucdavis.edu, [@AsherHudson](https://twitter.com/AsherHudson), microscopy sections of a bunch of plants * Jeff Rodzen, Fisheries Geneticist, CA Dept of Fish and Wildlife, jeff.rodzen@wildlife.ca.gov, smoked brisket * Jillian Adkins, CA Department of Fish and Wildlife; Wildlife Forensic Specialist jillian.adkins@wildlife.ca.gov I made black bean falafel on Sunday * Natalie Swinford, UC Davis Department of Evolutionary Anthropology -- Henn Lab (human population and quantitative genetics), naswinford@ucdavis.edu, roasted potatos and scrambled eggs yesterday morning (with hot sauce). * Yu-Chieh David Chen; Neuroscience PhD student, UC Riverside; ychen107@ucr.edu; [@YCDavidChen](https://twitter.com/YCDavidChen); toast * Sari Mäntynen; Department of Microbiology and Molecular Genetics, UC Davis; sari.s.mantynen@jyu.fi; I made a sandwich * Andrew Essin; I installed an electric car charger General notes == * Interested in XSEDE credits, talk to Lisa * Here is a public collection of successfully-funded XSEDE allocation proposals, feel free to use and distribute! https://github.com/ljcohen/jetstream-xsede-illo/tree/master/xsede_applications * Gitbash Tips: Copy pasting in bash: Need to turn on the quickedit mode (right click top left icon > Defaults > QuickEdit mode > okay ) Can also use Shift + Insert * Trimmomatic command for PE data, see tutorial for [nonmodel RNAseq workshop](http://rnaseq-workshop-2017.readthedocs.io/en/latest/quality-trimming.html): ``` for filename in *_R1_*.fastq.gz do # first, make the base by removing fastq.gz base=$(basename $filename .fastq.gz) echo $base # now, construct the R2 filename by replacing R1 with R2 baseR2=${base/_R1_/_R2_} echo $baseR2 # finally, run Trimmomatic TrimmomaticPE ${base}.fastq.gz ${baseR2}.fastq.gz \ ${base}.qc.fq.gz s1_se \ ${baseR2}.qc.fq.gz s2_se \ ILLUMINACLIP:TruSeq3-PE.fa:2:40:15 \ LEADING:2 TRAILING:2 \ SLIDINGWINDOW:4:2 \ MINLEN:25 done ``` Questions for instructors/helpers ? =========== Q_2) How do you find the link to a new genome or data set? Ie where do I find the link to investigate the Sequoia genome rather than the mouse genome? I can download the file from NCBI, but what if I want to have my script do that for me? A_2)This is an excellent question, and on a broad level, is something that a lot of people want to know! Some genomes are on NCBI. Others are only located on investigator websites, or in the Supplementary materials for papers. Sometimes you have to contact specific investigators to figure out where the genomes are located. It's a big problem to know where to go to search for genome assembly and annotation files. To answer your specific question about how to get the script to download the file for you, once you've found the url address for the genome file(s) you want, use this command in your script: ``` wget ‐‐level=1 ‐‐recursive ‐‐no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/Bos_taurus/Assembled_chromosomes/seq/*.gz gunzip * ``` Then you will have the assembly files for the cow genome (I couldn't find Sequoia), separated by chromosomes. Link to NCBI genomes: ftp://ftp.ncbi.nlm.nih.gov/genomes/ The NCBI taxnomy browser is a good place to start with your organism of interest to see what is available: https://www.ncbi.nlm.nih.gov/taxonomy Log of shell commands == if you want more information on any command, type "man" before the command name and press enter (takes you to the manual entry for that command) * pwd * "print working directory"... tells you where you are * ls * "list"... prints the contents of your working directory * lots of flags to change the format of the output * `ls -l` shows you the permissions, byte size, etc of your files * `ls -1` forces all of the contents to print in one column... easier to read, and possibly useful if you're trying to save the output as a file * `ls -sh1` shows single line, with human readable file sizes * `ls -lah` shows hidden files, in a list, in human readable file sizes * ssh * stands for something like "secure shell"... how you connect to cloud computing environments from the terminal on your own computer * chmod * changes the permissions of a file, directory, computing environment, etc... * here's a [link](http://www.zz9.co.za/chmod-permissions-flags-explained-600-0600-700-777-100-etc/) if you want to know more about the chmod shorthands * `-i` flag stands for identity file (indicates that the private key file will be provided after the `-i`) * mkdir * "make directory" ... makes a new directory (ie folder) within your working directory * curl * "see URL"... uses a URL to download a file (usually data) or a package directly from the internet * `-o` flag stands for "output"; allows you to specify a file name for your download * there are other similar commands, e.g. wget. some folks like wget because if you type in the wrong URL, the computer will try to find the correct web address * "Up arrow" will retrieve the previous command typed * redirect symbol, `>` * takes the output from one command and uses it as the input for the command after the redirect symbol * `cat <filename>` will display all lines of <filename> to the output on the screen * `less <filename>` will open an editor to display <filename> contents, line-by-line with arrow keys, or page-by-page with the tab key. Use `q` to exit from `less`. YYYYYou can use `less -S <filename>` to a mores organized view of the file * `.tsv` = "tab-separated value" formatted file * `.csv` = "comma-separated value" formatted file * md5sum to check whether file download was complete (sometimes there are network errors that cause abnormal breaks in teh datafiles that can go undetected until you really need your file to be complete! So, please use this!!!) ``` md5sum -c err_md5sum.txt ``` Output format 6 header: * http://www.metagenomics.wiki/tools/blast/blastn-output-format-6 --- ### Polls 1) List your favourite UNIX command * ls * uniq...cause it's uniq... * uh... cd? * exit * nano * ls -sh1 * cat/ls * ls * man * nano (not vi) * ls * sed * sl * doshistory * ls --color * ls -sh1 * ls * `*` as a regex * grep (and piping) * screen * sleep * dos2unix * cowsay** 2) Here is a list of UNIX commands, place an X next to the command if you know how the command works or what it does: * pwd xxxxxxxxxxxxxxxxxx * ls xxxxxxxxxxxxxxxxxxxxxxxxxx * cd xxxxxxXxxxxxxxxxxxxxx * mkdir xxxxxxxxxx * curl xxxxxxxxxxxxxxx(0.5x) * wget xxxxxxxxxxxxxx(0.5x) * cut xxxxx(0.5x)x * grep xxxxxxxxx(0.25x) * man xxxxxxxxxxxxxxx * less xxxxxxxxxxxxxxxxxxx * nano xxxxxxx * whoami ## ice breaker If you could meet any living person for a chat over a shared dinner, who would you pick and why? Elias Oziolor Kat Fowler Jessica Zehnpfennig Pearl Lam Sateesh Peri Maggi Brisbin Tristan De Buysscher Mike Maniscalco Jen Reeve Lisa Ma Anna James Joshua Kling Rogan Grant Nick Jensen Richard Harris Katarina Stuart Suren Ambegaokar Lisa Johnson Amy Young Michael Mann Emma Wear Asher Hudson Jeff Rodzen Jillian Adkins Natalie Swinford Yu-Chieh David Chen Sari Mäntynen Andrew Essin --- ## Private Notes Section Tristan == my name is in bold!!!! Lisa J. == This is a [link](http://angus.readthedocs.io/en/2018/) * point one * point two * point three Lisa Ma == Graduate student at UC Davis studying a genetic neurodegenerative disease in humans Apologize in advance for overly detailed notes because I forget everything. Booting a Jetstream Computer Instance: * Jetstream: NSF's free cloud computing service that you can apply to using an Xsede Application * For this class, we are using an educational application * Go to [this page](https://use.jetstream-cloud.org/application) and click login at the top right, and continue with the Xsede institution * User: dibbears, password in Slack * In the future, if I want to submit an Xsede application, talk to Lisa Johnson * Create a new Project to keep my computer separated from others: Click Projects, Create new project, then use your name and add a description * Boot an instance: Click on Images, Search DIBSI for DIBSI 2018 workshop image, and click on it. This image already has preloaded the basic software we need for the workshop. Ubuntu is a type of Linux that has a stable environment with support. It's our core OS. Docker has a set of installs in a package. Bioconda is a way to install a lot of Python packages quickly. Click Launch, and change the name and select your project. Launch and wait until status says active with a green circle (<15min). * Vocab: a dependency is when you write a program that depends on using someone else's software, so if that someone else changes something, that affects you * We will shut down the instance at the end of the day. * Default size of instance is m1.medium. Based on what we are doing, we may need larger instances. * Keying into your instance using a private key: download the private key for ANGUS using the link provided in the tutorial. On your instance, find your IP address. On Windows, use MobaXTerm to start a new session. Put IP address for Remote Host, enter class name for class group name. Click advanced SSH settings, and under use private key, put the exact location of the private key file on my computer. Click OK. * To save the Jetstream workspace for later in the day, click Suspend under the instance details screen. * To wake up your Jetstream instance, click Resume under the instance details screen. * To save your instance for another day: this will save your files, but any info in your memory (history in command line) will be gone. Click Stop under the instance details screen. Wait until activity says powering-off and status says shutoff. * To restart your instance another day, click Start under the instance details screen. Wait until activity says powering-on and status says active. * To delete an instance, click Delete under the instance details screen. Running command line BLAST: * ```cd ~/``` * cd: go to directory * ~/: my home directory * ```mkdir -p ~/blast``` * mkdir: make a subdirectory * -p: make parent directories as needed. If you already have a directory with the same name, it will continue to make it anyway instead of giving you an error * ~/blast: in the home direcotory, and call it blast * ```man mkdir``` * man: access the manual * mkdir: to see what mkdir means * hit q to quit * ```ls``` * list the directories in your current directory * ```pwd``` * print working directory, so shows you what you're in * ```cd ~/blast ``` * cd: go to directory * ~/blast: subdirectory blast * ```conda install -y blast``` * conda: bioconda; a complicated package that you will use the install function of * install: install * -y: answers yes to install all the dependent packages needed for the BLAST package * blast: BLAST package * ```curl -o mouse.1.protein.faa.gz -L https://osf.io/v6j9x/download``` * curl: download a file from the web * -o: writes output to the file name of the following name * mouse.1.protein.faa.gz: name it this * -L: if the location of the webpage has moved, curl from the new location * ```ls -l``` * ls: list things in current directory * -l: list in long format * ```ls -1``` * -1: list it in 1 column * ```ls -sh1``` * -sh1: lists things with total sizes of files and names of files in a column * ```ls -lah``` * -lah: shows hidden files as well * ```gunzip *.faa.gz``` * gz in the file name means that the files are gzipped, or compressed * gunzip: unzip a compressed file * *: wildcard * *.faa.gz: anything that ends in .faa.gz * ```head mouse.1.protein.faa``` * head: show me the first 10 lines of the file * mouse.1.protein.faa: of this file * Note: the files are in FASTA format, which is just a text file with records where each new record starts with > * ```tail mouse.1.protein.faa``` * tail: shows that last 10 lines of the file * Note: to copy the last thing you typed, just press the up arrow * ```head -n 11 mouse.1.protein.faa > mm-first.faa``` * -n: specifies how many lines to show, which is 11 in this case * mouse.1.protein.faa: of this file * > mm-first.faa: take the output and put it into a file named mm-first.faa * ```cat mm-first.faa``` * cat: show me the contents of this file * ```less mm-first.faa``` * less: show me the contents of this file in a separate screen * press q to exit * ```makeblastdb -in zebrafish.1.protein.faa -dbtype prot``` * makeblastdb -in: make the following file a blast database * zebrafish.1.protein.faa: this file * -dbtype prot: the type of this database is a protein database * Note: notice that many commands start with -something, then follows with what that command is supposed to work on * ```blastp -query mm-first.faa -db zebrafish.1.protein.faa -out mm-first.x.zebrafish.txt``` * blastp: call BLAST to do a search * -query mm-first.faa: look for the sequences in this file * -db zebrafish.1.protein.faa: look in the database named this file * -out mm-first.x.zebrafish.txt: name the output as filename this * Note: to autocomplete a file name, press tab in the middle of typing the file name, then enter * ```less mm-first.x.zebrafish.txt``` * look at this file * press space to scroll down and q to exit * ```head -n 498 mouse.1.protein.faa > mm-second.faa``` * same as above, except grabbing 498 lines * ```blastp -query mm-second.faa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.txt``` * BLASTing the larger file * notice it takes a lot longer * ```blastp -query mm-second.faa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.tsv -outfmt 6``` * saves the output as file mm-second.x.zebrafish.tsv * tsv: a tab separated file, which separates values with a tab * -outfmt 6: use an output format 6, which is a tsv format * it won't know you want that just from the name of the file * this [webpage](http://www.metagenomics.wiki/tools/blast/blastn-output-format-6) shows the options for blastp format 6: * ```blastp -query mm-first.faa -db zebrafish.1.protein.faa -out mm.first.x.zebrafish_one_hit_per_line.txt -evalue 1e-5 -outfmt 6 -max_target_seqs 1``` * -evalue 1e-5: sets a threshold evalue, or tells it to provide less hits that are of better quality * -max_target_seqs 1: only show 1 sequence, the best hit per query * type exit to exit your instance * to start again, press R * to see your history, type history * to save your history to a text file, type S Random notes on UNIX commands * to exit vi use :q * whoami display your user name Visualizing BLAST score distributions in R: * ```sudo passwd $USER``` * sudo passwd: super user do; you are taking over the root of your system to force your OS to do this command. You wouldn't have this access for an institutional HPC. Be very careful of using it because it can mess things up. * passwd: changes password. If without sudo, would just change the password for you, not the whole system * $USER: this specifies the current user that you're on, which is dibbears. $ means you are creating a variable. Everything after a $ is a variable. * To cancel out of a typed command without running, use ctrl+C at the end before pressing enter * ```echo $USER``` * echo: plays back what you typed, which since $USER is a variable, it shows what $USER is, which is dibbears * After running this, you choose a new password and type it twice (I chose my favorite rodent) * ```echo My username is $USER``` * echo: play back what I followed up with * This time, it should play back My username is dibbears * you could use this trick to have the computer tell you what sample you are on or something * ```echo http://$(hostname):8787/``` * this should tell you what host name you are on * enter into your web browser that allows you to sign into Rstudio using dibbears as user and your chosen password * now we have launched Rstudio directly from your instance, so we should have all the files we made from yesterday * From here on for the rest of today, commands will have a different format because they will be in Rstudio, not UNIX * Notice in Rstudio that the Console section is where you type in R commands, and the next tab over is terminal, which is our UNIX interface that we can use * ```blast_out <- read.table('blast/mm-second.x.zebrafish.tsv', sep='\t')``` * blast_out <-: the function following this will put the output into a file with the name blast_out * In R, you designate a symbol with <- * read.table(): reads a file in a table format and creates a data frame with it * 'blast/mm-second.x.zebrafish.tsv': reads the file that is in the blast folder called mm-second.x.zebrafish.tsv * sep='/t': separates each value on a line with a tab * ```getwd()```: in R, displays the current working directory * R will try to autocomplete commands, and tab complete for file names also works in R * You can type in commands into your source box (top left) and it won't run it unless you put your cursor in that box and hit run or hit Ctl+enter * ```View(blast_out)``` * pulls up blast_out file to view * ```class(blast_out)``` * tells you what type of data blast_out is * blast_out is a dataframe, which is kind of a spreadsheet with columns and rows * ```dim(blast_out)``` * tells you what dimensions blast_out has * ```colnames(blast_out) <- c("qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")``` * colnames(blast_out): designates column names for the table blast_out * the other half of the arrow, you are designating all of the column names from left to right * ```hist(blast_out$evalue)``` * hist: creates a histogram of blast_out * $ in R means column name * $evalue: designates the y axis of the histogram as the evalue column * evalue is a scoring system: the likelihood that you would get that matching score by chance, so you want it to be low * we would want to change the scale though because the axes ranges aren't very useful * ```hist(-log(blast_out$evalue))``` * this stretches out the binning or distribution along the x axis * Lisa Johnson likes to set her default e-value at 1e-5, depending no what you are looking at * sometimes it's not easy to troubleshoot if you have lots of parantheses in parantheses. One way you can pipe things in a better way is to store some outputs in new variables * ```ev <- blast_out$evalue``` * ```hist(=log(ev))``` * ```hist(blast_out$bitscore)``` * bitscore is a scoring system for how much of your query aligns with the database * so a higher bitscore is better * ```plot(blast_out$pident, blast_out$bitscore)``` * plot(): plots a graph, with the first variable on the x axis and the 2nd variable on the y axis * the pident column stands for percent identity, which is also a measure of how similar 2 sequences are * you would expect a correlation where incr bitscore goes with incr percent identity. But we don't see that so much because bitscores take percent identity and length into account, so there's some difference depending on the lengths * ```plot(blast_out$pident * (blast_out$qend - blast_out$qstart), blast_out$bitscore)``` * ```blast_out$qend - blast_out$qstart```: subtracts the two so that we get the length of the protein * by multiplying that with the pident, it normalizes by length * so we get a much better correlation now * You can save this script using save as so you can pull up those commands again later Short read quality and trimming: * After the sequencing center runs your data on Illumina, you get a big fastq file back * you maybe submitted a DNA strand 400-600bp long, and you get back 50 bp segments back, or if you did paired end, you have 50 bp on each end * but those 50bp on each end may have some errors * or if those 50bp came from an original strand only 80bp long, that's probably an adapter sequence, and the sequencing could run off the front or back end * also, read quality can drop off later in the cycles, so you could have more errors on one end of the sequence as your quality drops off. That's because at that point in the cycles, there are too many underlying fluorophores to stably stack more sequence on top, and those underlying fluorophores have higher chance of interfering * ```cd ~/``` * go to your home directory * `conda install -y fastqc multiqc trimmomatic` * uses bioconda to install fastqc, multiqc, and trimmomatic programs * multiqc will combine all our fastqc * trimmomatic is our trimming tool * `cd ~/` `mkdir -p data` `cd data` * goes home, makes a subdirectory called data, and goes to that subdirectory * `curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz` * pulls in file from this website, and names it ERR458493.fastq.gz * the 6 files we pulled in are a yeast RNAseq dataset that we are going to use a lot after this * `md5sum *.fastq.gz` * md5sum: computes and checks md5 message digest * md5: cryptographic hash function that can be used to check message integrity. It calculates a new value for each file that is unique to that file. * you would compare the md5sum numbers you got to the numbers that the sender of the file got * `md5sum -c err_md5sum.txt` * this would check your md5sum values with anothers (the sequencer's) md5sum values, and if no problems, it would return only OK values * `chmod a-w *` * this will prevent the files from being writeable, so I can't introduce typos or errors in the data * a-w: this indicates which set of permissions we want to modify. a means all permissions, and w means writeable permissions, and the - means take away the writing permission * *: wildcard, so telling it to apply it to everything. Could also have written *.gz to mean do it to everything that ends in .gz * `mkdir -p ~/quality` `cd ~/quality` * this makes a subdirectory called quality and goes to it * `ln -fs ~/data/* .` * ln: makes a virtual copy, but really it's just linking back to the original data file * -fs: f removes existing destination files with the same name, and s means symbolic link instead of a permanent link * ~/data/*: links to all the files in the data directory * .: links from the current directory * now if you use ls, you will see these files are now accessible from your current directory * you can also use `less` to look at the data files * FASTQ files stores biological sequence and quality scores * 4 lines total * line 1: starts with @; descriptor of the file. If getting from a sequencing center, it usually includes the following separated by colons: a name that represents which sequencer it is, the run ID, the flowcell ID, the flowcell lane, tile number in the lane, x coordinate of the cluster, y coordinate of the cluster, indexed number if multiplexed, the member of a pair * line 2: raw sequence itself * line 3: begins with +; separator. But + is also a quality value, so need to be careful with that, especially if working with early sequencing files * line 4: quality scores that correspond to the sequence letters in line 2. These are PHRED quality scores; they are log 10 values that represent the probable accuracy of the base call. Consider whether they are PHRED 33 or PHRED 64 scores, because that will determine which symbols mean what score * these files aren't really readable * `fastqc ERR458493.fastq.gz` `fastqc ERR458500.fastq.gz` * type fastqc and just give it the name of the file you want it to operate on * now if you type ls, you will see that it created a new fastqc zip file * also creates html files that if you open in a browser using Rstudio, you can see * `ls -d *fastqc.zip*` * shows files that end in fastqc.zip, which are the two that fastqc just created * -d: tells it to list the directories themselves, not the contents * When you open the html file in a web browser * on the left side, you can see a check, an x, or an exclamation point, but those may not mean anything because the program doesn't know what you are trying to do, and some stats are only calculated for subsets of the data (i.e. duplicated sequences only calculated from the first 100K sequences in the file) * per base sequence quality: shows quality of bases across the sequence * notice the first 10 scores tend to be lower * and there's usually a slow shallow drop off in quality at the end too * per tile sequence quality shows where on the flowcell may have had lower quality * per sequence quality scores: shows the average quality score across an entire read. You want to see a peak at the top, but you may see a small lip towards the middle showing a few bad reads. * per base sequence content: again, notice the first 10 bases are a little all over the place * per sequence GC content: they don't know what species you are working with so may not be helpful * per base N content: if the quality score is so bad that it has no idea what it was, it will assign an N * sequence length distribution: shows how long each sequence is. Not helpful with untrimmed data because you chose that * sequence duplication levels: possibility of a colony invading or taking over its neighbors, and you may not care if doing amplicon sequencing or only expecting to see sequences from one part of the genome * overrepresented sequences: none here * adapter content: sometimes can guess what adapters are there in overrepresented sequences * `cp /opt/miniconda/pkgs/trimmomatic-*/share/trimmomatic-*/adapters/TruSeq2-PE.fa .` * cp: copy files and directories * /opt/miniconda/pkgs/trimmomatic-*/share/trimmomatic-*/adapters/TruSeq2-PE.fa: from this location * .: to my current location * we need a reference for what the adapter sequences are, so we are bringing this file in. Most trimming packages include it, but this is the location this trimming package has it. * if you need to make custom adapters for your experiment, you will need your own file of adapter sequences * Lisa Johnson concatenates all the adapter files in trimmomatic into one adapter file * `trimmomatic SE ERR458500.fastq.gz \ ERR458500.qc.fq.gz \ ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \ LEADING:2 TRAILING:2 \ SLIDINGWINDOW:4:2 \ MINLEN:25` * notice these are not the standard UNIX conventions. These are the conventions chosen by the programmer who created trimmomatic * backslash indicates that you aren't done yet, so you can separate things out to make it more readable * trimmomatic has a webpage that tells you all these functions * SE: single end; if doing paired end files, * ERR458500.fastq.gz: starting file name * ERR458500.qc.fq.gz: ending file name * ILLUMINACLIP:TruSeq2-PE.fa:2:40:15: this is the function that does adapter clipping; path to the adapter sequences, seed mismatches of 2, accuracy of 40 for adapter sequence match to be cut for ligated adapters, accuracy of 15 for adapter sequence match to be cut for reads * LEADING:2 TRAILING:2: starting at beginning and end, if quality score is 2 or lower, cut it off * SLIDINGWINDOW:4:2: in a sliding window of 4, once quality drops below 2, cut it * MINLEN:25: min length; drop read if below length 25 * `for filename in *.fastq.gz` `do` # first, make the base by removing fastq.gz `base=$(basename $filename .fastq.gz)` `echo $base` trimmomatic SE ${base}.fastq.gz \ ${base}.qc.fq.gz \ ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \ LEADING:2 TRAILING:2 \ SLIDINGWINDOW:4:2 \ MINLEN:25 `done` * you can use a for loop to do this for all the files faster * filename: this is the temporary name you give to each file ending in .fastq.gz that the loop runs through * in *.fastq.gz: using anything that ends with .fastq.gz * do: do the following for each iteration of the loop * #: allows you to make a note to yourself that the program can't see. You are creating base names that consist of just the first alphanumeric sections of the .fastq.gz file * base=$(): make "base" be a variable that consists of * basename $filename .fastq.gz: * basename: command that strips directory and suffix from filenames * $filename: whatever the filename the loop is on * .fastq.gz: tells the function to cut this off * echo $base: show that base name * curly brackets will insert the base name alone and put it in front of .fastq.gz * done: ends the loop * To save the for loop to run through bash: you can use nano to paste the for loop in and save it using write out. Give it a name, then you can run that file by using `source $filename` * perform fastqc again to see the changes you made * notice that since we were working with pretty high quality sequence to begin with, it didn't really trim much. We used extremely permissive trimming parameters. The only way you can see the dif is in the sequence length distribution, which now shows that a very low number got trimmed down or thrown out * MultiQC will aggregate results across many samples into a single report (takes all the FastQC files and puts them into 1 report) * `multiqc .` * runs multiqc on the current directory, finds all the FastQC files, and makes a report out of them * you can open the results in a web browser using R * you can see the different naming to compare the trimmed vs. untrimmed files * M Seqs: total sequence counts * FastQC Mean quality scores: this is the graph that matters. Since they are all so close together, we don't see 4 different traces. * FastQC per sequence quality scores: here you can see the difference after trimming * per base sequence content: can see at the beginning there's some inconsistency in what bases were called Jen R == CU Boulder, PhD student Cyanobacterial experimental evolution Booting from Jetstream: * NSF's answer to Amazon Web services * free! * XSEDE is the blanket organization * apply to XSEDE and then choose the kind of computing you need to use * Jetstream is one of these kinds of computing, elastic cloud computing * all you need to do is submit a paragraph of project description and then XSEDE gives you access (no rejections known) * can apply for education resources (like this course) How to use: * XSEDE log in * Username: dibbears * Password: davis9.2018 * launch new instance * make own project because we are all using the same account * click projects, create new project (give name and description) * they already got an image set up with everything we need for the class (search DIBSI 2018), it has lots (Ubuntu, docker, bioconda) * click launch, change the image name and project and launch the image * wait (a while, like 10 minutes, go grab coffee, enjoy the outdoors etc) until active with a green dot Launching this from your terminal (w/ Windows): * download mobaxterm * confusion * DONE! (I followed the instructions in the docs and it seems to have worked) Command line meanings/tips/hints * mkdir = make directory * -p = parents, makes the necessary parent directories * man + name of command gives you the manual (ie man mkdir) * to get out of this press q for quit * scroll/move around with the spacebar and arrow keys * cd = change directory * ~ home directory * use tab to autocomplete directory names etc * ls = list directories, shows you all the directories inside the directory you are currently in * conda is a complicated package with a lot of secondary commands * can use the install function to install the blast package * -y answers yes to the installation of all the necessary packages in the installation process for the blast package (in "conda install -y blast") * curl can go get files for us from the web * -1 (number) gives you the information in one column * -l (letter) gives you the long output (with a lot of info you probably don't need), shows you the permissions etc * -sh1 gives you single column (1) of human readable (h) with sum total (s) * -a displays the hidden files * the asterix is the unix wildcard \*.faa.gz would tell you to do the command to anything that ends with the ending indicated * head displays the first 10 lines of the file * tail displays the last 10 lines of the file * tip: use the history, hit the up arrow to bring up the last command (which you can now edit) * cat spews the contents of a file to the screen * less opens a virtual screen to view the file in, rather than spewing * leave this by pressing q * move around with spacebar and arrow keys * exit to log out * history | tail will tell you the last 10 lines from your last session * history | head will tell you the first 10 lines of your history BLAST: * basic local alignment search tool * very good at making comparisons between sets of sequences and IDing differences * computationally intensive * in command line we can access ALL the options that exist for BLAST, whereas the website only shows you some of the options * makeblastdb tells blast we are making a database from a file and tells it the specifics of the database * -in to explicitly tell it what file we are using * after specifying the file we are using, define the type using -dbtype and the type (prot for protein) * blastq to make a query * -query and the file we want to use as our source * -db for the database we are searching against * -out for where to save the output * -outfmt (output format) changes how it saves the output * for the details (ie column names) of option 6 (used here) click [here](http://www.metagenomics.wiki/tools/blast/blastn-output-format-6) * -max_target_seqs 1 gives you only the best hit per query, can change the number as you wish * -evalue sets the threshold e value, 1e-5 is a good option Day 2: Recap Commmand line stuff: * sudo: super user do * taking over the root and forcing the operating system to execute the command * sudo passwd $USER sets a password * the letters will not show up! this is normal * $ means everything after it is a variable * echo means print * chmod to change permissions * a as a prefix says we want to modify permissions on all files * -w means remove write permissions * -r means remove read permissions * create a virtual copy of the contents of one directory in another * ln -f (copy over anything with the same name in the destination) s (symbolic link) and the directory you are copying from: ~/data/* and then the directory we want to put the virtual copy in: . (do that to the current location) * the linked files tend to have all permissions granted * why link instead of copying? * avoids taking up a ton of file space * also doesn't write over our original data * BLAST stuff: * e value is a scoring system that tells you stuff? basically we want it to be low * cut offs can be useful, but where it should be will vary on the situation, but starting with 10^-5 is a good option R stuff: * getwd() aka get working directory tells you where you are (like pwd in Unix) * R is case sensitive * use class() to determine the variable type * use dim() to get the dimensions of a variable * use View() to display the variable * use colnames() to give useful names to the columns of a dataframe * $ means everything after it is a column name (unlike in Unix) * log() is the natural log by default * hist() plots the histogram of the variable given * plot(a,b) plots the variables a and b against each other (with a on the x axis and b on the y axis) * use ctrl-1 and ctrl-2 to move between the console and script Short read quality trimming * you've gotten libraries back from the sequencing place and have a big fastq file * the file is filled with fragments of DNA that are 50-100 bp long (lots and lots of fragments) * potential problems * sequencing errors * adaptor sequences * read quality drop off as you move down the fragment (less of an issue with newer tech) * installing fastqc, multiqc and trimmomatic with conda * multiqc: can aggregate fastqc scores * trimmomatic: trimming program of choice for this * md5sum calculates a new value unique to each file based on the properties of the file * this can be used as a quality check, often included as a separate file when given data to check against * to run a check: md5sum -c md5sum file given to you * FASTQC files are not meant to be human readable, so don't worry too much * Quality scores * log 10 value * basically gives you probability that a base pair was called wrong (as determined by the source of the data, ie Illumina) * generally want scores above 30ish * fastqc command provides you with a .zip and .html file that has quality info * can open html from R * per base sequence quality * first few bp are often lower quality * and drop off on last few bp is also helpful * per tile sequence * generally just for the sequencing center * per sequence quality score * tells you the average score for each bp in a sequence, want this to be in 30s/40s * sequence duplication levels * not always very useful * only based on first 100,000 reads * overrepresented sequence * most likely this will tell you about the adapter sequences * trimmomatic * what happens if you don't trim your data? * hard to align data that hasn't been trimmed/cleaned up * aligning data also kinda filters data * more about rescuing data, this helps make some data useful when not all the data is good, this used to be more important (but still good practice) * if you are working with transcriptome data etc you may have a harder time aligning with the genome * trimming out adapter sequences * need reference to what the adapter sequences actually are * trimmomatic has a lot of options * can use a backslash to avoid execution of commands when the command line is particularly long * [documentation!](http://www.usadellab.org/cms/?page=trimmomatic) * Twitter tutorial: * lists * data! Rogan === login string: ssh -i angus_private_key dibbears@129.114.16.141 * Main focus: expression of proteostasis network genes across tissues/regions/cell-types, and throughout aging Raquel === if you use less -S file you have a more organized view of the file. if you use `code` to enphasize a code in your text Here is a markdown [cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet). Peri === Elias Oziolor Kat Fowler Jessica Zehnpfennig Pearl Lam Sateesh Peri Maggi Brisbin Tristan De Buysscher Mike Maniscalco Jen Reeve Lisa Ma Anna James Joshua Kling Rogan Grant Nick Jensen Richard Harris Katarina Stuart Suren Ambegaokar Lisa Johnson Amy Young Michael Mann Emma Wear Asher Hudson Jeff Rodzen Jillian Adkins Natalie Swinford Yu-Chieh David Chen Sari Mäntynen Andrew Essin Michael M === Andrew == ssh -i angus_private_key dibbears@149.165.170.39 rstudio link: http://js-170-39.jetstream-cloud.org:8787/ Joshua_K == Ph.D. Candidate University of Southern California Temperature resilience of marine phytoplankton *ubuntu = flavor of linux; docker = specific set of installs and packages * cd && mv FILENAME . to move a file into your curring directory * chmod 600 removes read and write privilages for everyone except the owner * adding -y to conda automatically answers 'yes' to installation of any needed dependencies * curl does not resolve mismatches in url... wget will (eg http vs https) * ls -sh1 is helpful, gives human readable size of files * $ = variable in unix * echo 'some text' $SOMEVARIABLE can let you print out things in long scripts that are human readable * evalue = percentage chance that query hits database based on chance... * use -log to stretch out lots of low values, eg separating out evalues of blast output... * what evalue you use as a cutoff depends on the question. Hearing that exploratory studies 10-5 works but when annotating with blast may want to be more conservative... eg 10-9 * last is a more efficient form of blast * ln = link which creates a virtual copy (symbolic link) to files in another directory... doesn't take up more space *

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.