owned this note
owned this note
Published
Linked with GitHub
---
tags: workshop, notepad
---
# Welcome to Data Carpentry Genomics Workshop UNR 2019
January 15-16th 2019
MIKC-107, UNR
workshop lessons: https://unr-omics.readthedocs.io/en/latest/
---
Post workshop survey: https://www.surveymonkey.com/r/dcpostworkshopassessment?workshop_id=2019-01-15-reno
---
## We will use this HackMD to share links and snippets of code, take notes, ask and answer questions, and whatever else comes to mind.
### Modes
The page displays a screen with three major parts:
<i class="fa fa-edit fa-fw"></i> Edit: See only the editor.
<i class="fa fa-eye fa-fw"></i> View: See only the result.
<i class="fa fa-columns fa-fw"></i> Both: See both in split view.
## Announcements
Bathrooms: gender neutral bathrooms - down the hallway
Code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
### Instructors:
* Elias Ozolor
* Sateesh Peri
### TA's:
* Richard Tillett
* Gurlaz Kaur
* Jeremiah Reyes
* Ning Chang
* Kyle Wang
### Introductions: Name, Institution, Email (optional), Twitter (optional). What is the last thing that you made?
Yue Wang, UNR, ywang2@nevada.unr.edu, yue97734575
Marina MacLean, UNR, mmaclean@unr.edu, zucchini chicken enchiladas
Vanessa Gutierrez, UNR, igutierrez@med.unr.edu, garlic-lemon mahi-mahi
Stephanie Otto, UNR, sotto@nevada.unr.edu
Jessica Reimche, UNR, reimchej@gmail.com, omelette
Salome Manska, UNR, salomemanska@gmail.com, Made my flight!
Tara Radniecki, UNR, tradniecki@unr.edu, cross-stich of avengers
Isadora Batalha, UNR, isadora.batalha@nevada.unr.edu, eggs
Anson Call, UNR, anson@nevada.unr.edu
Samantha Romanick, UNR, sromanick@nevada.unr.edu, yogurt
Chrystle Weigand, UNR, chrystleweigand@gmail.com, chow mein
Andrew Hagemann, UNR, andrew.hagemann@nevada.unr.edu, eggs
Chandra Sarkar, UNR, csarkar@nevada.unr.edu, https://twitter.com/genes_o_me, instant noodles
Joshua Hallas, UNR,, jhallas@nevada.unr.edu
Mustafa Solmaz, UNR, msolmaz@nevada.unr.edu, tea!
Jess Danger, UNR Micro Dept, JDanger@med.unr.edu, last thing I made: knit dog sweaters
Bruna Alves, UNR, brunaalves@cabnr.unr.edu, tea
Avery Grant, UNR, averygrant@nevada.unr.edu, tea
Chanchanok (Saw) Sudta, UNR, csudta@nevada.unr.edu
Erica Shebs, UNR, eshebs@cabnr.unr.edu, chocolate cake!
Hayden McSwiggin, UNR, hmcswiggin@nevada.inr.edu,
Lauryn Eggleston, UNR, leggleston@nevada.unr.edu, blueberry waffles!!!
Hedy Wang, UNR, hetanwa@163.com
Lana Sheta, UNR, lsheta@nevada.unr.edu
Jennifer Schoener, UNR, jschoener@nevada.unr.edu, coffee
General notes
==
* Sometimes logging in to Atmosphere can lag and take a moment. Sateesh advises letting the page try to load for a bit, and if fails, try again.
* **TODO: Add/link bioconda lesson in the TOC.** Until then, can be found at https://unr-omics.readthedocs.io/en/latest/bioconda-config.html
Questions for instructors/helpers?
===========
Q) _What size of instance on Atmosphere?_
A) **Medium2**
Q) What other types of data can you work with in CyVerse? For example, can you process neuroimaging (MRI) data using CyVerse if it's command line based?
A) CyVerse has more capabilities other than Atmosphere; I will Introduce CyVerse's Discovery Environment which does not require command-line expertise
A) But yes, if things are command line based, you can process them in CyVerse
Q) How to decide how many CPU and how much RAM
A) Maybe depends on your data and what amount of storage and hard drive space is required for the tools you will run on your data
A2) It can sometimes help to, beyond reading the papers, documentations, and forums related to a bioinformatics tool, to simply test the performance using a single sample or other subset of your data. Often you can extrapolate from that to your entire set. -- RLT
Q) What are kmers?
A)
Q) I understand the purpose of doing log transformation of RNA expression data. But why do we do log2 transformation and not log10? Does it make any difference what the base for log transformation is used?
Log of shell commands
==
if you want more information on any command, type "man" before the command name and press enter (takes you to the manual entry for that command)
### How to take notes:
Sateesh Peri
==
This is a [link](http://angus.readthedocs.io/en/2018/)
* point one
* point two
* point three
## Logging onto Atmosphere Cloud
Data is huge, with all genomics projects, to be handled by local resources, so there is a need for high performance computer clusters e.g: Pronghorn, UNR's HPC. But you cannot install softwares easily on HPCs and have to wait in queue for your jobs to finish. Alternate to this is using virtual high speed computers such as Cyverse that we are using for this workshop. You can get your own allocation for your lab 'for free' by requesting it in a form from Cyverse. You can chose number of CPUs, hard drive and storage etc. on demand and delete it once you finish analysing your data.
We will start with setting up our instances for Cyverse. Go to https://atmo.cyverse.org/application/images and login. It might take a minute to show up the website. Go to projects tab and create a new project, you can name it anything you like or DC_Genomics. Once it is done we go to new tab and select instance. There is already an image set with all the softwares/tools installed needed for this workshop. Go to show all tab and select UNR-RNAseq image. Now we do shopping for specifications of our instances (sounds fun!). Select the medium 2 instance size that is big enough to run all the lessons of this workshop. Once you launch your instances, it takes quite some time for the instance to become active 16 GBs of RAM is required for assembling the transcriptomes.
This virtual machine we set up is remote, how do we keep our data safe? SSH is a secure way to do that. Secure Shell (SSH) is a cryptographic network protocol for operating network services securely over an unsecured network. Typical applications include remote command-line login and remote command execution, but any network service can be secured with SSH. (wikipedia definition).
### Create the RSA Key Pair
Now we need to setup a key to login in our insrances from our terminal in the local machine (our laptops). For that we will follow the commands from the lesson. Window people should have mobaxterm running by know so that they can run the command lines to talk to their operating system. Mac people can open up the terminal and follow the lesson commands to generate a random key which is secure and easier way to login to our instances. Once the key is generated we need to copy our key in the atmosphere, below your user name go to settings, go to show more and in the SSH Configuration section paste your key.
We have a key in our local machine and that same key is deposited in our instance (public key) so we can login to our instance from our laptops. The key generated is random and safe and hard to hack through so your data should be safe.
Now we go back to our terminal in the laptop. Now we type Cyverse userid@ip adress of your instance and enter. It will ask few questions (yes or no). Windows may ask for the password, this is your Cyverse password. It is succesful when you see a big atmosphere written in your terminal window and now you are logged in your instance from your laptop.
For windows if you did not type the password correctly or just followed the lesson and press enter instead of password, then you might be getting a port22 connection time out error. To solve this just delete your instance on atmosphere and create a new instance with a new ip address. You don't need to generate keys again, it is already saved and same. Just change the ip address after your userid in the terminal and try signing in again and just a reminder for windows, the password is Cyverse password.
## Using the command line
We follow this lesson on our terminal.
We type any command and press enter to execute it.
* `ls` list items in the directory
* `ls -lh` gives human readable long list of the items with their sizes, type of the item (directory, text file etc.), details of can we read or write those items.
* `man ls` opens the manual of the typed command after man. Type q to quit manual page.
Go to `man ls` and try three - options of it.
* `ls -a` list all items, even the hidden ones (with . in front of the file names)
* `ls -l` show everything
* `ls -d` list directories themselves, not their contents
* `ls -r` reverse order of sorting items while listing
* `ls -S` sort by file sizes, largest first
* `ls -t` sort by time, newest first
* `ls -ltu ` this is -u with -lt: sort by, and show, access time; with -l: show access time and sort by name; otherwise: sort by access time, newest first
* `wc` word count
* `wc -l` count the number of lines
* `wc -c` count the bytes
* `wc -m` count the characters
We can use pipes `|` to push one command's results into another command.
* `ls | wc -l` it counts number of items in the current directory
* `pwd` prints the path for the current directory
* `cd` change directory
* `cd` followed by the path of the directory you want to go to #takes you to the required directory
* `cntrl + c` kills any command that did not work
the **'tab'** button on your keyboard is very handy. It can be used to autocomplete the file names/directory names which are below/inside of the current directory.
* cd followed by two dots `cd..` takes you to the previous directory (only one back) above the current directory
* `cd ~` takes you the home directory from anywhere you are
* cd followed by enter (just `cd`) also does the above function
* `cd` followed by full path can be used to jump between directories
* `ctrl+shift+C` can copy be used to copy in the terminal
absolute path is the path that you get by using pwd. It is a complete address of the directory starting from the root directory.
Command line is case sensitive so if a command used in wrong case will not work.
Always be careful while deleting anything. Because it do not go to any trash folder but is deleted forever.
* `mkdir` makes a new directory
* `touch filename` makes a new file.
e.g.: `touch simple.txt`
creates a text file names simple
* `rmdir directoryname` deletes the directory when it is empty
* `rm` means remove, but it can delete only files, you will get an error that it is a directory and cannot remove it
* `rm -rf directoryname` force deletes the mentioned directory
* `*` is a wild card which means all. So if you type `rm -rf *` it will delete everything in the current directory
* `nano document.txt` nano is a text editor that can be used to write inside the text file we just created or any mentioned text file. Write something, it will ask if you want to save changes, choose yes and exit by pressing `ctrl+x`
* `cat document.txt` spit out the contents of the file in the terminal window
* `cat document.txt | head` #by using head in a pipe it only shows the head of the file, not all of it. Imagine you have a really big text file, say a huge book, and you cat it will be spitting words on the window forever. So head comes very handy in this
* `cat document.txt | head -n 3` #only lists first three lines of the document
* `cat document.txt | tail -n 3` #spits up last 3 lines of the document.txt
* `rm document.txt` #deletes the mentioned text file
* `curl -O [followed by a weblink]` downloads the file in the current directory. It literally means copy url
* `mv` means move, can be used to move files from one directory to other. If you use it in the same directory, it moves the file in the same directory so it can be used to change file names e.g. mv old file name new file name
* `zcat` #reads any zipped files
* `zcat name of the file | grep "^>"` #goes to the file and only spits out lines that begin with >
* `zcat [name of the file] | grep "^>" | head` only first few lines from the file having > in the beginning will be written out
* `grep` searches for PATTERNS in each FILE. PATTERNS is one or patterns separated by newline characters, and grep prints each line that matches a pattern
* `grep -c` you can use -c option to count (-c) how often this pattern occurs
* `zcat name of the file | grep "^>" | wc -l` counts the number of lines with > in the start
* `zcat name of the file | tr "-" "." | grep "^>" | head` read file, pipe to tr, which translates _ to . and grep only first lines with > and show only the head on the window
* `tr` can be used to change any particular pattern to another in throughout the file such as replacing , with . or , with \t to change commas with tab
* `zcat file name | grep -v "^>" | wc -m` grep -v means anything other the pattern that follows, here we gave it >. So this command counts anything other than the line that start with >
* `zcat file name | grep -v "^>" | head` to know if it is doing what we want it to do that is excluding header lines from the fasta file
* `less` is an alternate of cat and is great for glancing at/through a file
* `ctrl+l` clears all the contents on the terminal window
* `ctrl+c` kills the runing function or the command
You can use **history** command to view the commands you used before. e.g.(search the commands containing youur username):
* `history | grep "username"`
In order to do onther jobs while you are running a program, you can use **screen** or **tmux**. That will give you another working space to execute other jobs.
* `tmux` launch a "terminal multiplexer" and intercept HUP(hangup) signals that might otherwise interrupt work when you close your laptop or lose internet.
* `tmux list-sessions` (or `tmux ls` in super-duper shorthand) to view if you have any tmux sessions running on the server
* `tmux attach -t 0` (or `tmux a -t 0`)
to reconnect a tmux session (in this case, the session name is 0)
tmus cheat sheet: https://tmuxcheatsheet.com
* `screen`
screen cheat sheet: https://gist.github.com/jctosta/af918e1618682638aa82
**sed** is the command for replacing or substituting the string. e.g: to replace the 'TRINITY' to 'NEMA' in the file:
* `cat <filename> | sed 's/TRINITY/NEMA/g' > <new file name>` _<> is a place holder for your required filenames_
* `cat file name | grep "^>" | awk '{OFS=" "}{print $2}' | sed 's/len\=//g' | sort -rn | head` we concatenate the file and grab only the header line and then using awk that columns are separated by space and only print column 2 that has our transcript lengths. After that we use sed to replace the words len with nothing to leave us only transcripts lengths in the column 2 and then we sort it with size and now we know what is the size of our longest transcripts in decreasing order
More information on sed, tr and other commands: https://astrobiomike.github.io/bash/six_commands#tr
## Bioconda
An useful link: Dr. Tillett's notes on conda: https://github.com/rltillett/conda_notes
Lesson link: https://unr-omics.readthedocs.io/en/latest/bioconda-config.html
Conda installations make life much much easier. No need to install dependencies separately and moving the tools in your path so that operting system can find them.
Bioconda is a channel for the conda package manager (specicalizing in bioinformatics software).
after installing bioconda (it's already installed on this instance), we need to let the instance know the path to find the bioconda:
`echo export PATH=$PATH:/opt/miniconda3/bin >> ~/.bashrc`
* `>` sign is used for redirecting the output to a file:
* `>>` appends to a file
* `>` overwrites the file
run the **source** command to execute the contents in bashrc:
`source ~/.bashrc`
adding channels:
```
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
```
you can use the **conda search** command to search for packages and display the information. e.g. search for a specific package named 'sourmash':
`conda search sourmash`
to install package, you can use **conda install** command, e.g.:
`conda install -y checkm-genome`
to list installed packages, use:
`conda list`
"Environments" are multiple different collections of installed software. Let's create a new environment named pony by using **conda create** command:
`conda create -n pony`
to activate/deavtivate the environment, use:
```
source activate pony
source deactivate pony
```
to list the environment, type:
`conda env list`
you can also save this list (what softwares you have installed) for this particular environment, you can type:
`conda list --export > packages.txt`
## RNA-Seq Workflow
* Biological samples/Library preparation
-- Technical replicates is not necessary
-- Biologicla replicates is more important
* Sequence reads
* Quality control
* Map to the reference genome
* Count the reads
* Statistical analysis to identify differentially expressed genes
## Short read quality and trimming
Now, let's login your Atmosphere computer. Make sure you've added conda to your PATH and activated it by source command.
create the directories and subdirectories:
`cd ~/`
`mkdir -p work work/data`
`cd work/data`
the -p option of mkdir command creates the specified intermediate directories for a new directory if they do not already exist.
download subset of data by using curl command
curl -O http://where.is.my/FILENAME.zip
`curl -O http://where.is.my/FILENAME.zip`
unzip the file by using unzip command
`unzip [FILENAME.zip]`
#define your $PROJECT variable:
`export PROJECT=~/work`
then you can check your data by typing:
`ls $PROJECT/data`
use the **less** command to view your fastq files:
https://en.wikipedia.org/wiki/FASTQ_format
print out the numbers of your data in the PROJECT location by using **printf** command:
`set -u`
`printf "\nMy raw data is in $PROJECT/data/, and consists of $(ls -1 ${PROJECT}/data/*.fastq | wc -l) files\n\n"`
`set +u`
link your data into your working directory by using **ln** command with **-s** option that avoids having to make a copy of the files, which will take up storage space. e.g.:
`ln -s ../data/*.fastq .`
the dot at the end `.` means right here (current directory)
#run **FastQC** program on the files that end with .fastq:
`fastqc *.fastq`
#**scp**(secure copy) command can let you copy the files from the remote server to your personal laptops
The first argument after scp is your username@atmosphere ip address: the full path of the files. The second argument is the path to place the files on your own computer.
### Trim the sample:
Create a directory named trim, and navigate to that directory.
```
cd ..
mkdir trim
cd trim
```
Link the fastq data right here (.)
`ln -s ../data/*.fastq .`
Create a fa file containing the adaptor sequence information:
`cat /opt/miniconda3/share/trimmomatic*/adapters/* > combined.fa`
Use the **for** loop to run the trimmomatic program:
The basic concept of a for loop is to do something the same way for each _thing_ in a _list of things_
```
for thing in [a list of things]
do
trimmomatic
done
```
Here is the info for trimmomatic:
http://www.usadellab.org/cms/?page=trimmomatic
## GitHub: How not to lose your entire analysis!
Github allows version control of all the anlysis. So that if you make a change and want to go back to the older script without losing the older steps so saving older versions of scripts using git helps you to chose and work with different versions.
First of all we need a github account. Here is the link for github: https://github.com
Link for git commands cheat sheet: https://services.github.com/on-demand/downloads/github-git-cheat-sheet.pdf
Git is a program that can initiate your repositories.
* `git init` initializes an empty repository for git
* `ls -lah` shows the created repository
* `git config --global user.name "your human name goes here"` **tells the git who you are**
* `git config --global user.email "your email goes here"` **tells the git where can find you**
* `git status`
* `git add` followed by directory or file name
* `git commit -m "Trimmed and quality control files for UNR workshop"` it commits all changes in the git with a comment/message within the "". Commit messages are usually required, so use the `-m` or it will try to force you to add one (it will even launch the tricky `vi` to make you write one. so use `-m`!)
* `git log` gives you the unique identifier that tells who committed it what commit was made
Now we go to github web account and go to create a new repository. You can name it anything or UNR-workshop. Then create this public repository.
* `git remote add origin` your github account website goes here for the repository that we created above on the website
* `git push -u origin master` push all the files that you committed into the UNR-workshop. You can go to the website and check the files uploaded. Good Job!
## Day 2 Workshop
please put down your name below to mark your attendance
Sateesh Peri
Elias Ozolor
Ning Chang
Vanessa Gutierrez
Salome Manska
Andrew Hagemann
Richard Tillett
Chandra Sarkar
Mustafa Solmaz
Jessica Reimche
Marina MacLean
Chanchanok Sudta
Erica Shebs
Jess Danger
Edgar Torres
Lauryn Eggleston
Hayden McSwiggin
Kyle Wang
Lana Sheta
Jennifer Schoener
## De novo transcriptome assembly
How to assemble your transcriptome? When you build a transcriptome, generally you should consider the all possible variations (e.g. different life stages, different tissues etc.) of your samples. There are two ways to do the transcriptome assembly: splice-aware alignment to the reference genome, and De novo transcriptome assembly.
Trinity is a assembler. Trinity works in K-mer space (K=25). Different assembler starts with different K-mers. How to decide different K-mers?
Trinity has four stages:
* Jellyfish
Extracts and counts K-mers from the reads
* Inchworm
* Chrysalis
* Butterfly
Let's try to assemble some transcriptome now. We will follow the lesson notes from here on. https://unr-omics.readthedocs.io/en/latest/transcriptome-assembly.html
**Contig Nx values**: the total length of the contigs (if you rank the contigs by the length) that represent the x% of the transcriptomes
**Annotation**
https://angus.readthedocs.io/en/2018/dammit_annotation.html
**Evaluation**
https://dibsi-rnaseq.readthedocs.io/en/latest/evaluation.html
## Read Quantification
Once we have the aligned reads, the next step is to count the reads.
Index is the step to extract the information of your transcripts (or genome).
Kallisto-Sleuth tutorial: https://sateeshperi.shinyapps.io/kallisto-sleuth/
## Differential expression analysis with DESeq2
Before doing DE analysis, the read counts initially should be normalized. After normalization, the unsupervised clustering analysis should be performed for quality control.
Let's run RStudio in a way of server interface on your browser by this link:
$ echo http://$(hostname):8787/
What is gene dispersion? (answered by Michael Love, author of DESeq, and Gordon Smyth, author of EdgeR) https://support.bioconductor.org/p/75260/
### Working in Rstudio:
Here is the link for R and Rstudio:
https://datacarpentry.org/R-ecology-lesson/00-before-we-start.html
Here is the link for DESeq2 in Rstudio:
https://unr-omics.readthedocs.io/en/latest/DE.html#working-in-rstudio
* `library()` load a package
* `setwd()` set the working directory
* `<-` or `=` to define a value
* `?` In R, you can always use ? to get the help of the functions.
* `list.file()` to print out a vector of the names of files
* `file.path()` a way to build the path to a file
* `read.csv()`a function allows you to input the data (create a dataframe)
* `colnames()` to view the columne names of the file
* `head()` to view the first parts of the contents
* `dim()` to view the dimensions of an object or a dataframe
* `tximport()` import transcript-level abundances and counts
* `DESeqDataSetFromTximport()`
* `DESeq()`
* `plotDispEsts()` plot the dispersion
* `rlog` log transformed values
* `plotPCA` to tear apart the data based on their grouping/treatments
PCA(Principal Components Analysis)-- Another way to visualize sample-to-sample distances. It's a part of quality control of your RNA-Seq data.
* `results()` extract the results from a DESeq analysis
### Other links on R for further learning:
https://dss.princeton.edu/training/RStudio101.pdf
https://www.rstudio.com/online-learning/
http://web.cs.ucla.edu/~gulzar/rstudio/basic-tutorial.html
https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
http://datacarpentry.org/semester-biology/lectures/
### Useful link for DESeq2
https://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf
Fancy volcano plots: https://bioconductor.org/packages/release/bioc/vignettes/EnhancedVolcano/inst/doc/EnhancedVolcano.html
https://www.datacamp.com/courses/rna-seq-differential-expression-analysis
Jessica Reimche
===
How to decide how many CPU and how much RAM (Maybe depends on your data and what amount of storage and hard drive space is required for the tools you will run on your data)
Chandra Sarkar
===
cyverse is FREE!!! Yeah!
#document/text editor nano::
$ nano document_name
#reads the first 10 lines(default) of the document::
$ document.txt | head
#reads the first 3 lines of the document::
$ cat document.txt | head -n 3
#reads the last 3 lines of the document::
$ cat document.txt | tail -n 3
#curl = copy (any) URL
#zcat = reads any zipped document
#visualizing the information lines::
$ zcat fhet.tr.fna.gz | grep "^>" | head
$ zcat fhet.tr.fna.gz | grep "^>" | wc -l
#translating a character into another::
$ zcat fhet.tr.fna.gz | tr "_" "." | grep "^>" | head
$ zcat fhet.tr.fna.gz | grep -v "^>" | wc -m
#for double checking the previous command::
$ zcat fhet.tr.fna.gz | grep -v "^>" | head
#one redirect symbol (>) = Its going to replace the contents of the entire file
#two redirect symbol (>>) = It would not replace the contents of a file but will instead append
#set +u puts a checkpoint in bash script. It will terminate when an error is met in the first instance
#IMP = for viewing the files after secure copy as html files::
firefox ~/Desktop/nema_fastqc/0Hour_ATCACG_L002_R1_001_fastqc.html
#IMP = everytime you are kicked out of the ssh connection, you need to define your variables again. OR you can initially modify the bash profile and declare it there.
DAY 2 - - - - - - -
For de novo assembly, it is better to use a less number f organisms so that suring assembly, the assembler does not confuse the original base and polymorphic bases.
Trinity is just one assembler. Most people use multiple assembler and combine the results to derive the end assembly.
#declaring variable after being logged out of workspace
export PROJECT=~/work
#why use the command time with Trinity?
So it gives an output of the time taken to run the program. That gives an idea about when you want to run this program again with a different set of data (of different size).
#to save your work from getting broken if you lose conenction
$ tmux list-sessions
$ tmux attach -yt 0 ###because session number 0
$ exit ### exit
$ tmux
#The no. of transcripts will ALWAYS be different. Always use the latest version and use CONDA.
#if the parent directory does not exist, use -p with mkdir
$ mkdir -p test1/test2
#Index command extracts a file which is easier to go through and does not slow down the computer. Because evrything is pulled in the memory, the index file speeds up the whole process of mapping.
#in the fro loop for salmon it is not necessary to extract basename and do the other steps. BUT it is used for step checks and that each file is accessed. its basically QC.
Stephanie Otto
===
What other types of data can you work with in CyVerse? For example, can you process neuroimaging (MRI) data using CyVerse if it's command line based?
Anthony Harrington
===
zcat fhet.tr.fna.gz | tr "_" "." | grep "^>" | head
#convert to tab delim?
tr "," "\t"
#to get fasta size
zcat fhet.tr.fna.gz | grep -v "^>" | wc -m
cat nema-transcriptome-assembly.fa | sed 's/TRINITY/NEMA/g'
# Resources
Shell novice guide: https://unr-dcg19.slack.com/archives/CDZ9692LA/p1547588745016300
https://www.tldp.org/LDP/Bash-Beginners-Guide/html/
https://linuxconfig.org/bash-scripting-tutorial-for-beginners
6 unix commands worth knowing: https://astrobiomike.github.io/bash/six_commands#tr
Datacamp RNAseq analysis pdf's :
https://unr-dcg19.slack.com/files/UDYGWD2RE/FFEPGL885/datacamp_rnaseq1.pdf
https://unr-dcg19.slack.com/files/UDYGWD2RE/FFDV6R8TB/datacamp_rnaseq2.pdf
https://unr-dcg19.slack.com/files/UDYGWD2RE/FFDR9SMA4/datacamp_rnaseq3.pdf
https://unr-dcg19.slack.com/files/UDYGWD2RE/FFDR9USHW/datacamp_rnaseq4.pdf
Datacamp course on RNAseq DE analysis: https://www.datacamp.com/courses/rna-seq-differential-expression-analysis