# Sequence-a-Genome 2022
Jason - williams@cshl.edu
Anna - feitzin@cshl.edu
## 2022 Species - Chamaecrista glandulosa var. mirabilis
- 
- Biorxiv [paper](https://www.biorxiv.org/content/10.1101/2022.03.04.483036v1.full)
- Nature server [link](https://explorer.natureserve.org/Taxon/ELEMENT_GLOBAL.2.149103/Chamaecrista_glandulosa_var_mirabilis)
- Genome size: Unknown (less than 700Mb); [paper](https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-019-4152-0)
### Learning Resources
- CyVerse [link](https://learning.cyverse.org)
- Genomics data carpentry: https://datacarpentry.org/lessons/#genomics-workshop
- [Shell lesson](https://datacarpentry.org/shell-genomics/)
**General Coding**
- CodeCademy: [link](https://www.codecademy.com/)
- Hour of code (also in languages other than English): [link](https://code.org/learn)
**Software installations**
Be sure you have permission to install software
- Try Ubuntu: [link](https://tutorials.ubuntu.com/tutorial/try-ubuntu-before-you-install#0)
- Python: [link](https://www.python.org/dowloads/)
- Jupyter: [link](https://jupyter.org/)
- Wing IDE: [link](https://wingware.com/)
- Atom text editor: [link](https://atom.io/)
**Bioinformatics**
- Learn bioinformatics in 100 hours: [link](https://www.biostarhandbook.com/edu/course/1/)
- Rosalind bioinformatics: [link](http://rosalind.info/about/)
- Bioinformatics coursera: [link](https://www.coursera.org/learn/bioinformatics)
- Bioinformatics careers: [link](https://www.iscb.org/bioinformatics-resources-for-high-schools/careers-in-bioinformatics)
**Help**
- General software help: [link](https://stackoverflow.com/)
- Bioinformatics-specific software help: [link](https://www.biostars.org/)
### DNAi Videos
- Sequencing project animation [link](https://youtu.be/-gVh3z6MwdU)
- Beginnings of the Human Genome Project at the Cold Spring Harbor Laboratory, James Watson [link](https://dnalc.cshl.edu/view/15445-Beginnings-of-the-Human-Genome-Project-at-the-Cold-Spring-Harbor-Laboratory-James-Watson.html)
- Importance of genetic maps, Mary-Claire King [link](https://dnalc.cshl.edu/view/15128-Importance-of-genetic-maps-Mary-Claire-King.html)
- Compiling the data from the Human Genome Project, Jim Kent [link](https://dnalc.cshl.edu/view/15305-Compiling-the-data-from-the-Human-Genome-Project-Jim-Kent.html)
- Using computers to predict how genes within the human genome, Craig Venter [link](https://dnalc.cshl.edu/view/15358-Using-computers-to-predict-how-genes-within-the-human-genome-Craig-Venter.html)
- Finding genes in the human genome, Ewan Birney [link](https://dnalc.cshl.edu/view/15291-Finding-genes-in-the-human-genome-Ewan-Birney.html)
- The public Human Genome Project's DNA donors, Eric Lander [link](https://dnalc.cshl.edu/view/15327-The-public-Human-Genome-Project-s-DNA-donors-Eric-Lander.html)
- The first draft of the human genome, Ari Patrinos [link](https://dnalc.cshl.edu/view/15343-The-first-draft-of-the-human-genome-Ari-Patrinos.html)
- Relating a gene to a sequence of amino acids, Sydney Brenner [link](https://dnalc.cshl.edu/view/15279-Relating-a-gene-to-a-sequence-of-amino-acids-Sydney-Brenner.html)
---
### Other Important Links
- Human genome at NCBI: [link](https://www.ncbi.nlm.nih.gov/genome/guide/human/)
- Markdown cheatsheet: [link](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)
- AllofUs: https://allofus.nih.gov/
### Laboratory
- Plant DNA extraction: [link](https://www.promega.com/products/nucleic-acid-extraction/genomic-dna/high-molecular-weight-dna-extraction-kit/?catNum=A2920#protocols)
- Microbial swab DNA extraction: [link](https://jasonjwilliamsny.github.io/stars-2022/documentation/microbiome_dna_isolation/)
### Software
**Software Utilities**
- PuTTY for windows: https://the.earth.li/~sgtatham/putty/latest/w64/putty.exe
- Install Docker on Ubuntu: https://docs.docker.com/engine/install/ubuntu/
- Miniconda: https://docs.conda.io/en/latest/miniconda.html#linux-installers
- Bioconda: https://bioconda.github.io/user/install.html
- Samtools manual: http://www.htslib.org/doc/
- IGV: https://software.broadinstitute.org/software/igv/
- Filezilla (client): https://filezilla-project.org/
## Accounts
|Usernames|
|--------|
|burton|
|frankenberger|
|feitzinger|
|galvin|
|gilman|
|lauto|
|maloney|
|patel|
|thole|
|wong|
|williams|
### Platforms
- CyVerse Atmosphere: https://atmo.cyverse.org
---
## Markdown tutorial
This is regular text
*This is a list*
- Item one
- Item two
- Item three
This is code
Python
print("Hello world")
This is a URL
- [Google homepage](http://www.google.com)
HTML works too
<iframe width="256" height="144" src="https://www.youtube.com/embed/E9-Rm5AoZGw" allowfullscreen></iframe>
---
## Linux Commands
The format of a command generally looks like:
command -(flag) (argument)
Example:
ls -l my_dir
"-l" is a flag, and "my_dir" is an argument
You may use multiple flags and/or arguments depending on the command.
| Command | Function |
| -------- | -------- |
| whoami | Returns the current user |
| pwd | "Print Working Directory" <br/> Returns current folder |
| ls | Lists files in the current directory<br/> Returns extra information with -l or -F flags<br/> Defaults to current directory, path can be provided as argument|
| cd | "Change directory" <br/> Changes current folder |
| mkdir | Creates a directory <br/> Requires a name as an argument <br/> Flag -p creates the directories of the given path |
| mv | Moves a file into a different directory <br/> Requires a file to move and a folder to move to as arguments <br/> Moving a file into a directory with a file of the same name will overrite it |
| cp | Copies a file <br/> Requires a file to copy and a folder to copy to as arguments <br/> Can copy a directory and contents with the -r flag |
| rm | Deletes (removes) a file or directory <br/> Removes a directory and its contents recursively with the -r flag|
| head | Returns the beginning of a file. <br/> -n flag specifies the number of lines <br/> |
| tail | Returns the end of a file |
| cat | Returns entire file |
| grep | What is returned: <br/> (1) @metadata <br/> (2) sequence <br/> (3) +metadata <br/> (4) information about the quality <br/> Allows you to search patterns in a dataset <br/> Requires a search sequence and a location to search as arguments <br/> Example: *grep CGCATC SRR097977.fastq* - Searches for "CGCATC" in the file SRR097977.fastq <br/> Flag -A(number) returns the pattern and (number) of lines after the pattern <br/> Flag -B(number) returns the pattern and (number) of lines before the pattern|
| wc | Returns word count<br/> Returns number of lines with -l flag |
| echo | Prints any arguments passed as an output. <br/> Example: *echo \*.fastq.gz* returns all fastq.gz files that would be given if used with another command |
| sudo | "Super user do" Used to access restricted files or operations <br/> *sudo su* : super user do switch user |
| exit | causes normal process termination <br/> Example: use *exit* to terminate *sudo su* |
| wget | "Web get" Gets a file from the web |
| gzip | Compresses or expands files <br/> Flag -d will decompress the file |
| man | Returns the "manual" of potential arguments and flags for a command |
| Special Commands | Function |
| -------- | -------- |
| **\.** | Refers to the current directory |
| **\.\.** | Refers to the above directory |
| \* | "wildcard", allows you to search with any unknown sequence, for example "\*.fastq" returns .fastq files with any name |
| \| | "pipe", gives output of command before the "\|" symbol as input to the command after the "\|" symbol |
| && | used to run multiple commands in one line <br/> Example: *cd /home/user && ls* moves user to a certain directory and lists the files of that directory |
| \> | Pipes output of command before the "\>" symbol into a file provided after the "\>" symbol. |
---
## Genomics Vocabulary
| Term | Definition |
| -------- | -------- |
| Genome | All of the genetic information of an organism that is contained in nucleotide sequences of DNA. |
| Shell | The interface used in command line programming. |
| Variable Regions | Regions of the 16S ribosomal DNA that act as a unique "signature" to identify a microbe. |
| Genome Browser | A tool which allows a user to visualize features of a genome |
| SNP | A single nucleotide polymorphism, a variation of a single nucleotide in the genome |
| N50 | The median size of the individual segments (or contigs) of DNA |
| Contig | A piece of a genome; an individual segment of a genome which has been assembled |
| .fastq | A file format |
| Metadata | data about the data|
| Phred Quality Score | A logarithmic score which estimates the probability a certain nucleotide was correctly identified |
---
## Atmosphere Server
Cloud computing for Sequence-a-Genome: [link](http://128.196.142.16:8000/hub/login)
---
## Putty Download
Link to [Putty Download](https://the.earth.li/~sgtatham/putty/latest/w64/putty.exe)
#### Servers
| username | addresses |
|---------------|----------------|
| burton | 44.210.149.221 |
| deng | 44.200.187.104 |
| frankenberger | 44.192.45.0 |
| feitzinger | 35.172.218.32 |
| galvin | 34.204.193.152 |
| gilman | 34.204.172.246 |
| lauto | 3.239.48.245 |
| maloney | 3.239.4.230 |
| patel | 3.236.106.74 |
| thole | 3.231.228.99 |
| wong | 18.232.59.122 |
| williams | 100.24.126.80 /34.236.38.133 |
---
## Daily Notes
### Monday
*Tasks*
- [x] - Intro to genomics
- [x] - Introduction to Linux
- [x] - DNA extraction from Chamechriste
- [x] - Microbial sampling swab
### Promega plant DNA extraction
- **Group 1: Anjali and Richard,** Concentration: 31.8 ng/μl
- **Group 2: Alex and Henry,** Concentration: 16.9 ng/μl
- **Group 3: Kathryn and Eileen,** Concentration: 26.6 ng/μl
- **Group 4: Jake and Sophia,** Concentration: 37.7 ng/μl
- **Group 5: Daniel and Evan,** Concentration: 59.0 ng/μl
----
### Tuesday
*Tasks*
- [x] - DNA extraction from microbial swabs
- [x] - Microbial 16s PCR
- [x] - DNA QC of plant DNA
- [x] - Intro to linux/command line continued
- [x] - 16s library prep and sequencing
- [x] - Genome assembly talk part I
### Microbe swabs
1. Kathryn, bathroom faucet
2. Eileen, rabbit
3. Daniel, door knob
4. Evan, AC filter
5. Alex, door handle
6. Henry, tree
7. Anjali, desk
8. Richard, Under sink
9. Jake, Toilet bowl
10. Sophia, bathroom sink
11. Anna, laundry room wall
12. Christopher (STARS)
### Household microbiome
See for example [link](https://www.epa.gov/indoor-air-quality-iaq/indoor-microbiome)
### Nanopore PCR and sequencing
- Gel photo (to be posted)
----
### Wednesday
*Tasks*
- [x] - Review linux commands and more on Linux
- [x] - Setup of analysis computers on AWS
- [x] - Ligation library prep
- [x] - Genome assembly talk part II
### Computers for analysis
**Warning**: You must be on the OC-CSHL Wifi network. We can connect you.
#### Commands for setting up conda
#optional - simplify the prompt
export PS1='$ '
# From the terminal change to your home directory
cd /opt
# Change to the root user
sudo su
# Download and install Anaconda
wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
# Install Anaconda
# say yes to licence
# /opt/anaconda3 is the install directory
# say yes to running conda init
bash Anaconda3-2022.05-Linux-x86_64.sh
# Link Anaconda programs with shortcuts
ln -s /opt/anaconda3/pkgs/*/bin/* /bin
ln -s /opt/anaconda3/pkgs/*/lib/* /usr/lib
**Configure Bioconda**
exit
sudo su
This will help make installations smoother and the `-c bioconda` and smiliar commands won't be needed.
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
**Configure and Run Jupyter**
Setup a password for the server
jupyter notebook password
Intsall the bash kernel
pip install bash_kernel
python -m bash_kernel.install
Start at tmux session so that this program can run in the background
tmux new-session -n jupyter
Run the command to start the Jupyter server
jupyter lab --no-browser --allow-root --ip=0.0.0.0 --port=8000 --NotebookApp.token='' --notebook-dir='/home/ubuntu'
#### Command to install fastqc using conda
conda create --name fastqc fastqc==0.11.9 -y
#### Command to retreive test data
wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/quickstarts/trimmomatic/00_input/SRR1761506_R1_001.fastq.gz
### Thursday
*Tasks*
- [x] - Loaded the plant library
- [x] - Jupyter tool
- [x] - Genome assembly talk part II
**Sample nanopore data from duckweed**
Download [link](https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/spolyrhiza_08_2021/spolyrhiza_08252021/reads_for_assembly/guppy_gpu_fastq_files/concat_fastq/1000_spolyrhiza_reads.fastq.gz)
**Download command**
wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/spolyrhiza_08_2021/spolyrhiza_08252021/reads_for_assembly/guppy_gpu_fastq_files/concat_fastq/1000_spolyrhiza_reads.fastq.gz
**Import the fastp notebook**
wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/jupyter_notebooks/fastp_qc/ReadQC-with-fastp.ipynb
**Fastp test commamd**
fastp --in1 1000_spolyrhiza_reads.fastq.gz\
--out1 data/output/fastp_analysis/1000_spolyrhiza_reads_filtered.fastq.gz\
--length_required 1000\
--report_title "1000 Reads Sample Reads>= 1000bp"\
--html data/output/fastp_analysis/1000_spolyrhiza_reads_filtered_example.html
**Assembly Notebooks**
- Flye:
wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/jupyter_notebooks/genome_assembly/genome_assembly_flye.ipynb
- Redbean:
wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/jupyter_notebooks/genome_assembly/genome_assembly_redbean.ipynb
**Full-size spolyrhiza dataset**
this works
wget https://www.dropbox.com/s/4izz00ssbiijfoy/spolyrhiza_reads_filtered.fastq.gz?dl=0
** Rename file
mv spolyrhiza_reads_filtered.fastq.gz\?dl\=0 spolyrhiza_reads_filtered.fastq.gz
flye --nano-raw spolyrhiza_reads_filtered.fastq.gz\\
--out-dir data/output/flye_example_assembly\
--genome-size 70m\
--threads 8\
--iterations 2
### Friday
*Tasks*
- [x] - Review the example ployrihza assembly
- [x] - Assemble the plant sequence
**Download QUAST notebook**
Assembly QC with QUAST
wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/jupyter_notebooks/assembly_qc/assembly_qc_quast.ipynb
**Download reference for polyrhiza**
polyrhiza reference
wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/spolyrhiza_08_2021/spolyrhiza_08252021/reads_for_assembly/spolyrhiza_reference_genome/Spirodela_polyrhiza_strain_7498.faa
**Download reference annotation**
polyrhiza annotation
wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/spolyrhiza_08_2021/spolyrhiza_08252021/reads_for_assembly/spolyrhiza_reference_genome/Spirodela_polyrhiza_strain_7498.gff
**Download Friday AM - Chamaecrista reads**
wget https://www.dropbox.com/s/fkw892z9mrp4qg0/friday_am_chamaecrista_reads.zip
**Install Quast**
conda create -y --name quast quast==5.0.2
**Running Quast**
quast -o quast_example_QC \
--no-sv \
--no-snps \
-r Spirodela_polyrhiza_strain_7498.faa \
--features Spirodela_polyrhiza_strain_7498.gff \
--threads 8 \
--labels "Flye example" \
data/output/flye_example_assembly/00-assembly/draft_assembly.fasta
# Analysis of Chamaecristae
## Unzip and organize files
mkdir -p chamaecristae/fastp
unzip friday_am_chamaecrista_reads.zip
zcat friday_am_reads/fastq_pass/*.fastq.gz >chamaecristae/chamaecristae_combined.fastq.gz
## Run Fastp
#combined file
wget https://www.dropbox.com/s/ixz9tg7lujhkoyf/combined_friday_chamaecristsa_filtered.fastq.gz
fastp --in1 chamaecristae/chamaecristae_combined.fastq.gz\
--out1 chamaecristae/fastp/chamaecristae_combined_filtered.fastq.gz\
--length_required 1000\
--report_title "Chamaecristae Sample Reads>= 1000bp"\
--html chamaecristae/fastp/chamaecristae_combined_filtered.html
## Run Redbean
mkdir -p chamaecristae/redbean
wtdbg2 -o chamaecristae/readbean \
-i chamaecristae/combined_friday_chamaecristsa_filtered.fastq.gz \
-x "ont" \
-t 8\
-g "70m"
## Tasks
Genome assembly with Jellyfish
# Continuing following up?
Evan Thole: tholeevan306@gmail.com
Eileen Deng: eileen_deng@ryecountryday.org
Kathryn Maloney: kathrynannemaloney@gmail.com
Anjali Patel: anjalijpatel05@gmail.com
Alex Frankenberger: zander.frankenberger@gmail.com
Richard Wong: iamrichard2015@gmail.com
Daniel Galvin Gusmano: galving@cshl.edu
Henry: henryburton@hunterschools.org
---