Rotation 3 Project Notes 3/16/2020

--- tags: minion_flu --- # Rotation 3 Project Notes 3/16/2020 this is more or less what I accomplished durring the "r3" part of my work in Sam's lab ## Minion Flu Project Notes! [toc] ### About the genome The genome has a negative-sense, single stranded, 8 RNA segment RNA genome. segments are often displayed 5' to 3', because they are translated to mRNA. Each segment has a a highly conserved "cap" on each end: a U13 on the 5' end, and a U12 on the 3' end. These are not only conserved among _strands_ but also conserved among _all strains_ of influenza A. ![](https://i.imgur.com/FUBAZh7.png =250x) Each segment amplicon looks something like this going through the pore: **forward( 5') barcode -> UMI ->U13 -> uniqDNADNADNA -> U12 -> reverse( 3') barcode** ### File preperation cat *.fastq > cat_fastq ### About the barcodes for our purposes, i think? + the adapter is part of the barcode? or is being removed with it + we can also use barcode and primer interchangably so we will first demultiplex my 3' bardcode, then 5' barcode. There are 4 barcode sets: | Barcodes | Sample | Plate | | -------- | ----------- | ----- | | F1/R1 | H3N2-NYU-14 | B 11 | | F1/R2 | H3N2-NYU-19 | A 11 | | F2/R1 | vTC1EGc3 | B 12 | | F2/R2 | PR8 | A 12 | The Primers are (5'-3') | Primer | Sequence | Name | | --------| ---------------------------------- | ---- | | F1 | ATCAGTTATGGTGGTCCTTACC AGCRAAAGCAGG | B 11 | | F2 | GACATACGACGGTCGCAAGTTG AGCRAAAGCAGG | A 11 | | R1 | TGGTCGAGTGAAGTGTTAGAGT AGTAGAAACAAGG| B 12 | | R2 | CAGCGGACGACGGTCAAGGTTA AGTAGAAACAAGG| A 12 | These primers include the 12 and U13 regions: | 5' |3' || | ---|---| ------------ | | BC |U12| AGCRAAAGCAGG | | U13|BC | AGTAGAAACAAGG | So in the cutadapt action, we are going to use these: | Barcode | Sequence | | --------| ----------------------| | F1 | ATCAGTTATGGTGGTCCTTACC| | F2 | GACATACGACGGTCGCAAGTTG| | R1 | TGGTCGAGTGAAGTGTTAGAGT| | R2 | CAGCGGACGACGGTCAAGGTTA| Also, we need to use the -O 22 flag to mandate a 22 bp match, with little to no error tolerance. Also, I think we need the -rc flag, which also searches for the reverrs compliment of the provided primer. because I think this is a blend of sense & anti sense cDNA. ### Trim and Demultiplex: the example this is all done using a subset of the data `cutadapt -a uni12_F1=ATCAGTTATGGTGGTCCTTACCAGCRAAAGCAGG -a uni12_F2=GACATACGACGGTCGCAAGTTGAGCRAAAGCAGG -o trimmed-{name}.fasta INPUT.fasta` note that -g flag is for 5', -a flag is for 3' ``` cutadapt \ -a F1=ATCAGTTATGGTGGTCCTTACCAGCRAAAGCAGG\ -a F2=GACATACGACGGTCGCAAGTTGAGCRAAAGCAGG\ --untrimmed-output \ untrimmed_subset99_F.fasta \ -e 0.3 -o {name}.fasta \ subset_99.fasta > F_subset99_report.txt ``` (same code as above) ``` cutadapt -a F1=ATCAGTTATGGTGGTCCTTACCAGCRAAAGCAGG -a F2=GACATACGACGGTCGCAAGTTGAGCRAAAGCAGG --untrimmed-output untrimmed_F.fasta -e 0.3 -o {name}.fasta subset_99.fasta > F_subset99_report.txt ``` This flag can be can be used to examine trimmed regions before trimming: ``` --action=lowercase \ ``` So now, the trimmed reads are sorted by F (forward, u12) barcode. There is a basket for untrimmed reads, but we wont move forward with those because the indices are combinatorial and we wont be able to resolve identity from 1 barcode. Now, for F1.fasta, if R1 matches, sample ID is H3N2-NYU-14. If R2 matches sample ID is H3N2-NYU-19: ``` cutadapt \ -g H3N2-NYU-14=TGGTCGAGTGAAGTGTTAGAGTAGTAGAAACAAGG\ -g H3N2-NYU-19=CAGCGGACGACGGTCAAGGTTAAGTAGAAACAAGG\ --untrimmed-output \ untrimmed_subset99_F1Runk.fasta \ -e 0.3 -o {name}.fasta \ F1.fasta > F1R_subset99_report.txt ``` (same code as above) ``` cutadapt -g H3N2-NYU-14=TGGTCGAGTGAAGTGTTAGAGTAGTAGAAACAAGG -g H3N2-NYU-19=CAGCGGACGACGGTCAAGGTTAAGTAGAAACAAGG --untrimmed-output untrimmed_subset99_F1Runk.fasta -e 0.3 -o {name}.fasta F1.fasta > F1R_subset99_report.txt ``` and F2.fasta, if R1 matches, sample ID is vTC1EGc3. If R2 matches, sample ID is PR8 ``` cutadapt \ -g vTC1EGc3=TGGTCGAGTGAAGTGTTAGAGTAGTAGAAACAAGG\ -g PR8=CAGCGGACGACGGTCAAGGTTAAGTAGAAACAAGG\ --untrimmed-output \ untrimmed_subset99_F2Runk.fasta \ -e 0.3 -o {name}.fasta \ F2.fasta > F2R_subset99_report.txt ``` (same code as above) ``` cutadapt -g vTC1EGc3=TGGTCGAGTGAAGTGTTAGAGTAGTAGAAACAAGG -g PR8=CAGCGGACGACGGTCAAGGTTAAGTAGAAACAAGG --untrimmed-output untrimmed_subset99_F2Runk.fasta -e 0.3 -o {name}.fasta F2.fasta > F2R_subset99_report.txt ``` ### Trim and Demultiplex: the real concatenate unzipped fastq files: `cat *.fastq > all_reads.fastq ` install seqtk and cutadapt `conda install -c bioconda seqtk` `conda install cutadapt -c bioconda` fasta-fy all_reads.fastq `seqtk seq -A all_reads.fastq>all_reads.fasta` sort out the forward primers: ``` mkdir no_match; cutadapt -a F1=ATCAGTTATGGTGGTCCTTACCAGCRAAAGCAGG -a F2=GACATACGACGGTCGCAAGTTGAGCRAAAGCAGG --untrimmed-output no_match/no_match_F.fasta -e 0.3 -o {name}.fasta all_reads.fasta > report_F_all.txt ``` sort out the reverse primers: ``` cutadapt -g H3N2-NYU-14=TGGTCGAGTGAAGTGTTAGAGTAGTAGAAACAAGG -g H3N2-NYU-19=CAGCGGACGACGGTCAAGGTTAAGTAGAAACAAGG --untrimmed-output no_match/no_match_F1_R.fasta -e 0.3 -o {name}.fasta F1.fasta > report_F1.txt ``` ``` cutadapt -g vTC1EGc3=TGGTCGAGTGAAGTGTTAGAGTAGTAGAAACAAGG -g PR8=CAGCGGACGACGGTCAAGGTTAAGTAGAAACAAGG --untrimmed-output no_match/no_match_F2_R.fasta -e 0.3 -o {name}.fasta F2.fasta > report_F2.txt ``` ## Snakemake moved to draft document for now... https://hackmd.io/NPtm_dVuRz6jg3hfnMX-fQ ---------------- ### How we want the files organized : inputs & outputs we want the user to create a directory, and in the directory `analysis/` we want them to put in the + snakefile, + and the minion-output-zipped-file of fastq files. This directory should include directories + `trimmed_fastas/` + 1 directory per sample + `.unassigned sequences/` + in this will be a branched disaster + `cutadapt_reports/` + 1 report per sample + `minion_source_data` + `minion_seqs.zip` + `unzipped_fastqs/` + `files` + `all_reads_untrimmed` + `all_reads.fastq` + `all_reads.fasta` ### Putting it in snakey-makey Snakemake is The Next Big Thing<sup>TM</sup>. It keeps a record of how your scripts are used and what their input dependencies are. Snakemake also allows you to run multiple steps in sequence, parallelising where possible. It also automatically detects if something changes and then reprocesses data if needed. <a href="https://imgflip.com/i/3nevby"><img src="https://i.imgflip.com/3nevby.jpg" title="made at imgflip.com"/></a> First, install snakemake with conda ` conda install -y -c conda-forge -c bioconda snakemake-minimal` The whole file gose in one big "snakefile" I think I was to write this snake file with really generic file inputnames, and I'll just change the names of the input files accordingly. rule unzip_and_cat input: minion_seqs.zip output: /minion_source_data/minion_seqs.zip, minion_source_data/unzipped_fastqs/** , minion_source_data/ shell: ``` mv {input} minion_source_data/; unzip minion_source_data/minion_seqs.zip -o minion_source_data/unzipped_fastqs; cat minion_source_data/unzipped_fastqs/*.fastq > minion_source_data/combined_fastq/all_reads.fastq; ``` kfkf ``` rule zip_and_cat_fastq: input: "minion_seqs.zip" output:"fastq" "fastqc_raw/ERR458493_fastqc.html", "fastqc_raw/ERR458493_fastqc.zip" shell:''' fastqc -o fastqc_raw {input} ''' ``` Notes: use wild card to make cat_fastq file lets make fastq generated by we want to have a reference genome to ## comparing to the illumina reads: From Github "coding resources" bash file: * The start with a .fasta file * use bowtie with a reference: or build a reference? * `bowtie2-build PR8_Mt_Sinai_NYC.fasta PR8_ref` * minimap2 to map the _long reads_ against a reference. * has a wrapper installation: * `minimap2 -ax map-ont PR8_Mt_Sinai_NYC.fasta forward_PR8_RNA_B.fasta > forward_PR8_RNA_B.sam` * outputs to SAM * SAM is a format. rad rad. * SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. minimap2 -ax : -a long read alignment with cigar -x preset map-ont PR8_Mt_Sinai_NYC.fasta forward_PR8_RNA_B.fasta > forward_PR8_RNA_B.sam ### a Cite to see https://angus.readthedocs.io/en/2019/snakemake_for_automation.html#specifying-software-required-for-a-rule BAR CODES: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005075 ONT seq: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1462-9 minion quality study: https://www.sciencedirect.com/science/article/pii/S2214753515000224

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.