---
tags: minion_flu
---
# Rotation 3 Project Notes 3/16/2020
this is more or less what I accomplished durring the "r3" part of my work in Sam's lab
## Minion Flu Project Notes!
[toc]
### About the genome
The genome has a negative-sense, single stranded, 8 RNA segment RNA genome. segments are often displayed 5' to 3', because they are translated to mRNA. Each segment has a a highly conserved "cap" on each end: a U13 on the 5' end, and a U12 on the 3' end. These are not only conserved among _strands_ but also conserved among _all strains_ of influenza A.

Each segment amplicon looks something like this going through the pore:
**forward( 5') barcode -> UMI ->U13 -> uniqDNADNADNA -> U12 -> reverse( 3') barcode**
### File preperation
cat *.fastq > cat_fastq
### About the barcodes
for our purposes, i think?
+ the adapter is part of the barcode? or is being removed with it
+ we can also use barcode and primer interchangably
so we will first demultiplex my 3' bardcode, then 5' barcode.
There are 4 barcode sets:
| Barcodes | Sample | Plate |
| -------- | ----------- | ----- |
| F1/R1 | H3N2-NYU-14 | B 11 |
| F1/R2 | H3N2-NYU-19 | A 11 |
| F2/R1 | vTC1EGc3 | B 12 |
| F2/R2 | PR8 | A 12 |
The Primers are (5'-3')
| Primer | Sequence | Name |
| --------| ---------------------------------- | ---- |
| F1 | ATCAGTTATGGTGGTCCTTACC AGCRAAAGCAGG | B 11 |
| F2 | GACATACGACGGTCGCAAGTTG AGCRAAAGCAGG | A 11 |
| R1 | TGGTCGAGTGAAGTGTTAGAGT AGTAGAAACAAGG| B 12 |
| R2 | CAGCGGACGACGGTCAAGGTTA AGTAGAAACAAGG| A 12 |
These primers include the 12 and U13 regions:
| 5' |3' ||
| ---|---| ------------ |
| BC |U12| AGCRAAAGCAGG |
| U13|BC | AGTAGAAACAAGG |
So in the cutadapt action, we are going to use these:
| Barcode | Sequence |
| --------| ----------------------|
| F1 | ATCAGTTATGGTGGTCCTTACC|
| F2 | GACATACGACGGTCGCAAGTTG|
| R1 | TGGTCGAGTGAAGTGTTAGAGT|
| R2 | CAGCGGACGACGGTCAAGGTTA|
Also, we need to use the -O 22 flag to mandate a 22 bp match, with little to no error tolerance.
Also, I think we need the -rc flag, which also searches for the reverrs compliment of the provided primer. because I think this is a blend of sense & anti sense cDNA.
### Trim and Demultiplex: the example
this is all done using a subset of the data
`cutadapt -a uni12_F1=ATCAGTTATGGTGGTCCTTACCAGCRAAAGCAGG -a uni12_F2=GACATACGACGGTCGCAAGTTGAGCRAAAGCAGG -o trimmed-{name}.fasta INPUT.fasta`
note that -g flag is for 5', -a flag is for 3'
```
cutadapt \
-a F1=ATCAGTTATGGTGGTCCTTACCAGCRAAAGCAGG\
-a F2=GACATACGACGGTCGCAAGTTGAGCRAAAGCAGG\
--untrimmed-output \
untrimmed_subset99_F.fasta \
-e 0.3 -o {name}.fasta \
subset_99.fasta > F_subset99_report.txt
```
(same code as above)
```
cutadapt -a F1=ATCAGTTATGGTGGTCCTTACCAGCRAAAGCAGG -a F2=GACATACGACGGTCGCAAGTTGAGCRAAAGCAGG --untrimmed-output untrimmed_F.fasta -e 0.3 -o {name}.fasta subset_99.fasta > F_subset99_report.txt
```
This flag can be can be used to examine trimmed regions before trimming:
```
--action=lowercase \
```
So now, the trimmed reads are sorted by F (forward, u12) barcode. There is a basket for untrimmed reads, but we wont move forward with those because the indices are combinatorial and we wont be able to resolve identity from 1 barcode.
Now, for F1.fasta, if R1 matches, sample ID is H3N2-NYU-14. If R2 matches sample ID is H3N2-NYU-19:
```
cutadapt \
-g H3N2-NYU-14=TGGTCGAGTGAAGTGTTAGAGTAGTAGAAACAAGG\
-g H3N2-NYU-19=CAGCGGACGACGGTCAAGGTTAAGTAGAAACAAGG\
--untrimmed-output \
untrimmed_subset99_F1Runk.fasta \
-e 0.3 -o {name}.fasta \
F1.fasta > F1R_subset99_report.txt
```
(same code as above)
```
cutadapt -g H3N2-NYU-14=TGGTCGAGTGAAGTGTTAGAGTAGTAGAAACAAGG -g H3N2-NYU-19=CAGCGGACGACGGTCAAGGTTAAGTAGAAACAAGG --untrimmed-output untrimmed_subset99_F1Runk.fasta -e 0.3 -o {name}.fasta F1.fasta > F1R_subset99_report.txt
```
and F2.fasta, if R1 matches, sample ID is vTC1EGc3. If R2 matches, sample ID is PR8
```
cutadapt \
-g vTC1EGc3=TGGTCGAGTGAAGTGTTAGAGTAGTAGAAACAAGG\
-g PR8=CAGCGGACGACGGTCAAGGTTAAGTAGAAACAAGG\
--untrimmed-output \
untrimmed_subset99_F2Runk.fasta \
-e 0.3 -o {name}.fasta \
F2.fasta > F2R_subset99_report.txt
```
(same code as above)
```
cutadapt -g vTC1EGc3=TGGTCGAGTGAAGTGTTAGAGTAGTAGAAACAAGG -g PR8=CAGCGGACGACGGTCAAGGTTAAGTAGAAACAAGG --untrimmed-output untrimmed_subset99_F2Runk.fasta -e 0.3 -o {name}.fasta F2.fasta > F2R_subset99_report.txt
```
### Trim and Demultiplex: the real
concatenate unzipped fastq files:
`cat *.fastq > all_reads.fastq
`
install seqtk and cutadapt
`conda install -c bioconda seqtk`
`conda install cutadapt -c bioconda`
fasta-fy all_reads.fastq
`seqtk seq -A all_reads.fastq>all_reads.fasta`
sort out the forward primers:
```
mkdir no_match;
cutadapt -a F1=ATCAGTTATGGTGGTCCTTACCAGCRAAAGCAGG -a F2=GACATACGACGGTCGCAAGTTGAGCRAAAGCAGG --untrimmed-output no_match/no_match_F.fasta -e 0.3 -o {name}.fasta all_reads.fasta > report_F_all.txt
```
sort out the reverse primers:
```
cutadapt -g H3N2-NYU-14=TGGTCGAGTGAAGTGTTAGAGTAGTAGAAACAAGG -g H3N2-NYU-19=CAGCGGACGACGGTCAAGGTTAAGTAGAAACAAGG --untrimmed-output no_match/no_match_F1_R.fasta -e 0.3 -o {name}.fasta F1.fasta > report_F1.txt
```
```
cutadapt -g vTC1EGc3=TGGTCGAGTGAAGTGTTAGAGTAGTAGAAACAAGG -g PR8=CAGCGGACGACGGTCAAGGTTAAGTAGAAACAAGG --untrimmed-output no_match/no_match_F2_R.fasta -e 0.3 -o {name}.fasta F2.fasta > report_F2.txt
```
## Snakemake moved to draft document for now...
https://hackmd.io/NPtm_dVuRz6jg3hfnMX-fQ
----------------
### How we want the files organized : inputs & outputs
we want the user to create a directory, and in the directory `analysis/` we want them to put in the
+ snakefile,
+ and the minion-output-zipped-file of fastq files.
This directory should include directories
+ `trimmed_fastas/`
+ 1 directory per sample
+ `.unassigned sequences/`
+ in this will be a branched disaster
+ `cutadapt_reports/`
+ 1 report per sample
+ `minion_source_data`
+ `minion_seqs.zip`
+ `unzipped_fastqs/`
+ `files`
+ `all_reads_untrimmed`
+ `all_reads.fastq`
+ `all_reads.fasta`
### Putting it in snakey-makey
Snakemake is The Next Big Thing<sup>TM</sup>. It keeps a record of how your scripts are used and what their input dependencies are. Snakemake also allows you to run multiple steps in sequence, parallelising where possible. It also automatically detects if something changes and then reprocesses data if needed.
<a href="https://imgflip.com/i/3nevby"><img src="https://i.imgflip.com/3nevby.jpg" title="made at imgflip.com"/></a>
First, install snakemake with conda
` conda install -y -c conda-forge -c bioconda snakemake-minimal`
The whole file gose in one big "snakefile"
I think I was to write this snake file with really generic file inputnames, and I'll just change the names of the input files accordingly.
rule unzip_and_cat
input: minion_seqs.zip
output: /minion_source_data/minion_seqs.zip, minion_source_data/unzipped_fastqs/** , minion_source_data/
shell:
```
mv {input} minion_source_data/;
unzip minion_source_data/minion_seqs.zip -o minion_source_data/unzipped_fastqs;
cat minion_source_data/unzipped_fastqs/*.fastq > minion_source_data/combined_fastq/all_reads.fastq;
```
kfkf
```
rule zip_and_cat_fastq:
input: "minion_seqs.zip"
output:"fastq"
"fastqc_raw/ERR458493_fastqc.html",
"fastqc_raw/ERR458493_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
```
Notes:
use wild card to make cat_fastq file
lets make
fastq generated by
we want to have a reference genome to
## comparing to the illumina reads:
From Github "coding resources" bash file:
* The start with a .fasta file
* use bowtie with a reference: or build a reference?
* `bowtie2-build PR8_Mt_Sinai_NYC.fasta PR8_ref`
* minimap2 to map the _long reads_ against a reference.
* has a wrapper installation:
* `minimap2 -ax map-ont PR8_Mt_Sinai_NYC.fasta forward_PR8_RNA_B.fasta > forward_PR8_RNA_B.sam`
* outputs to SAM
* SAM is a format. rad rad.
* SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.
minimap2
-ax : -a long read alignment with cigar
-x preset
map-ont
PR8_Mt_Sinai_NYC.fasta forward_PR8_RNA_B.fasta > forward_PR8_RNA_B.sam
### a Cite to see
https://angus.readthedocs.io/en/2019/snakemake_for_automation.html#specifying-software-required-for-a-rule
BAR CODES: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005075
ONT seq: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1462-9
minion quality study: https://www.sciencedirect.com/science/article/pii/S2214753515000224