TnSeq data analysis

# TnSeq data analysis ## Protocol Full protocol available [here](https://sfvideo.blob.core.windows.net/sitefinity/docs/default-source/protocol/xgen-dna-library-prep-ez-kit-and-xgen-dna-library-prep-ez-uni-kits-protocol.pdf?sfvrsn=57b1e007_13). Dan's Benchling available [here](https://benchling.com/daniel_deatherage/f/lib_SsaupUkW-flora/etr_o0j8gqil-0001-initial-tnseq-sequencing/edit). ### DNA fragmentation ![image](https://hackmd.io/_uploads/SyRMnBt4p.png) Fragmentation mastermix: ![image](https://hackmd.io/_uploads/SyxYhHK4T.png) Fragmentation conditions: ![image](https://hackmd.io/_uploads/S1j5hHtEa.png) ### Adapter ligation #### DS adapter aneal Mix the following in a PCR tube: - 15µl of Adapter_forward (100pmol/µl) - 15µl of MI_Reverse (100pmol/µl) Use thermocycler with following conditions to anneal the 2 primers together: - 97C 2minutes - 97C 1minute - Repeat previous step for 72 total cycles at -1C per cycle - 25C 5 minutes After this step, adapters should NEVER be warmed above 25C. #### Ligation mix ![image](https://hackmd.io/_uploads/B1cE6rtVT.png) Dan note: reaction volume remain after aliquot = 3x extra reactions not 1x .... all samples have equal line in pcr tubes .... suggests ~50µl extra volume, of 1 of reagents? all pipett with p200 ... moving forward anyway Add 5 µL dsAdapter (Y adapter) for samples 1-5. Add 5 µL of adapter W5 (stubby adapter) for samples 6-7. #### Ligation conditions ![image](https://hackmd.io/_uploads/ryT0pHFEp.png) #### Ligation clean-up 36µl beads/ sample, elute 20µl. ### PCR #### PCR mastermix ![image](https://hackmd.io/_uploads/ryaX0BKV6.png) Add 2.5µL of Rx to each tube (cf PCR primers table). ![image](https://hackmd.io/_uploads/HkqNCBFE6.png) #### PCR primers ![image](https://hackmd.io/_uploads/B1t80BKET.png) TN-FL 150-3a-corrected sequence: ![image](https://hackmd.io/_uploads/BJ_jjHKVa.png) #### PCR conditions ![image](https://hackmd.io/_uploads/H1T2AHKEa.png) #### PCR clean-up Dan's note: spri ratio 0.5x therefore 25µl beads .... 22 elute (20µl sample after br qbit) ## Data processing ### Demultiplexing Dan demultiplexed the data by hand (automated demultiplexing not possible due to the different samples that were sequenced along with these). ### Filtering Used fastp to filter and pair reads. Output: both merged and unmerged. ``` #!/bin/bash SAMPLES="A1_S1 A2_S9 A3_S17 A4_S25 A5_S33 A6_S41" for S in $SAMPLES; do fastp --thread 16 --verbose \ --unpaired1 /home/gaudillf/thesis_project/TnSeq_IS150/01_fastp_filtering_paired_end_unmerged/unpaired/"$S""_R1_unpaired.fastq.gz" \ --unpaired2 /home/gaudillf/thesis_project/TnSeq_IS150/01_fastp_filtering_paired_end_unmerged/unpaired/"$S""_R2_unpaired.fastq.gz" \ --failed_out /home/gaudillf/thesis_project/TnSeq_IS150/01_fastp_filtering_paired_end_unmerged/failed_out/"$S""_failed_out.fastq.gz" \ --in1 /home/gaudillf/thesis_project/TnSeq_IS150/data/tn-seq-reads/"$S""_L001_R1_001.fastq" \ --in2 /home/gaudillf/thesis_project/TnSeq_IS150/data/tn-seq-reads/"$S""_L001_R2_001.fastq" \ --out1 /home/gaudillf/thesis_project/TnSeq_IS150/01_fastp_filtering_paired_end_unmerged/passed/"$S""_R1_fastp_ouput.fastq.gz" \ --out2 /home/gaudillf/thesis_project/TnSeq_IS150/01_fastp_filtering_paired_end_unmerged/passed/"$S""_R2_fastp_ouput.fastq.gz" done ``` ### Junction mapping Used breseq for a first round on the merged reads from all samples to predict potential junctions. ``` #!/bin/bash breseq -r /home/gaudillf/thesis_project/TnSeq_IS150/data/MG1655.gbk \ /home/gaudillf/thesis_project/TnSeq_IS150/01_fastp_filtering_paired_end/merged/merged_reads_all_samples.fastq \ -p -t \ --junction-only-reference /home/gaudillf/thesis_project/TnSeq_IS150/data/pFDXTH1.fasta \ --junction-indel-split-length 10000 --require-match-fraction 0.8 \ --junction-minimum-side-match 12 --skip-RA-MC-prediction --junction-alignment-pair-limit 0 \ --junction-minimum-candidate-pos-hash-score 1 --junction-candidate-length-factor 0 \ --junction-maximum-candidates 0 --junction-minimum-pos-hash-score 1 \ --polymorphism-frequency-cutoff 0 \ -o /home/gaudillf/thesis_project/TnSeq_IS150/02_breseq_predict_junctions \ -n breseq_predict_candidate_junctions ``` Used the output as user evidence to run breseq on individual samples (paired, non-merged reads). ``` #!/bin/bash SAMPLES="A1_S1 A2_S9 A3_S17 A4_S25 A5_S33 A6_S41" for S in $SAMPLES; do breseq -r /home/gaudillf/thesis_project/TnSeq_IS150/data/MG1655.gbk \ /home/gaudillf/thesis_project/TnSeq_IS150/01_fastp_filtering_paired_end_unmerged/passed/"$S""_R1_fastp_ouput.fastq.gz" \ /home/gaudillf/thesis_project/TnSeq_IS150/01_fastp_filtering_paired_end_unmerged/passed/"$S""_R2_fastp_ouput.fastq.gz" \ -p -t \ --junction-only-reference /home/gaudillf/thesis_project/TnSeq_IS150/data/pFDXTH1.fasta \ --junction-indel-split-length 10000 --require-match-fraction 0.8 \ --junction-minimum-side-match 12 --skip-RA-MC-prediction --junction-alignment-pair-limit 0 \ --junction-minimum-candidate-pos-hash-score 1 --junction-candidate-length-factor 0 \ --junction-maximum-candidates 0 --junction-minimum-pos-hash-score 1 \ --polymorphism-frequency-cutoff 0 \ -o /home/gaudillf/thesis_project/TnSeq_IS150/03_breseq_user_evidence_non_merged/"$S" \ -n breseq_predict_candidate_junctions_user_evidence \ --user-evidence-gd /home/gaudillf/thesis_project/TnSeq_IS150/02_breseq_predict_junctions/output/output.gd done ``` ## Genome and plasmid coverage ### Genome coverage Genome coverage is nearly identical for sample 1 (first image) and for sample 5, the negative control (second image). ![U00096.overview](https://hackmd.io/_uploads/Syz31cRNp.png) ![U00096.overview](https://hackmd.io/_uploads/rktNkq0N6.png) Zoom in on one of the coverage peaks (sample 1 top, negative control bottom): ![peak](https://hackmd.io/_uploads/BJl_IuYE6.png) ![peak](https://hackmd.io/_uploads/Bk0OI_YNT.png) Not all reads carry the part that matches the plasmid sequence (top sample 1, bottom negative control): ![image](https://hackmd.io/_uploads/BkyPvdtEp.png) ![image](https://hackmd.io/_uploads/HyJBDdtEp.png) NB: the zoom is from earlier analysis with the wrong MG1655.gbk. I replaced the total coverage plot with the ones obtained using the right .gbk (U000096) but I didn't redo the zoom nor the plasmid coverage/junctions because the only thing that changes is were the peaks are located). ### Plasmid coverage Around ~150 bp from the plasmid are covered in sample 1 (first image), although not were we would expect (see second image: plasmid sequence at end of the modified IS element). ![image](https://hackmd.io/_uploads/BJlMZLK4T.png) ![image](https://hackmd.io/_uploads/SJmd-It46.png) Read alignment to the plasmid: ![image](https://hackmd.io/_uploads/r13WmIYNa.png) A few bp of the plasmid are found in the negative control, but given the small size of the match, it's probably not the plasmid. ![image](https://hackmd.io/_uploads/rkHQWLY4T.png) Read alignment to the plasmid: not actual matches ![image](https://hackmd.io/_uploads/rJq77UF4p.png) ## Junction analysis Junctions are predicted both for sample 1 (top) and for the negative control (bottom). (This only shows the head of the table.) ![image](https://hackmd.io/_uploads/ByCJM8F4T.png) ![image](https://hackmd.io/_uploads/HJBeMUYN6.png) The read mapping to the junctions is similar in both, with only a few bp mapped to the actual plasmid, and not were we would expect (last image: plasmid sequence at the end of the modified IS element): ![image](https://hackmd.io/_uploads/HJ_Hf8KVa.png) ![image](https://hackmd.io/_uploads/SkSIG8F4a.png) ![image](https://hackmd.io/_uploads/SJmd-It46.png) ## Blastn against TnSeq primer I blasted the unfiltered reads against the sequence of the primer: matches are found in the A1_R2 reads, but again, not the part I would expect: the IRR is almost not covered. ![image](https://hackmd.io/_uploads/ryoArUKVT.png) When specifically blasting the reads against the IRR part, no hits are returned. Same results with filtered paired reads. NB: the sequence on the TnSeq tube is not complete (it stops at 58 bp, one pb after the end of the universal adapter part). I'm assuming it is just because IDT does not put the full sequence of long oligos on the tube, but is it possible that there was a mistake in the primer order and that the IRR specific part is not actually in it? I would need the order details to check. ## G-tails and quality issues Graphs for R1 and R2 from first sample (after default fastp processing, filtering for paired reads but no merging of reads): R2 has terrible quality, and there is a G-tail for both R1 and R2 (approximately 1/3 of reads if I'm calculating correctly). Jeff said the quality issue could be because of a bad batch of iSeq cartridges, they ran into issues before. (But that still would not explain why the reads don't align where we expect them to.) ![image](https://hackmd.io/_uploads/H1TVbKyLp.png) ![image](https://hackmd.io/_uploads/BkfBZFJIT.png) ![image](https://hackmd.io/_uploads/SyPSWKk8a.png) ![image](https://hackmd.io/_uploads/S1oSWKJLa.png) I'm going to test a few things with fastp. See detailed manual [here](https://github.com/OpenGene/fastp). - poly-G tail trimming - base correction for PE data - detect adapters instead of trimming by overlap analysis > for PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify --detect_adapter_for_pe to enable it. - quality of reads: minimum of 60% of bases above 30 threshold When doing that, fastp does not detect any adapter for R2. It detects the following for R1: GATCGGAAGAGCACACGTCTGAACTCCAGTCAC Output: ``` Read1 before filtering: total reads: 82069 total bases: 12392419 Q20 bases: 12120520(97.8059%) Q30 bases: 11547171(93.1793%) Read2 before filtering: total reads: 82069 total bases: 12392419 Q20 bases: 8780521(70.854%) Q30 bases: 6709665(54.1433%) Read1 after filtering: total reads: 22584 total bases: 2789435 Q20 bases: 2765133(99.1288%) Q30 bases: 2654432(95.1602%) Read2 after filtering: total reads: 22584 total bases: 2869172 Q20 bases: 2380185(82.9572%) Q30 bases: 1948703(67.9187%) Filtering result: reads passed filter: 45168 reads failed due to low quality: 102098 reads failed due to too many N: 0 reads failed due to too short: 16872 reads with adapter trimmed: 59106 bases trimmed due to adapters: 3150935 reads corrected by overlap analysis: 32839 bases corrected by overlap analysis: 168261 Duplication rate: 2.84151% Insert size peak (evaluated by paired-end reads): 0 ``` Cleaned up the data the remove the G-tails and the poorest quality reads and re-ran breseq (junction prediction on all samples and then junction prediction with user evidence on individual samples). I no longer have reads that are not mapped. The junctions problem persists. I kept that fastp run and the following breseq runs and blasts as main outputs in the folder, and put whatever I did earlier in an 'old' folder to avoid having too much of a mess in there. Percentage of reads kept for each sample: | Sample | % reads kept | Number of reads kept | Detected R1 Adapter | Detected R2 Adapter | | -------- | -------- | -------- | -------- | -------- | | A1 | 27 | 45K | GATCGGAAGAGCACACGTCTGAACTCCAGTCAC | None | | A2 | 30 | 24K | AGATCGGAAGAGCACACGTCTGAACTCCAGTCA | None | | A3 | 13 | 19K | GATCGGAAGAGCACACGTCTGAACTCCAGTCAC | None | | A4 | 9 | 9K | GATCGGAAGAGCACACGTCTGAACTCCAGTCAC | None | | A5 (control) | 20 | 38K | GATCGGAAGAGCACACGTCTGAACTCCAGTCAC | None | Example of sample 1 for quality of reads after filtering (R1/R2): ![image](https://hackmd.io/_uploads/SJ_DvTkIT.png) ![image](https://hackmd.io/_uploads/HkrFvp1Ip.png) Example of sample 1 for base content: ![image](https://hackmd.io/_uploads/Skw2v6kIa.png) ![image](https://hackmd.io/_uploads/HkbovTJUa.png)