Metagenomics/AMR

--- title: Metagenomics/AMR description: Information and Notes on my studies and research of deeplearning techniques in the context of antimicrobial resistances in metagenomics. --- # AMR Studies ## Efforts to reproduce AMR++ v1 - still some differences when running my pipeline steering script. - nele and judith are keeping intermediate files from some galaxy runs - need to use that to figure out where the difference comes from. - annotations are different, but only are missing the "new" column with RequiresSNP Info. ## AMR++ Additional Implementations - DeepArg - should take removed host reads as input - metaxa2 workflow (after kraken run) - kraken2 - figure out how to make kraken faster: - maybe rather than: - `kraken2 --memory-map S1 ; kraken2 --memory-map S2` - do - `kraken2 S1 S2` - such that kraken loads the entire db into memory once and run all samples in one prozess/call. ## Meeting w/ Judith, Nele, Steffen - ToDo - [ ] retrain DeepARG with MegaRes v3.0 database - [ ] AMR++ v1 vs v3 validation - sequence data located: - on synology `/data_syn/sequenzdaten/` - on server `/home/admin/Seqs/` - start/continue common struture - ## IOW / Christiane Hassenrück / OTC Genomic - Christiane erfragt, ob ich potentiell an deren Daten kommen könnte und unser tool daran testen koennte. - links: - https://www.oceantechnologycampus.com/ - https://www.io-warnemuende.de/otcg-home-de.html - https://github.com/tseemann/abricate (Christiane hat bisher nur abricate genutzt) - kein Fan von analyse mit short reads als einzigen input. lieber nach assembly, i.e. contigs as input. - kooperiert das FLI mit dem DSMZ in Braunschweig? - können wir (Fabian ich und das FLI an etablierte Strukturen anknüpfen?) ## Reference Dump - PhD thesis https://vtechworks.lib.vt.edu/items/4bd67860-7df9-41c5-893f-0b0e284a5eb1 - blasting: - https://doi.org/10.1186/1471-2105-10-421 - https://doi.org/10.1016/S0022-2836(05)80360-2 - Follow up articles from DeepArg secondary authors: - DeepMRG: https://doi.org/10.1101/2023.11.14.566903 - ARGEM: https://doi.org/10.3389/fgene.2023.1219297 - HT-ARGfinder: https://doi.org/10.3389/fenvs.2022.901917 - PhD thesis https://vtechworks.lib.vt.edu/handle/10919/106511 - BRENDA enzym db https://www.brenda-enzymes.org/enzyme.php?ecno=3.5.2.6 - article https://towardsdatascience.com/building-machine-learning-models-for-predicting-antibiotic-resistance-7640046a91b6 ## Pipelines/Workflows ### funcscan (Pipeline) - https://github.com/nf-core/funcscan - includes AMRFinderPlus ABRicate, DeepARG, RGI, fARGene ### ResistoXplorer (Pipeline) - https://www.resistoxplorer.no/ResistoXplorer/faces/docs/FaqView.xhtml#input2 - uses both AMR++ v2 and DeepARG ? ### AMRFinderPlus (Workflow) - https://github.com/ncbi/amr - ### AMR++ (Workflow) - https://www.meglab.org/amrplusplus/ - version 3 https://github.com/Microbial-Ecology-Group/AMRplusplus/tree/master - version 1 (credentials user:admin@galaxy.org pw:admin) - Galaxy shed [link](https://toolshed.g2.bx.psu.edu/repository?repository_id=f249d27395ea9e5b&changeset_revision=c9fbf44a96f7) - github repo [link](https://github.com/cdeanj/galaxytools/tree/master/workflows/amrplusplus) <details> <summary>AMR++ version comparison</summary> taking test input from v3: reads: data/raw/*_R{1,2}.fastq.gz host: data/host/chr21.fasta.gz host_index: data/host/chr21.fasta.gz amr_index: data/amr/megares_database_v3.00.fasta amr: data/amr/megares_database_v3.00.fasta annotation: data/amr/megares_annotation_v3.00.csv v3 default: trimmomatic: v3 (v1) leading = 3 (3 - not applied) trailing = 3 (3 - not applied) slidingwindow = 4:15 (4:20) minlen = 36 (20 - not applied) resistome threshold = 80 (1) rarefaction min = 5 max = 100 threshold = 80 skip = 5 samples = 1 - datasets in S3 Galaxy queue - 1126 -> ch21.fasta - 1124 -> S3_test_R1.fastq - 1125 -> S3_test_R2.fastq - 1127 -> megares_annotation_v3.00.csv - 1128 -> megares_database_v3.00.fasta - 1244 -> R1.paired.fastq - 1245 -> R2.paired.fastq - step10 Map with BWA - 1248 -> alignment_sorted.bam - step 11 Filter by flag - in v1 `samtools view ... -f 0x0004 -f 0x0008 -f 0x0001 ...` - in v3 `samtools view ... -f 12 ...` equivalent to `samtools view ... -f 0x4 -f 0x8 ...` - 1249 -> alignment_sorted_filtered. - step 12 Sort again - 1250 -> alignment_sorted_filtered_sorted.bam - step 13 #### Conclusion Two major differences between v1 and v3: 1. v1 only uses the sliding window (`4:20`) while v3 uses more options and tighter window (`4:15`). 1. v1 uses different alignment method `bwa aln` instead of `bwa mem` Recreation of v1 in v3 nextflow framework: [stalbrec/AMRplusplus](https://github.com/Microbial-Ecology-Group/AMRplusplus/compare/master...stalbrec:AMRplusplus:amrpp_v1_recreation) </details> ### DeepARG <details> <summary>Workflow steps</summary> 1. Trimmomatic - trim & QC on paired end input - input: forward and reverse read e.g. `F.fq.gz` and `R.fq.gz` - output: - `F.fq.gz.paired` surviving pairs - `R.fq.gz.unpaired` surviving Rs - `F.fq.gz.unpaired` surviving Fs - rest discarded - e.g. 25k read pairs - 24246 out paired passing in both R and F - 519 out passing in F only - 216 out passing in R only - (29 failing both in R and F) 2. merge `F.paired` and `R.paired` - vsearch - input `R.paired` and `F.paired` - out `F.unmerged`, `R.unmerged` and `merged` $\rightarrow$ summed to `reads.clean` 3. DeepARG Inference - Diamond $\rightarrow$ database with 12279 features (gene sequences) - features are aligned to short reads - outputs `.tsv` file $\rightarrow$ hits with metrics ``` X = [ feature bit-scores -> . . . reads | v . . . ] ``` - DeepARG predict - score $\geq 0.9 \rightarrow$ single hit - score $<0.9 \rightarrow$ first two leading hits - read-id; predicted class; prediction score (probabilty) - output files - mapping $\rightarrow$ Diamond best hit info + prediction :::danger best hit might be the wrong one, since the sorting is done with bit-score as primary sort key, but in diamond they are sorted in descending alphabetical order - i think! $\rightarrow$ check this!! ::: 4. Quantification of results - pair best-hit aus alignment das mit predicted class passt 5. Normalization 1. `bowtie2` gg13 alignment to reads 2. Take $N_\mathrm{16s}$ from reads and normalize ARG counts using $N_\mathrm{16s}/L_\mathrm{16s}$, where $L_\mathrm{16s} = 1432$ (**why?**) 3. ARG counts are normalized: $N_\mathrm{sub-type} = (N_\mathrm{sub-type, raw}/L_\mathrm{sub-type~gene})/(N_\mathrm{16s}/L_\mathrm{16s})$ </details> ### hAMRonization - https://github.com/pha4ge/hAMRonization - harmonizes output format of various AMR detection tools - ABRicate https://github.com/tseemann/abricate - ARIBA https://github.com/sanger-pathogens/ariba - c-SSTAR https://github.com/chrisgulvik/c-SSTAR - fargene https://github.com/fannyhb/fargene - GROOT https://github.com/will-rowe/groot - kmerresistance https://bitbucket.org/genomicepidemiology/kmerresistance/src/master/ - mykrobe https://github.com/Mykrobe-tools/mykrobe - PointFinder (now in ResFinder) (https://doi.org/10.1093/jac/dkx217, https://bitbucket.org/genomicepidemiology/pointfinder, https://bitbucket.org/genomicepidemiology/resfinder) - ResFams https://github.com/dantaslab/resfams - ResFinder https://bitbucket.org/genomicepidemiology/resfinder http://genepi.food.dtu.dk/resfinder - The Resistance Gene Identifier https://github.com/arpcard/rgi - sraX https://github.com/lgpdevtools/sraX - SRST2 https://github.com/katholt/srst2 - staramr https://github.com/phac-nml/staramr - TBProfiler https://tbdr.lshtm.ac.uk/ https://github.com/jodyphelan/TBProfiler ### Other Tools/collections - https://github.com/topics/antimicrobial-resistance-genes - https://github.com/topics/antimicrobial-resistance - https://nf-co.re/deepvariant/1.0