PIG Microbiome project

# Looking for AMRs in pig gut microbiota What we currently have: - Annie (~800 SRA datasets from pig gut) - Shya (24 samples from own project, with treatments) - David (32 samples from own project, with treatments) All these datasets are whole metagenome sequencing, sequenced at various depths. We would like to investigate wheter there are AMR genes that are more present in some samples compared to others. I (Annie), am just interested in what AMR genes are present in the microbiomes from SRA, are there genes that are more commonly present then others, are there differences between location of sample, pig strain, etc. Shya and David are interested in what AMR genes are present in their samples, and are there genes that are more often present in specific treatment groups compared to others. Because resistance to antibiotic/heavy metal may go up if animal is treated with it, and microbes within animal gut gain resistance. ## Issues with current approach As with everything, this approach too has issues. Main issue is that through assembly and binning, we lose a lot of data, and we have the risk of not finding AMRs that are actually present, because they did not end up in an assembly or bin. **Here's a [paper](https://www.biorxiv.org/content/10.1101/2023.12.13.571436v1) about how metagenome assemblies break around AMRs**. An alternative would be, to run AMR finder on the contigs instead of the bins. That way we retain more of the assemblies, even the ones that don't end up in bins. Ideally we'd use the raw reads, so that we dont lose any data, but I don't think there is a solid (aka published) approach currently for this (looking for AMRs in raw read data). ## Data processing so far: Had raw reads, ran these through atlas: Now there are MAG (Metagenome assembled genome) bins. There are also contigs (assembled sequences pre-binning), and clean reads in the atlas folder. Let's start by using the MAGs and looking for AMRs there, just because this data is readily available (as practice). After discussing with titus, it would probably be good to look for AMRs in all contigs that were assembled. We can use the predicted protein files from the MAGs, and then later use prokka or prodigal to predict our own proteins on all assembled contigs. We can then use a read mapper, such as bowtie2 to map the reads to all contigs/MAGs, and see which ones are more abundant in certain samples compared to other. ## How to run AMRfinder I decided to take the easy approach, and copy all faa files to a new folder. You guys don't need to do that, since all of your MAG annotation files are in one folder. (atlas/genomes/annotations/genes) There is a few things you'll need, most importantly a conda env yml file for your amrfinder: ``` # Make a directory where you want your AMR results and go in it mkdir AMRfinder cd AMRfinder # Now to export the conda environment # Im doing this for my own environment as example, please change env names as needee: # Load your env: mamba activate amrfinderplus # create yaml file from the loaded environment conda env export > amrfinder.env.yml # Now in the Snakemake, there is a line that says : conda: "amrfinder.env.yml" ``` Copy my snakefile from [github](https://github.com/AnneliektH/2023-swine-sra) and use it. Make sure to change the file paths for both the database and the location of the files. Id suggest changing number of threads to 2 or 4, and then running conda with 28 - 32 threads so that it can run 7 or 8 jobs at a time Run Snakemake as follows. Don't forget the --use-conda: ``` snakemake --use-conda --rerun-triggers mtime -c 32 --rerun-incomplete -k ``` ## AMRfinder + database Program to detect AMRs: [AMRfinderplus](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6811410/). AMRfinder can be used on both protein and nucleotide sequences to look for AMR genes. When using protein sequences as input, it will use both blastp and HMM search to identify genes. When inputting DNA sequence, it will use blastx to look for genes. ## Relative abundance after running AMRfinder - Filter MAGs based on completeness & contamination ``` # csvtk concat csvtk concat L1802*/binning/vamb/checkm2_report.tsv -t > ../../annie/david_binqual.tsv # filter on genomequality csvtk filter david_binqual.tsv -f "Completeness_General>=50" -t > david_binqual.comp.tsv # filter on contam csvtk filter david_binqual.comp.tsv -f "Contamination<=10" -t > david_binqual.comp.con.tsv # cut only the MAG names to a txt file csvtk cut david_binqual.comp.con.tsv -t -f 1 > davidMAGs.txt # move to new folder cd ALL_NUCLEOTIDE_FASTA for f in $(cat ../../annie/davidMAGs.txt); do cp $f.fasta /medhigh/; done ``` - Do taxonomy on the bins (high/med qual) ``` mamba activate gtdbtk gtdbtk classify_wf --cpus 16 \ --genome_dir ./genomes/path/to/genomes/ \ --extension fasta \ --outdir ./genomes/taxonomy/ ```