Análisis taxonómico general Shotgun

# Análisis taxonómico general Shotgun ###### tags: `WGS` `estudiantes` El flowchart que he usado para asignación taxonómica y funcional, es este: 1. Ver QC y quitar adaptadores… (MetaWRAP usa Read_QC y Trimgallore) 2. Filtrar seq. humanas. Con BMTagger Descargar: hg38 index. - En METAWrap 3. Asignación taxonómica - Metaphlan3 :custard: :::warning Esto es una guia, lo primero es revisar actulizaciones de programas en la web de cada uno y usar siempre el más actulizado. [Metaphlan 4](https://huttenhower.sph.harvard.edu/metaphlan/) [Metawrap](https://github.com/bxlab/metaWRAP) ::: ## MetaWRAP :taco: ```bash=s conda create -y -n metawrap-env phyton=2.7 conda install -y -c ursky metwrap-mg ``` Sacar data de c/carpeta y juntar en una. Luego descomprimir ```bash=s gunzip *.gz ``` :::info Cambiar nombre/ quitar toda la :shit: del nombre De D1807_ESFP200009043-1a_HCYJCDSXY_L2_1.fq a D1807_1.fq ::: ```bash=v mkdir RAW_READS mv *.fq /home/laura/Laura/ECCMID/RAW_READS ls RAW_READS mkdir READ_QC ``` Limpiar y filtrar reads del host... (Paciencia :hourglass_flowing_sand:) ```bash=v conda activate metawrap-env for F in RAW_READS/*_1.fq; do R=${F%_*}_2.fq BASE=${F##*/} SAMPLE=${BASE%_*} metawrap read_qc -1 $F -2 $R -t 6 -o READ_QC/$SAMPLE & done ``` Mover las secuencias limpias y sin human-host ```bash= mkdir CLEAN_READS for i in READ_QC/*; do b=${i#*/} mv ${i}/final_pure_reads_1.fastq CLEAN_READS/${b}_1.fastq mv ${i}/final_pure_reads_2.fastq CLEAN_READS/${b}_2.fastq done ``` ## Metaphlan :custard: ``` bash conda create --name mpa -c bioconda python=3.7 metaphlan ``` :::info [MetaPhlAn](https://sci-hub.st/https://doi.org/10.1038/nmeth.2066) estimates the relative abundance of microbial cells by mapping reads against a reduced set of clade-specific marker sequences that are computationally preselected from coding sequences that unequivocally identify specific microbial clades at the species level or higher taxonomic levels and cover all of the main functional categories. ::: >The MetaPhlAn classifier compares each metagenomic read from a sample to this marker catalog to identify high-confidence matches. ##### Descargar e instalar Chocophlan. :eyes: (Debería instalarse si descargas metaphlan de Conda, pero OJO con las actulizaciones!) > MetaPhlAn relies on ~1.1M unique clade-specific marker genes (the latest marker information file mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2 can be found [here](https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAlyQITZuUCtBUJxpxhIroIa/mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1) identified from ~100,000 reference genomes (~99,500 bacterial and archaeal and ~500 eukaryotic) ``` bash=v metaphlan --install --index mpa_v30_CHOCOPhlAn_201901 --bowtie2db <database folder> ``` #### Usar metaphlan. Metaphlan no toma en cuenta si son paired-end reads... :::warning Because spurious reads are unlikely to have significant matches with a marker sequence, no preprocessing of metagenomic DNA (such as error detection, assembly or gene annotation) is required. The classifier normalizes the total number of reads in each clade by the nucleotide length of its markers and provides the relative abundance of each taxonomic unit, taking into account any markers specific to subclades. Microbial reads belonging to clades with no available sequenced genomes are reported as an ‘unclassified’ subclade of the closest ancestor for which there is available sequence data. ::: - Script para ejecutar metaphlan. ```perl=v #!/usr/bin/perl #dar de argumento solo los _1 foreach $ar (@ARGV) { @field = split (/_/, $ar); $name = $ar; $name =~ s/_1.fastq//; $r1 = $ar; $r2 = $ar; $r2 =~ s/1.fastq/2.fastq/; system("metaphlan $r1,$r2 --bowtie2out $name.bowtie2.bz2 --nproc 6 --input_type fastq -o profiled_$name.txt"); } ``` - Ejecutar comando ```bash=v conda activate mpa perl metaphlan.pl *.fastq ``` - Unir resultados: ```bash=v #!bash python utils/merge_metaphlan_tables.py profiled_*.txt > output/merged_abundance_table.txt ``` # Análisis en R :computer: Para pasar los datos de Metaphlan a phyloseq usamos la función metaphlantophyloseq: ```r=1 metaphlanToPhyloseq <- function( tax, metadat=NULL, simplenames=TRUE, roundtointeger=FALSE, split="|"){ ## tax is a matrix or data.frame with the table of taxonomic abundances, rows are taxa, columns are samples ## metadat is an optional data.frame of specimen metadata, rows are samples, columns are variables ## if simplenames=TRUE, use only the most detailed level of taxa names in the final object ## if roundtointeger=TRUE, values will be rounded to the nearest integer xnames = rownames(tax) shortnames = gsub(paste0(".+\\", split), "", xnames) if(simplenames){ rownames(tax) = shortnames } if(roundtointeger){ tax = round(tax * 1e4) } x2 = strsplit(xnames, split=split, fixed=TRUE) taxmat = matrix(NA, ncol=max(sapply(x2, length)), nrow=length(x2)) colnames(taxmat) = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species", "Strain")[1:ncol(taxmat)] rownames(taxmat) = rownames(tax) for (i in 1:nrow(taxmat)){ taxmat[i, 1:length(x2[[i]])] <- x2[[i]] } taxmat = gsub("[a-z]__", "", taxmat) taxmat = phyloseq::tax_table(taxmat) otutab = phyloseq::otu_table(tax, taxa_are_rows=TRUE) if(is.null(metadat)){ res = phyloseq::phyloseq(taxmat, otutab) }else{ res = phyloseq::phyloseq(taxmat, otutab, phyloseq::sample_data(metadat)) } return(res) } ``` ## Convertir en Objeto phyloseq * Metadata: Añadir metadata ojo con los Sample_names :eyes: ```r=1 DATA_plus <- read_csv("Data/Objetos/data_plus.csv") DATAmfa.cluster <- read_csv("Data/Objetos/DATAmfa.cluster.csv") DATA <- DATAmfa.cluster %>% select(ID,Cluster) names(DATA) <- c("Sample_ID", "Cluster") DATA <- DATA %>% left_join(DATA_plus) DATA$Cluster <- factor(DATA$Cluster, levels = c("1", "2"), labels=c("1","2")) row.names(DATA) <- DATA$Sample_ID sample <- sample_data(DATA) sample_names(sample) <- DATA$Sample_ID ``` * Taxonomía: Leemos la tabla de abundancia "merged_abundance_table.txt". Quitamos la primera línea y arreglamso SampleID para que coincida. > Revisar que no haya errores con los números. ```r=1 Abundance <- read_csv("Data/Metagenomica/merged_abundance_table.txt") Abundance <- column_to_rownames(Abundance, var="clade_name") Abundance$NCBI_tax_id <- NULL ``` * Cargar script Function metaphlantophyloseq ```r=1 phyloseqin= metaphlanToPhyloseq(Abundance, metadat = sample) phyloseqin ``` * Añadir árbol filogenético para poder calcular distancia Unifrac ```r=1 #Arbol random_tree=rtree(ntaxa(phyloseqin), rooted = T, tip.label = taxa_names(phyloseqin)) #Objeto phyloseq con arbol physeq.tree <- merge_phyloseq(phyloseqin, random_tree) ```