Pangenomics integral pipeline renewed

# NCBI Dataset and Dataformat programs to download metadata of genomes https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/ ## Installation First, create a conda environment: `conda create -n ncbi_datasets` Then, activate your new environment: `conda activate ncbi_datasets` Finally, install the datasets conda package: `conda install -c conda-forge ncbi-datasets-cli` ## Download genomes using GCA accesions **For just one genome** `datasets download genome accession GCF_XXXXXXXXX.XX --filename genomes_dataset.zip` **For multiple genomes** Crear un archivo de texto con los accession numbers **(accessions.txt)**: ``` GCF_XXXXXXXXX.X1 GCF_XXXXXXXXX.X2 GCF_XXXXXXXXX.X3 GCF_XXXXXXXXX.X4 GCF_XXXXXXXXX.X5 ``` Correr el comando: ```datasets download genome accession --inputfile accessions.txt --filename genomes.zip``` ## dataset: Print a data report containing genome metadata Utilizar el siguiente comando agregando los accession GCA o GCF. Para un solo genoma: `datasets summary genome accession GCA_XXXXXXXX --as-json-lines` Para un multiples genoma: `datasets summary genome accession GCA_XXXXXXXX GCA_YYYYYYYY GCA_ZZZZZZZZZZ --as-json-lines` **Sin embargo, hay que crear el archivo de salida utilizando el siguiente, porque, de lo contrario los metadatos solo se mostraran en la terminal:** `datasets summary genome accession GCA_XXXXXXXX GCA_YYYYYYYY GCA_ZZZZZZZZZZ --as-json-lines > assembly_data_report.jsonl` ## dataformat: convert metadata from JSON Lines format to other formats. Esto va a crear un .json file, por lo que necesitaremos del programa dataformat para convertirlo a formato excel: ``` dataformat excel genome --inputfile assembly_data_report.jsonl --outputfile NOMBRE.xlsx dataformat excel genome --package NOMBRE.zip --outputfile NOMBRE.xlsx ``` Lo que se puede hacer y es mejor, es que el archivo que se descargo del NCBI junto con los genomas .fna, el archivo assembly_data_report.jsonl sea utilizado para esto. Por lo que, no se tendria que usar dataset para obtener los metadatos, si no solo convertir el jsonl que se descargo inicialmente. # ncbi-genome-download https://github.com/kblin/ncbi-genome-download Download fasta genomes directly ## Installation `pip install ncbi-genome-download` ## Download fasta genomes using accesions Crear un archivo de texto con los accession numbers **(accessions.txt)**: ``` GCF_XXXXXXXXX.X1 GCF_XXXXXXXXX.X2 GCF_XXXXXXXXX.X3 GCF_XXXXXXXXX.X4 GCF_XXXXXXXXX.X5 ``` Corre el comando: `ncbi-genome-download bacteria -A accessions.txt --formats fasta --section genbank --output-folder genomes_fna -v ` # CheckM2 **Rapid assessment of genome bin quality using machine learning.** https://github.com/chklovski/CheckM2 --- ## **Installation** The easiest way to install is using Conda in a new environment: `conda create -n checkm2 -c bioconda -c conda-forge checkm2` However, conda can be very slow when processing requirements for the environment. A much faster and better way to install CheckM2 is to install using mamba and creating a new environment: `mamba create -n checkm2 -c bioconda -c conda-forge checkm2` --- ## **Bin quality prediction** The main use of CheckM2 is to predict the completeness and contamination of metagenome-assembled genomes (MAGs) and single-amplified genomes (SAGs), although it can also be applied to isolate genomes. You can give it a folder with FASTA files using --input and direct its output with --output-directory: ``` checkm2 predict --threads 30 --input <folder_with_bins> --output-directory <output_folder> ``` --- ## Quality prediction for multiple genomes (Script bash) ``` #!/bin/bash # Directorios de entrada y salida INPUT_DIR="ruta/a/los/archivos/fna" OUTPUT_DIR="ruta/a/donde/se/generaran/los/archivos/checkm2" # Recorre cada archivo .fna en el directorio for FILE in "$INPUT_DIR"/*.fna; do # Extrae el nombre del archivo sin la extensión BASENAME=$(basename "$FILE" .fna) # Ejecuta CheckM2 y guarda los resultados en un subdirectorio específico checkm2 predict --threads 20 --input "$FILE" --output-directory "$OUTPUT_DIR/${BASENAME}_checkm2" & # Controla el número de procesos en paralelo (ajusta según tu capacidad) while [ "$(jobs -r | wc -l)" -ge 10 ]; do sleep 5 done done # Espera a que terminen todos los procesos wait echo "CheckM2 ha finalizado en todos los archivos." ``` **Merge quality_report.tsv files** This guide demonstrates how to merge all the quality_report.tsv files located in different folders within the directory ***/ruta/hacia/el/directorio/que/contiene/las/carpetas/generadas/por/checkm2*** into a single file called **merged_quality_report.tsv**. The final file will include the header (the first line) only once, avoiding any duplicates. Execute the following script directly in the terminal: ``` first=1 for file in */quality_report.tsv; do if [ $first -eq 1 ]; then cat "$file" > merged_quality_report.tsv first=0 else tail -n +2 "$file" >> merged_quality_report.tsv fi done ``` **Explanation:** For the first file, the entire content (including the header) is copied into the output file. For subsequent files, tail -n +2 skips the first line (the header) and appends the remaining lines. **Filtration of genome dataset** A common standard for "near complete" assembly is **>90% completeness** and **<5% contamination**. For this go to the merge tsv and discard los genomas segun el standard o segun las necesidades de cada taxon. Crear un directorio donde esten todos los genomas fna que se utilizaron: `mkdir fna_filtered | cp ./fna_coded/*.fna fna_filtered` Contar el numero de genomas pre-filtracion: `ls | wc -l` Eliminar los genomas que no cumplan con el standard establecido: `rm genome1.fna genome1.fna genome1.fna genome1.fna genome1.fna ` Contar el numero de genomas pos-filtracion: `ls | wc -l` # Prokka ## **Installation** The easiest way to install is using Bioconda in a new environment: `conda install -c conda-forge -c bioconda -c defaults prokka` ## **Anotacion de multiples genomas (Script sh)** Crear un nuevo archivo usando nano: `nano prokka_multiple_anot.sh` Copiar y pegar el script: ``` #!/bin/bash outputdir="PROKKA" fnadir="/directorio/de/los/fna/filtrados" extension=".fna" cpus=20 # Create output directory if it doesn't exist if [ ! -d "$outputdir" ]; then mkdir "$outputdir" fi allfna=("${fnadir}"/*"${extension}") for file in "${allfna[@]}" do bname=$(basename "$file" "${extension}") prokka --cpus "$cpus" --kingdom Bacteria --locustag "$bname" \ --addgenes --outdir "${outputdir}/${bname}" --prefix "$bname" "$file" done ``` Cambiar los permisos para que sea ejecutable: `chmod +777 prokka_multiple_anot.sh` Correr el script: `./prokka_multiple_anot.sh` Si todo esta bien se observara algo asi en el bash: ![image](https://hackmd.io/_uploads/BJseQfp5kl.png) **Output files** ![image](https://hackmd.io/_uploads/S1uqWMpckl.png) # Roary https://github.com/sanger-pathogens/Roary ## Installation ## Analysis To run the analysis using gff files from PROKKA annotation: `roary -f ./roary_output -e -r -n -v -p 7 *.gff` ## Plots **Script para generar curva de rarefaccion ** ``` library(ggplot2) # Archivos de datos archivo_conserved = "number_of_conserved_genes.Rtab" archivo_total = "number_of_genes_in_pan_genome.Rtab" # Leer datos y calcular medias conserved = colMeans(read.table(archivo_conserved)) total = colMeans(read.table(archivo_total)) # Crear un dataframe combinado datos = data.frame( Genes = c(conserved, total), Genomas = c(1:length(conserved), 1:length(total)), Tipo = c(rep("Genes Conservados", length(conserved)), rep("Genes Totales", length(total))) ) # Crear la gráfica p <- ggplot(data = datos, aes(x = Genomas, y = Genes, color = Tipo)) + geom_line(size = 1.2) + theme_classic() + xlab("Number of genomes") + ylab("Number of genomes") + theme_bw(base_size = 16) + theme(legend.position = "right") # Guardar la gráfica en formato PNG ggsave(filename = "pangenoma_unico.png", plot = p, width = 10, height = 10) # Guardar la gráfica en formato SVG ggsave(filename = "pangenoma_unico.svg", plot = p, width = 10, height = 10) ``` **Script para generar curva de rarefaccion con multiples especies o genomas** ``` library(ggplot2) # Lista de especies y sus archivos correspondientes especies = c("all", "G1", "G2") archivos_conserved = c("number_of_conserved_genes_ALL.Rtab", "number_of_conserved_genes_G1.Rtab", "number_of_conserved_genes_G2.Rtab") archivos_total = c("number_of_genes_in_pan_genome_ALL.Rtab", "number_of_genes_in_pan_genome_G1.Rtab", "number_of_genes_in_pan_genome_G2.Rtab") # Leer datos y combinarlos en un dataframe datos_combinados = data.frame() for (i in 1:length(especies)) { conserved = colMeans(read.table(archivos_conserved[i])) total = colMeans(read.table(archivos_total[i])) datos_especie = data.frame( Genes = c(conserved, total), Genomas = c(1:length(conserved), 1:length(total)), Tipo = c(rep("Genes Conservados", length(conserved)), rep("Genes Totales", length(total))), Especie = rep(especies[i], 2 * length(conserved)) ) datos_combinados = rbind(datos_combinados, datos_especie) } # Crear la gráfica p <- ggplot(data = datos_combinados, aes(x = Genomas, y = Genes, color = Tipo)) + geom_line(aes(linetype = Especie)) + scale_linetype_manual(values = c("solid", "dotted", "dashed")) + theme_classic() + xlab("Número de Genomas") + ylab("Número de Genes") + theme_bw(base_size = 16) + theme(legend.position = "right") # Guardar la gráfica en formato PNG ggsave(filename = "comparacion_pangenomas.png", plot = p, width = 10, height = 10) # Guardar la gráfica en formato SVG ggsave(filename = "comparacion_pangenomas.svg", plot = p, width = 10, height = 10) ``` ### Script para generara graficos: modificacion del py ``` #!/usr/bin/env python # Copyright (C) <2015> EMBL-European Bioinformatics Institute # Modified for enhanced aesthetics and SVG output by Gemini. # Este script es una versión modificada de roary_plots.py para generar # gráficos de mayor calidad visual, listos para una publicación. # Para usarlo, ejecuta en la terminal: # python roary_plots_custom.py [tu_arbol.newick] [tu_matriz.csv] __author__ = "Marco Galardini & Gemini" __version__ = '1.0.0' def get_options(): """Configura y parsea los argumentos de la línea de comandos.""" import argparse description = "Create enhanced plots from Roary outputs." parser = argparse.ArgumentParser(description=description, prog='roary_plots_custom.py') parser.add_argument('tree', action='store', help='Newick Tree file (e.g., accessory_binary_genes.fa.newick)') parser.add_argument('spreadsheet', action='store', help='Roary gene presence/absence spreadsheet (e.g., gene_presence_absence.csv)') parser.add_argument('--labels', action='store_true', default=False, help='Add node labels to the tree (up to 10 chars)') # El formato por defecto ahora es SVG parser.add_argument('--format', choices=('png', 'tiff', 'pdf', 'svg'), default='svg', help='Output format [Default: svg]') parser.add_argument('-N', '--skipped-columns', action='store', type=int, default=14, help='First N columns of Roary\'s output to exclude [Default: 14]') parser.add_argument('--version', action='version', version='%(prog)s ' + __version__) return parser.parse_args() if __name__ == "__main__": options = get_options() import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np from Bio import Phylo # Importamos una función para crear mapas de color personalizados from matplotlib.colors import ListedColormap # --- Configuración de Estilo Global --- # Usamos el estilo 'ticks' para un look limpio sin cuadrícula de fondo. sns.set_theme(style="ticks") # Aseguramos que todo el texto sea negro por defecto para máxima legibilidad. plt.rc('text', color='black') plt.rc('axes', labelcolor='black', edgecolor='black') plt.rc('xtick', color='black') plt.rc('ytick', color='black') plt.rc('figure', dpi=300) # --- Carga y Preparación de Datos --- print("-> Loading tree and Roary spreadsheet...") t = Phylo.read(options.tree, 'newick') roary = pd.read_csv(options.spreadsheet, low_memory=False) roary.set_index('Gene', inplace=True) roary.drop(list(roary.columns[:options.skipped_columns-1]), axis=1, inplace=True) roary.replace('.{2,100}', 1, regex=True, inplace=True) roary.replace(np.nan, 0, regex=True, inplace=True) print("-> Data loaded successfully.") # --- Cálculo de Categorías del Pangenoma --- num_genomes = roary.shape[1] core_threshold = num_genomes * 0.99 soft_core_threshold = num_genomes * 0.95 shell_threshold = num_genomes * 0.15 gene_sums = roary.sum(axis=1) # --- Gráfico 1: Histograma de Frecuencia del Pangenoma --- print("-> Generating pangenome frequency plot...") freq_df = pd.DataFrame({'frequency': gene_sums}) def assign_category_hist(freq): if freq >= core_threshold: return 'Core genes' elif freq >= soft_core_threshold: return 'Soft-core genes' elif freq >= shell_threshold: return 'Shell genes' else: return 'Cloud genes' freq_df['Category'] = freq_df['frequency'].apply(assign_category_hist) # Definimos la paleta de colores "Pastel1" que se usará en los gráficos de composición comp_palette = sns.color_palette("Pastel1", 4) color_map = { 'Cloud genes': comp_palette[0], 'Shell genes': comp_palette[1], 'Soft-core genes': comp_palette[2], 'Core genes': comp_palette[3] } plt.figure(figsize=(10, 6)) sns.histplot(data=freq_df, x='frequency', hue='Category', multiple="stack", bins=num_genomes, palette=color_map, edgecolor="white", linewidth=0.5, hue_order=['Cloud genes', 'Shell genes', 'Soft-core genes', 'Core genes']) plt.xlim(-0.5, num_genomes + 0.5) # Ajuste para centrar las barras plt.title('Pangenome Gene Frequency', fontsize=16, fontweight='bold') plt.xlabel('Number of Genomes', fontsize=12, fontweight='bold') plt.ylabel('Number of Genes', fontsize=12, fontweight='bold') sns.despine() # Elimina las líneas del recuadro superior y derecho plt.tight_layout() plt.savefig(f'pangenome_frequency.{options.format}') plt.clf() # --- Gráfico 2: Árbol Filogenético y Matriz --- print("-> Generating tree and presence/absence matrix plot...") roary_sorted = roary.loc[roary.sum(axis=1).sort_values(ascending=False).index] roary_sorted = roary_sorted[[x.name for x in t.get_terminals()]] fig = plt.figure(figsize=(17, 10)) gs = fig.add_gridspec(1, 40, wspace=0) ax_tree = fig.add_subplot(gs[0, 0:10]) with plt.rc_context({'lines.linewidth': 0.5}): Phylo.draw(t, axes=ax_tree, show_confidence=Fal ``` # Panaroo ## Installation Although Panaroo supports the latest version of python, currently Bioconda does not support versions >= 3.10 for all packages. An older version of python can be installed by creating an environment as ``` conda create -n panaroo python=3.9 conda activate panaroo ``` Panaroo can then be installed by running the following within a conda environment `conda install -c conda-forge -c bioconda -c defaults 'panaroo>=1.3'` ## Analysis ### Input format Panaroo now supports multiple input formats. To use non-standard **GFF3** files you must profile the input file as a list in a text file (one per line). Separate **GFF** and **FASTA** files can be provided per isolate by providing each file delimited by a space or a tab. Genbank file formats are also supported with extensions `'.gbk', '.gb' or '.gbff'`. These must compliant with Genbank/ENA/DDJB. This can be forced in Prokka by specifying the `--compliance` parameter. **NOTE:** Some annotations file such as those from RefSeq include annotations that break some of the assumptions of Panaroo. These include gene annotations of unusual length or that with premature stop codons (pseudogenes). By default Panaroo will throw an error if it encounters these annotations. You can automatically filter out such annotations by calling panaroo with the `--remove-invalid-genes` flag. ### Clean mode: strict By default Panaroo runs in its strictest (most conservative) mode. We have found that for most use cases this removes potential sources of contamination and error whilst retaining the majority of genes researchers are interested in. `panaroo -i *.gff -o results --clean-mode strict` ### Clean mode: sensitive Very rare plasmids are difficult to distinguish from contamination. Thus, if you are interested in retaining such plasmids at the expense of added contamination we recommend running panaroo using its most sensitive mode. `panaroo -i *.gff -o results --clean-mode sensitive` ### Exportar grafico para # Bacta https://github.com/oschwengers/bakta ### Installation ``` conda create -n bakta conda activate bakta conda install -c conda-forge -c bioconda bakta ``` # Anvio ## Installation **FASTA reformat** **PROGRAM: *anvi-script-reformat-fasta*** Asegúrate de tener un archivo llamado **genomes.txt** que contenga los nombres de los genomas que deseas procesar. Puedes crear este archivo de la siguiente manera: `ls *.fna | sed 's/.fna$//' > genomes.txt` Esto generará un archivo **genomes.txt** que contiene los nombres de los archivos FASTA sin la extensión ".fna". Crear un archivo .sh con el nombre que se desee usando nano para el script: `nano reformat_fasta.sh` Copiar y pegar el script en el archivo nano: ``` #Script for g in $(cat genomes.txt); do echo echo "Working on $g ..." echo # Obtener el nombre base del archivo base_name=$(basename "$g" .fna) # Obtener el nombre sin extensión del archivo file_name=$(basename "$g") name_without_ext="${file_name%.*}" # Reformatear el archivo fasta anvi-script-reformat-fasta "${base_name}.fna" \ --min-len 1000 \ --simplify-names \ --prefix "$name_without_ext" \ --seq-type NT \ -o "${base_name}_r.fna" done ``` Control X, Y, para guardar y salvar el archivo. Cambiar el archivo a executable: `chmod +777 reformat_fasta.sh` Ejecutar el script: `./reformat_fasta.sh` **Create contigs databases** **PROGRAM: *Anvio-gen-contigs-database*** The first thing we need to do is to generate an anvi’o contigs database for each one of them. Run the following script in directly in the terminal: ``` for i in `ls *fna | awk 'BEGIN{FS=".fna"}{print $1}'` do anvi-gen-contigs-database -f $i.fna -o $i.db -T 6 anvi-run-hmms -c $i.db -T 6 done ``` At the end of this, for each fna file you should have a file with the same name that ends with ‘**.db**’. The next step is to define all the contigs databases of interest, and give them a name to introduce them to anvi’o. Let’s call this file ‘**external-genomes.txt**’: **Contigs database annotation and HMMs SEARCHING?** Para anotar los db **PENDIENTE** ``` echo -e "name\tcontigs_db_path" > external-genomes.txt ls *.db | sed 's/\.db$//' | awk '{print $0 "\t" $0 ".db"}' >> external-genomes.txt ``` **Contigs database KEGG, COGs and tRNAs** Para anotar los db **PENDIENTE** ``` for g in *.db do anvi-run-kegg-kofams -c $g --num-threads 4 anvi-run-ncbi-cogs -c $g --num-threads 4 anvi-scan-trnas -c $g --num-threads 4 done ``` Antes de crear el -GENOMES.db file se debe **verificar que los `db files` hayan sido anotados correctamente y que realmente contengan las anotaciones funcionales como KEGGs, COGs, HMMs y RNAs**. Si se quiere verificar genoma por genoma utilizar el siguiente: `anvi-db-info genomaX.db` Se obtendra un output como el siguiente: ``` DB Info (no touch) =============================================== Database Path ................................: genomaX.db description ..................................: [Not found, but it's OK] db_type ......................................: contigs (variant: unknown) version ......................................: 21 DB Info (no touch also) =============================================== project_name .................................: genomaX contigs_db_hash ..............................: hash2900ccfd split_length .................................: 20000 kmer_size ....................................: 4 num_contigs ..................................: 39 total_length .................................: 4007300 num_splits ...................................: 198 gene_level_taxonomy_source ...................: None genes_are_called .............................: 1 external_gene_calls ..........................: 0 external_gene_amino_acid_seqs ................: 0 skip_predict_frame ...........................: 0 splits_consider_gene_calls ...................: 1 scg_taxonomy_was_run .........................: 0 scg_taxonomy_database_version ................: None trna_taxonomy_was_run ........................: 0 trna_taxonomy_database_version ...............: None creation_date ................................: 1736701872.34964 modules_db_hash ..............................: a2b5bde358bb gene_function_sources ........................: KEGG_BRITE,COG20_PATHWAY,Transfer_RNAs,COG20_FUNCTION,KEGG_Module,COG20_CATEGORY,KOfam,KEGG_Class * Please remember that it is never a good idea to change these values. But in some cases it may be absolutely necessary to update something here, and a programmer may ask you to run this program and do it. But even then, you should be extremely careful. AVAILABLE GENE CALLERS =============================================== * 'prodigal' (3,661 gene calls) * 'Transfer_RNAs' (58 gene calls) * 'Ribosomal_RNA_23S' (2 gene calls) * 'Ribosomal_RNA_16S' (1 gene calls) AVAILABLE FUNCTIONAL ANNOTATION SOURCES =============================================== * COG20_CATEGORY (3,243 annotations) * COG20_FUNCTION (3,243 annotations) * COG20_PATHWAY (828 annotations) * KEGG_BRITE (2,744 annotations) * KEGG_Class (600 annotations) * KEGG_Module (600 annotations) * KOfam (2,746 annotations) * Transfer_RNAs (58 annotations) AVAILABLE HMM SOURCES =============================================== * 'Archaea_76' (76 models with 35 hits) * 'Bacteria_71' (71 models with 74 hits) * 'Protista_83' (83 models with 6 hits) * 'Ribosomal_RNA_12S' (1 model with 0 hits) * 'Ribosomal_RNA_16S' (3 models with 1 hit) * 'Ribosomal_RNA_18S' (1 model with 0 hits) * 'Ribosomal_RNA_23S' (2 models with 2 hits) * 'Ribosomal_RNA_28S' (1 model with 0 hits) * 'Ribosomal_RNA_5S' (5 models with 0 hits) * 'Transfer_RNAs' (61 models with 58 hits) ``` **Para obtener un archivo multiple con la informacion de las anotaciones si hay o no anotacion tanto de KEGGs como de COGs:** ``` import os import subprocess def tiene_anotacion(linea, tipo): if tipo == "COG": return any(cog in linea for cog in [ "COG20_CATEGORY", "COG20_FUNCTION", "COG20_PATHWAY" ]) if tipo == "KEGG": return any(kegg in linea for kegg in [ "KEGG_BRITE", "KEGG_Class", "KEGG_Module", "KOfam" ]) return False with open("anotaciones_funcionales.tsv", "w") as salida: salida.write("archivo\tCOG\tKEGG\n") # encabezado for archivo in os.listdir(): if archivo.endswith(".db"): proceso = subprocess.run(["anvi-db-info", archivo], capture_output=True, text=True) salida_texto = proceso.stdout + proceso.stderr tiene_cog = tiene_kegg = False for linea in salida_texto.splitlines(): if "*" in linea: if tiene_anotacion(linea, "COG"): tiene_cog = True if tiene_anotacion(linea, "KEGG"): tiene_kegg = True cog_str = "sí" if tiene_cog else "no" kegg_str = "sí" if tiene_kegg else "no" salida.write(f"{archivo}\t{cog_str}\t{kegg_str}\n") print("✅ Archivo 'anotaciones_funcionales.tsv' generado con éxito.") ``` Correr este script de python en la terminal directamente y en el directorio que contiene los `db files`. El output `anotaciones_funcionales.tsv` debe ser algo asi: ``` archivo COG KEGG Aquamarina_558.db no no Aquamarina_DS40M3.db no no Neptunia_CECT5815.db no no Neptunia_DSM15720.db no no Populi_PA5.db no no Alkaliphila_2B2203Z.db sí sí Alkaliphila_BAA953.db sí sí Alkaliphila_DSM16354.db sí sí Alkaliphila_LS44.db sí sí Alkaliphila_WN018.db sí sí Alkaliphila_X3.db sí sí ``` Si representa que si hay anotacion para ese tipo de funcion y no para cuando no existe anotacion en el db para esa funcion. Para asegurarse que estos resultados estan correctos **verificar siempre de forma aleatorio y representativa** esto de forma manual. **anvi-gen-genomes-storage** Create genome storage db: anvi-gen-genomes-storage -e external-genomes.txt -o Met_genus-GENOMES.db **anvi-pan-genome** Start pangenome analysis: anvi-pan-genome -g Met_genus-GENOMES.db --project-name Methanobrevibacter_genus --num-threads 12 **anvi-display-pan** Display the pangenome: anvi-display-pan -g Met_genus-GENOMES.db -p Met_genus-PAN.db Extract bacteria 71 genes for phylogenomics # Anvio metabolism prediction `anvi-estimate-metabolism -e external-genomes.txt -O Enterococcus --matrix-format` `anvi-matrix-to-newick Enterococcus-completeness-MATRIX.txt` ### dry run to get the profile db: ``` anvi-interactive -d Enterococcus-completeness-MATRIX.txt -p Enterococcus_metabolism_PROFILE.db --manual-mode --dry-run ``` ### run for reals: ``` anvi-interactive --manual-mode -d Enterococcus-completeness-MATRIX.txt -t Enterococcus-completeness-MATRIX.txt.newick -p Enterococcus_metabolism_PROFILE.db --title "Enterococcus Metabolism Heatmap" ``` ### dry run to get the profile db: ``` anvi-interactive -d Enterococcus-completeness-MATRIX.txt \ -p Enterococcus_metabolism_PROFILE.db \ --manual-mode \ --dry-run ``` ### run for reals: ``` anvi-interactive --manual-mode \ -d Enterococcus-completeness-MATRIX.txt \ -t Enterococcus-completeness-MATRIX.txt.newick \ -p Enterococcus_metabolism_PROFILE.db \ --title "Enterococcus Metabolism Heatmap" ``` ### learn where the MODULES.db is: ``` export ANVIO_MODULES_DB=$(python -c "import anvio; import os; print(os.path.join(os.path.dirname(anvio.__file__), 'data/misc/KEGG/MODULES.db'))") ``` ### start an empty file: ``` echo -e "module\tclass\tcategory\tsubcategory\tname" > modules_info.txt ``` ### get module classes: ``` sqlite3 $ANVIO_MODULES_DB "select module, data_value from modules where data_name='CLASS'" | sed 's/; /|/g' | tr '|' '\t' >> module_class.txt ``` ### get module names: ``` sqlite3 $ANVIO_MODULES_DB "select module, data_value from modules where data_name='NAME'" | tr '|' '\t' > module_names.txt ``` **COMENTARIO:** en estos casos lo que ha cambiado es que no se llama kegg_modules si no modules, como se observó al revisar los archivos con la ayuda del comando: `sqlite3 $ANVIO_MODULES_DB ".tables"` en este comando nos dimos cuenta que en realidad era la tabla modules y lo comprobamos: ``` sqlite3 $ANVIO_MODULES_DB SELECT * FROM modules; ``` ### Unir todo `paste module_class.txt module_names.txt >> modules_info.txt` Se debe eliminar manualmente la segunda columna de modulos porque si no anvio no deja correrlo, ya que hay 5 columnas con nombres y 6 filas. **VERIFICAR PENDIENTE PARA EXPLICAR MEJOR.** A partir de aqui, el archivo que se anexara en el `PROFILE.db` va a depender del interes. Dado el caso sean todos, se utiliza el archivo creado `modules_info.txt` pero for prefiltered modules of interest hay que hacer un proceso de filtrado de los datos antes. #### Para extraer los modulos y poder posteriormente solo crear un archivo con los modulos de interes crear un archivo extract_modules_info.py con el script: Copiar los modulos que estan en el archivo de stepwise o pathwise completeness, dependiendo de cual enfoque se utilice y utilizarlos para obtener la informacion de los modulos a partir del archivo `modules_info.txt` que contiene toda la info de todos los modulos del keggs. Esto se hace de tal forma, al hacer la importacion solo aparezca en el filograma los modulos de interes, en este caso, los que se estan extrayendo, pero posterior a su filtrado, y no como cuando se importa directamente el `modules_info.txt` en el cual aparece en el filograma todos los modulos, incluyendo lo de interes y no intres, completos e incompletos. Crear un archivo python con el siguiente script: ``` def cargar_modulos(archivo_modulos): """Carga los módulos de un archivo de texto.""" with open(archivo_modulos, "r") as f: return {line.strip() for line in f if line.strip() and not line.startswith("module")} def extraer_info_modulos(archivo_info, modulos_interes): """Extrae la información de los módulos del archivo modules_info.txt""" info_modulos = {} with open(archivo_info, "r") as f: for line in f: parts = line.strip().split("\t") # Separar por tabulaciones if parts[0] in modulos_interes: info_modulos[parts[0]] = { "class": parts[1], "category": parts[2], "subcategory": parts[3], "name": parts[4] } return info_modulos # Archivos de entrada archivo_modulos = "lista_modulos.txt" # Reemplaza con la ruta de tu archivo de módulos archivo_info = "modules_info.txt" # Archivo con la información de los módulos # Cargar lista de módulos de interés modulos_interes = cargar_modulos(archivo_modulos) # Extraer información resultados = extraer_info_modulos(archivo_info, modulos_interes) # Guardar resultados en un archivo with open("modulos_extraidos.txt", "w") as f_out: f_out.write("module\tclass\tcategory\tsubcategory\tname\n") for modulo, info in resultados.items(): f_out.write(f"{modulo}\t{info['class']}\t{info['category']}\t{info['subcategory']}\t{info['name']}\n") print("Extracción completada. Revisa el archivo 'modulos_extraidos.txt'.") ``` Por otra parte, se debe crear un archivo `lista_modulos.txt` que contenga los modulos. En el siguiente formato: ``` M00001 M00002 M00003 M00004 M00005 M00006 M00007 M00008 M00009 M00010 M00011 ``` Revisar el documento y dejar solos los modulos de interes. Posteriormente, correr todo el analisis pero solo con los de interes. El archivo que debe ser modificado, es el que se importara, ya que el **generado por el script solo es para obtener los nombres de los modulos que estan presentes en este** y poder buscar, filtrar y dejar los de interes. ### Importar metabolism Independiente de las dos variantes, si es especifico o general, se sigue el siguiente proceso pero teniendo en cuenta que archivo es, si es `modules_info.txt` para general: ``` anvi-import-misc-data specific_modules.txt -p Enterococcus_metabolism_PROFILE.db -t items ``` O `modulos_extraidos.txt` para modulos de interes: ``` anvi-import-misc-data modulos_extraidos.txt -p Enterococcus_metabolism_PROFILE.db -t items ``` Finalmente, para la visualizacion del filograma del metabolismo: ``` anvi-interactive --manual-mode -d Enterococcus-completeness-MATRIX.txt -t Enterococcus-completeness-MATRIX.txt.newick -p Enterococcus_metabolism_PROFILE.db --title "Enterococcus Metabolism Heatmap" --manual ``` # ABRicate Mass screening of contigs for antimicrobial resistance or virulence genes. It comes bundled with multiple databases: NCBI, CARD, ARG-ANNOT, Resfinder, MEGARES, EcOH, PlasmidFinder, Ecoli_VF and VFDB. https://github.com/tseemann/abricate ## Installation ``` conda install -c conda-forge -c bioconda -c defaults abricate abricate --check abricate --list ``` Run the analysis for one genome: `abricate ARCHIVO.fna` También se puede correr en varios: `abricate *.fna` ABRicate tiene diferentes bases de datos que pueden ser utilizadas. Se pueve verificar las bases de datos con el comando: `abricate --list` Se vera un resultado como el siguiente: ``` DATABASE SEQUENCES DBTYPE DATE card 2631 nucl 2023-Nov-4 ecoli_vf 2701 nucl 2023-Nov-4 plasmidfinder 460 nucl 2023-Nov-4 ecoh 597 nucl 2023-Nov-4 ncbi 5386 nucl 2023-Nov-4 resfinder 3077 nucl 2023-Nov-4 vfdb 2597 nucl 2023-Nov-4 megares 6635 nucl 2023-Nov-4 argannot 2223 nucl 2023-Nov-4 ``` Para correr el analisis con una base de datos en especifico y generar un output en csv: `abricate *.fna --db DBname --csv > card.csv` Por ejemplo para correrla con la DB card: `abricate *.fna --db card --csv > card.csv` # AMRFinder ## Installation ### **Set up bioconda as the default channel** ``` conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge ``` ### **Create a conda enviroment and install all dependencies:** ``` conda create -n amrfinder conda activate amrfinder conda create -y -c conda-forge -c bioconda -n amrfinder --strict-channel-priority ncbi-amrfinderplus ``` ## Running the analysis ### **To run the analysis in a single genome:** `amrfinder -n GENOME.fna` **To run the analysis in multiple fna files create a amrfinder_multiple.sh using nano and the following script: ``` #!/bin/bash # Vacía o crea el archivo de salida > amrfinder_results.txt # Recorre todos los archivos .fna en el directorio actual for file in *.fna; do echo "Procesando $file" >> amrfinder_results.txt amrfinder -n "$file" >> amrfinder_results.txt echo "" >> amrfinder_results.txt # Línea en blanco para separar resultados done ``` Cambiar los permisos para que sea ejecutable: `chmod 777 amrfinder_multiple_anot.sh` Correr el script: `./amrfinder_multiple_anot.sh` # Gene extraction using PROKKA outputs **To extract genes from **gbk** files use the following script:** ``` from Bio import SeqIO import os # Lista de genes de interés genes_of_interest = [ "vioA", "vioB", "vioC", "vioD" ] # Ruta de los archivos .gbk input_folder = "/Users/abraham/Documents/Pangenomics/Violacein/Chromobacterium/gbk" output_folder = os.path.join(input_folder, "OUTPUT_DIRECTORY_NAME") # Crear la carpeta de salida si no existe if not os.path.exists(output_folder): os.makedirs(output_folder) # Detectar automáticamente archivos .gbk en la carpeta de entrada gbk_files = [f for f in os.listdir(input_folder) if f.endswith(".gbk")] if not gbk_files: print("No se encontraron archivos .gbk en la carpeta especificada.") exit() print(f"Se encontraron {len(gbk_files)} archivos .gbk.") # Función para extraer genes de interés y sus bloques completos def extract_genes(input_folder, gbk_files, genes_of_interest, output_folder): for gbk_file in gbk_files: file_path = os.path.join(input_folder, gbk_file) output_file = os.path.join(output_folder, f"genes_{gbk_file}") # Crear un archivo por genoma with open(output_file, "w") as out_handle: for record in SeqIO.parse(file_path, "genbank"): new_record = record[:] new_record.features = [] for feature in record.features: if feature.type == "gene" and "gene" in feature.qualifiers: gene_name = feature.qualifiers["gene"][0] if gene_name in genes_of_interest: # Agregar la característica 'gene' new_record.features.append(feature) # Buscar y agregar características relacionadas (CDS, mRNA, etc.) for related_feature in record.features: if ( related_feature.location == feature.location and related_feature.type in {"CDS", "mRNA"} ): new_record.features.append(related_feature) # Escribir el nuevo registro si tiene características if new_record.features: SeqIO.write(new_record, out_handle, "genbank") print(f"Genes extraídos de {gbk_file} y guardados en {output_file}") # Llamar a la función para extraer los genes extract_genes(input_folder, gbk_files, genes_of_interest, output_folder) print("Proceso completado. Todos los genes extraídos fueron guardados en:", output_folder) ``` ### Create a table with presence/absence from gbk files Puedes automatizar la construcción de tu tabla a partir de archivos GBK usando Python y Biopython. En este ejemplo se usa un script que extrae la presencia de los genes vioA, vioB, vioC, vioD y vioE de múltiples archivos GenBank y genera una tabla en formato CSV. Sin embargo, los nombres de los genes pueden modificar segun sea la necesidad. Asegúrate de tener Biopython instalado: `pip install biopython` Usa el siguiente script en Python: ``` import os import csv from Bio import SeqIO # Directorio donde están los archivos GBK gbk_dir = "ruta/a/tus/archivos" # Lista de genes a buscar vio_genes = ["vioA", "vioB", "vioC", "vioD", "vioE"] # Inicializar lista para la tabla table_data = [["Genome"] + vio_genes] # Recorrer cada archivo GBK en el directorio for filename in os.listdir(gbk_dir): if filename.endswith(".gbk") or filename.endswith(".gb"): genome_name = filename.replace(".gbk", "").replace(".gb", "") # Nombre base del genoma gene_presence = {gene: "No" for gene in vio_genes} # Inicializar como 'No' gbk_path = os.path.join(gbk_dir, filename) with open(gbk_path, "r") as gbk_file: for record in SeqIO.parse(gbk_file, "genbank"): for feature in record.features: if feature.type == "gene" and "gene" in feature.qualifiers: gene_name = feature.qualifiers["gene"][0] if gene_name in vio_genes: gene_presence[gene_name] = "Sí" # Marcar como encontrado # Agregar fila con la información table_data.append([genome_name] + [gene_presence[gene] for gene in vio_genes]) # Guardar la tabla en un archivo CSV output_csv = "vio_genes_table.csv" with open(output_csv, "w", newline="") as csvfile: writer = csv.writer(csvfile) writer.writerows(table_data) print(f"Tabla guardada en {output_csv}") ``` **To extract genes from **gff** files use the following script:** ``` import os import pandas as pd # Lista de genes de interés genes_of_interest = {"vioA", "vioB", "vioC", "vioD"} # Directorios de entrada y salida input_folder = "/Users/abraham/Documents/Pangenomics/Violacein/Chromobacterium/gff" output_folder = os.path.join(input_folder, "genes_extracted") # Crear carpeta de salida si no existe if not os.path.exists(output_folder): os.makedirs(output_folder) # Detectar archivos GFF en la carpeta de entrada gff_files = [f for f in os.listdir(input_folder) if f.endswith(".gff")] if not gff_files: print("No se encontraron archivos .gff en la carpeta especificada.") exit() print(f"Se encontraron {len(gff_files)} archivos .gff.") # Función para extraer genes de interés def extract_genes_from_gff(input_folder, gff_files, genes_of_interest, output_folder): for gff_file in gff_files: file_path = os.path.join(input_folder, gff_file) output_file = os.path.join(output_folder, f"genes_{gff_file}") # Leer el archivo GFF en un DataFrame, ignorando comentarios df = pd.read_csv(file_path, sep="\t", comment="#", header=None, names=[ "seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes" ]) # Filtrar solo las líneas que corresponden a genes y contienen nuestros genes de interés df_filtered = df[df["type"] == "gene"].copy() # Extraer el nombre del gen desde la columna "attributes" df_filtered["gene_name"] = df_filtered["attributes"].str.extract(r"Name=([^;]+)") # Filtrar los genes de interés df_filtered = df_filtered[df_filtered["gene_name"].isin(genes_of_interest)] # Guardar los genes extraídos en un nuevo archivo GFF if not df_filtered.empty: df_filtered.to_csv(output_file, sep="\t", index=False, header=False) print(f"Genes extraídos de {gff_file} y guardados en {output_file}") else: print(f"No se encontraron genes de interés en {gff_file}") # Ejecutar la función para extraer genes extract_genes_from_gff(input_folder, gff_files, genes_of_interest, output_folder) print("Proceso completado. Todos los genes extraídos fueron guardados en:", output_folder) ``` ### Create a table with presence/absence from gff files Puedes automatizar la construcción de tu tabla a partir de archivos GBK usando Python y Biopython. En este ejemplo se usa un script que extrae la presencia de los genes vioA, vioB, vioC, vioD y vioE de múltiples archivos GenBank y genera una tabla en formato CSV. Sin embargo, los nombres de los genes pueden modificar segun sea la necesidad. Asegúrate de tener Biopython instalado: `pip install biopython` Usa el siguiente script en Python: ``` import os import csv import pandas as pd # Directorio donde están los archivos GFF gff_dir = "/Users/abraham/Documents/Pangenomics/Violacein/Chromobacterium/gff" # Lista de genes a buscar vio_genes = ["vioA", "vioB", "vioC", "vioD", "vioE"] # Inicializar lista para la tabla table_data = [["Genome"] + vio_genes] # Recorrer cada archivo GFF en el directorio for filename in os.listdir(gff_dir): if filename.endswith(".gff"): genome_name = filename.replace(".gff", "") # Nombre base del genoma gene_presence = {gene: "No" for gene in vio_genes} # Inicializar como 'No' gff_path = os.path.join(gff_dir, filename) # Cargar el GFF como DataFrame (ignorando comentarios) df = pd.read_csv(gff_path, sep="\t", comment="#", header=None, names=[ "seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes" ]) # Filtrar solo las líneas de genes df_genes = df[df["type"] == "gene"].copy() # Extraer el nombre del gen desde la columna "attributes" df_genes["gene_name"] = df_genes["attributes"].str.extract(r"Name=([^;]+)") # Marcar los genes presentes for gene in vio_genes: if gene in df_genes["gene_name"].values: gene_presence[gene] = "Sí" # Agregar fila con la información table_data.append([genome_name] + [gene_presence[gene] for gene in vio_genes]) # Guardar la tabla en un archivo CSV output_csv = os.path.join(gff_dir, "vio_genes_presence_absence.csv") with open(output_csv, "w", newline="") as csvfile: writer = csv.writer(csvfile) writer.writerows(table_data) print(f"Tabla guardada en {output_csv}") ``` # Phylogenomics --- # MAFFT: Multiple alignment program for amino acid or nucleotide sequences **Installation** sudo apt install mafft - **--op # :** Gap opening penalty, default: 1.53 - **--ep # :** Offset (works like gap extension penalty), default: 0.0 - **--maxiterate # :** Maximum number of iterative refinement, default: 0 - **--clustalout :** Output: clustal format, default: fasta - **--reorder :** Outorder: aligned, default: input order - **--quiet :** Do not report progress - **--thread # :** Number of threads (if unsure, --thread -1) - **--dash :** Add structural information (Rozewicki et al, submitted) **If unsure which option to use:** % `mafft --auto in > out` **To run standard aligment:** `mafft --reorder --adjustdirection --auto input.fasta > output.fas` - **--reorder**: Outorder: aligned, default: input order - **--adjustdirection:** organize the direction of the sequences based on the first sequence - **--auto:** selection of auto model for the aligment - **in > out:** input.fas and output.fas --- --- # IQ-Tree: Efficient and versatile phylogenomic software by maximum likelihood **Installation** **Create a enviroment for IQ-Tree using conda:** `conda create -n iqtree` **Activate the enviroment:** `conda activate iqtree` **Install the packages using bioconda:** `conda install -c bioconda iqtree` **General ML tree analysis** Combine ModelFinder, tree search, SH-aLRT test and ultrafast bootstrap with 1000 replicates: `iqtree -s example.phy -B 1000 -alrt 1000` **Partitioned analysis with mixed data** http://www.iqtree.org/doc/Advanced-Tutorial To analyze different genes aligned in different files to create a phylogenomic tree using ML **Script:** ``` #nexus begin sets; charset part1 = rps16_alig.nex; charset part2 = ITS_alig.nex; charset part3 = rbcl_alig.nex; end;* ``` Note: each **part*** represent each of the aligned files that are going to be included in the ML tree analysis. It is **important** that in case .nex file is used, the *match=.* needs to be removed because it will not run the script as it wilk detect the *match=.* as an error # Species/genus delimitation standard ## Pyani: Average Nucleotide Identity https://github.com/widdowquinn/pyani ### Installation conda create --name pyani_env python=3.8 -y conda activate pyani_env conda install mummer blast legacy-blast -y conda install pyani Para correr el analisis: `average_nucleotide_identity.py -i . -o ANIb_output -m ANIb -g -v --write_excel --gformat svg` INFORMACION: https://gensoft.pasteur.fr/docs/pyani/0.2.10/ **PARA RETOMAR UN ANALISIS DE PYANI QUE QUEDO EN LA GENERACION DE LOS ARCHIVOS .BLAST**. Correr este comando en la carpeta fna donde se genero la carpeta ANIb_output y donde dentro de esa estan los otros archivos: `average_nucleotide_identity.py -i . -o ANIb_output -m ANIb -v --skip_blastn --noclobber --nocompress --force --write_excel` ``` INFO: Carrying out ANIb analysis INFO: Running ANIb INFO: Writing BLAST output to ANIb_output/blastn_output WARNING: Skipping BLASTN runs (as instructed)! INFO: Processing pairwise ANIb BLAST output. INFO: Writing ANIb results to ANIb_output INFO: ANIb_alignment_lengths INFO: ANIb_percentage_identity INFO: ANIb_alignment_coverage INFO: ANIb_similarity_errors INFO: ANIb_hadamard INFO: Done: Tue May 14 11:42:36 2024. INFO: Time taken: 890.59s ``` Debe verse al final algo como lo anterior Para hacer un clustering basado en el umbral del 0,95: Crear un archivo .py con el siguiente script: ``` import pandas as pd import networkx as nx def check_file(file_path): """Verifica si el archivo se puede abrir correctamente y muestra las primeras filas.""" try: # Intentamos leer el archivo y mostramos las primeras filas ani_matrix = pd.read_csv(file_path, sep='\t', index_col=0) print("Archivo cargado correctamente.") print(ani_matrix.head()) # Muestra las primeras filas para verificar return ani_matrix except Exception as e: print(f"Error al leer el archivo: {e}") return None def cluster_species(ani_matrix, threshold=0.95): G = nx.Graph() # Añadir nodos for genome in ani_matrix.index: G.add_node(genome) # Añadir aristas según el umbral de ANI for i, genome1 in enumerate(ani_matrix.index): for j, genome2 in enumerate(ani_matrix.columns): if i < j: # Evitar comparar duplicados y la diagonal ani_value = ani_matrix.loc[genome1, genome2] if ani_value >= threshold: G.add_edge(genome1, genome2) # Encontrar componentes conectados (grupos de especies) species_clusters = list(nx.connected_components(G)) return species_clusters def save_results(species_clusters, output_file): """Guarda los resultados en un archivo de texto o CSV.""" with open(output_file, 'w') as f: f.write(f"Se encontraron {len(species_clusters)} especies:\n") for i, cluster in enumerate(species_clusters, 1): f.write(f"Especie {i}:\n") for genome in sorted(cluster): f.write(f" {genome}\n") f.write("\n") def main(file_path, output_file): ani_matrix = check_file(file_path) if ani_matrix is not None: species_clusters = cluster_species(ani_matrix) save_results(species_clusters, output_file) print(f"Los resultados se guardaron en {output_file}") # Uso: # Asegúrate de pasar la ruta completa del archivo a 'main' y la ruta de salida donde deseas guardar el archivo main('ruta/a/tu/ANIb_percentage_identity.tab', 'ruta/al/output/resultados_ani_clusters.txt') ``` Solo se debe modificar la ruta al archivo.tab y la donde se quiere que se genere el output ## CompareM: Amino Acid Identity (AAI) Workflow *DID NOT WORK* https://github.com/donovan-h-parks/CompareM ### Installation ``` conda create -n comparem conda activate comparem conda install -c bioconda comparem ``` ### Analysis The most common task performed with CompareM is the calculation of pairwise amino acid identity (AAI) values between a set of genomes. This can be performed using the **`aai_wf`** command: `comparem aai_wf <input_files> <output_dir>` The `<input_file>` argument indicates the set of genomes to compare and can either i) a text file where each line indicating the location of a genome, or ii) a directory containing all genomes to be compared. The genomic nucleotide sequences of genomes must be in FASTA format. The `<output_dir>` indicates the desired directory for all output files. A typical use of this command would be: `comparem --cpus 20 aai_wf my_genomes aai_output` Where the directory `my_genomes` contains a set of genomes in FASTA format, the results are to be written to a directory called `aai_output`, and 20 processors should be used to calculate the results. A number of optional arguments can also be specified. This includes the **sequence similarity parameters** used to define reciprocal best hits between genomes(i.e., homologs). By default the e-value (`--evalue`), percent sequence identity (`--per_identity`), and percent alignment length (`--per_aln_len`) parameters are set to **1e-5**, **30%**, and **70%**. **When specifying a directory of genomes to process, CompareM only processes files with a fna extension.** This can be changes with the `--file_ext` argument. In addition, if genomes are already represented by amino acid protein sequences (as opposed to genomic nucleotide sequences), this must be specified with the `--proteins` flag. Otherwise, genes will be identified de novo using the Prodigal gene caller. The time to compute all pairwise AAI values can be substantially reduced by using multiple processors as specified with the `--cpus` argument. Other arguments are for specialized uses and are discussed in the User's Guide. Pairwise AAI statistics are provided in the output file **./<output_dir>/aai/aai_summary.tsv**. This file consists of 8 columns indicating: * Identifier of the first genome * Number of genes in the first genome * Identifier of the second genome * Number of genes in the second genome * Number of orthologous genes identified between the two genomes * The mean amino acid identity (AAI) of orthologous genes * The standard deviation of the AAI across orthologous genes * The orthologous fraction (OF) between the two genomes defined as the number of orthologous genes divided the minimum number of genes in either genome * Other output files produced by this command are described below. Debido al **error que muestra se deben hacer algunas modificaciones (https://hackmd.io/L2llRUU_SrWfI4OYN-uozQ?view)** pero como es para linux (servidor) se debe hacer de la siguiente forma: **Localiza tu instalación de CompareM** Activa el ambiente de conda de comparem: `conda activate comparem` En tu ambiente donde corre comparem, lanza: ``` python3 - <<'PY' import comparem, os print(os.path.dirname(comparem.__file__)) PY ``` Eso te imprimirá la ubicacion de la carpeta de comparem: `/home/aguerra/.conda/envs/comparem/lib/python3.13/site-packages/comparem` **Edita el archivo similarity_search.py** `cd /home/aguerra/.conda/envs/comparem/lib/python3.13/site-packages/comparem` Abre `similarity_search.py` en tu editor favorito y busca las líneas que contienen algo así: `nano similarity_search.py` **Las lineas que se deben buscar:** ``` if platform.system() == 'Darwin': os.system("LC_ALL=C sed -i '' 's/~/\t/g' %s" % input_hit_table) os.system("LC_ALL=C gsort --parallel=8 -o %s -k1,1 -k3,3 %s" % (input_hit_table, input_hit_table)) else: os.system("LC_ALL=C sed -i 's/~/\t/g' %s" % input_hit_table) os.system("LC_ALL=C sort --parallel=8 -o %s -k1,1 -k3,3 %s" % (input_hit_table, input_hit_table)) ``` Remplazar todo eso por lo siguiente: ``` def _some_method(self, input_hit_table, output_hit_table): # <-- el nombre depende del método original self.logger.info('Sorting table with hits (be patient!).') # Fix simplificado para Linux: usar GNU sed y sort os.system(f"LC_ALL=C sed -i 's/~/\t/g' {input_hit_table}") os.system(f"LC_ALL=C sort --parallel=8 -o {input_hit_table} -k1,1 -k3,3 {input_hit_table}") # Mover el fichero ya procesado a la salida os.system('mv %s %s' % (input_hit_table, output_hit_table)) ``` ## EzAAI: High Throughput Prokaryotic Average Amino acid Identity Calculator http://leb.snu.ac.kr/ezaai ### Installation ``` conda create -n EzAAI conda activate EzAAI conda install -y -c bioconda ezaai ezaai -h ``` ### Analysis #### EzAAI extract - Extract profile DB from sequences Creá un archivo de script llamado `create_AAI_db.sh`: `nano create_AAI_db.sh` Pegá este contenido dentro del archivo (ajustando las rutas si es necesario): ``` #!/bin/bash mkdir -p db # crea la carpeta db si no existe for file in /ruta/completa/a/los/fna/*.fna; do base=$(basename "$file" .fna) echo "Procesando $base..." ezaai extract -i "$file" -o db/"$base".db -l "$base" done ``` Reemplazá **/ruta/completa/a/los/fna/** por la ruta real donde están tus archivos .fna. Guardá y cerrá el archivo (en nano, presionás Ctrl+O para guardar y Ctrl+X para salir). Dale permisos de ejecución: `chmod +x create_AAI_db.sh` Ejecutá el script: `./create_AAI_db.sh` EzAAI will automatically produce a CDS profile DB of the genome with Prodigal with following prompt. ``` EzAAI |: Running prodigal on genome fasta/Cn.fasta... EzAAI |: Converting given CDS file into profile database... (fasta/Cn.fasta.faa -> db/Cn.db) EzAAI |: Task finished. ``` #### EzAAI calculate - calculate AAI value from profile DBs **Run EzAAI calculate module** Simply enter the following one-liner to perform all-by-all pairwise AAI calculation on the extracted profiles from above. `ezaai calculate -i db/ -j db/ -o out/aai.tsv` The pipeline will automatically detect .db files from the directory and calculate AAI values across the entire set of pairs using MMSeqs2. ``` EzAAI |: Calculating AAI... [Task 1/36] EzAAI |: Calculating AAI... [Task 2/36] ... EzAAI |: Calculating AAI... [Task 36/36] EzAAI |: Task finished. ``` **Check results** Run following to peek the contents of the result file. `head -7 out/aai.tsv` You migth see something like the following: ``` ID1 ID2 Label1 Label2 AAI 786958951 786958951 Leucobacter muris Leucobacter muris 100.000000 786958951 199206886 Leucobacter muris Clavibacter nebraskensis 61.609688 786958951 334056981 Leucobacter muris Microbacterium hominis 61.465138 786958951 204122518 Leucobacter muris Microbacterium aurum 61.842079 786958951 1073644442 Leucobacter muris Clavibacter insidiosus 61.453637 786958951 727401181 Leucobacter muris Leucobacter chironomi 88.755422 ``` You can see the result in glance of which the pair from same genus reports relatively high AAI value than the others. #### EzAAI cluster - hierarchical clustering of taxa with AAI values Run following to perform hierarchical clustering on the matrix provided from the previous step. `ezaai cluster -i out/aai.tsv -o out/sample.nwk` You migth see something like the following: ``` EzAAI |: AAI matrix identified. Running hierarchical clustering with UPGMA method... EzAAI |: Task finished. Resulting file is in a Newick format, which you can either look at it as a text, ``` `cat out/sample.nwk` You migth see something like the following: ``` (((Microbacterium hominis:9.325725,Microbacterium aurum:9.325725):9.500001, (Clavibacter nebraskensis:2.373246,Clavibacter insidiosus:2.373246):16.452480):0.335034, (Leucobacter muris:5.622289,Leucobacter chironomi:5.622289):13.538471); ``` or as a tree visualized with different external programs such as MEGA. ![image](https://hackmd.io/_uploads/B1V25TAA1l.png) ## POC https://github.com/2015qyliang/POCP?tab=readme-ov-file ### Installation ## core-proteome average amino acid identity (pAAI) https://github.com/flass/cpAAI_Rhizobiaceae?tab=readme-ov-file#cpaai_rhizobiaceae # POCP-VS-AAI: https://github.com/RiescoR/POCP-VS-AAI Evaluation of POCP VS AAI: This is a R script to parse all the results from EzAAI and Bio-PI POCP matrix into summary tables, analyze them and create some plots. # PPanGGOLiN: Depicting microbial species diversity via a Partitioned PanGenome Graph Of Linked Neighbors https://github.com/labgem/PPanGGOLiN ## Installation Install PPanGGOLiN into a new conda environment `conda create -n ppanggolin -c defaults -c conda-forge -c bioconda ppanggolin` Check PPanGGOLiN install ``` conda activate ppanggolin ppanggolin --version ``` ## Analysis: to run a complete pangenome analysis A complete pangenomic analysis with PPanGGOLiN can be performed using the all subcommand. This workflow runs a series of PPanGGOLiN commands to generate a partitioned pangenome graph with predicted RGPs (Regions of Genomic Plasticity), spots of insertion and modules. Execute the following command to run the all workflow: `ppanggolin all --fasta GENOMES_FASTA_LIST` By default, it uses parameters that we have found to be generally the best for working with species pangenomes. For further customization, you can adjust some parameters directly on the command line. Alternatively, you can use a configuration file to fine-tune the parameters of each subcommand used by the workflow (see here for more details). If you want to generate a rarefaction curve use: `ppanggolin all --fasta GENOME_LIST.txt -c 6 --rarefaction` The meaning of the added parameters: ``` --rarefaction Use to compute the rarefaction curves (WARNING: can be time consuming) --c CPU, --cpu CPU Number of available cpus ``` ### Input files The file GENOMES_FASTA_LIST is a tsv-separated file with the following organization: * The first column contains a unique genome name (without space) * The second column contains the path to the associated FASTA file * Circular contig identifiers are indicated in the following columns * Each line represents a genome Example of tsv GENOMES_FASTA_LIST: ``` alkanivorans_IITR_71 /Volumes/TOSHIBA EXT/Violacein/Chromobacterium/fna_coded/alkanivorans_IITR_71.fna alkanivorans_S00115 /Volumes/TOSHIBA EXT/Violacein/Chromobacterium/fna_coded/alkanivorans_S00115.fna alkanivorans_S00172 /Volumes/TOSHIBA EXT/Violacein/Chromobacterium/fna_coded/alkanivorans_S00172.fna alkanivorans_S00296 /Volumes/TOSHIBA EXT/Violacein/Chromobacterium/fna_coded/alkanivorans_S00296.fna alticapitis_MWU14_2602 /Volumes/TOSHIBA EXT/Violacein/Chromobacterium/fna_coded/alticapitis_MWU14_2602.fna amazonense_001855565_1 /Volumes/TOSHIBA EXT/Violacein/Chromobacterium/fna_coded/amazonense_001855565_1.fna ``` ### Results files Upon executing the all command, multiple output files and graphics are generated (more information here). Most notably, it writes an HDF-5 file (pangenome.h5). This file can be used as input to any of the subcommands to rerun parts of the analysis with different parameters, write and draw different representations of the pangenome, or perform additional analyses with PPanGGOLiN. ## Gephi setup to analyze pangenome graph https://gephi.org/ Steps to see the graph in gephi 1) Go to File/Open/and select the file pangenomeGraph.gexf. 2) Click OK in the window that appears. 3) Scroll out with your mouse. 4) Go to the Layout section on the left and in the selection bar choose ForceAtlas2. 5) In the Tuning section change the Scaling value to 4000 and check the Stronger Gravity box. 6) Click on the Run button and then click it again to stop. # pyGenomeViz https://github.com/moshi4/pyGenomeViz ## Installation `conda install -c conda-forge pygenomeviz` Additional installation of progressiveMauve is required. `conda install -c conda-forge -c bioconda pygenomeviz progressivemauve` **EN MACOS NO ESTA FUNCIONANDO, UTILIZAR EL DEL SERVIDOR O UN LINUX** ## Usage `pgv-pmauve seq1.gbk seq2.gbk seq3.gbk seq4.gbk -o pmauve_example` ***Options*** ``` General Options: --seq_files IN [IN ...] Input genome sequence files (Genbank or Fasta format) -o OUT, --outdir OUT Output directory --refid Reference genome index (Default: 0) --format [ ...] Output image format ('png'[*]|'jpg'|'svg'|'pdf'|`html`[*]) --reuse Reuse previous result if available -v, --version Print version information -h, --help Show this help message and exit Figure Appearence options: --fig_width Figure width (Default: 15) --fig_track_height Figure track height (Default: 1.0) --feature_track_ratio Feature track ratio (Default: 1.0) --link_track_ratio Link track ratio (Default: 5.0) --tick_track_ratio Tick track ratio (Default: 1.0) --track_labelsize Track label size (Default: 20) --tick_labelsize Tick label size (Default: 15) --normal_link_color Normal link color (Default: 'grey') --inverted_link_color Inverted link color (Default: 'tomato') --align_type Figure tracks align type ('left'|'center'[*]|'right') --tick_style Tick style ('bar'|'axis'|None[*]) --plotstyle Block box plot style ('box'[*]|'bigbox') --cmap Block box colormap (Default: 'hsv') --curve Plot curved style link (Default: OFF) --dpi Figure DPI (Default: 300) [*] marker means the default value. ``` Example 1 Download example dataset, four E.coli genbank files: `pgv-download-dataset -n escherichia_coli` Run CLI workflow: `pgv-pmauve --seq_files NC_000913.gbk NC_002695.gbk NC_011751.gbk NC_011750.gbk -o pmauve_example1 --tick_style bar --block_cmap Pastel2` ![image](https://hackmd.io/_uploads/ry3oaTrEle.png) # Panstripe https://github.com/gtonkinhill/panstripe ## Installation Instalar el gestor de paquetes de Bioconductor (BiocManager). Este comando instala BiocManager, que es la herramienta oficial para descargar paquetes de Bioconductor. Es seguro ejecutarlo incluso si ya lo tienes. ``` if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") ``` Usar BiocManager para instalar ggtree. Ahora que tienes el instalador correcto, úsalo para instalar el paquete que falta. `BiocManager::install("ggtree")` Para instalar ahora si el panstripe: ``` install.packages("remotes") remotes::install_github("gtonkinhill/panstripe") ``` If you would like to also build the vignette with your installation run: `remotes::install_github("gtonkinhill/panstripe", build_vignettes = TRUE)` ## Analysis Panstripe takes as input a phylogeny and gene presence/absence matrix in Rtab format as is produced by Panaroo, Roary, PIRATE and other prokaryotic pangenome tools. Como tenemos un dataset particular, los comandos van a cambiar porque no estaremos usando el dataset del programa si no uno particular. ### Siempre cargar las librerias necesarias: ``` library(panstripe) library(ape) library(patchwork) set.seed(1234) ``` ### Siempre definir las rutas a tus archivos ``` my_tree_path <- "/path/to/newick/file/tree.newick" my_rtab_path <- "/path/to/Rtab/file/gene_presence_absence.Rtab" ``` ### Load files ``` pa <- read_rtab(my_rtab_path) tree <- read.tree(my_tree_path) ``` Apartir de aqui algunas cosas cambian dependiendo del dataset. En algunas ocasiones surge un error y hay que recurrrir a otros modelos para correr el analis. En la forma regular: ### Run panstripe (Option A = regular) ``` fit <- panstripe(pa, tree) fit$summary ``` Deberias de ver un **output una table** asi como esta: ``` #> # A tibble: 6 × 7 #> term estimate std.error statistic p.value `bootstrap CI …` `bootstrap CI …` #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 istip 1.49 0.282 5.29 8.03e- 7 0.978 1.99 #> 2 core 2.97 0.247 12.0 9.69e-21 2.54 3.54 #> 3 depth 0.0572 0.0619 0.923 3.58e- 1 -0.0859 0.162 #> 4 isti… -1.93 0.431 -4.48 2.10e- 5 -2.82 -1.18 #> 5 p 1.25 NA NA NA 1.11 1.36 #> 6 phi 2.58 NA NA NA 1.92 3.28 ``` Si con esto se obtiene un **mensaje de error** como el siguiente: ``` Error in value[[3L]](cond) : Panstripe model fit failed! This can sometime be caused by unusual branch lengths. Setting fit_method='glmmTMB' or family='quasipoisson' or 'gaussian' often provides a more stable fit to difficult datasets In addition: There were 50 or more warnings (use warnings() to see the first 50) Error in value[[3L]](cond) : Panstripe model fit failed! This can sometime be caused by unusual branch lengths. Setting fit_method='glmmTMB' or family='quasipoisson' or 'gaussian' often provides a more stable fit to difficult datasets In addition: Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: algorithm did not converge 3: step size truncated due to divergence Error in value[[3L]](cond) : Panstripe model fit failed! This can sometime be caused by unusual branch lengths. Setting fit_method='glmmTMB' or family='quasipoisson' or 'gaussian' often provides a more stable fit to difficult datasets In addition: Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: algorithm did not converge 3: step size truncated due to divergence Error in value[[3L]](cond) : Panstripe model fit failed! This can sometime be caused by unusual branch lengths. Setting fit_method='glmmTMB' or family='quasipoisson' or 'gaussian' often provides a more stable fit to difficult datasets In addition: Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: algorithm did not converge 3: step size truncated due to divergence Error in value[[3L]](cond) : Panstripe model fit failed! This can sometime be caused by unusual branch lengths. Setting fit_method='glmmTMB' or family='quasipoisson' or 'gaussian' often provides a more stable fit to difficult datasets In addition: Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: algorithm did not converge 3: step size truncated due to divergence Error in value[[3L]](cond) : Panstripe model fit failed! This can sometime be caused by unusual branch lengths. Setting fit_method='glmmTMB' or family='quasipoisson' or 'gaussian' often provides a more stable fit to difficult datasets In addition: Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: algorithm did not converge 3: step size truncated due to divergence Error in statistic(data, i[r, ], ...) : Model fitting failed to converge in bootstrap replicate! ``` Proceder con la opcion b que utiliza la familia quasipoisson. ### Run panstripe (Option B = quasipoisson family) ``` fit_gaussian <- panstripe(pa, tree, family = "gaussian") fit$summary ``` En el dataset que se utilizo para ejemplificar este tipo de casos, se obtuvo un output como el siguiente: ``` # A tibble: 6 × 7 term estimate std.error statistic p.value `bootstrap CI 2.5%` `bootstrap CI 97.5%` <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 istip 2.52 0.268 9.39 9.42e-19 2.03 2.91 2 core 81554. 22512. 3.62 3.37e- 4 53970 132420 3 depth 2417. 4326. 0.559 5.77e- 1 -5688 10188 4 istip:core -79323. 22557. -3.52 4.97e- 4 -129180 -51864 5 p NA NA NA NA NA NA 6 phi NA NA NA NA NA NA ``` Notese de los valores de phi, core y tip ya que de esos valores dependeran los resultados de algunas graficas. Independiemente de la opcion A o B los siguientes plots deben correrse de la misma forma. ## Plot results ### Pangenome params `plot_pangenome_params(fit)` Un output se deberia ver como los siguientes: ##### Option A (Regular; tomado del github) ![image](https://hackmd.io/_uploads/rkBk196Vlx.png) ##### Option B (Quasipoisson family) ![image](https://hackmd.io/_uploads/rJ5Nkq6Vgx.png) Note: Los dos output provienen de dos datasets diferentes, por lo que no son comparables. Pero si muestran dos casos a los cuales nos podriamos estar enfrentando. #### How to interpret params figures * **Core**: indicates whether the branch lengths in the phylogeny are associated with gene gain and loss. * **Depth:** indicates whether the rate of gene gain and loss changes significantly with the depth of a branch. * **Istip:** indicates associations with genes observed on the tips of the phylogeny. * **p:** the inferred index parameter of the underlying Tweedie distribution. * **Phi:** the inferred dispersion parameter. A significant p-value for the tip term indicates that there is a different rate of gene exchange at the tips of the phylogeny compared with the internal branches. This is usually driven by annotation errors or highly mobile elements that do not persist long enough to be observed in multiple genomes. A significant p-value for the core term indicates that there is a significant association between the core genome branch length and the number of gene exchange events. The depth term is less interesting but indicates when there is a difference in our ability to detect older gene exchange events. #### Script para refinar pangenome params ``` # Cargar los paquetes necesarios library(ggplot2) library(dplyr) library(gridExtra) library(grid) # 1. Extraer y limpiar datos del objeto 'fit' summary_data <- fit$summary summary_data_clean <- summary_data %>% rename( lower_ci = 6, upper_ci = 7 ) %>% filter(term %in% c("core", "istip")) # 2. Crear el gráfico para el parámetro 'core' plot_core <- summary_data_clean %>% filter(term == "core") %>% ggplot(aes(x = term, y = estimate)) + geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci), width = 0.25, linewidth = 0.8) + geom_point(size = 4, shape = 21, fill = "dodgerblue") + labs( title = NULL, x = NULL, y = "Gene count estimate" ) + theme_bw(base_size = 14) + theme( axis.title.y = element_text(face = "bold") ) # 3. Crear el gráfico para el parámetro 'tip' plot_tip <- summary_data_clean %>% filter(term == "istip") %>% ggplot(aes(x = term, y = estimate)) + geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci), width = 0.25, linewidth = 0.8) + geom_point(size = 4, shape = 21, fill = "coral") + labs( title = NULL, x = NULL, y = NULL ) + theme_bw(base_size = 14) # 4. Organizar los gráficos y añadir el eje X compartido grid.arrange( plot_core, plot_tip, ncol = 2, # Añade el texto en la parte inferior, centrado entre ambos gráficos bottom = textGrob("Parameter", gp = gpar(fontface = "bold", fontsize = 14)) ) ``` ### Pangenome cumulative plot We also suggest plotting the cumulative number of gene gain and loss events against the core branch length. As Panstripe models each branch individually and accounts for the depth of the branch it is not possible to add the Panstripe model fit to this plot. `plot_pangenome_cumulative(fit)` Un output se deberia ver como los siguientes: ##### Option A (Regular; tomado del github) ![image](https://hackmd.io/_uploads/SyixhhpVgl.png) ##### Option B (Quasipoisson family) ![M_Smithii_cumulative](https://hackmd.io/_uploads/BJMb32aVxl.png) #### Script para refinar pangenome params ``` # Cargar los paquetes necesarios library(ggplot2) library(ggtext) # --- Define la ruta de la carpeta de salida --- output_directory <- "/path/to/output/directory/" # --- Crea la carpeta si no existe --- if (!dir.exists(output_directory)) { dir.create(output_directory, recursive = TRUE) } # 1. ASIGNAR EL GRÁFICO A UNA VARIABLE my_cumulative_plot <- plot_pangenome_cumulative(fit) + theme_bw(base_size = 14) + # --- ETIQUETAS (LABELS) --- labs( title = NULL, y = "**cumulative gene gain & loss events**", x = "**cumulative core branch distance**", shape = "**branch**", color = "**branch**" ) + # --- GEOMETRÍA DE PUNTOS --- geom_point(aes(shape = branch, color = branch), size = 2.5) + # --- ESCALAS MANUALES (CONTROL TOTAL) --- scale_shape_manual(name = "**branch**", values = c("internal" = 16, "terminal" = 17)) + scale_color_manual(name = "**branch**", values = c("internal" = "#e78ac3", "terminal" = "#7fb1d3")) + # --- TEMA Y ESTILO --- theme( axis.title.y = element_markdown(), axis.title.x = element_markdown(), legend.title = element_markdown(), legend.position = "right" ) + # Unimos las leyendas de color y forma en una sola guides(color = guide_legend(), shape = guide_legend()) # 2. GUARDAR LA VARIABLE COMO UN ARCHIVO SVG EN LA RUTA ESPECIFICADA ggsave( filename = file.path(output_directory, "pangenome_cumulative_plot.svg"), plot = my_cumulative_plot, width = 8, height = 6 ) ``` These two plots can easily be combined using the patchwork R package as: `plot_pangenome_params(fit) + plot_pangenome_cumulative(fit) + plot_layout(nrow = 1)` Un output se deberia ver como los siguientes: ##### Option A (Regular; tomado del github) ![image](https://hackmd.io/_uploads/BJRw3npNxl.png) ##### Option B (Quasipoisson family) ![M_Smithii_params_vs_cumulative](https://hackmd.io/_uploads/S1FY3264eg.png) ### Pangenome residuals of regression A plot of the residuals of the regression can also be generated using `plot_residuals(fit)` Un output se deberia ver como los siguientes: ##### Option A (Regular; tomado del github) ![image](https://hackmd.io/_uploads/Sk81p3pEee.png) ##### Option B (Quasipoisson family) ![M_smithii_residual](https://hackmd.io/_uploads/SJP03h6Neg.png) #### Script para refinar pangenome residuals of regression ``` # Cargar los paquetes necesarios library(ggplot2) library(ggtext) # --- Define la ruta de la carpeta de salida --- output_directory <- "path/to/the/output/directory" # --- Crea la carpeta si no existe --- if (!dir.exists(output_directory)) { dir.create(output_directory, recursive = TRUE) } # 1. ASIGNAR EL GRÁFICO DE RESIDUOS A UNA VARIABLE my_residuals_plot <- plot_residuals(fit) + # --- GEOMETRÍA DE PUNTOS --- # Se actualiza el color al código hexadecimal que pediste geom_point(shape = 16, colour = "#99df8a", alpha = 0.5) + theme_bw(base_size = 14) + # --- ETIQUETAS (LABELS) --- labs( title = NULL, y = "**Randomized quantile residuals**", x = "**Core distance**" ) + # --- TEMA Y ESTILO --- theme( axis.title.y = element_markdown(), axis.title.x = element_markdown() ) # 2. GUARDAR LA VARIABLE COMO UN ARCHIVO SVG EN LA RUTA ESPECIFICADA ggsave( filename = file.path(output_directory, "residuals_plot.svg"), plot = my_residuals_plot, width = 7, height = 6 ) # 3. MOSTRAR EL GRÁFICO EN R print(my_residuals_plot) ``` ### Tree with presence/absence A plot of the phylogeny and the corresponding gene presence/absence matrix can be made using the plot_tree_pa function. This function also takes an optional vector of gene names to include in the plot. ``` # look at only those genes that vary variable_genes <- colnames(pa)[apply(pa, 2, sd) > 0] plot_tree_pa(tree = tree, pa = pa, genes = variable_genes, label_genes = FALSE, cols = "black") ``` Un output se deberia ver como los siguientes: ##### Option A (Regular; tomado del github) ![image](https://hackmd.io/_uploads/r1tLp3T4ge.png) ##### Option B (Quasipoisson family) ![M_smithii_presence_abs](https://hackmd.io/_uploads/Bkqr6hTNle.png) ### Inferred ancestral states The plot_gain_loss function allows for the visualisation of the fitted gene gain and loss events on the given phylogeny. The enrichment for events at the tips of a tree is often driven by a combination of highly mobile elements and annotation errors. `plot_gain_loss(fit)` Un output se deberia ver como los siguientes: ##### Option A (Regular; tomado del github) ![image](https://hackmd.io/_uploads/B11oahpNll.png) ##### Option B (Quasipoisson family) ![M_smithii_gain_loss_plot](https://hackmd.io/_uploads/SJuia3aVel.png) #### Script para refinar pangenome gene gain/loss tree plot ``` # Cargar los paquetes necesarios, si no estan instalados proceder con la instalación library(ggplot2) library(ggtree) library(dplyr) library(rlang) # Para .data # ============================================================================== # 1. DEFINICIÓN DE LA NUEVA FUNCIÓN PERSONALIZADA # ============================================================================== # Hemos creado una nueva función para no modificar la original del paquete. # Añadimos argumentos para controlar el grosor del árbol y el tamaño de las etiquetas. my_plot_gain_loss <- function(fit, tree_linewidth = 0.5, # Controla el grosor de las ramas tip_label = TRUE, tip_label_size = 2.5) { # Controla el tamaño de las etiquetas # (Esta parte de preparación de datos es la misma que la original) gt <- dplyr::full_join( ggtree::fortify(fit$tree), data.frame(node = fit$tree$edge[, 2], trait = fit$data$acc), by = 'node' ) # --- Construcción del Gráfico con MODIFICACIONES --- gg <- ggtree::ggtree(gt, ggplot2::aes(color = .data$trait), linewidth = tree_linewidth) + # <-- Usamos nuestro argumento para líneas más finas ggplot2::labs(colour = 'Total genes\ngained & lost') + ggplot2::scale_color_binned(type = 'viridis') + ggtree::theme_tree2() if (tip_label) { # --- MODIFICACIONES CLAVE AQUÍ --- gg <- gg + ggtree::geom_tiplab( align = FALSE, # <-- Se pone en FALSE para eliminar las líneas del medio size = tip_label_size, # <-- Usamos nuestro argumento para etiquetas más pequeñas colour = 'black' ) } return(gg) } # ============================================================================== # 2. USO DE LA NUEVA FUNCIÓN PARA CREAR Y GUARDAR EL GRÁFICO # ============================================================================== # --- Define la ruta de la carpeta de salida --- output_directory <- "/path/to/output/directory/" # --- Crea la carpeta si no existe --- if (!dir.exists(output_directory)) { dir.create(output_directory, recursive = TRUE) } # Ahora llamamos a nuestra nueva función con los valores que queramos final_plot <- my_plot_gain_loss( fit, tree_linewidth = 0.4, # Ramas más finas tip_label_size = 2 # Etiquetas mucho más pequeñas ) + # Podemos añadir más personalizaciones aquí theme(legend.position = "right") # Guardamos el gráfico mejorado ggsave( filename = file.path(output_directory, "gain_loss_refined_plot.svg"), plot = final_plot, width = 8, height = 10 ) # Mostramos el gráfico final en R print(final_plot) ``` ### tSNE The tSNE dimension reduction technique can be used to investigate evidence for clusters within the pangenome. `plot_tsne(pa)` Un output se deberia ver como los siguientes: ##### Option A (Regular; tomado del github) ![image](https://hackmd.io/_uploads/Sk0rR3T4gg.png) ##### Option B (Quasipoisson family) ![M_smithii_tSNE](https://hackmd.io/_uploads/BkqNAnaNxx.png) #### Script para refinar tSNE ``` # Cargar los paquetes necesarios library(ggplot2) library(ggtext) # --- Define la ruta de la carpeta de salida --- output_directory <- "/path/to/output/directory/" # --- Crea la carpeta si no existe --- if (!dir.exists(output_directory)) { dir.create(output_directory, recursive = TRUE) } # --- 1. Fijar la semilla aleatoria para reproducibilidad --- # Esto garantiza que el resultado del t-SNE sea siempre el mismo. set.seed(123) # --- 2. Ejecutar t-SNE una vez y obtener solo los datos --- tsne_data <- plot_tsne(pa, plot = FALSE) # --- 3. Construir el gráfico personalizado con las coordenadas fijas --- final_tsne_plot <- ggplot(tsne_data, aes(x = dim1, y = dim2)) + # Geometría de los puntos (sólidos y de color) geom_point(shape = 16, colour = "#1f77b4", alpha = 0.8, size = 2.5) + # Tema y etiquetas theme_bw(base_size = 14) + labs( title = NULL, x = "**dim1**", y = "**dim2**" ) + theme( axis.title.x = element_markdown(), axis.title.y = element_markdown() ) # --- 4. Guardar el gráfico final como SVG --- ggsave( filename = file.path(output_directory, "tsne_plot_final.svg"), plot = final_tsne_plot, width = 7, height = 7 ) # --- 5. Mostrar el gráfico en R --- print(final_tsne_plot) ``` ## Run the analysis for multiple pangenomes ### Siempre cargar las librerias necesarias: ``` library(panstripe) library(ape) library(patchwork) set.seed(1234) ``` ### Siempre definir las rutas a tus archivos ``` base_path <- "/path/to/directory/with/files" ``` **Rutas para pangenome1** ``` pangenome1_tree_path <- file.path(base_path, "tree_p1.newick") pangenome1_rtab_path <- file.path(base_path, "gene_presence_absence_p1.Rtab") ``` **Rutas para pangenome2** ``` pangenome2_tree_path <- file.path(base_path, "tree_p2.newick") pangenome2_rtab_path <- file.path(base_path, "gene_presence_absence_p2.Rtab") ``` ### Cargar los datos del pangenoma1 ``` pa_p1 <- read_rtab(pangenome1_rtab_path) tree_p1 <- read.tree(pangenome1_tree_path) ``` ### Correr panstripe en pangenoma1. Usamos family='quasipoisson' para mayor estabilidad. `fit_p1 <- panstripe(pa_p1, tree_p1, family = 'quasipoisson')` ### Cargar los datos del pangenoma2 ``` pa_p2 <- read_rtab(pangenome2_rtab_path) tree_p2 <- read.tree(pangenome2_tree_path) ``` ### Correr panstripe en pangenoma1. Usamos family='quasipoisson' para mayor estabilidad. `fit_p2 <- panstripe(pa_p2, tree_p2, family = 'quasipoisson')` ### Correr la función de comparación `result_comparacion <- compare_pangenomes(fit_p1, fit_p2)` ### Imprimir el resumen de la comparación print(result_comparacion$summary) ### Para graficar params vs cumulative Cargar la librería patchwork si no está cargada `library(patchwork)` Utilizar el siguiente comando para generar los dos gráficos juntos: ``` plot_pangenome_params(list(P1 = fit_p1, P2 = fit_p2), legend = FALSE) + plot_pangenome_cumulative(list(P1 = fit_p1, P2 = fit_p2)) + plot_layout(nrow = 1) ``` El output se deberia ver como los siguientes: ##### Option A (Regular; tomado del github) ![image](https://hackmd.io/_uploads/HycE-6pNex.png) ##### Option B (Quasipoisson family) ![Rplot04](https://hackmd.io/_uploads/HJ50WTaNgx.png) # EggNOG-mapper https://github.com/eggnogdb/eggnog-mapper ## Website analysis Debido al gran espacio (80GB) que ocupa el programa se puede correr el analisis en la pagina directamente: http://eggnog-mapper.embl.de/ Upload the `pan_genome_reference.fa` generado por roary o panaroo y seleccion la opcion ***CDS*** para que se corra correctamente el analisis. Ademas, ingresa un correo donde se envia el link del analisis: ![image](https://hackmd.io/_uploads/HkLMa-vSll.png) Modificar los parametros segun sea necesario y luego click en **Submit**: ![image](https://hackmd.io/_uploads/HyySpWDSee.png) Al click en **Submit** se debe ver algo asi en la pagina: ![image](https://hackmd.io/_uploads/HJRt6Zvrxx.png) Revisar el correo y entrar al link dando click en ***Click to manage your job***: ![image](https://hackmd.io/_uploads/HJ436Wwree.png) Al entrar al link darle click en ***Star job*** para comenzar el analisis: ![image](https://hackmd.io/_uploads/r1KgAbDBxx.png) Se recibira un correo donde el analisis ha finalizado: ![image](https://hackmd.io/_uploads/rJCNC-wSgl.png) entrar al link dando click en ***Click to manage your job*** y en ***Explore annotation*** para ver las anotaciones online: ![image](https://hackmd.io/_uploads/HJN20WDBlg.png) Ademas, click en el formato en el cual se requiere el archivo, asi como los archivos necesarios. Para el caso de los analisis subsequentes necesitaremos el `.emapper.annotations` file en el ejemplo llamado de `MM_rqbj1453.emapper.annotations.tsv` y en formato csv. ## Analisis de categorias COGs general de todo el pangenoma based on EggNOG-mapper output Este script de Python realiza un análisis de frecuencia de las categorías funcionales COG a partir de un archivo de resultados de eggNOG-mapper. Su propósito es cuantificar y visualizar la distribución de funciones en un conjunto de genes anotados. ### Entradas (Inputs): Archivo de anotaciones de eggNOG: Un archivo de salida estándar de eggnog-mapper con formato TSV (ej. `MM_rqbj1453.emapper.annotations.ts`v). El script asume que este archivo se encuentra en el mismo directorio. ### Explicacion del analisis: * **Carga de Datos:** Lee el archivo de anotaciones, identificando automáticamente la fila del encabezado para ignorar los metadatos. * **Extracción y Limpieza:** Extrae la columna COG_category, elimina las entradas sin anotación y procesa las categorías múltiples (ej. "C,G") para quedarse únicamente con la primera letra. * **Cuantificación:** Realiza un conteo de la frecuencia absoluta de cada categoría COG. * **Normalización:** Calcula la frecuencia relativa de cada categoría como un porcentaje del total de anotaciones válidas. * **Visualización:** Genera una gráfica de barras que representa la frecuencia relativa de cada categoría COG. ### Salidas (Outputs): **Tabla de datos (`tabla_frecuencia_cog.csv`):** Un archivo CSV con tres columnas: Categoria_COG, Conteo y Frecuencia_Porcentaje. **Gráfica de resultados (`grafica_frecuencia_cog.png`):** Una imagen en alta resolución (300 dpi) del gráfico de barras, ideal para reportes y publicaciones. ## Horizontal barplot de la frecuencia relativa de COGs based on EggNOG-MAPPER tsv (.emapper.annotations.tsv) #### Script `plot_pangenome_cogs.py`: Importante, verificar que el **EGGNOG_FILE** en el script conincida con el nombre del archivo **`.emapper.annotations.tsv`**. Correr el script despues de haber creado el archivo usando python `plot_pangenome_cogs.py`: #### Script: ``` import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # --- CONFIGURACIÓN --- # El nombre exacto de tu archivo de resultados de eggNOG. # Asegúrate de que este archivo esté en la misma carpeta que el script. EGGNOG_FILE = 'MM_rqbj1453.emapper.annotations.tsv' # Nombres de los archivos que se van a crear. OUTPUT_PLOT_FILE = 'grafica_frecuencia_cog.png' OUTPUT_DATA_FILE = 'tabla_frecuencia_cog.csv' # --- SCRIPT PRINCIPAL --- print(f"Paso 1: Cargando el archivo de anotaciones '{EGGNOG_FILE}'...") try: # Lógica robusta para encontrar la fila del encabezado (la que empieza con '#query'). header_row = 0 with open(EGGNOG_FILE, 'r') as f: for i, line in enumerate(f): if line.startswith('#query'): header_row = i break # Cargar el archivo usando el número de fila correcto para el encabezado. df = pd.read_csv(EGGNOG_FILE, sep='\t', header=header_row) except FileNotFoundError: print(f"\nERROR: No se encontró el archivo '{EGGNOG_FILE}'.") print("Asegúrate de que el script y el archivo de datos estén en la misma carpeta.") exit() print("Paso 2: Extrayendo y limpiando las categorías COG...") # Verificar si la columna 'COG_category' existe. if 'COG_category' not in df.columns: print("\nERROR: No se encontró la columna 'COG_category' en tu archivo.") print("Verifica el formato del archivo de eggNOG.") exit() # Eliminar filas que no tienen una categoría COG asignada. cog_categories = df['COG_category'].dropna() # Procesar categorías: Ignorar las que son '-' y tomar solo la primera letra si hay varias (ej. "C,G"). cleaned_cogs = [str(cog)[0] for cog in cog_categories if str(cog) != '-'] if not cleaned_cogs: print("\nERROR: No se encontraron anotaciones COG válidas en el archivo.") exit() print(f"Se encontraron {len(cleaned_cogs)} anotaciones COG válidas.") print("Paso 3: Contando frecuencias y calculando porcentajes...") # Contar la frecuencia de cada categoría COG. cog_counts = pd.Series(cleaned_cogs).value_counts().reset_index() cog_counts.columns = ['Categoria_COG', 'Conteo'] # Calcular la frecuencia relativa en porcentaje. total_annotations = cog_counts['Conteo'].sum() cog_counts['Frecuencia_Porcentaje'] = (cog_counts['Conteo'] / total_annotations) * 100 # Guardar la tabla de datos en un archivo CSV. cog_counts.to_csv(OUTPUT_DATA_FILE, index=False) print(f"Tabla de datos guardada como '{OUTPUT_DATA_FILE}'") print("Paso 4: Generando la gráfica...") # --- VISUALIZACIÓN --- plt.figure(figsize=(12, 8)) splot = sns.barplot(data=cog_counts, x='Categoria_COG', y='Frecuencia_Porcentaje', palette='viridis', order=sorted(cog_counts['Categoria_COG'])) # Añadir los porcentajes encima de cada barra. for p in splot.patches: splot.annotate(format(p.get_height(), '.1f') + '%', (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 9), textcoords = 'offset points') plt.title('Frecuencia Relativa de Categorías COG', fontsize=16) plt.xlabel('Categoría COG', fontsize=12) plt.ylabel('Frecuencia Relativa (%)', fontsize=12) plt.ylim(0, max(cog_counts['Frecuencia_Porcentaje']) * 1.1) # Ajustar límite del eje Y plt.tight_layout() # Guardar la gráfica en un archivo. plt.savefig(OUTPUT_PLOT_FILE, dpi=300) print(f"¡Éxito! Gráfica guardada como '{OUTPUT_PLOT_FILE}'") plt.show() ``` Se debe de generar un output svg y png como el siguiente: ![grafica_frecuencia_cog](https://hackmd.io/_uploads/H1ZchYvSel.png) Realizar las modificaciones especificas para las necesidades de los datos. Se puede modificar la paleta, entre otros detalles visuales de la figura. # eggnogCOGextractor https://github.com/vsmicrogenomics/eggnogCOGextractor/tree/main ## Instalation Solo es descargar el archivo `eggnogCOGextractor.py` para correr el programa ## Analysis Crear una carpeta llama input y otra output. En la input debe estar el`MM_rqbj1453.emapper.annotations.tsv`, sin embargo, es importante que este no tenga la extension tsv si no que termine en **`.emapper.annotations`** Para correr el analisis correr el siguiente comando: `python eggnogCOGextractor.py ` Se generara un archivo tsv en el directorio output/ de la siguiente forma: ![image](https://hackmd.io/_uploads/SkkuYPwHeg.png) **Este archivo es importante para hacer algunos analisis como barplot**, entre otros. ## Vertical barplot de la frecuencia relativa de COGs based on eggnogCOGextractor output tsv (.emapper.cog.tsv) #### Script: ``` import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # --- CONFIGURACIÓN --- INPUT_FILE = '/Users/abraham/Downloads/eggNOG_mapper_analysis_CLAS/eggnogCOGextractor/output/MM_rqbj1453.emapper.cog.tsv' # Nombres para los archivos de salida OUTPUT_PLOT_PNG = 'cog_bar_chart_final.png' OUTPUT_PLOT_SVG = 'cog_bar_chart_final.svg' # --- SCRIPT PRINCIPAL --- print(f"Paso 1: Cargando datos desde '{INPUT_FILE}'...") try: df = pd.read_csv(INPUT_FILE, sep='\t') except FileNotFoundError: print(f"\nERROR: No se encontró el archivo '{INPUT_FILE}'.") exit() except pd.errors.EmptyDataError: print(f"\nERROR: El archivo '{INPUT_FILE}' está vacío.") exit() df = df[df['COGCount'] > 0] df = df.sort_values('COGCount', ascending=False) print("Paso 2: Generando la gráfica de barras...") # Crear la figura (fig) y los ejes (ax) fig, ax = plt.subplots(figsize=(12, 8)) # Crear la gráfica de barras horizontales sns.barplot(data=df, x='COGCount', y='COGDescription', hue='COGDescription', palette='tab20', orient='h', legend=False, ax=ax) # Añadir el número exacto al final de cada barra. for index, value in enumerate(df['COGCount']): ax.text(value, index, f' {value}', va='center', ha='left') # Añadir la letra de la categoría (COGID) en el lado derecho. ax.spines['right'].set_visible(False) ax.spines['top'].set_visible(False) ax2 = ax.twinx() ax2.spines['right'].set_visible(False) ax2.set_ylim(ax.get_ylim()) ax2.set_yticks(ax.get_yticks()) ax2.set_yticklabels(df['COGID'], fontsize=10, weight='bold') ax.set_title('Número de Genes por Categoría Funcional COG', fontsize=16, weight='bold') ax.set_xlabel('Número de Genes', fontsize=12) ax.set_ylabel('Categoría COG', fontsize=12) plt.tight_layout() # --- MODIFICACIÓN FINAL: Guardar en ambos formatos --- plt.savefig(OUTPUT_PLOT_PNG, dpi=300, bbox_inches='tight') plt.savefig(OUTPUT_PLOT_SVG, bbox_inches='tight') # SVG es un vector, no necesita dpi print(f"¡Éxito! Gráficas guardadas como '{OUTPUT_PLOT_PNG}' y '{OUTPUT_PLOT_SVG}'") plt.show() ``` Se debe de generar un output svg y png como el siguiente: ![cog_bar_chart_final](https://hackmd.io/_uploads/HyhmhFwrll.png) Realizar las modificaciones especificas para las necesidades de los datos. Se puede modificar la paleta, entre otros detalles visuales de la figura. ## Vertical barplot de los genes especificos de algun COGs category (Ej: V and Q) based on eggnogCOGextractor output tsv (.emapper.cog.tsv) #### Script ``` import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # --- CONFIGURATION --- EGGNOG_ANNOTATIONS_FILE = 'MM_rqbj1453.emapper.annotations.tsv' # Color map for categories of interest COLOR_MAP = { 'Q': '#2eabb8ff', # Color for 'Secondary metabolites' 'V': '#d1d197ff' # Color for 'Defense mechanisms' } CATEGORIES_TO_PLOT = list(COLOR_MAP.keys()) # --- PLOTTING FUNCTION --- def generate_detailed_plot(df, cog_letter, plot_color): """ Filters the dataframe by a COG category and generates a bar plot using a specific color. """ print(f"\n--- Processing Category: {cog_letter} with color {plot_color} ---") category_df = df[df['Cleaned_COG'] == cog_letter].copy() if category_df.empty: print(f"No genes found for category '{cog_letter}'.") return plot_data = category_df.loc[category_df.groupby('Description')['score'].idxmax()] plot_data = plot_data.sort_values('score', ascending=False).head(25) print(f"Found {len(plot_data)} unique functions to plot.") plt.figure(figsize=(12, 10)) sns.barplot(data=plot_data, x='score', y='Description', color=plot_color, orient='h') # --- Plot labels in English --- plt.title(f'Detailed Functions for COG Category: "{cog_letter}"', fontsize=16, weight='bold') plt.xlabel('Annotation Score (Bit-score)', fontsize=12) plt.ylabel('Function Description', fontsize=12) plt.tight_layout() output_filename_svg = f'detailed_plot_category_{cog_letter}.svg' plt.savefig(output_filename_svg, bbox_inches='tight') print(f"Plot saved as '{output_filename_svg}'") plt.close() # --- MAIN SCRIPT --- print(f"Step 1: Loading data from '{EGGNOG_ANNOTATIONS_FILE}'...") try: header_row = 0 with open(EGGNOG_ANNOTATIONS_FILE, 'r') as f: for i, line in enumerate(f): if line.startswith('#query'): header_row = i break main_df = pd.read_csv(EGGNOG_ANNOTATIONS_FILE, sep='\t', header=header_row) except FileNotFoundError: print(f"\nERROR: File not found '{EGGNOG_ANNOTATIONS_FILE}'.") exit() print("Step 2: Cleaning COG categories...") main_df.dropna(subset=['COG_category'], inplace=True) main_df['Cleaned_COG'] = main_df['COG_category'].apply(lambda x: str(x)[0] if str(x) != '-' else None) # Generate a plot for each category of interest for category, color in COLOR_MAP.items(): generate_detailed_plot(main_df, category, color) print("\nAnalysis complete.") ``` Se debe de generar un output svg y png como el siguiente: ![final_cohesive_plot](https://hackmd.io/_uploads/rkAFgRnBee.png) # KEGGaNOG: visualize KEGG Pathways using output results from EggNOG-mapper https://github.com/iliapopov17/KEGGaNOG ## Installation ``` # Linux / WSL / macOS conda create -n kegganog pip -y conda activate kegganog pip install kegganog ``` ## Analysis ### Single mode plot for KEGGs completeness: `KEGGaNOG -i MM_rqbj1453.emapper.annotations.tsv -o output_dir` By the key -i the input (eggNOG-mapper output file) must be provided By the key -o the path to output directory (it will be created if it does not exist) must be provided ### Barplot for KEGGs completeness: **PENDIENTE MODIFICAR EL SCRIPT PARA QUE GENERE EL SVG. ASI COMO LA PALETA** Cambiar la ubicacion del archivo **`SAMPLE_pathways.tsv`** generado por el programa. ``` import kegganog as kgn import pandas as pd # Cargar los datos df = pd.read_csv("/Users/abraham/Downloads/eggNOG_mapper_analysis_CLAS/grouped_clas/SAMPLE_pathways.tsv", sep="\t") # Preparar el objeto de la gráfica kgnbar = kgn.barplot(df, figsize = (8, 10), sort_order="descending", yticks_fontsize = 8, cmap = "Blues", ) # ---- CORRECCIÓN FINAL ---- # 1. Generar la figura y guardarla en una variable 'fig' fig = kgnbar.plotfig() # 2. Usar el método savefig() de la figura para guardarla en un archivo fig.savefig("KEGG_barplot.png", # Nombre del archivo de salida dpi=300, # Para alta resolución bbox_inches='tight') # Asegura que las etiquetas no se corten print("¡Éxito! Gráfica guardada como KEGG_barplot.png") ``` Para analisis subsequentes explorar: https://github.com/iliapopov17/KEGGaNOG?tab=readme-ov-file https://github.com/raymondkiu/eggnog-mapper_COGextraction/tree/master https://github.com/lillycummins/IdentifierToGroupName https://docs.omicsbox.biobam.com/latest/Functional-Annotation-with-EggNOG-Mapper/