Flujo de trabajo RNA-seq

# **Flujo de trabajo RNA-Seq** **Análisis transcriptómico de la almeja mano de león (*Nodipecten subnodosus*) bajo condiciones térmicas oscilatorias en dos localidades de la península de Baja California** ![image](https://hackmd.io/_uploads/rJyoFWJO0.png) > [name=Giselle Moreno López] --- --- ## **I. Preprocesamiento de los datos** Lo primero que se realiza es una evaluación de la calidad de las lecturas. Para ello, se utiliza el software *FastQC*. Se puede utilizar a *MultiQC* para ver las métricas de todas las lecturas, sin embargo, es necesario ver una a una por si *MultiQC* sesga los gráficos. **1. Primera evaluación en FastQC** > Archivos de entrada: > - [ ] Lecturas crudas en formato fq, se recomienda que estén comprimidas (generalmente tienen un nombre largo, como N_2_1_CKDL230014579-1A_HJLJKBBXX_L3_1.fq.gz) ``` #!/bin/bash #SBATCH -p cicese #SBATCH --job-name=fastqc #SBATCH --output=fastqc-%j.log #SBATCH --error=fastqc-%j.err #SBATCH -N 1 #SBATCH --ntasks-per-node=8 #SBATCH -t 6-00:00:00 export PATH=$PATH:/LUSTRE/apps/bioinformatica/FastQC_v0.11.7/:$PATH module load gcc-7.2.0 fastqc /LUSTRE/bioinformatica_data/genomica_funcional/Paulina/mano_leon/experimental/Giselle/trimmed/ph30/*.fq.gz -t 8 -o /LUSTRE/bioinformatica_data/genomica_funcional/Gis/trimmomatic/trimm2/multiq ``` > > Archivos de salida: > - [ ] Archivos .html con las métricas de cada archivo > - [ ] Archivos .zip con las imágenes y métricas > - [ ] Archivos .log y .err con información de la tarea enviada **2.** **Visualización de los datos en MultiQC** *MultiQC* es un visualizador de gráficos que nos permite ver múltiples lecturas a la vez. Nos da un panorama general de como se ven las lecturas. > Archivos de entrada: > - [ ] Archivos .zip con las imágenes y métricas ``` #!/bin/bash #SBATCH -p cicese #SBATCH --job-name=multiqc #SBATCH --output=mul-%j.log #SBATCH --error=mul-%j.err #SBATCH -N 1 #SBATCH --mem=100GB #SBATCH --ntasks-per-node=24 #SBATCH -t 06-00:00:00 export PATH=/LUSTRE/apps/Anaconda/2023/miniconda3/bin:$PATH source activate qiime2-2023.2 multiqc /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq/*zip-o /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq --data-format json --export ``` > Archivos de salida: > - [ ] Archivo .html con todas las métricas de las lecturas > - [ ] Carpeta multiqc_data con la información de cada parámetro (calidad, sobrerepresentación, %GC, etc.) en formato JSON > - [ ] Carpeta multiqc_plots con todos los gráficos en pdf, png y svg > - [ ] Archivos .log y .err con información de la tarea enviada **3. Recorte de las lecturas en Trimmomatic** Una vez se deciden cuáles serán los parámetros, se realiza el recorte de los índices y adaptadores y se eliminan lecturas de baja calidad (Phred score <30) en *Trimmomatic*. En mi caso, agregué todos los adaptadores para que se eliminarán de mis lecturas. > Archivos de entrada: > - [ ] Lecturas crudas en formato .fq.gz ``` #!/bin/bash ######################################################### #TRIMOMMATIC ## Directivas #SBATCH --job-name=trimming #SBATCH --output=trimmomatic-%j.log #SBATCH --error=trimmomatic-%j.err #SBATCH -N 2 #SBATCH --mem=100GB #SBATCH --ntasks-per-node=24 #SBATCH -t 6-00:00:00 #SBATCH -p cicese # Ruta a TRIMMOMATIC TRIMMOMATIC=/LUSTRE/apps/bioinformatica/Trimmomatic-0.39 #Ruta al archivo con los adaptadores trueseq="$TRIMMOMATIC/adapters/TruSeq3-PE-2.fa" trueseq="$TRIMMOMATIC/adapters/TruSeq3-PE.fa" trueseq="$TRIMMOMATIC/adapters/TruSeq2-PE.fa" trueseq="$TRIMMOMATIC/adapters/NexteraPE-PE.fa" trueseq="$TRIMMOMATIC/adapters/TruSeq2-SE.fa" trueseq="$TRIMMOMATIC/adapters/TruSeq3-SE.fa" cd ${SLURM_SUBMIT_DIR} java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_1_CKDL230014573-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_1_CKDL230014573-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_1.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36 java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_2_1_CKDL230014579-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_2_1_CKDL230014579-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_2.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36 java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_3_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_3_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_3.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36 java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_4_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_4_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_4.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36 java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_5_CKDL230014574-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_5_CKDL230014574-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_5.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36 java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_6_CKDL230014580-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_6_CKDL230014580-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_6.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36 java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_7_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_7_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_7.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36 java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_8_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_8_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_8.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36 java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_9_CKDL230014575-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_9_CKDL230014575-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_9.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36 java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_10_2_CKDL230020994-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_10_2_CKDL230020994-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_10.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36 java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_11_1_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_11_1_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_11.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36 java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_12_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_12_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_12.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36 exit 0 ``` > Archivos de salida: > - [ ] Archivo .fq.gz con un nuevo nombre > - [ ] Archivos .log y .err con información de la tarea enviada Una vez que se realiza el recorte de las lecturas, podemos realizar de nuevo la evaluación de la calidad en FastQC y MultiQC. **FastQC** > Archivos de entrada: > - [ ] Lecturas preprocesadas en formato .fq.gz ``` #!/bin/bash #SBATCH -p cicese #SBATCH --job-name=fastqc #SBATCH --output=fastqc-%j.log #SBATCH --error=fastqc-%j.err #SBATCH -N 2 #SBATCH --ntasks-per-node=24 #SBATCH -t 6-00:00:00 export PATH=$PATH:/LUSTRE/apps/bioinformatica/FastQC_v0.11.7/:$PATH module load gcc-7.2.0 fastqc /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/*.fq.gz -t 8 -o /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq ``` > Archivos de salida: > - [ ] Archivos .html con las métricas de cada archivo > - [ ] Archivos .zip con las imágenes y métricas > - [ ] Archivos .log y .err con información de la tarea enviada **MultiQC** > Archivos de entrada: > - [ ] Archivos .zip con las imágenes y métricas ``` #!/bin/bash #SBATCH -p cicese #SBATCH --job-name=multifqc #SBATCH --output=mul-%j.log #SBATCH --error=mul-%j.err #SBATCH -N 4 #SBATCH --mem=100GB #SBATCH --ntasks-per-node=24 #SBATCH -t 06-00:00:00 export PATH=/LUSTRE/apps/Anaconda/2023/miniconda3/bin:$PATH source activate qiime2-2023.2 multiqc /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq/*zip -o /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq --data-format json --export ``` > Archivos de salida: > - [ ] Archivo .html con todas las métricas de las lecturas > - [ ] Carpeta multiqc_data con la información de cada parámetro (calidad, sobrerepresentación, %GC, etc.) en formato JSON > - [ ] Carpeta multiqc_plots con todos los gráficos en pdf, png y svg > - [ ] Archivos .log y .err con información de la tarea enviada --- ## **II. Ensamble *de novo*** Ahora que sabemos que nuestras lecturas están libres de adaptadores, índices y de secuencias de baja calidad, procederemos a realizar el ensamble de novo. Utilizaremos el software Trinity. Trinity es un paquete de programas cuya metodología permite la reconstrucción de transcriptomas basados en datos de RNA-Seq y reporta transcritos ensamblados con sus posibles isoformas. El ensamble de Trinity está basado principalmente en tres módulos: Inchworm, Chrysalis y Butterfly. **1. Ensamblado *de novo*** Antes de realizar el ensamble, es necesario que todas las lecturas sean descomprimidas. Se puede utilizar el siguiente comando: ``` gunzip *.gz ``` Una vez que se descomprimieron todas las lecturas, se deben comprimir las 1P y 2P en archivos diferentes, usando el siguiente comando: ``` cat *1P.fq | gzip > R1.fastq.gz cat *2P.fq | gzip > R2.fastq.gz ``` Una vez hecho esto, podemos iniciar. > Archivos de entrada: > - [ ] Lecturas en formato .fastq.gz ``` #!/bin/bash # Directivas #SBATCH --job-name=Tri-denovo # nombre del trabajo que aparecera #SBATCH --output=Tri-%j.log # salida estandar num de trabajo #SBATCH --error=Tri-%j.err # salida estandar de error #SBATCH --nodes 2 # número de nodos #SBATCH --mem=100GB # RAM 1GB por cada millon de reads #SBATCH --ntasks-per-node=24 # número de tareas por nodo #SBATCH -t 6-00:00:00 # número maximo #SBATCH -p cicese #cola #export some soft export PATH=$PATH:/LUSTRE/apps/bioinformatica/bowtie2/bin/ export PATH=$PATH:/LUSTRE/apps/bioinformatica/samtools-1.17/bin/ export PATH=$PATH:/LUSTRE/apps/bioinformatica/jellyfish-2.3.0/ export PATH=$PATH:/LUSTRE/apps/bioinformatica/salmon/bin/ module load conda-2023 module load trinityrnaseq-v2.15.1 #my vars R1="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamblev3/trimm2/R1.fastq.gz" R2="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamblev3/trimm2/R2.fastq.gz" OUTDIR="Trinity_ensamble" LIBT="RF" Trinity --no_version_check --seqType fq --max_memory 100G \ --no_bowtie \ --CPU 24 --left $R1 --right $R2 \ --normalize_reads \ --output $OUTDIR #--full_cleanup" # see the time spent START=`date +%s` # Run trinity ###$TRINITY $PARM END=`date +%s` RUNTIME=$(($END-$START)) echo "run time -> $RUNTIME" ``` > Archivos de salida: > - [ ] Archivo Trinity_ensamble.Trinity.fasta > - [ ] Archivo Trinity_ensamble.Trinity.fasta.gene_trans_map > - [ ] Carpeta Trinity_ensamble con la información de cada módulo > - [ ] Archivos .log y .err con información de la tarea enviada **Evaluación rápida del ensamble** Mediante Trinity, se puede hacer un análisis rápido de las métricas del ensamble (promedio del contig, número de genes e isoformas, N50, etc.). > Archivos de entrada: > - [ ] Archivo Trinity_ensamble.Trinity.fasta ``` #!/bin/bash # Directivas #SBATCH --job-name=Trin-est # nombre del trabajo que aparecera #SBATCH --output=stats-%j.log # salida estandar num de trabajo #SBATCH --error=stats-%j.err # salida estandar de error #SBATCH --nodes 2 # número de nodos #SBATCH --mem=100GB # #SBATCH --ntasks-per-node=24 # número de tareas por nodo #SBATCH -t 6-00:00:00 # número maximo #SBATCH -p cicese #cola #SBATCH --exclusive cd $SLURM_SUBMIT_DIR # export PATH=$PATH:/LUSTRE/apps/bioinformatica/trinityrnaseq-v2.15.1/ cd /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble_bueno/ensamblev3 time perl /LUSTRE/apps/bioinformatica/trinityrnaseq-v2.15.1/util/TrinityStats.pl Trinity_ensamble.Trinity.fasta exit 0 ``` > Archivos de salida: > - [ ] Archivos .log y .err con información de la tarea enviada **2. Evaluación del ensamble** Transrate es un software que evalua la calidad del ensamble > Archivos de entrada: > - [ ] Archivo Trinity_ensamble.Trinity.fasta > - [ ] Archivos R1.fastq.gz y R2.fastq.gz ``` #!/bin/bash #SBATCH -p cicese #SBATCH --job-name=tst-rvm #SBATCH --output=ruby-%j.log #SBATCH --error=ruby-%j.err #SBATCH -N 2 #SBATCH --mem=100GB #SBATCH --ntasks-per-node=24 #SBATCH -t 06-00:00:00 ##se deben concatenar todas las trimm_p forward en un archivo y las reverse en otro ## correr este script dentro de la carpeta TRASNRATE TRANSRATE=/LUSTRE/apps/bioinformatica/ruby2/ruby-2.2.0/bin ENSAMBLE="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/seq_cal/Trinity_ensamble.Trinity.fasta" R1="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/seq_cal/R1.fastq.gz" R2="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/seq_cal/R2.fastq.gz" cd $SLURM_SUBMIT_DIR export PATH=/LUSTRE/apps/bioinformatica/.local/bin:$PATH export PATH=/LUSTRE/apps/bioinformatica/ruby2/ruby-2.2.0/bin:$PATH export LD_LIBRARY_PATH=/LUSTRE/apps/bioinformatica/ruby2/ruby-2.2.0/lib:$LD_LIBRARY_PATH $TRANSRATE/transrate \ --assembly $ENSAMBLE \ --left $R1 \ --right $R2 \ --threads 8 \ --output gcontigs ``` > Archivos de salida: > - [ ] Carpeta gcontigs > > - [ ] Carpeta Trinity_ensamble.Trinity con los conteos buenos y malos de los transcritos > > - [ ] Archivo assemblies.csv > - [ ] Archivos .log y .err con información de la tarea enviada --- ## **III. Alineamiento y cuantificación** Una vez que nuestro ensamble tiene buenas métricas, podemos hacer la alineación, que consiste en alinear las secuencias cortas de nuestras lecturas al transcriptoma ensamblado, además de contar el número de secuencias cortas que se alinean a un determinado contig. Mediante RSEM y Bowtie2 incluidos en Trinity podemos realizar este paso. **1. Alineamiento** RSEM alinea las lecturas al transcriptoma de referencia utilizando Bowtie2. Antes de realizar el alineamiento, es necesario generar una matriz que contenga las lecturas y tratamientos, como en el siguiente ejemplo: ``` Basal_BLA Basal_BLA_1 N_1_1P.fq N_1_2P.fq Basal_BLA Basal_BLA_2 N_2_1P.fq N_2_2P.fq Basal_BLA Basal_BLA_3 N_3_1P.fq N_3_2P.fq Basal_LOL Basal_LOL_1 N_4_1P.fq N_4_1P.fq Basal_LOL Basal_LOL_2 N_5_1P.fq N_5_2P.fq Basal_LOL Basal_LOL_3 N_6_1P.fq N_6_2P.fq Osc_reg_BLA Osc_reg_BLA_1 N_7_1P.fq N_7_2P.fq Osc_reg_BLA Osc_reg_BLA_2 N_8_1P.fq N_8_2P.fq Osc_reg_BLA Osc_reg_BLA_3 N_9_1P.fq N_9_2P.fq Osc_reg_LOL Osc_reg_LOL_1 N_10_1P.fq N_10_2P.fq Osc_reg_LOL Osc_reg_LOL_2 N_11_1P.fq N_11_2P.fq Osc_reg_LOL Osc_reg_LOL_3 N_12_1P.fq N_12_2P.fq ``` > Archivos de entrada: > - [ ] Archivo samples.txt > - [ ] Archivo good.Trinity_ensamble.Trinity.fasta proveniente de Transrate ``` #!/bin/bash # Directivas #SBATCH --job-name=Trin-cuantif #SBATCH --output=cuantif-%j.log #SBATCH --error=cuantif-%j.err #SBATCH --nodes 4 #SBATCH --mem=100GB #SBATCH --ntasks-per-node=24 #SBATCH -t 6-00:00:00 #SBATCH --exclusive #export some soft export PATH=$PATH:/LUSTRE/apps/bioinformatica/bowtie2/bin/ export PATH=$PATH:/LUSTRE/apps/bioinformatica/samtools-1.17/bin/ export PATH=$PATH:/LUSTRE/apps/bioinformatica/jellyfish-2.3.0/ export PATH=$PATH:/LUSTRE/apps/bioinformatica/salmon/bin/ export PATH=$PATH:/LUSTRE/apps/bioinformatica/RSEM/bin/ module load conda-2023 module load trinityrnaseq-v2.15.1 UTIL=/LUSTRE/apps/bioinformatica/trinityrnaseq-v2.15.1/util SAMPLES=/LUSTRE/bioinformatica_data/genomica_funcional/Gis/cuantificacion cd $SLURM_SUBMIT_DIR $UTIL/align_and_estimate_abundance.pl \ --seqType fq \ --samples_file samples.txt \ --transcripts good.Trinity_ensamble.Trinity.fasta \ --est_method RSEM \ --aln_method bowtie2 \ --trinity_mode \ --prep_reference \ --output_dir RSEM_outdir \ --thread_count=24 exit 0 ``` > Archivos de salida: > - [ ] Carpetas de cada condición con los resultados de RSEM (conteos de genes e isoformas) > - [ ] Archivos .log y .err con información de la tarea enviada Una vez que se realizó el alineamiento y conteo de las lecturas, se procede a realizar la matriz con todos los conteos mediante el siguiente comando (sin slurm): ``` ls */*isoforms.results > isoforms ``` > Aquí no es necesario tener algún archivo de entrada, el script por sí solo buscará los archivos de conteo en las carpetas de cada condición o tratamiento. Posteriormente, se utiliza el siguiente script: ``` #!/bin/bash # Directivas #SBATCH --job-name=abundance_mat #SBATCH --output=abundance_mat-%j.log #SBATCH --error=abundance_mat-%j.err #SBATCH --nodes 2 #SBATCH --mem=100GB #SBATCH --ntasks-per-node=24 #SBATCH -t 6-00:00:00 #SBATCH --exclusive #export some soft export PATH=$PATH:/LUSTRE/apps/bioinformatica/bowtie2/bin/ export PATH=$PATH:/LUSTRE/apps/bioinformatica/samtools-1.17/bin/ export PATH=$PATH:/LUSTRE/apps/bioinformatica/jellyfish-2.3.0/ export PATH=$PATH:/LUSTRE/apps/bioinformatica/salmon/bin/ export PATH=$PATH:/LUSTRE/apps/bioinformatica/RSEM/bin/ module load conda-2023 module load trinityrnaseq-v2.15.1 UTIL=/LUSTRE/apps/bioinformatica/trinityrnaseq-v2.15.1/util TRANS=/LUSTRE/bioinformatica_data/genomica_funcional/Gis/exp_dif/ensamble_pg cd $SLURM_SUBMIT_DIR $UTIL/abundance_estimates_to_matrix.pl \ --est_method RSEM \ --out_prefix count \ --quant_files $TRANS/isoforms \ --gene_trans_map $TRANS/good.Trinity_ensamble.Trinity.fasta.gene_trans_map \ --name_sample_by_basedir exit 0 ``` > Archivos de salida: > - [ ] Matriz de conteos de genes > - [ ] Matriz de conteos de isoformas (es la que utilizaremos para el análisis DEseq2) > - [ ] Archivos .log y .err con información de la tarea enviada --- ## **IV. Análisis de expresión diferencial génica** Cuando tenemos las matrices de conteo de los genes e isoformas, podemos iniciar nuestro análisis de expresión diferencial génica (DGE). Podemos utilizar DEseq2 o EdgeR en RStudio. ### **DGE en DEseq2** El análisis de expresión diferencial con DESeq2 implica varios pasos, donde DESeq2 modelará los recuentos brutos, utilizando factores de normalización (factores de tamaño) para tener en cuenta las diferencias en la profundidad de la biblioteca. Luego, estimará las dispersiones genéticas y reducirá estas estimaciones para generar estimaciones de dispersión más precisas para modelar los recuentos. Finalmente, DESeq2 se ajustará al modelo binomial negativo y realizará pruebas de hipótesis utilizando la prueba de Wald o prueba de ratio de verosimilitud. Es importante que los valores de recuento estén sin normalizar, ya que permiten evaluar correctamente la precisión de la medición. El modelo DESeq2 corrige internamente el tamaño de la biblioteca. > **Conceptos importantes** > * log2FoldChange: > * baseMean (expresión media): > * valor de p ajustado: > * FDR (tasa de descubrimiento falso): > **Manual de DEseq2**: [https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#standard-workflow](https://) ### DEseq2 Abriremos RStudio, de preferencia debemos de mantener un espacio de trabajo determinado (una carpeta llamada DEseq o DGE) donde tengamos nuestros scripts, matrices y los gráficos que resulten del análisis. Ejemplo de matriz de conteos: ![image](https://hackmd.io/_uploads/SJSunoMrT.png) > Archivos de entrada: > - [ ] Script para RStudio (.R) > - [ ] Matriz de conteos de isoformas (.tsv) El siguiente script se ejecuta en RStudio, dando clic directamente en el archivo .R ``` # DIFFERENTIAL EXPRESSION ANALYSIS # Script modificado de Ricardo Gomez-Reyes # Limpiar la memoria de la sesion de R ==== rm(list = ls()) if(!is.null(dev.list())) dev.off() if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DESeq2") # Configuramos # options(stringsAsFactors = FALSE) # Cargamos nuestras librerias de trabajo ==== # sacar herramientas de nuestra caja de herramientas if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("vsn") #install.packages("tidyverse") library(tidyverse) library(DESeq2) library(vsn) .cran_packages <- c('tidyverse', 'DESeq2') .inst <- .cran_packages %in% installed.packages() if(any(!.inst)) { install.packages(.cran_packages[!.inst], dep=TRUE, repos='http://cran.us.r-project.org') } sapply(c(.cran_packages), require, character.only = TRUE) ############################################################### ############################################################## ####### Establecemos directorio (escritorio) de trabajo ==== path <- "D:/OneDrive/Documentos/Tesis_Giselle/DEseq2" count_f <- list.files(pattern = "good.count.isoform.counts.matrix", full.names = T) # Cargamos archivos en R ==== #library(readr) count <- read_tsv(count_f) colData <- read_tsv(list.files(pattern = "metadata_Ns.tsv", full.names = T)) rownames(colData) <- colData$Library_ID rownames=TRUE #colData <- colData[-1] #remover la primera columna # Modificamos caracteres ==== colNames <- gsub("_L001.isoforms.results", "", names(count)) #colNames <- sapply(strsplit(colNames, "_"), `[`, 1)# colnames(count) <- c("transcript_id", colNames[-1]) query.ids <- count$transcript_id library(dplyr) count <- count %>% select_if(is.double) %>% as(., "matrix") rownames(count) <- query.ids # 1) Filter data by removing low-abundance genes ---- by_count <- 1; by_freq <- 2 keep <- rowSums(count > by_count) >= by_freq sum(keep) # N transcripts nrow(count <- count[keep,]) count <- round(count) # redondea los valores # 2) Format metadata and count matrix ---- row.names(colData) <- colData$Library_ID #para colocar los nombres de Library_ID en rownames row.names=TRUE colData <- colData %>% arrange(match(Library_ID, colnames(count))) #all(colnames(count) %in% rownames(colData)) #colData <- dplyr::rename(colData, samples = Library_ID) #rownames(colData) <- colData$Library_ID any(colnames(count) == colData$Library_ID) # sanity check #colData <- mutate_if(colData, is.character, as.factor) colData <- colData %>% tibble::rownames_to_column(var = "ROW_ID") %>% mutate(across(-ROW_ID, as.factor)) %>% column_to_rownames(var = "ROW_ID") # Usamos objeto colData para elaborar el diseno experimental f_col <- "Condition" names(colData)[names(colData) %in% f_col] <- "Condition" # Si es posible, especificar el factor "control" como nivel de referencia: colData <- colData %>% mutate(Condition = relevel(Condition, ref = "Ctrl")) #usa el factor Basal como nivel de referencia en las comparaciones estadisticas # PREVALENCE ---- apply(count, 1, function(x) sum(x > 0)) %>% table() #calcular elementos por cada fila >0 prevelancedf = apply(count, 1, function(x) sum(x > 0)) # calcula el n?mero de elementos mayores que cero en cada fila de la matriz count y los almacena en un vector llamado prevalence #librerias library(dplyr) library(ggplot2) data.frame(Prevalence = prevelancedf, TotalAbundance = rowSums(count)) %>% # mean_se as_tibble(rownames = "Name") %>% arrange(desc(TotalAbundance)) -> prevelancedf prevelancedf %>% dplyr::count(Prevalence) %>% ggplot(aes(Prevalence, n)) + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12) + scale_y_continuous("Number of transcript", labels = scales::comma) # BOXPLOT ==== qprobs <- function(x) { x <- x[x > 1] quantile(x, probs = c(0.05, 0.25, 0.5, 0.75, 0.95)) } apply(log2(count+1), 2, qprobs) %>% t() %>% as_tibble(rownames = 'Library_ID') %>% left_join(colData) -> probs_df probs_df %>% ggplot(., aes(x = Library_ID, ymin = `5%`, lower = `25%`, middle = `50%`, upper = `75%`, ymax = `95%`, group = Library_ID)) + geom_errorbar(width = 0.3, position = position_dodge(0.6)) + geom_boxplot(aes(fill = Condition), width = 0.5, stat = 'identity', position = position_dodge(0.6)) + labs(y = expression(log[2]~ 'Abundance'), x = '') + theme_classic(base_family = "GillSans", base_size = 18) + theme( legend.position = 'top', panel.border = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.line.x = element_blank()) -> ptop ptop <- ptop + facet_grid(~ Condition, scales = 'free') + theme( strip.background = element_rect(fill = 'grey', color = 'white')) ptop # BOTTOM ==== apply(count, 2, function(x) sum(x > 0)) -> Total_genes # How singletones are per sample? ---- apply(count, 2, function(x) sum(x > 0)) -> filtered_genes cbind(as_tibble(Total_genes, rownames = 'name'), as_tibble(filtered_genes)) -> n_genes names(n_genes) <- c('Library_ID','Raw', 'Filt') n_genes %>% mutate(pct_genes_retained = Filt/Raw) -> n_genes n_genes %>% left_join(colData) %>% mutate(Library_ID = factor(Library_ID, levels = unique(Library_ID))) %>% ggplot() + geom_col(aes(x = Library_ID, y = Raw, fill = Condition)) + labs(x = 'Samples', y = 'Total Transcript') + scale_y_continuous(labels = scales::comma) + theme_minimal(base_family = "GillSans", base_size = 12) + theme(legend.position = 'none', axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 10)) -> pbottom pbottom <- pbottom + facet_grid(~ Condition, scales = 'free') + theme( strip.background = element_blank(), strip.text = element_blank()) pbottom ###aqui obtuve tres plots library(patchwork) # para unir las dos graficas ptop / pbottom + patchwork::plot_layout(heights = c(1,1.2)) # DESEQ2 ===== ddsFullCountTable <- DESeqDataSetFromMatrix( countData = count, colData = colData, design = ~ Condition + Site+ Condition:Site) ##para modelar la influencia de m?ltiples factores en los datos de exp g?nica, si no unicamente dejar ~ Condition # (Above): transform or not the count matrix? # In order to test for differential expression, we operate on raw counts and use discrete distributions ... # 3) Run DESeq in the following functions order ---- # Negative Binomial GLM # # For the analysis we need to estimate the effective library size to normalize for. # Imagine, a gene has the same number of counts in two samples. But the library size # (total number of reads) was twice as high in the first sample. Then we would conclude that the gene was higher expressed in the second sample. # You can estimate the size factors from the count data using the function estimateSizeFactors(): dds <- estimateSizeFactors(ddsFullCountTable) # sizeFactors(dds) # Next we need to estimate for each condition a function that allows to predict the # dispersion (= variance across samples). The core assumption of this method is that the # mean is a good predictor of the variance, i.e., that genes with a similar expression level also have similar variance across the samples: dds <- estimateDispersions(dds) #boxplot(log2(counts(ddsFullCountTable)+0.5)) #boxplot(log2(counts(dds, normalized=TRUE)+0.5)) # Now we can test for differential expression of the genes between the two groups by calling the function nbinomWaldTest(). We provide the dds and the names of the groups of our samples to this function. This might take a few minutes: dds <- nbinomWaldTest(dds, maxit = 200) # (Above) use Wald test when contrasting a unique pair (ex. control vs cancer) res <- results(dds) # d1 <- plotCounts(dds, gene=which.min(res$padj), intgroup="Design", returnData=TRUE) # d2 <- plotCounts(dds, gene=which.max(res$padj), intgroup="Design", returnData=TRUE) # # p1 <- ggplot(rbind(d1,d2), aes(x=Design, y=count)) + # geom_jitter(position=position_jitter(w=0.1,h=0), size = 5, alpha = 0.5) + # scale_y_log10() + # stat_summary(fun.data = "mean_cl_boot", colour = "red", linewidth = 0.7, size = 1) + # labs(y = "Gene count (Log10)", x = "Treatment") + # theme_bw(base_family = "GillSans", base_size = 20) # # # p2 <- rbind(d1,d2) %>% # group_by(Design) %>% # tally(count) %>% # ggplot(aes(x=Design, y=n)) + # geom_col(width = 0.4) + # labs(y = "Library Size", x = "Treatment") + # theme_bw(base_family = "GillSans", base_size = 20) # # p2 /p1 # normal vs gamma-poisson distribution # library(bayestestR) # # prior <- distribution_normal(nrow(m), mean = 5) # posterior <- distribution_gamma(nrow(m), 2) # Generate a skewed distribution # # # p <- bayesfactor_parameters(posterior, prior, # direction = "two-sided", null = 0, verbose = FALSE) # # # plot(p) # centrality <- point_estimate(posterior) # # plot(centrality) # Note: also run as a single batch dds <- DESeq(ddsFullCountTable) res <- results(dds) #4) (OPTIONAL) Test transformation for downstream analyses---- #However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data. Ex: vst <- DESeq2::varianceStabilizingTransformation(dds) # vst if cols > 10 ntr <- DESeq2::normTransform(dds) DESeq2::plotPCA(ntr, intgroup = "Condition") DESeq2::plotPCA(vst, intgroup = "Condition") if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("vsn") raw_df <- vsn::meanSdPlot(assay(dds), plot = F) vst_df <- vsn::meanSdPlot(assay(vst), plot = F) ntr_df <- vsn::meanSdPlot(assay(ntr), plot = F) # rbind(data.frame(py = vst_df$sd, px = vst_df$rank, col = "vst"), # data.frame(py = ntr_df$sd, px = ntr_df$rank, col = "ntr"), # data.frame(py = raw_df$sd, px = raw_df$rank, col = "raw")) %>% # ggplot(aes(px, py, color = col)) + # labs(x = "Ranks", y = "sd", color = "") + # geom_line(orientation = NA, position = position_identity(), size = 2) + # theme_bw(base_family = "GillSans", base_size = 20) + # theme(legend.position = "top") # 5) (OPTIONAL) Correlation heatmap ---- sample_cor = cor(assay(vst), method='pearson', use='pairwise.complete.obs') sample_dist = dist(t(assay(vst)), method='euclidean') hc_samples = hclust(sample_dist, method='complete') hc_order <- hc_samples$labels[hc_samples$order] heatmap(sample_cor, col = cm.colors(1)) sample_cor %>% as_tibble(rownames = 'Library_ID') %>% pivot_longer(cols = colnames(sample_cor), values_to = 'cor') %>% left_join(colData) %>% mutate(Library_ID = factor(Library_ID, levels = hc_order)) -> sample_cor_long library(ggplot2) library(ggh4x) sample_cor_long %>% ggplot(aes(x = Library_ID, y = name, fill = cor)) + geom_tile(color = 'white', linewidth = 0.2) + # geom_raster() + # geom_text(aes(label = cor), color = 'white') + scale_fill_viridis_c(option = "F", name = "Pearson", direction = -1) + scale_x_discrete(position = 'top') + ggh4x::scale_x_dendrogram(hclust = hc_samples, position = 'top') + ggh4x::scale_y_dendrogram(hclust = hc_samples) + theme_classic(base_size = 12, base_family = "GillSans") + labs(x = '', y = '') + theme(axis.text.x = element_text(angle = 50, hjust = -0.15, vjust = 1)) # -> pheat # 6) PCA ===== library(DESeq2) ncol(data <- log2(count+1)) ncol(data <- assay(vst)) PCA = prcomp(t(data), center = FALSE, scale. = FALSE) percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) sd_ratio <- sqrt(percentVar[2] / percentVar[1]) PCAdf <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2]) library(ggforce) PCAdf %>% mutate(Library_ID = rownames(.)) %>% left_join(colData) %>% ggplot(., aes(PC1, PC2)) + # coord_fixed(ratio = sd_ratio) + geom_abline(slope = 0, intercept = 0, linetype="dashed", alpha=0.5) + geom_vline(xintercept = -74, linetype="dashed", alpha=0.5) + ggforce::geom_mark_ellipse(aes(group = as.factor(Condition)), fill = 'grey', con.colour = 'grey') + geom_point(size = 7, alpha = 0.7, aes(color = Condition)) + geom_text( family = "GillSans", mapping = aes(label = Site), size = 2.5) + labs(caption = '', color = "") + xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) + ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) + theme_minimal(base_family = "GillSans", base_size = 20) + theme(plot.title = element_text(hjust = 0.5), legend.position = 'top') # EXPLORATORY DATA ==== summary(res) out of 127719 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 143, 0.11% LFC < 0 (down) : 226, 0.18% outliers [1] : 648, 0.51% low counts [2] : 61847, 48% (mean count < 4) hist(res$pvalue) res <- res %>% as_tibble(rownames = "transcript") # ========================= Pruebas HeatMap ======================================== library(DESeq2) library(ggplot2) library(ComplexHeatmap) #library(org.Hs.eg.db) #library("org.Hs.eg.db") paquete de Bioconductor de anotacion de genes del genoma de Homo Sapiens significant <- res %>% filter((log2FoldChange > 2 & padj < 0.05) | (log2FoldChange < 2 & padj < 0.05)) sigs.df <- as.data.frame(significant) row.names(sigs.df) <- sigs.df$transcript #para colocar los nombres de Transcript en rownamesrow.names=TRUE #sigs.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigs.df), keytype = "TRINITY", column = "SYMBOL") ################################################################################# ###Silencie las siguientes dos lineas porque filtraban mis datos, reduciendolos a 5 #sigs.df <- sigs.df[(sigs.df$baseMean >100) & (abs(sigs.df$log2FoldChange) > 1), ] #sig <- sigs.df[, -1] ################################################################################ mat <- counts(dds, normalized =T)[rownames(sigs.df), ] mat.z <- t(apply(mat, 1, scale)) colnames(mat.z) <- rownames(colData) Heatmap(mat.z, cluster_rows = T, cluster_columns = T, column_labels = colnames(mat.z), name = "Z-score", row_labels = sigs.df[rownames(mat.z),]$transcript) ``` Posteriormente, es necesario obtener un gráfico de volcano, el cual se obtiene de la siguiente manera (mismo script): > Archivos de salida: > Gráficos: > - [ ] Conteos crudos en un boxplot y gráfico de barras > - [ ] Heatmaps de correlación de Pearson > - [ ] Heatmap de los genes transcritos en cada condición > - [ ] Volcano plot > - [ ] En la consola, el número y porcentaje de transcritos (up, down y outliers) ##### Asimismo, se puede realizar el análisis DGE por contrastes. En mi caso, lo realicé para Basales, Osc, LOL y BLA. Sin embargo, para fines prácticos, colocaré solo el script de Basales y de BLA. El siguiente script se corre en RStudio. > Archivos de entrada: > - [ ] Script R Basales.R > - [ ] Archivo good.count.isoform.counts.matrix > - [ ] Archivo metadata_Ns.tsv > - [ ] Trinotate_report.tsv ``` #BASALES # DIFFERENTIAL EXPRESSION ANALYSIS # Script modificado de Ricardo Gomez-Reyes # Limpiar la memoria de la sesion de R ==== rm(list = ls()) if(!is.null(dev.list())) dev.off() #if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") #BiocManager::install("DESeq2") library(DESeq2) # Configuramos # options(stringsAsFactors = FALSE) # Cargamos nuestras librerias de trabajo ==== # sacar herramientas de nuestra caja de herramientas #if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") #BiocManager::install("vsn") library(vsn) #install.packages("tidyverse") library(tidyverse) library(DESeq2) library(vsn) .cran_packages <- c('tidyverse', 'DESeq2') .inst <- .cran_packages %in% installed.packages() if(any(!.inst)) { install.packages(.cran_packages[!.inst], dep=TRUE, repos='http://cran.us.r-project.org') } sapply(c(.cran_packages), require, character.only = TRUE) ############################################################### ############################################################## ####### Establecemos directorio (escritorio) de trabajo ==== path <- "D:/OneDrive/Documentos/Tesis_Giselle/DEseq2" count_f <- list.files(pattern = "good.count.isoform.counts.matrix", full.names = T) # Cargamos archivos en R ==== #library(readr) count <- read_tsv(count_f) colData <- read_tsv(list.files(pattern = "metadata_Ns.tsv", full.names = T)) rownames(colData) <- colData$Library_ID rownames=TRUE #colData <- colData[-1] #remover la primera columna # Modificamos caracteres ==== colNames <- gsub("_L001.isoforms.results", "", names(count)) #colNames <- sapply(strsplit(colNames, "_"), `[`, 1)# colnames(count) <- c("transcript_id", colNames[-1]) query.ids <- count$transcript_id library(dplyr) count <- count %>% select_if(is.double) %>% as(., "matrix") rownames(count) <- query.ids # 1) Filter data by removing low-abundance genes ---- by_count <- 1; by_freq <- 2 keep <- rowSums(count > by_count) >= by_freq sum(keep) # N transcripts nrow(count <- count[keep,]) count <- round(count) # redondea los valores # 2) Format metadata and count matrix ---- row.names(colData) <- colData$Library_ID #para colocar los nombres de Library_ID en rownames row.names=TRUE colData <- colData %>% arrange(match(Library_ID, colnames(count))) #all(colnames(count) %in% rownames(colData)) #colData <- dplyr::rename(colData, samples = Library_ID) #rownames(colData) <- colData$Library_ID any(colnames(count) == colData$Library_ID) # sanity check #colData <- mutate_if(colData, is.character, as.factor) colData <- colData %>% tibble::rownames_to_column(var = "ROW_ID") %>% mutate(across(-ROW_ID, as.factor)) %>% column_to_rownames(var = "ROW_ID") # Usamos objeto colData para elaborar el diseno experimental colData=colData %>% filter(Condition=="Bas") #filtro de condicion(Basal) por sitio f_col <- "Site" names(colData)[names(colData) %in% f_col] <- "Site" # Si es posible, especificar el factor "control" como nivel de referencia: colData <- colData %>% mutate(Condition= relevel(Condition, ref = "LOL")) #usa el factor Basal como nivel de referencia en las comparaciones estadisticas keep=colnames(count) %in% colData$Library_ID count=count[,keep ] # PREVALENCE ---- apply(count, 1, function(x) sum(x > 0)) %>% table() #calcular elementos por cada fila >0 prevelancedf = apply(count, 1, function(x) sum(x > 0)) # calcula el n?mero de elementos mayores que cero en cada fila de la matriz count y los almacena en un vector llamado prevalence #librerias library(dplyr) library(ggplot2) data.frame(Prevalence = prevelancedf, TotalAbundance = rowSums(count)) %>% # mean_se as_tibble(rownames = "Name") %>% arrange(desc(TotalAbundance)) -> prevelancedf prevelancedf %>% dplyr::count(Prevalence) %>% ggplot(aes(Prevalence, n)) + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12) + scale_y_continuous("Number of transcript", labels = scales::comma) # BOXPLOT ==== qprobs <- function(x) { x <- x[x > 1] quantile(x, probs = c(0.05, 0.25, 0.5, 0.75, 0.95)) } apply(log2(count+1), 2, qprobs) %>% t() %>% as_tibble(rownames = 'Library_ID') %>% left_join(colData) -> probs_df probs_df %>% ggplot(., aes(x = Library_ID, ymin = `5%`, lower = `25%`, middle = `50%`, upper = `75%`, ymax = `95%`, group = Library_ID)) + geom_errorbar(width = 0.3, position = position_dodge(0.6)) + geom_boxplot(aes(fill = Site), width = 0.5, stat = 'identity', position = position_dodge(0.6)) + labs(y = expression(log[2]~ 'Abundance'), x = '') + theme_classic(base_family = "GillSans", base_size = 18) + theme( legend.position = 'top', panel.border = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.line.x = element_blank()) -> ptop ptop <- ptop + facet_grid(~ Site, scales = 'free') + theme( strip.background = element_rect(fill = 'grey', color = 'white')) ptop # BOTTOM ==== apply(count, 2, function(x) sum(x > 0)) -> Total_genes # How singletones are per sample? ---- apply(count, 2, function(x) sum(x > 0)) -> filtered_genes cbind(as_tibble(Total_genes, rownames = 'name'), as_tibble(filtered_genes)) -> n_genes names(n_genes) <- c('Library_ID','Raw', 'Filt') n_genes %>% mutate(pct_genes_retained = Filt/Raw) -> n_genes n_genes %>% left_join(colData) %>% mutate(Library_ID = factor(Library_ID, levels = unique(Library_ID))) %>% ggplot() + geom_col(aes(x = Library_ID, y = Raw, fill = Site)) + labs(x = 'Samples', y = 'Total Transcript') + scale_y_continuous(labels = scales::comma) + theme_minimal(base_family = "GillSans", base_size = 12) + theme(legend.position = 'none', axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 10)) -> pbottom pbottom <- pbottom + facet_grid(~ Site, scales = 'free') + theme( strip.background = element_blank(), strip.text = element_blank()) pbottom ######## # aqui obtuve tres plots: # el boxplot, grafico de barras (arriba) y el que tiene a ambos (abajo) ######## library(patchwork) # para unir las dos graficas ptop / pbottom + patchwork::plot_layout(heights = c(1,1.2)) # DESEQ2 ===== ddsFullCountTable <- DESeqDataSetFromMatrix( countData = count, colData = colData, design = ~ Site) ##para modelar la influencia de m?ltiples factores en los datos de exp g?nica, si no unicamente dejar ~ Condition #####OJOOOOO, AQUI cambie #Site# por condition # (Above): transform or not the count matrix? # In order to test for differential expression, we operate on raw counts and use discrete distributions ... # 3) Run DESeq in the following functions order ---- # Negative Binomial GLM # # For the analysis we need to estimate the effective library size to normalize for. # Imagine, a gene has the same number of counts in two samples. But the library size(total number of reads) was twice as high in the first sample. Then we would conclude that the gene was higher expressed in the second sample. # You can estimate the size factors from the count data using the function estimateSizeFactors(): dds <- estimateSizeFactors(ddsFullCountTable) # sizeFactors(dds) # Next we need to estimate for each condition a function that allows to predict the dispersion (= variance across samples). The core assumption of this method is that the # mean is a good predictor of the variance, i.e., that genes with a similar expression level also have similar variance across the samples: dds <- estimateDispersions(dds) ############## ##aqui empieza a estimar las dispersiones geneticas y reduce estas estimaciones para generar estimaciones de dispersión más precisas para modelar los recuentos ############## #boxplot(log2(counts(ddsFullCountTable)+0.5)) #boxplot(log2(counts(dds, normalized=TRUE)+0.5)) # Now we can test for differential expression of the genes between the two groups by calling the function nbinomWaldTest(). We provide the dds and the names of the groups of our samples to this function. This might take a few minutes: dds <- nbinomWaldTest(dds, maxit = 200) # (Above) use Wald test when contrasting a unique pair (ex. control vs cancer) res <- results(dds) # d1 <- plotCounts(dds, gene=which.min(res$padj), intgroup="Design", returnData=TRUE) # d2 <- plotCounts(dds, gene=which.max(res$padj), intgroup="Design", returnData=TRUE) # # p1 <- ggplot(rbind(d1,d2), aes(x=Design, y=count)) + # geom_jitter(position=position_jitter(w=0.1,h=0), size = 5, alpha = 0.5) + # scale_y_log10() + # stat_summary(fun.data = "mean_cl_boot", colour = "red", linewidth = 0.7, size = 1) + # labs(y = "Gene count (Log10)", x = "Treatment") + # theme_bw(base_family = "GillSans", base_size = 20) # # # p2 <- rbind(d1,d2) %>% # group_by(Design) %>% # tally(count) %>% # ggplot(aes(x=Design, y=n)) + # geom_col(width = 0.4) + # labs(y = "Library Size", x = "Treatment") + # theme_bw(base_family = "GillSans", base_size = 20) # # p2 /p1 # normal vs gamma-poisson distribution # library(bayestestR) # # prior <- distribution_normal(nrow(m), mean = 5) # posterior <- distribution_gamma(nrow(m), 2) # Generate a skewed distribution # # # p <- bayesfactor_parameters(posterior, prior, # direction = "two-sided", null = 0, verbose = FALSE) # # # plot(p) # centrality <- point_estimate(posterior) # # plot(centrality) # Note: also run as a single batch dds <- DESeq(ddsFullCountTable) res <- results(dds) #4) (OPTIONAL) Test transformation for downstream analyses---- #However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data. Ex: vst <- DESeq2::varianceStabilizingTransformation(dds) # vst if cols > 10 ntr <- DESeq2::normTransform(dds) DESeq2::plotPCA(ntr, intgroup = "Site") DESeq2::plotPCA(vst, intgroup = "Site") # if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") #BiocManager::install("vsn") library(vsn) raw_df <- vsn::meanSdPlot(assay(dds), plot = F) vst_df <- vsn::meanSdPlot(assay(vst), plot = F) ntr_df <- vsn::meanSdPlot(assay(ntr), plot = F) # rbind(data.frame(py = vst_df$sd, px = vst_df$rank, col = "vst"), # data.frame(py = ntr_df$sd, px = ntr_df$rank, col = "ntr"), # data.frame(py = raw_df$sd, px = raw_df$rank, col = "raw")) %>% # ggplot(aes(px, py, color = col)) + # labs(x = "Ranks", y = "sd", color = "") + # geom_line(orientation = NA, position = position_identity(), size = 2) + # theme_bw(base_family = "GillSans", base_size = 20) + # theme(legend.position = "top") # 5) (OPTIONAL) Correlation heatmap ---- sample_cor = cor(assay(vst), method='pearson', use='pairwise.complete.obs') sample_dist = dist(t(assay(vst)), method='euclidean') hc_samples = hclust(sample_dist, method='complete') hc_order <- hc_samples$labels[hc_samples$order] heatmap(sample_cor, col = cm.colors(1)) sample_cor %>% as_tibble(rownames = 'Library_ID') %>% pivot_longer(cols = colnames(sample_cor), values_to = 'cor') %>% left_join(colData) %>% mutate(Library_ID = factor(Library_ID, levels = hc_order)) -> sample_cor_long library(ggplot2) library(ggh4x) sample_cor_long %>% ggplot(aes(x = Library_ID, y = name, fill = cor)) + geom_tile(color = 'white', linewidth = 0.2) + # geom_raster() + # geom_text(aes(label = cor), color = 'white') + scale_fill_viridis_c(option = "F", name = "Pearson", direction = -1) + scale_x_discrete(position = 'top') + ggh4x::scale_x_dendrogram(hclust = hc_samples, position = 'top') + ggh4x::scale_y_dendrogram(hclust = hc_samples) + theme_classic(base_size = 12, base_family = "GillSans") + labs(x = '', y = '') + theme(axis.text.x = element_text(angle = 50, hjust = -0.15, vjust = 1)) # -> pheat # 6) PCA ===== library(DESeq2) ncol(data <- log2(count+1)) ncol(data <- assay(vst)) PCA = prcomp(t(data), center = FALSE, scale. = FALSE) percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) sd_ratio <- sqrt(percentVar[2] / percentVar[1]) PCAdf <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2]) library(ggforce) PCAdf %>% mutate(Library_ID = rownames(.)) %>% left_join(colData) %>% ggplot(., aes(PC1, PC2)) + # coord_fixed(ratio = sd_ratio) + geom_abline(slope = 0, intercept = 0, linetype="dashed", alpha=0.5) + geom_vline(xintercept = -74, linetype="dashed", alpha=0.5) + ggforce::geom_mark_ellipse(aes(group = as.factor(Condition)), fill = 'grey', con.colour = 'grey') + geom_point(size = 7, alpha = 0.7, aes(color = Condition)) + geom_text( family = "GillSans", mapping = aes(label = Site), size = 2.5) + labs(caption = '', color = "") + xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) + ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) + theme_minimal(base_family = "GillSans", base_size = 20) + theme(plot.title = element_text(hjust = 0.5), legend.position = 'top') # EXPLORATORY DATA ==== summary(res) ######################################################################### ########### BASALES: Por sitio, BLA x LOL ######################################## # out of 126956 with nonzero total read count #adjusted p-value < 0.1 #LFC > 0 (up) : 189, 0.15% #LFC < 0 (down) : 258, 0.2% #outliers [1] : 491, 0.39% #low counts [2] : 44291, 35% ######################################################################## ######################################################################## hist(res$pvalue) res <- res %>% as_tibble(rownames = "transcript") # ========================= Pruebas HeatMap ======================================== library(DESeq2) library(ggplot2) library(ComplexHeatmap) #library(org.Hs.eg.db) #library("org.Hs.eg.db") paquete de Bioconductor de anotacion de genes del genoma de Homo Sapiens significant <- res %>% filter((log2FoldChange > 2 & padj < 0.05) | (log2FoldChange < 2 & padj < 0.05)) sigs.df <- as.data.frame(significant) row.names(sigs.df) <- sigs.df$transcript #para colocar los nombres de Transcript en rownamesrow.names=TRUE #sigs.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigs.df), keytype = "TRINITY", column = "SYMBOL") ################################################################################# ###Silencie las siguientes dos lineas porque filtraban mis datos, reduciendolos a 5 #sigs.df <- sigs.df[(sigs.df$baseMean >100) & (abs(sigs.df$log2FoldChange) > 1), ] #sig <- sigs.df[, -1] ################################################################################ mat <- counts(dds, normalized =T)[rownames(sigs.df), ] mat.z <- t(apply(mat, 1, scale)) colnames(mat.z) <- rownames(colData) Heatmap(mat.z, cluster_rows = T, cluster_columns = T, column_labels = colnames(mat.z), name = "Z-score", row_labels = sigs.df[rownames(mat.z),]$transcript) #write.table(mat.z, file = "transcritos_DE.txt", sep = "\t", quote = FALSE, row.names = TRUE) write.csv(sigs.df, file = "DEG_Basales.csv") ######## # # # :) # # # ######## #=============== Seguimiento con Script de Ricardo ========================= sigfc <- "Sign and FC";pv <- "Sign";fc <- "FC" colors_fc <- c("red2", "#4169E1", "forestgreen", "grey30") colors_fc <- structure(colors_fc, names = c(sigfc, pv, fc, "NS")) res$cc <- 'NS' res[which(abs(res$log2FoldChange) > 2), 'cc'] <- fc res[which(abs(res$padj) <= 0.05), 'cc'] <- pv res[which(res$padj <= 0.05 & abs(res$log2FoldChange) > 2), 'cc'] <- sigfc p <- res %>% ggplot(aes(y = -log10(pvalue), x = log2FoldChange, color = cc)) + geom_point() p <- p + scale_color_manual(name = "", values = colors_fc) + labs(x= expression(Log[2] ~ "Fold Change"), y = expression(-Log[10] ~ "P")) + theme_bw(base_family = "GillSans", base_size = 18) + theme(legend.position = "top") + geom_abline(slope = 0, intercept = -log10(0.05), linetype="dashed", alpha=0.5) + geom_vline(xintercept = 0, linetype="dashed", alpha=0.5) p # POSITIVE LFC == UP EXPRESSED IN CANCER # NEGATIVE LFC == UP EXPRESSED IN CONTROL res %>% mutate(g = sign(log2FoldChange)) %>% dplyr::count(cc, g) %>% mutate(g = ifelse(g == "1", "Osc", "Ctrl")) %>% filter(cc != 'NS') %>% group_by(g) %>% mutate(pct = n / sum(n)) %>% ggplot() + geom_col(aes(x = g, y = pct, fill = cc), width = 0.5) + scale_fill_manual(name = "", values = colors_fc[-4]) + theme_bw(base_family = "GillSans", base_size = 18) + labs(y = '% Transcripts', x = '') + theme(legend.position = 'top') + coord_flip() # Annotation ==== query.ids <- res %>% filter(cc == "Sign and FC") %>% pull(transcript) query.ids <- res$transcript url <- "https://raw.githubusercontent.com/RJEGR/Small-RNASeq-data-analysis/master/FUNCTIONS.R" source(url) annot_path <- paste0(path, "/anotacion/") annot_f <- list.files(path = annot_path, pattern = "Trinotate_report.tsv", full.names = T) annot <- read_tsv(annot_f, na = ".") %>% filter(transcript_id %in% query.ids) blastp_df <- split_blast(annot, hit = "BLASTP") %>% group_by(transcript) %>% arrange(identity) %>% sample_n(1) %>% ungroup() blastp_df %>% dplyr::count(genus, sort = T) tex_dat <- res %>% left_join(blastp_df) %>% drop_na(name) p + ggrepel::geom_text_repel(data = tex_dat, aes(label = name), min.segment.length = 0) #############aqui salio un volcanoplot########### ``` > Archivos de salida: > - [ ] transcritos_DE.txt > - [ ] DEG_Basales.csv > - [ ] Archivo metadata_Ns.tsv En el caso de que se requiera obtener las GO de los DGE up-down, se añade como última parte del script lo siguiente: ``` ##############ENRIQUECIMIENTO FUNCIONAL##################### ########### # Functional enrichment ==== .bioc_packages <- c('biomaRt','GOSemSim') # DESeq2' .inst <- .bioc_packages %in% installed.packages() #if(any(!.inst)) { # if (!require("BiocManager", quietly = TRUE)) #install.packages("BiocManager") BiocManager::install(.bioc_packages[!.inst], ask = F) #} sapply(c(.bioc_packages), require, character.only = TRUE) # Local functions url <- "https://raw.githubusercontent.com/RJEGR/Cancer_sete_T_assembly/main/functions.R" source(url) go_df <- split_gene_ontology(annot, hit = "BLASTP") table(go_df$ontology) gene2GO <- split(go_df$go, go_df$transcript) ########yo agregué esta líneas para que agarrara a GOSemSim & org if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("GOSemSim") #install.packages("GOSemSim") library(GOSemSim) if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("org.Hs.eg.db") library(org.Hs.eg.db) ############################ aqui sigue el script hsGO <- GOSemSim::godata('org.Hs.eg.db', ont="BP") #hsGO <- read_rds(paste0(annot_path, '/Trinotate_GO_annotations.txt')) library(readr) # Assuming the file is a TSV file (tab-separated values)/read_rds es formato binario hsGO <- read_tsv(paste0(annot_path, '/Trinotate_GO_annotations.txt')) res_up <- res %>% filter(log2FoldChange >=2 & padj <0.05) #filtro para los sobre-expresados res <- res %>% drop_na(padj) res_down <- res %>% filter(log2FoldChange <2 & padj <0.05) #filtro para los sub-expresados res_up %>% pull(padj) -> query.p res_up %>% pull(transcript) -> query.names #topGO es requerida, por lo que lo llamare por Biocmanager BiocManager::install("topGO") allRes_up <- GOenrichment(query.p, query.names, gene2GO, Nodes = 20) allRes_up %>% as_tibble() %>% arrange(Annotated) %>% mutate(Term = factor(Term, levels= unique(Term))) %>% ggplot(aes(x = Term, y = Annotated, fill = p.adj.ks)) + # , fill = p.adj.ks coord_flip() + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12) #AQUI me dio un grafico de enriquecimiento funcional con los procesos UP res_down %>% pull(padj) -> query.p res_down %>% pull(transcript) -> query.names allRes_down <- GOenrichment(query.p, query.names, gene2GO, Nodes = 20) #null para devolver todos allRes_down %>% as_tibble() %>% arrange(Annotated) %>% mutate(Term = factor(Term, levels= unique(Term))) %>% ggplot(aes(x = Term, y = Annotated, fill = p.adj.ks)) + # , fill = p.adj.ks coord_flip() + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12) #write.table(res_down, file = "down_DE.txt", sep = "\t", quote = FALSE, row.names = TRUE) #write.csv(sigs.df, file= "matriz.csv", row.names= TRUE ) ``` Si se realiza este paso, se generan los archivos: > - [ ] down_DE.txt y up_DE.txt > - [ ] matriz.csv #### Script de BLA ``` ##BLA # DIFFERENTIAL EXPRESSION ANALYSIS # Script modificado de Ricardo Gomez-Reyes # Limpiar la memoria de la sesion de R ==== rm(list = ls()) if(!is.null(dev.list())) dev.off() #if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") #BiocManager::install("DESeq2") library(DESeq2) # Configuramos # options(stringsAsFactors = FALSE) # Cargamos nuestras librerias de trabajo ==== # sacar herramientas de nuestra caja de herramientas #if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") #BiocManager::install("vsn") library(vsn) #install.packages("tidyverse") library(tidyverse) library(DESeq2) library(vsn) .cran_packages <- c('tidyverse', 'DESeq2') .inst <- .cran_packages %in% installed.packages() if(any(!.inst)) { install.packages(.cran_packages[!.inst], dep=TRUE, repos='http://cran.us.r-project.org') } sapply(c(.cran_packages), require, character.only = TRUE) ############################################################### ############################################################## ####### Establecemos directorio (escritorio) de trabajo ==== path <- "D:/OneDrive/Documentos/Tesis_Giselle/DEseq2" count_f <- list.files(pattern = "good.count.isoform.counts.matrix", full.names = T) # Cargamos archivos en R ==== #library(readr) count <- read_tsv(count_f) colData <- read_tsv(list.files(pattern = "metadata_Ns.tsv", full.names = T)) rownames(colData) <- colData$Library_ID rownames=TRUE #colData <- colData[-1] #remover la primera columna # Modificamos caracteres ==== colNames <- gsub("_L001.isoforms.results", "", names(count)) #colNames <- sapply(strsplit(colNames, "_"), `[`, 1)# colnames(count) <- c("transcript_id", colNames[-1]) query.ids <- count$transcript_id library(dplyr) count <- count %>% select_if(is.double) %>% as(., "matrix") rownames(count) <- query.ids # 1) Filter data by removing low-abundance genes ---- by_count <- 1; by_freq <- 2 keep <- rowSums(count > by_count) >= by_freq sum(keep) # N transcripts nrow(count <- count[keep,]) count <- round(count) # redondea los valores # 2) Format metadata and count matrix ---- row.names(colData) <- colData$Library_ID #para colocar los nombres de Library_ID en rownames row.names=TRUE colData <- colData %>% arrange(match(Library_ID, colnames(count))) #all(colnames(count) %in% rownames(colData)) #colData <- dplyr::rename(colData, samples = Library_ID) #rownames(colData) <- colData$Library_ID any(colnames(count) == colData$Library_ID) # sanity check #colData <- mutate_if(colData, is.character, as.factor) colData <- colData %>% tibble::rownames_to_column(var = "ROW_ID") %>% mutate(across(-ROW_ID, as.factor)) %>% column_to_rownames(var = "ROW_ID") # Usamos objeto colData para elaborar el diseno experimental colData=colData %>% filter(Site=="BLA") #filtro de condicion(Basal) por sitio f_col <- "Condition" names(colData)[names(colData) %in% f_col] <- "Condition" # Si es posible, especificar el factor "control" como nivel de referencia: colData <- colData %>% mutate(Condition= relevel(Condition, ref = "Bas")) #usa el factor Basal como nivel de referencia en las comparaciones estadisticas keep=colnames(count) %in% colData$Library_ID count=count[,keep ] # PREVALENCE ---- apply(count, 1, function(x) sum(x > 0)) %>% table() #calcular elementos por cada fila >0 prevelancedf = apply(count, 1, function(x) sum(x > 0)) # calcula el n?mero de elementos mayores que cero en cada fila de la matriz count y los almacena en un vector llamado prevalence #librerias library(dplyr) library(ggplot2) data.frame(Prevalence = prevelancedf, TotalAbundance = rowSums(count)) %>% # mean_se as_tibble(rownames = "Name") %>% arrange(desc(TotalAbundance)) -> prevelancedf prevelancedf %>% dplyr::count(Prevalence) %>% ggplot(aes(Prevalence, n)) + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12) + scale_y_continuous("Number of transcript", labels = scales::comma) # BOXPLOT ==== qprobs <- function(x) { x <- x[x > 1] quantile(x, probs = c(0.05, 0.25, 0.5, 0.75, 0.95)) } apply(log2(count+1), 2, qprobs) %>% t() %>% as_tibble(rownames = 'Library_ID') %>% left_join(colData) -> probs_df probs_df %>% ggplot(., aes(x = Library_ID, ymin = `5%`, lower = `25%`, middle = `50%`, upper = `75%`, ymax = `95%`, group = Library_ID)) + geom_errorbar(width = 0.3, position = position_dodge(0.6)) + geom_boxplot(aes(fill = Site), width = 0.5, stat = 'identity', position = position_dodge(0.6)) + labs(y = expression(log[2]~ 'Abundance'), x = '') + theme_classic(base_family = "GillSans", base_size = 18) + theme( legend.position = 'top', panel.border = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.line.x = element_blank()) -> ptop ptop <- ptop + facet_grid(~ Condition, scales = 'free') + theme( strip.background = element_rect(fill = 'grey', color = 'white')) ptop # BOTTOM ==== apply(count, 2, function(x) sum(x > 0)) -> Total_genes # How singletones are per sample? ---- apply(count, 2, function(x) sum(x > 0)) -> filtered_genes cbind(as_tibble(Total_genes, rownames = 'name'), as_tibble(filtered_genes)) -> n_genes names(n_genes) <- c('Library_ID','Raw', 'Filt') n_genes %>% mutate(pct_genes_retained = Filt/Raw) -> n_genes n_genes %>% left_join(colData) %>% mutate(Library_ID = factor(Library_ID, levels = unique(Library_ID))) %>% ggplot() + geom_col(aes(x = Library_ID, y = Raw, fill = Site)) + labs(x = 'Samples', y = 'Total Transcript') + scale_y_continuous(labels = scales::comma) + theme_minimal(base_family = "GillSans", base_size = 12) + theme(legend.position = 'none', axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 10)) -> pbottom pbottom <- pbottom + facet_grid(~ Site, scales = 'free') + theme( strip.background = element_blank(), strip.text = element_blank()) pbottom ######## # aqui obtuve tres plots: # el boxplot, grafico de barras (arriba) y el que tiene a ambos (abajo) ######## library(patchwork) # para unir las dos graficas ptop / pbottom + patchwork::plot_layout(heights = c(1,1.2)) # DESEQ2 ===== ddsFullCountTable <- DESeqDataSetFromMatrix( countData = count, colData = colData, design = ~ Condition) ##para modelar la influencia de m?ltiples factores en los datos de exp g?nica, si no unicamente dejar ~ Condition #####OJOOOOO, AQUI cambie #Site# por condition # (Above): transform or not the count matrix? # In order to test for differential expression, we operate on raw counts and use discrete distributions ... # 3) Run DESeq in the following functions order ---- # Negative Binomial GLM # # For the analysis we need to estimate the effective library size to normalize for. # Imagine, a gene has the same number of counts in two samples. But the library size(total number of reads) was twice as high in the first sample. Then we would conclude that the gene was higher expressed in the second sample. # You can estimate the size factors from the count data using the function estimateSizeFactors(): dds <- estimateSizeFactors(ddsFullCountTable) # sizeFactors(dds) # Next we need to estimate for each condition a function that allows to predict the dispersion (= variance across samples). The core assumption of this method is that the # mean is a good predictor of the variance, i.e., that genes with a similar expression level also have similar variance across the samples: dds <- estimateDispersions(dds) ############## ##aqui empieza a estimar las dispersiones geneticas y reduce estas estimaciones para generar estimaciones de dispersión más precisas para modelar los recuentos ############## #boxplot(log2(counts(ddsFullCountTable)+0.5)) #boxplot(log2(counts(dds, normalized=TRUE)+0.5)) # Now we can test for differential expression of the genes between the two groups by calling the function nbinomWaldTest(). We provide the dds and the names of the groups of our samples to this function. This might take a few minutes: dds <- nbinomWaldTest(dds, maxit = 200) # (Above) use Wald test when contrasting a unique pair (ex. control vs cancer) res <- results(dds) # d1 <- plotCounts(dds, gene=which.min(res$padj), intgroup="Design", returnData=TRUE) # d2 <- plotCounts(dds, gene=which.max(res$padj), intgroup="Design", returnData=TRUE) # # p1 <- ggplot(rbind(d1,d2), aes(x=Design, y=count)) + # geom_jitter(position=position_jitter(w=0.1,h=0), size = 5, alpha = 0.5) + # scale_y_log10() + # stat_summary(fun.data = "mean_cl_boot", colour = "red", linewidth = 0.7, size = 1) + # labs(y = "Gene count (Log10)", x = "Treatment") + # theme_bw(base_family = "GillSans", base_size = 20) # # # p2 <- rbind(d1,d2) %>% # group_by(Design) %>% # tally(count) %>% # ggplot(aes(x=Design, y=n)) + # geom_col(width = 0.4) + # labs(y = "Library Size", x = "Treatment") + # theme_bw(base_family = "GillSans", base_size = 20) # # p2 /p1 # normal vs gamma-poisson distribution # library(bayestestR) # # prior <- distribution_normal(nrow(m), mean = 5) # posterior <- distribution_gamma(nrow(m), 2) # Generate a skewed distribution # # # p <- bayesfactor_parameters(posterior, prior, # direction = "two-sided", null = 0, verbose = FALSE) # # # plot(p) # centrality <- point_estimate(posterior) # # plot(centrality) # Note: also run as a single batch dds <- DESeq(ddsFullCountTable) res <- results(dds) #4) (OPTIONAL) Test transformation for downstream analyses---- #However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data. Ex: vst <- DESeq2::varianceStabilizingTransformation(dds) # vst if cols > 10 ntr <- DESeq2::normTransform(dds) DESeq2::plotPCA(ntr, intgroup = "Site") DESeq2::plotPCA(vst, intgroup = "Site") # if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") #BiocManager::install("vsn") library(vsn) raw_df <- vsn::meanSdPlot(assay(dds), plot = F) vst_df <- vsn::meanSdPlot(assay(vst), plot = F) ntr_df <- vsn::meanSdPlot(assay(ntr), plot = F) # rbind(data.frame(py = vst_df$sd, px = vst_df$rank, col = "vst"), # data.frame(py = ntr_df$sd, px = ntr_df$rank, col = "ntr"), # data.frame(py = raw_df$sd, px = raw_df$rank, col = "raw")) %>% # ggplot(aes(px, py, color = col)) + # labs(x = "Ranks", y = "sd", color = "") + # geom_line(orientation = NA, position = position_identity(), size = 2) + # theme_bw(base_family = "GillSans", base_size = 20) + # theme(legend.position = "top") # 5) (OPTIONAL) Correlation heatmap ---- sample_cor = cor(assay(vst), method='pearson', use='pairwise.complete.obs') sample_dist = dist(t(assay(vst)), method='euclidean') hc_samples = hclust(sample_dist, method='complete') hc_order <- hc_samples$labels[hc_samples$order] heatmap(sample_cor, col = cm.colors(1)) sample_cor %>% as_tibble(rownames = 'Library_ID') %>% pivot_longer(cols = colnames(sample_cor), values_to = 'cor') %>% left_join(colData) %>% mutate(Library_ID = factor(Library_ID, levels = hc_order)) -> sample_cor_long library(ggplot2) library(ggh4x) sample_cor_long %>% ggplot(aes(x = Library_ID, y = name, fill = cor)) + geom_tile(color = 'white', linewidth = 0.2) + # geom_raster() + # geom_text(aes(label = cor), color = 'white') + scale_fill_viridis_c(option = "F", name = "Pearson", direction = -1) + scale_x_discrete(position = 'top') + ggh4x::scale_x_dendrogram(hclust = hc_samples, position = 'top') + ggh4x::scale_y_dendrogram(hclust = hc_samples) + theme_classic(base_size = 12, base_family = "GillSans") + labs(x = '', y = '') + theme(axis.text.x = element_text(angle = 50, hjust = -0.15, vjust = 1)) # -> pheat # 6) PCA ===== library(DESeq2) ncol(data <- log2(count+1)) ncol(data <- assay(vst)) PCA = prcomp(t(data), center = FALSE, scale. = FALSE) percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) sd_ratio <- sqrt(percentVar[2] / percentVar[1]) PCAdf <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2]) library(ggforce) PCAdf %>% mutate(Library_ID = rownames(.)) %>% left_join(colData) %>% ggplot(., aes(PC1, PC2)) + # coord_fixed(ratio = sd_ratio) + geom_abline(slope = 0, intercept = 0, linetype="dashed", alpha=0.5) + geom_vline(xintercept = -74, linetype="dashed", alpha=0.5) + ggforce::geom_mark_ellipse(aes(group = as.factor(Site)), fill = 'grey', con.colour = 'grey') + geom_point(size = 7, alpha = 0.7, aes(color = Condition)) + geom_text( family = "GillSans", mapping = aes(label = Site), size = 2.5) + labs(caption = '', color = "") + xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) + ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) + theme_minimal(base_family = "GillSans", base_size = 20) + theme(plot.title = element_text(hjust = 0.5), legend.position = 'top') # EXPLORATORY DATA ==== summary(res) ######################################################################### ########### Osc: Por sitio, BLA x LOL ######################################## # out of 126387 with nonzero total read count #adjusted p-value < 0.1 #LFC > 0 (up) : 775, 0.61% #LFC < 0 (down) : 504, 0.4% #outliers [1] : 520, 0.41% #low counts [2] : 46531, 37% ######################################################################## ######################################################################## hist(res$pvalue) res <- res %>% as_tibble(rownames = "transcript") # ========================= Pruebas HeatMap ======================================== library(DESeq2) library(ggplot2) library(ComplexHeatmap) #library(org.Hs.eg.db) #library("org.Hs.eg.db") paquete de Bioconductor de anotacion de genes del genoma de Homo Sapiens significant <- res %>% filter((log2FoldChange > 2 & padj < 0.05) | (log2FoldChange < 2 & padj < 0.05)) sigs.df <- as.data.frame(significant) row.names(sigs.df) <- sigs.df$transcript #para colocar los nombres de Transcript en rownamesrow.names=TRUE #sigs.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigs.df), keytype = "TRINITY", column = "SYMBOL") ################################################################################# ###Silencie las siguientes dos lineas porque filtraban mis datos, reduciendolos a 5 #sigs.df <- sigs.df[(sigs.df$baseMean >100) & (abs(sigs.df$log2FoldChange) > 1), ] #sig <- sigs.df[, -1] ################################################################################ mat <- counts(dds, normalized =T)[rownames(sigs.df), ] mat.z <- t(apply(mat, 1, scale)) colnames(mat.z) <- rownames(colData) Heatmap(mat.z, cluster_rows = T, cluster_columns = T, column_labels = colnames(mat.z), name = "Z-score", row_labels = sigs.df[rownames(mat.z),]$transcript) #write.table(mat.z, file = "transcritos_DE.txt", sep = "\t", quote = FALSE, row.names = TRUE) write.csv(sigs.df, file = "DEG_BLA.csv") ######## # # # :) # # # ######## #=============== Seguimiento con Script de Ricardo ========================= sigfc <- "Sign and FC";pv <- "Sign";fc <- "FC" colors_fc <- c("red2", "#4169E1", "forestgreen", "grey30") colors_fc <- structure(colors_fc, names = c(sigfc, pv, fc, "NS")) res$cc <- 'NS' res[which(abs(res$log2FoldChange) > 2), 'cc'] <- fc res[which(abs(res$padj) <= 0.05), 'cc'] <- pv res[which(res$padj <= 0.05 & abs(res$log2FoldChange) > 2), 'cc'] <- sigfc p <- res %>% ggplot(aes(y = -log10(pvalue), x = log2FoldChange, color = cc)) + geom_point() p <- p + scale_color_manual(name = "", values = colors_fc) + labs(x= expression(Log[2] ~ "Fold Change"), y = expression(-Log[10] ~ "P")) + theme_bw(base_family = "GillSans", base_size = 18) + theme(legend.position = "top") + geom_abline(slope = 0, intercept = -log10(0.05), linetype="dashed", alpha=0.5) + geom_vline(xintercept = 0, linetype="dashed", alpha=0.5) p # POSITIVE LFC == UP EXPRESSED IN CANCER # NEGATIVE LFC == UP EXPRESSED IN CONTROL res %>% mutate(g = sign(log2FoldChange)) %>% dplyr::count(cc, g) %>% mutate(g = ifelse(g == "1", "Osc", "Ctrl")) %>% filter(cc != 'NS') %>% group_by(g) %>% mutate(pct = n / sum(n)) %>% ggplot() + geom_col(aes(x = g, y = pct, fill = cc), width = 0.5) + scale_fill_manual(name = "", values = colors_fc[-4]) + theme_bw(base_family = "GillSans", base_size = 18) + labs(y = '% Transcripts', x = '') + theme(legend.position = 'top') + coord_flip() # Annotation ==== query.ids <- res %>% filter(cc == "Sign and FC") %>% pull(transcript) query.ids <- res$transcript url <- "https://raw.githubusercontent.com/RJEGR/Small-RNASeq-data-analysis/master/FUNCTIONS.R" source(url) annot_path <- paste0(path, "/anotacion/") annot_f <- list.files(path = annot_path, pattern = "Trinotate_report.tsv", full.names = T) annot <- read_tsv(annot_f, na = ".") %>% filter(transcript_id %in% query.ids) blastp_df <- split_blast(annot, hit = "BLASTP") %>% group_by(transcript) %>% arrange(identity) %>% sample_n(1) %>% ungroup() blastp_df %>% dplyr::count(genus, sort = T) tex_dat <- res %>% left_join(blastp_df) %>% drop_na(name) p + ggrepel::geom_text_repel(data = tex_dat, aes(label = name), min.segment.length = 0) #############aqui salio un volcanoplot########### ##############ENRIQUECIMIENTO FUNCIONAL##################### ########### # Functional enrichment ==== .bioc_packages <- c('biomaRt','GOSemSim') # DESeq2' .inst <- .bioc_packages %in% installed.packages() #if(any(!.inst)) { # if (!require("BiocManager", quietly = TRUE)) #install.packages("BiocManager") BiocManager::install(.bioc_packages[!.inst], ask = F) #} sapply(c(.bioc_packages), require, character.only = TRUE) # Local functions url <- "https://raw.githubusercontent.com/RJEGR/Cancer_sete_T_assembly/main/functions.R" source(url) go_df <- split_gene_ontology(annot, hit = "BLASTP") table(go_df$ontology) gene2GO <- split(go_df$go, go_df$transcript) ########yo agregué esta líneas para que agarrara a GOSemSim & org if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("GOSemSim") #install.packages("GOSemSim") library(GOSemSim) if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("org.Hs.eg.db") library(org.Hs.eg.db) ############################ aqui sigue el script hsGO <- GOSemSim::godata('org.Hs.eg.db', ont="BP") #hsGO <- read_rds(paste0(annot_path, '/Trinotate_GO_annotations.txt')) library(readr) # Assuming the file is a TSV file (tab-separated values)/read_rds es formato binario hsGO <- read_tsv(paste0(annot_path, '/Trinotate_GO_annotations.txt')) res_up <- res %>% filter(log2FoldChange >=2 & padj <0.05) #filtro para los sobre-expresados res <- res %>% drop_na(padj) res_down <- res %>% filter(log2FoldChange <2 & padj <0.05) #filtro para los sub-expresados res_up %>% pull(padj) -> query.p res_up %>% pull(transcript) -> query.names #topGO es requerida, por lo que lo llamare por Biocmanager BiocManager::install("topGO") allRes_up <- GOenrichment(query.p, query.names, gene2GO, Nodes = 20) allRes_up %>% as_tibble() %>% arrange(Annotated) %>% mutate(Term = factor(Term, levels= unique(Term))) %>% ggplot(aes(x = Term, y = Annotated, fill = p.adj.ks)) + # , fill = p.adj.ks coord_flip() + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12) #AQUI me dio un grafico de enriquecimiento funcional con los procesos UP res_down %>% pull(padj) -> query.p res_down %>% pull(transcript) -> query.names allRes_down <- GOenrichment(query.p, query.names, gene2GO, Nodes = 20) #null para devolver todos allRes_down %>% as_tibble() %>% arrange(Annotated) %>% mutate(Term = factor(Term, levels= unique(Term))) %>% ggplot(aes(x = Term, y = Annotated, fill = p.adj.ks)) + # , fill = p.adj.ks coord_flip() + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12) #write.table(res_down, file = "down_DE.txt", sep = "\t", quote = FALSE, row.names = TRUE) #write.csv(sigs.df, file= "matriz.csv", row.names= TRUE ) ``` ### Diagrama de Venn Una vez que se tienen los contrastes, se puede realizar un diagrama de Venn para ver cuales y cuantos son los genes que se comparten entre los contrastes. El siguientre script se trabaja en RStudio. > Archivos de entrada: > - [ ] Script R Basales.R > - [ ] Archivos DGE de cada contraste, DEG_Osc.csv ``` ####----VENN Diagram-----### #----Annotation-----# library(dplyr) # Set the working directory to the folder containing the CSV file setwd("D:/OneDrive/Documentos/Tesis_Giselle/DEseq2") # Read the CSV file into a data frame df1 <- read.csv("DEG_Basales.csv") ##Bas_BLA vs Bas_LOL* df2 <- read.csv("DEG_Osc.csv") ##Osc_BLA vs Osc_LOL* df3 <- read.csv("DEG_BLA.csv") ##Bas_BLA* vs Osc_BLA df4 <- read.csv("DEG_LOL.csv") ##Bas_LOL* vs Osc_LOL #* es el grupo ctrl # ##call de packages #install.packages("VennDiagram") library(VennDiagram) ###Extract unique values library(VennDiagram) # Create character vectors for each category transcript_Bas <- unique(df1$transcript) transcript_Osc <- unique(df2$transcript) transcript_BLA <- unique(df3$transcript) transcript_LOL <- unique(df4$transcript) # Create a list of character vectors venn_list <- list( Bas = transcript_Bas, Osc = transcript_Osc, BLA = transcript_BLA, LOL = transcript_LOL ) # Define colors for each category category_colors <- c("red", "blue", "green", "purple") # Create the Venn diagram venn.plot <- venn.diagram( x = venn_list, category.names = c("Bas (BLA vs LOL*)", "Osc (BLA vs LOL*)", "BLA (Bas* vs Osc)", "LOL (Bas* vs Osc)"), filename = NULL, output = TRUE, fill = category_colors ) # Display the Venn diagram grid.draw(venn.plot) # Extraer los tres genes compartidos entre los contrastes genes_compartidos <- Reduce(intersect, venn_list) # Mostrar los genes compartidos print(genes_compartidos) ####[1] "TRINITY_DN13726_c0_g1_i15", es el gen Ral GTPase-activating protein subunit beta # "TRINITY_DN2478_c1_g1_i14" Src kinase-associated phosphoprotein 2-A #[3] "TRINITY_DN3091_c0_g1_i17" gen sin identificar # Calcular la intersección entre las categorías "Osc" y "Bas" interseccion_osc_bas <- intersect(venn_list$Osc, venn_list$Bas) # Crear una tabla con la intersección y sus conteos tabla_interseccion <- data.frame( Gen = interseccion_osc_bas, Count_Osc = sum(venn_list$Osc %in% interseccion_osc_bas), Count_Bas = sum(venn_list$Bas %in% interseccion_osc_bas) ) # Mostrar la tabla print(tabla_interseccion) ``` > Archivos de salida: > - [ ] Diagrama de venn (guardar manualmente) --- Debido a que mi bitácora supera el número de caracteres de HackMD, a partir de la Anotación funcional esta disponible aquí: https://hackmd.io/@3Hs8iYAxRfK3NnYlJdTdyQ/BJmmIAfuC