# **Flujo de trabajo RNA-Seq**
**Análisis transcriptómico de la almeja mano de león (*Nodipecten subnodosus*) bajo condiciones térmicas oscilatorias en dos localidades de la península de Baja California**

> [name=Giselle Moreno López]
---
---
## **I. Preprocesamiento de los datos**
Lo primero que se realiza es una evaluación de la calidad de las lecturas. Para ello, se utiliza el software *FastQC*. Se puede utilizar a *MultiQC* para ver las métricas de todas las lecturas, sin embargo, es necesario ver una a una por si *MultiQC* sesga los gráficos.
**1. Primera evaluación en FastQC**
> Archivos de entrada:
> - [ ] Lecturas crudas en formato fq, se recomienda que estén comprimidas (generalmente tienen un nombre largo, como N_2_1_CKDL230014579-1A_HJLJKBBXX_L3_1.fq.gz)
```
#!/bin/bash
#SBATCH -p cicese
#SBATCH --job-name=fastqc
#SBATCH --output=fastqc-%j.log
#SBATCH --error=fastqc-%j.err
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH -t 6-00:00:00
export PATH=$PATH:/LUSTRE/apps/bioinformatica/FastQC_v0.11.7/:$PATH
module load gcc-7.2.0
fastqc /LUSTRE/bioinformatica_data/genomica_funcional/Paulina/mano_leon/experimental/Giselle/trimmed/ph30/*.fq.gz -t 8 -o /LUSTRE/bioinformatica_data/genomica_funcional/Gis/trimmomatic/trimm2/multiq
```
>
> Archivos de salida:
> - [ ] Archivos .html con las métricas de cada archivo
> - [ ] Archivos .zip con las imágenes y métricas
> - [ ] Archivos .log y .err con información de la tarea enviada
**2.** **Visualización de los datos en MultiQC**
*MultiQC* es un visualizador de gráficos que nos permite ver múltiples lecturas a la vez. Nos da un panorama general de como se ven las lecturas.
> Archivos de entrada:
> - [ ] Archivos .zip con las imágenes y métricas
```
#!/bin/bash
#SBATCH -p cicese
#SBATCH --job-name=multiqc
#SBATCH --output=mul-%j.log
#SBATCH --error=mul-%j.err
#SBATCH -N 1
#SBATCH --mem=100GB
#SBATCH --ntasks-per-node=24
#SBATCH -t 06-00:00:00
export PATH=/LUSTRE/apps/Anaconda/2023/miniconda3/bin:$PATH
source activate qiime2-2023.2
multiqc /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq/*zip-o /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq --data-format json --export
```
> Archivos de salida:
> - [ ] Archivo .html con todas las métricas de las lecturas
> - [ ] Carpeta multiqc_data con la información de cada parámetro (calidad, sobrerepresentación, %GC, etc.) en formato JSON
> - [ ] Carpeta multiqc_plots con todos los gráficos en pdf, png y svg
> - [ ] Archivos .log y .err con información de la tarea enviada
**3. Recorte de las lecturas en Trimmomatic**
Una vez se deciden cuáles serán los parámetros, se realiza el recorte de los índices y adaptadores y se eliminan lecturas de baja calidad (Phred score <30) en *Trimmomatic*. En mi caso, agregué todos los adaptadores para que se eliminarán de mis lecturas.
> Archivos de entrada:
> - [ ] Lecturas crudas en formato .fq.gz
```
#!/bin/bash
#########################################################
#TRIMOMMATIC
## Directivas
#SBATCH --job-name=trimming
#SBATCH --output=trimmomatic-%j.log
#SBATCH --error=trimmomatic-%j.err
#SBATCH -N 2
#SBATCH --mem=100GB
#SBATCH --ntasks-per-node=24
#SBATCH -t 6-00:00:00
#SBATCH -p cicese
# Ruta a TRIMMOMATIC
TRIMMOMATIC=/LUSTRE/apps/bioinformatica/Trimmomatic-0.39
#Ruta al archivo con los adaptadores
trueseq="$TRIMMOMATIC/adapters/TruSeq3-PE-2.fa"
trueseq="$TRIMMOMATIC/adapters/TruSeq3-PE.fa"
trueseq="$TRIMMOMATIC/adapters/TruSeq2-PE.fa"
trueseq="$TRIMMOMATIC/adapters/NexteraPE-PE.fa"
trueseq="$TRIMMOMATIC/adapters/TruSeq2-SE.fa"
trueseq="$TRIMMOMATIC/adapters/TruSeq3-SE.fa"
cd ${SLURM_SUBMIT_DIR}
java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_1_CKDL230014573-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_1_CKDL230014573-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_1.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36
java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_2_1_CKDL230014579-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_2_1_CKDL230014579-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_2.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36
java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_3_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_3_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_3.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36
java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_4_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_4_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_4.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36
java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_5_CKDL230014574-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_5_CKDL230014574-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_5.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36
java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_6_CKDL230014580-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_6_CKDL230014580-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_6.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36
java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_7_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_7_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_7.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36
java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_8_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_8_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_8.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36
java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_9_CKDL230014575-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_9_CKDL230014575-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_9.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36
java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_10_2_CKDL230020994-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_10_2_CKDL230020994-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_10.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36
java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_11_1_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_11_1_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_11.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36
java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_12_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_12_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_12.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36
exit 0
```
> Archivos de salida:
> - [ ] Archivo .fq.gz con un nuevo nombre
> - [ ] Archivos .log y .err con información de la tarea enviada
Una vez que se realiza el recorte de las lecturas, podemos realizar de nuevo la evaluación de la calidad en FastQC y MultiQC.
**FastQC**
> Archivos de entrada:
> - [ ] Lecturas preprocesadas en formato .fq.gz
```
#!/bin/bash
#SBATCH -p cicese
#SBATCH --job-name=fastqc
#SBATCH --output=fastqc-%j.log
#SBATCH --error=fastqc-%j.err
#SBATCH -N 2
#SBATCH --ntasks-per-node=24
#SBATCH -t 6-00:00:00
export PATH=$PATH:/LUSTRE/apps/bioinformatica/FastQC_v0.11.7/:$PATH
module load gcc-7.2.0
fastqc /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/*.fq.gz -t 8 -o /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq
```
> Archivos de salida:
> - [ ] Archivos .html con las métricas de cada archivo
> - [ ] Archivos .zip con las imágenes y métricas
> - [ ] Archivos .log y .err con información de la tarea enviada
**MultiQC**
> Archivos de entrada:
> - [ ] Archivos .zip con las imágenes y métricas
```
#!/bin/bash
#SBATCH -p cicese
#SBATCH --job-name=multifqc
#SBATCH --output=mul-%j.log
#SBATCH --error=mul-%j.err
#SBATCH -N 4
#SBATCH --mem=100GB
#SBATCH --ntasks-per-node=24
#SBATCH -t 06-00:00:00
export PATH=/LUSTRE/apps/Anaconda/2023/miniconda3/bin:$PATH
source activate qiime2-2023.2
multiqc /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq/*zip -o /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq --data-format json --export
```
> Archivos de salida:
> - [ ] Archivo .html con todas las métricas de las lecturas
> - [ ] Carpeta multiqc_data con la información de cada parámetro (calidad, sobrerepresentación, %GC, etc.) en formato JSON
> - [ ] Carpeta multiqc_plots con todos los gráficos en pdf, png y svg
> - [ ] Archivos .log y .err con información de la tarea enviada
---
## **II. Ensamble *de novo***
Ahora que sabemos que nuestras lecturas están libres de adaptadores, índices y de secuencias de baja calidad, procederemos a realizar el ensamble de novo. Utilizaremos el software Trinity.
Trinity es un paquete de programas cuya metodología permite la reconstrucción de transcriptomas basados en datos de RNA-Seq y reporta transcritos ensamblados con sus posibles isoformas. El ensamble de Trinity está basado principalmente en tres módulos: Inchworm, Chrysalis y Butterfly.
**1. Ensamblado *de novo***
Antes de realizar el ensamble, es necesario que todas las lecturas sean descomprimidas. Se puede utilizar el siguiente comando:
```
gunzip *.gz
```
Una vez que se descomprimieron todas las lecturas, se deben comprimir las 1P y 2P en archivos diferentes, usando el siguiente comando:
```
cat *1P.fq | gzip > R1.fastq.gz
cat *2P.fq | gzip > R2.fastq.gz
```
Una vez hecho esto, podemos iniciar.
> Archivos de entrada:
> - [ ] Lecturas en formato .fastq.gz
```
#!/bin/bash
# Directivas
#SBATCH --job-name=Tri-denovo # nombre del trabajo que aparecera
#SBATCH --output=Tri-%j.log # salida estandar num de trabajo
#SBATCH --error=Tri-%j.err # salida estandar de error
#SBATCH --nodes 2 # número de nodos
#SBATCH --mem=100GB # RAM 1GB por cada millon de reads
#SBATCH --ntasks-per-node=24 # número de tareas por nodo
#SBATCH -t 6-00:00:00 # número maximo
#SBATCH -p cicese #cola
#export some soft
export PATH=$PATH:/LUSTRE/apps/bioinformatica/bowtie2/bin/
export PATH=$PATH:/LUSTRE/apps/bioinformatica/samtools-1.17/bin/
export PATH=$PATH:/LUSTRE/apps/bioinformatica/jellyfish-2.3.0/
export PATH=$PATH:/LUSTRE/apps/bioinformatica/salmon/bin/
module load conda-2023
module load trinityrnaseq-v2.15.1
#my vars
R1="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamblev3/trimm2/R1.fastq.gz"
R2="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamblev3/trimm2/R2.fastq.gz"
OUTDIR="Trinity_ensamble"
LIBT="RF"
Trinity --no_version_check --seqType fq --max_memory 100G \
--no_bowtie \
--CPU 24 --left $R1 --right $R2 \
--normalize_reads \
--output $OUTDIR
#--full_cleanup"
# see the time spent
START=`date +%s`
# Run trinity
###$TRINITY $PARM
END=`date +%s`
RUNTIME=$(($END-$START))
echo "run time -> $RUNTIME"
```
> Archivos de salida:
> - [ ] Archivo Trinity_ensamble.Trinity.fasta
> - [ ] Archivo Trinity_ensamble.Trinity.fasta.gene_trans_map
> - [ ] Carpeta Trinity_ensamble con la información de cada módulo
> - [ ] Archivos .log y .err con información de la tarea enviada
**Evaluación rápida del ensamble**
Mediante Trinity, se puede hacer un análisis rápido de las métricas del ensamble (promedio del contig, número de genes e isoformas, N50, etc.).
> Archivos de entrada:
> - [ ] Archivo Trinity_ensamble.Trinity.fasta
```
#!/bin/bash
# Directivas
#SBATCH --job-name=Trin-est # nombre del trabajo que aparecera
#SBATCH --output=stats-%j.log # salida estandar num de trabajo
#SBATCH --error=stats-%j.err # salida estandar de error
#SBATCH --nodes 2 # número de nodos
#SBATCH --mem=100GB #
#SBATCH --ntasks-per-node=24 # número de tareas por nodo
#SBATCH -t 6-00:00:00 # número maximo
#SBATCH -p cicese #cola
#SBATCH --exclusive
cd $SLURM_SUBMIT_DIR
#
export PATH=$PATH:/LUSTRE/apps/bioinformatica/trinityrnaseq-v2.15.1/
cd /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble_bueno/ensamblev3
time perl /LUSTRE/apps/bioinformatica/trinityrnaseq-v2.15.1/util/TrinityStats.pl Trinity_ensamble.Trinity.fasta
exit 0
```
> Archivos de salida:
> - [ ] Archivos .log y .err con información de la tarea enviada
**2. Evaluación del ensamble**
Transrate es un software que evalua la calidad del ensamble
> Archivos de entrada:
> - [ ] Archivo Trinity_ensamble.Trinity.fasta
> - [ ] Archivos R1.fastq.gz y R2.fastq.gz
```
#!/bin/bash
#SBATCH -p cicese
#SBATCH --job-name=tst-rvm
#SBATCH --output=ruby-%j.log
#SBATCH --error=ruby-%j.err
#SBATCH -N 2
#SBATCH --mem=100GB
#SBATCH --ntasks-per-node=24
#SBATCH -t 06-00:00:00
##se deben concatenar todas las trimm_p forward en un archivo y las reverse en otro
## correr este script dentro de la carpeta TRASNRATE
TRANSRATE=/LUSTRE/apps/bioinformatica/ruby2/ruby-2.2.0/bin
ENSAMBLE="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/seq_cal/Trinity_ensamble.Trinity.fasta"
R1="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/seq_cal/R1.fastq.gz"
R2="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/seq_cal/R2.fastq.gz"
cd $SLURM_SUBMIT_DIR
export PATH=/LUSTRE/apps/bioinformatica/.local/bin:$PATH
export PATH=/LUSTRE/apps/bioinformatica/ruby2/ruby-2.2.0/bin:$PATH
export LD_LIBRARY_PATH=/LUSTRE/apps/bioinformatica/ruby2/ruby-2.2.0/lib:$LD_LIBRARY_PATH
$TRANSRATE/transrate \
--assembly $ENSAMBLE \
--left $R1 \
--right $R2 \
--threads 8 \
--output gcontigs
```
> Archivos de salida:
> - [ ] Carpeta gcontigs
> > - [ ] Carpeta Trinity_ensamble.Trinity con los conteos buenos y malos de los transcritos
> > - [ ] Archivo assemblies.csv
> - [ ] Archivos .log y .err con información de la tarea enviada
---
## **III. Alineamiento y cuantificación**
Una vez que nuestro ensamble tiene buenas métricas, podemos hacer la alineación, que consiste en alinear las secuencias cortas de nuestras lecturas al transcriptoma ensamblado, además de contar el número de secuencias cortas que se alinean a un determinado contig. Mediante RSEM y Bowtie2 incluidos en Trinity podemos realizar este paso.
**1. Alineamiento**
RSEM alinea las lecturas al transcriptoma de referencia utilizando Bowtie2.
Antes de realizar el alineamiento, es necesario generar una matriz que contenga las lecturas y tratamientos, como en el siguiente ejemplo:
```
Basal_BLA Basal_BLA_1 N_1_1P.fq N_1_2P.fq
Basal_BLA Basal_BLA_2 N_2_1P.fq N_2_2P.fq
Basal_BLA Basal_BLA_3 N_3_1P.fq N_3_2P.fq
Basal_LOL Basal_LOL_1 N_4_1P.fq N_4_1P.fq
Basal_LOL Basal_LOL_2 N_5_1P.fq N_5_2P.fq
Basal_LOL Basal_LOL_3 N_6_1P.fq N_6_2P.fq
Osc_reg_BLA Osc_reg_BLA_1 N_7_1P.fq N_7_2P.fq
Osc_reg_BLA Osc_reg_BLA_2 N_8_1P.fq N_8_2P.fq
Osc_reg_BLA Osc_reg_BLA_3 N_9_1P.fq N_9_2P.fq
Osc_reg_LOL Osc_reg_LOL_1 N_10_1P.fq N_10_2P.fq
Osc_reg_LOL Osc_reg_LOL_2 N_11_1P.fq N_11_2P.fq
Osc_reg_LOL Osc_reg_LOL_3 N_12_1P.fq N_12_2P.fq
```
> Archivos de entrada:
> - [ ] Archivo samples.txt
> - [ ] Archivo good.Trinity_ensamble.Trinity.fasta proveniente de Transrate
```
#!/bin/bash
# Directivas
#SBATCH --job-name=Trin-cuantif
#SBATCH --output=cuantif-%j.log
#SBATCH --error=cuantif-%j.err
#SBATCH --nodes 4
#SBATCH --mem=100GB
#SBATCH --ntasks-per-node=24
#SBATCH -t 6-00:00:00
#SBATCH --exclusive
#export some soft
export PATH=$PATH:/LUSTRE/apps/bioinformatica/bowtie2/bin/
export PATH=$PATH:/LUSTRE/apps/bioinformatica/samtools-1.17/bin/
export PATH=$PATH:/LUSTRE/apps/bioinformatica/jellyfish-2.3.0/
export PATH=$PATH:/LUSTRE/apps/bioinformatica/salmon/bin/
export PATH=$PATH:/LUSTRE/apps/bioinformatica/RSEM/bin/
module load conda-2023
module load trinityrnaseq-v2.15.1
UTIL=/LUSTRE/apps/bioinformatica/trinityrnaseq-v2.15.1/util
SAMPLES=/LUSTRE/bioinformatica_data/genomica_funcional/Gis/cuantificacion
cd $SLURM_SUBMIT_DIR
$UTIL/align_and_estimate_abundance.pl \
--seqType fq \
--samples_file samples.txt \
--transcripts good.Trinity_ensamble.Trinity.fasta \
--est_method RSEM \
--aln_method bowtie2 \
--trinity_mode \
--prep_reference \
--output_dir RSEM_outdir \
--thread_count=24
exit 0
```
> Archivos de salida:
> - [ ] Carpetas de cada condición con los resultados de RSEM (conteos de genes e isoformas)
> - [ ] Archivos .log y .err con información de la tarea enviada
Una vez que se realizó el alineamiento y conteo de las lecturas, se procede a realizar la matriz con todos los conteos mediante el siguiente comando (sin slurm):
```
ls */*isoforms.results > isoforms
```
> Aquí no es necesario tener algún archivo de entrada, el script por sí solo buscará los archivos de conteo en las carpetas de cada condición o tratamiento.
Posteriormente, se utiliza el siguiente script:
```
#!/bin/bash
# Directivas
#SBATCH --job-name=abundance_mat
#SBATCH --output=abundance_mat-%j.log
#SBATCH --error=abundance_mat-%j.err
#SBATCH --nodes 2
#SBATCH --mem=100GB
#SBATCH --ntasks-per-node=24
#SBATCH -t 6-00:00:00
#SBATCH --exclusive
#export some soft
export PATH=$PATH:/LUSTRE/apps/bioinformatica/bowtie2/bin/
export PATH=$PATH:/LUSTRE/apps/bioinformatica/samtools-1.17/bin/
export PATH=$PATH:/LUSTRE/apps/bioinformatica/jellyfish-2.3.0/
export PATH=$PATH:/LUSTRE/apps/bioinformatica/salmon/bin/
export PATH=$PATH:/LUSTRE/apps/bioinformatica/RSEM/bin/
module load conda-2023
module load trinityrnaseq-v2.15.1
UTIL=/LUSTRE/apps/bioinformatica/trinityrnaseq-v2.15.1/util
TRANS=/LUSTRE/bioinformatica_data/genomica_funcional/Gis/exp_dif/ensamble_pg
cd $SLURM_SUBMIT_DIR
$UTIL/abundance_estimates_to_matrix.pl \
--est_method RSEM \
--out_prefix count \
--quant_files $TRANS/isoforms \
--gene_trans_map $TRANS/good.Trinity_ensamble.Trinity.fasta.gene_trans_map \
--name_sample_by_basedir
exit 0
```
> Archivos de salida:
> - [ ] Matriz de conteos de genes
> - [ ] Matriz de conteos de isoformas (es la que utilizaremos para el análisis DEseq2)
> - [ ] Archivos .log y .err con información de la tarea enviada
---
## **IV. Análisis de expresión diferencial génica**
Cuando tenemos las matrices de conteo de los genes e isoformas, podemos iniciar nuestro análisis de expresión diferencial génica (DGE). Podemos utilizar DEseq2 o EdgeR en RStudio.
### **DGE en DEseq2**
El análisis de expresión diferencial con DESeq2 implica varios pasos, donde DESeq2 modelará los recuentos brutos, utilizando factores de normalización (factores de tamaño) para tener en cuenta las diferencias en la profundidad de la biblioteca. Luego, estimará las dispersiones genéticas y reducirá estas estimaciones para generar estimaciones de dispersión más precisas para modelar los recuentos. Finalmente, DESeq2 se ajustará al modelo binomial negativo y realizará pruebas de hipótesis utilizando la prueba de Wald o prueba de ratio de verosimilitud.
Es importante que los valores de recuento estén sin normalizar, ya que permiten evaluar correctamente la precisión de la medición. El modelo DESeq2 corrige internamente el tamaño de la biblioteca.
> **Conceptos importantes**
> * log2FoldChange:
> * baseMean (expresión media):
> * valor de p ajustado:
> * FDR (tasa de descubrimiento falso):
> **Manual de DEseq2**: [https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#standard-workflow](https://)
### DEseq2
Abriremos RStudio, de preferencia debemos de mantener un espacio de trabajo determinado (una carpeta llamada DEseq o DGE) donde tengamos nuestros scripts, matrices y los gráficos que resulten del análisis.
Ejemplo de matriz de conteos:

> Archivos de entrada:
> - [ ] Script para RStudio (.R)
> - [ ] Matriz de conteos de isoformas (.tsv)
El siguiente script se ejecuta en RStudio, dando clic directamente en el archivo .R
```
# DIFFERENTIAL EXPRESSION ANALYSIS
# Script modificado de Ricardo Gomez-Reyes
# Limpiar la memoria de la sesion de R ====
rm(list = ls())
if(!is.null(dev.list())) dev.off()
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
# Configuramos
# options(stringsAsFactors = FALSE)
# Cargamos nuestras librerias de trabajo ====
# sacar herramientas de nuestra caja de herramientas
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("vsn")
#install.packages("tidyverse")
library(tidyverse)
library(DESeq2)
library(vsn)
.cran_packages <- c('tidyverse', 'DESeq2')
.inst <- .cran_packages %in% installed.packages()
if(any(!.inst)) {
install.packages(.cran_packages[!.inst], dep=TRUE, repos='http://cran.us.r-project.org')
}
sapply(c(.cran_packages), require, character.only = TRUE)
###############################################################
##############################################################
####### Establecemos directorio (escritorio) de trabajo ====
path <- "D:/OneDrive/Documentos/Tesis_Giselle/DEseq2"
count_f <- list.files(pattern = "good.count.isoform.counts.matrix", full.names = T)
# Cargamos archivos en R ====
#library(readr)
count <- read_tsv(count_f)
colData <- read_tsv(list.files(pattern = "metadata_Ns.tsv", full.names = T))
rownames(colData) <- colData$Library_ID
rownames=TRUE
#colData <- colData[-1] #remover la primera columna
# Modificamos caracteres ====
colNames <- gsub("_L001.isoforms.results", "", names(count))
#colNames <- sapply(strsplit(colNames, "_"), `[`, 1)#
colnames(count) <- c("transcript_id", colNames[-1])
query.ids <- count$transcript_id
library(dplyr)
count <- count %>% select_if(is.double) %>% as(., "matrix")
rownames(count) <- query.ids
# 1) Filter data by removing low-abundance genes ----
by_count <- 1; by_freq <- 2
keep <- rowSums(count > by_count) >= by_freq
sum(keep) # N transcripts
nrow(count <- count[keep,])
count <- round(count) # redondea los valores
# 2) Format metadata and count matrix ----
row.names(colData) <- colData$Library_ID #para colocar los nombres de Library_ID en rownames
row.names=TRUE
colData <- colData %>% arrange(match(Library_ID, colnames(count)))
#all(colnames(count) %in% rownames(colData))
#colData <- dplyr::rename(colData, samples = Library_ID)
#rownames(colData) <- colData$Library_ID
any(colnames(count) == colData$Library_ID) # sanity check
#colData <- mutate_if(colData, is.character, as.factor)
colData <- colData %>%
tibble::rownames_to_column(var = "ROW_ID") %>%
mutate(across(-ROW_ID, as.factor)) %>%
column_to_rownames(var = "ROW_ID")
# Usamos objeto colData para elaborar el diseno experimental
f_col <- "Condition"
names(colData)[names(colData) %in% f_col] <- "Condition"
# Si es posible, especificar el factor "control" como nivel de referencia:
colData <- colData %>% mutate(Condition = relevel(Condition, ref = "Ctrl")) #usa el factor Basal como nivel de referencia en las comparaciones estadisticas
# PREVALENCE ----
apply(count, 1, function(x) sum(x > 0)) %>% table() #calcular elementos por cada fila >0
prevelancedf = apply(count, 1, function(x) sum(x > 0)) # calcula el n?mero de elementos mayores que cero en cada fila de la matriz count y los almacena en un vector llamado prevalence
#librerias
library(dplyr)
library(ggplot2)
data.frame(Prevalence = prevelancedf,
TotalAbundance = rowSums(count)) %>% # mean_se
as_tibble(rownames = "Name") %>%
arrange(desc(TotalAbundance)) -> prevelancedf
prevelancedf %>%
dplyr::count(Prevalence) %>%
ggplot(aes(Prevalence, n)) + geom_col() +
theme_minimal(base_family = "GillSans", base_size = 12) +
scale_y_continuous("Number of transcript", labels = scales::comma)
# BOXPLOT ====
qprobs <- function(x) {
x <- x[x > 1]
quantile(x, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))
}
apply(log2(count+1), 2, qprobs) %>%
t() %>%
as_tibble(rownames = 'Library_ID') %>%
left_join(colData) -> probs_df
probs_df %>%
ggplot(.,
aes(x = Library_ID, ymin = `5%`, lower = `25%`,
middle = `50%`, upper = `75%`, ymax = `95%`, group = Library_ID)) +
geom_errorbar(width = 0.3, position = position_dodge(0.6)) +
geom_boxplot(aes(fill = Condition), width = 0.5, stat = 'identity',
position = position_dodge(0.6)) +
labs(y = expression(log[2]~ 'Abundance'), x = '') +
theme_classic(base_family = "GillSans", base_size = 18) +
theme(
legend.position = 'top',
panel.border = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.line.x = element_blank()) -> ptop
ptop <- ptop + facet_grid(~ Condition, scales = 'free') +
theme(
strip.background = element_rect(fill = 'grey', color = 'white'))
ptop
# BOTTOM ====
apply(count, 2, function(x) sum(x > 0)) -> Total_genes
# How singletones are per sample? ----
apply(count, 2, function(x) sum(x > 0)) -> filtered_genes
cbind(as_tibble(Total_genes, rownames = 'name'), as_tibble(filtered_genes)) -> n_genes
names(n_genes) <- c('Library_ID','Raw', 'Filt')
n_genes %>% mutate(pct_genes_retained = Filt/Raw) -> n_genes
n_genes %>%
left_join(colData) %>%
mutate(Library_ID = factor(Library_ID, levels = unique(Library_ID))) %>%
ggplot() +
geom_col(aes(x = Library_ID, y = Raw, fill = Condition)) +
labs(x = 'Samples', y = 'Total Transcript') +
scale_y_continuous(labels = scales::comma) +
theme_minimal(base_family = "GillSans", base_size = 12) +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45,
hjust = 1, vjust = 1, size = 10)) -> pbottom
pbottom <- pbottom + facet_grid(~ Condition, scales = 'free') +
theme(
strip.background = element_blank(),
strip.text = element_blank())
pbottom
###aqui obtuve tres plots
library(patchwork) # para unir las dos graficas
ptop / pbottom + patchwork::plot_layout(heights = c(1,1.2))
# DESEQ2 =====
ddsFullCountTable <- DESeqDataSetFromMatrix(
countData = count,
colData = colData,
design = ~ Condition + Site+ Condition:Site) ##para modelar la influencia de m?ltiples factores en los datos de exp g?nica, si no unicamente dejar ~ Condition
# (Above): transform or not the count matrix?
# In order to test for differential expression, we operate on raw counts and use discrete distributions ...
# 3) Run DESeq in the following functions order ----
# Negative Binomial GLM
# # For the analysis we need to estimate the effective library size to normalize for.
# Imagine, a gene has the same number of counts in two samples. But the library size
# (total number of reads) was twice as high in the first sample. Then we would conclude that the gene was higher expressed in the second sample.
# You can estimate the size factors from the count data using the function estimateSizeFactors():
dds <- estimateSizeFactors(ddsFullCountTable)
# sizeFactors(dds)
# Next we need to estimate for each condition a function that allows to predict the
# dispersion (= variance across samples). The core assumption of this method is that the # mean is a good predictor of the variance, i.e., that genes with a similar expression level also have similar variance across the samples:
dds <- estimateDispersions(dds)
#boxplot(log2(counts(ddsFullCountTable)+0.5))
#boxplot(log2(counts(dds, normalized=TRUE)+0.5))
# Now we can test for differential expression of the genes between the two groups by calling the function nbinomWaldTest(). We provide the dds and the names of the groups of our samples to this function. This might take a few minutes:
dds <- nbinomWaldTest(dds, maxit = 200)
# (Above) use Wald test when contrasting a unique pair (ex. control vs cancer)
res <- results(dds)
# d1 <- plotCounts(dds, gene=which.min(res$padj), intgroup="Design", returnData=TRUE)
# d2 <- plotCounts(dds, gene=which.max(res$padj), intgroup="Design", returnData=TRUE)
#
# p1 <- ggplot(rbind(d1,d2), aes(x=Design, y=count)) +
# geom_jitter(position=position_jitter(w=0.1,h=0), size = 5, alpha = 0.5) +
# scale_y_log10() +
# stat_summary(fun.data = "mean_cl_boot", colour = "red", linewidth = 0.7, size = 1) +
# labs(y = "Gene count (Log10)", x = "Treatment") +
# theme_bw(base_family = "GillSans", base_size = 20)
#
#
# p2 <- rbind(d1,d2) %>%
# group_by(Design) %>%
# tally(count) %>%
# ggplot(aes(x=Design, y=n)) +
# geom_col(width = 0.4) +
# labs(y = "Library Size", x = "Treatment") +
# theme_bw(base_family = "GillSans", base_size = 20)
#
# p2 /p1
# normal vs gamma-poisson distribution
# library(bayestestR)
#
# prior <- distribution_normal(nrow(m), mean = 5)
# posterior <- distribution_gamma(nrow(m), 2) # Generate a skewed distribution
# #
# p <- bayesfactor_parameters(posterior, prior,
# direction = "two-sided", null = 0, verbose = FALSE)
# #
# plot(p)
# centrality <- point_estimate(posterior)
#
# plot(centrality)
# Note: also run as a single batch
dds <- DESeq(ddsFullCountTable)
res <- results(dds)
#4) (OPTIONAL) Test transformation for downstream analyses----
#However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data. Ex:
vst <- DESeq2::varianceStabilizingTransformation(dds) # vst if cols > 10
ntr <- DESeq2::normTransform(dds)
DESeq2::plotPCA(ntr, intgroup = "Condition")
DESeq2::plotPCA(vst, intgroup = "Condition")
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("vsn")
raw_df <- vsn::meanSdPlot(assay(dds), plot = F)
vst_df <- vsn::meanSdPlot(assay(vst), plot = F)
ntr_df <- vsn::meanSdPlot(assay(ntr), plot = F)
# rbind(data.frame(py = vst_df$sd, px = vst_df$rank, col = "vst"),
# data.frame(py = ntr_df$sd, px = ntr_df$rank, col = "ntr"),
# data.frame(py = raw_df$sd, px = raw_df$rank, col = "raw")) %>%
# ggplot(aes(px, py, color = col)) +
# labs(x = "Ranks", y = "sd", color = "") +
# geom_line(orientation = NA, position = position_identity(), size = 2) +
# theme_bw(base_family = "GillSans", base_size = 20) +
# theme(legend.position = "top")
# 5) (OPTIONAL) Correlation heatmap ----
sample_cor = cor(assay(vst), method='pearson', use='pairwise.complete.obs')
sample_dist = dist(t(assay(vst)), method='euclidean')
hc_samples = hclust(sample_dist, method='complete')
hc_order <- hc_samples$labels[hc_samples$order]
heatmap(sample_cor, col = cm.colors(1))
sample_cor %>%
as_tibble(rownames = 'Library_ID') %>%
pivot_longer(cols = colnames(sample_cor), values_to = 'cor') %>%
left_join(colData) %>%
mutate(Library_ID = factor(Library_ID, levels = hc_order)) -> sample_cor_long
library(ggplot2)
library(ggh4x)
sample_cor_long %>%
ggplot(aes(x = Library_ID, y = name, fill = cor)) +
geom_tile(color = 'white', linewidth = 0.2) +
# geom_raster() +
# geom_text(aes(label = cor), color = 'white') +
scale_fill_viridis_c(option = "F", name = "Pearson", direction = -1) +
scale_x_discrete(position = 'top') +
ggh4x::scale_x_dendrogram(hclust = hc_samples, position = 'top') +
ggh4x::scale_y_dendrogram(hclust = hc_samples) +
theme_classic(base_size = 12, base_family = "GillSans") +
labs(x = '', y = '') +
theme(axis.text.x = element_text(angle = 50,
hjust = -0.15, vjust = 1)) # -> pheat
# 6) PCA =====
library(DESeq2)
ncol(data <- log2(count+1))
ncol(data <- assay(vst))
PCA = prcomp(t(data), center = FALSE, scale. = FALSE)
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)
sd_ratio <- sqrt(percentVar[2] / percentVar[1])
PCAdf <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2])
library(ggforce)
PCAdf %>%
mutate(Library_ID = rownames(.)) %>%
left_join(colData) %>%
ggplot(., aes(PC1, PC2)) +
# coord_fixed(ratio = sd_ratio) +
geom_abline(slope = 0, intercept = 0, linetype="dashed", alpha=0.5) +
geom_vline(xintercept = -74, linetype="dashed", alpha=0.5) +
ggforce::geom_mark_ellipse(aes(group = as.factor(Condition)),
fill = 'grey', con.colour = 'grey') +
geom_point(size = 7, alpha = 0.7, aes(color = Condition)) +
geom_text( family = "GillSans",
mapping = aes(label = Site), size = 2.5) +
labs(caption = '', color = "") +
xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) +
ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) +
theme_minimal(base_family = "GillSans", base_size = 20) +
theme(plot.title = element_text(hjust = 0.5), legend.position = 'top')
# EXPLORATORY DATA ====
summary(res)
out of 127719 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 143, 0.11%
LFC < 0 (down) : 226, 0.18%
outliers [1] : 648, 0.51%
low counts [2] : 61847, 48%
(mean count < 4)
hist(res$pvalue)
res <- res %>% as_tibble(rownames = "transcript")
# ========================= Pruebas HeatMap ========================================
library(DESeq2)
library(ggplot2)
library(ComplexHeatmap)
#library(org.Hs.eg.db)
#library("org.Hs.eg.db") paquete de Bioconductor de anotacion de genes del genoma de Homo Sapiens
significant <- res %>% filter((log2FoldChange > 2 & padj < 0.05) |
(log2FoldChange < 2 & padj < 0.05))
sigs.df <- as.data.frame(significant)
row.names(sigs.df) <- sigs.df$transcript #para colocar los nombres de Transcript en rownamesrow.names=TRUE
#sigs.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigs.df), keytype = "TRINITY", column = "SYMBOL")
#################################################################################
###Silencie las siguientes dos lineas porque filtraban mis datos, reduciendolos a 5
#sigs.df <- sigs.df[(sigs.df$baseMean >100) & (abs(sigs.df$log2FoldChange) > 1), ]
#sig <- sigs.df[, -1]
################################################################################
mat <- counts(dds, normalized =T)[rownames(sigs.df), ]
mat.z <- t(apply(mat, 1, scale))
colnames(mat.z) <- rownames(colData)
Heatmap(mat.z, cluster_rows = T, cluster_columns = T, column_labels = colnames(mat.z), name = "Z-score", row_labels = sigs.df[rownames(mat.z),]$transcript)
```
Posteriormente, es necesario obtener un gráfico de volcano, el cual se obtiene de la siguiente manera (mismo script):
> Archivos de salida:
> Gráficos:
> - [ ] Conteos crudos en un boxplot y gráfico de barras
> - [ ] Heatmaps de correlación de Pearson
> - [ ] Heatmap de los genes transcritos en cada condición
> - [ ] Volcano plot
> - [ ] En la consola, el número y porcentaje de transcritos (up, down y outliers)
##### Asimismo, se puede realizar el análisis DGE por contrastes. En mi caso, lo realicé para Basales, Osc, LOL y BLA. Sin embargo, para fines prácticos, colocaré solo el script de Basales y de BLA.
El siguiente script se corre en RStudio.
> Archivos de entrada:
> - [ ] Script R Basales.R
> - [ ] Archivo good.count.isoform.counts.matrix
> - [ ] Archivo metadata_Ns.tsv
> - [ ] Trinotate_report.tsv
```
#BASALES
# DIFFERENTIAL EXPRESSION ANALYSIS
# Script modificado de Ricardo Gomez-Reyes
# Limpiar la memoria de la sesion de R ====
rm(list = ls())
if(!is.null(dev.list())) dev.off()
#if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("DESeq2")
library(DESeq2)
# Configuramos
# options(stringsAsFactors = FALSE)
# Cargamos nuestras librerias de trabajo ====
# sacar herramientas de nuestra caja de herramientas
#if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("vsn")
library(vsn)
#install.packages("tidyverse")
library(tidyverse)
library(DESeq2)
library(vsn)
.cran_packages <- c('tidyverse', 'DESeq2')
.inst <- .cran_packages %in% installed.packages()
if(any(!.inst)) {
install.packages(.cran_packages[!.inst], dep=TRUE, repos='http://cran.us.r-project.org')
}
sapply(c(.cran_packages), require, character.only = TRUE)
###############################################################
##############################################################
####### Establecemos directorio (escritorio) de trabajo ====
path <- "D:/OneDrive/Documentos/Tesis_Giselle/DEseq2"
count_f <- list.files(pattern = "good.count.isoform.counts.matrix", full.names = T)
# Cargamos archivos en R ====
#library(readr)
count <- read_tsv(count_f)
colData <- read_tsv(list.files(pattern = "metadata_Ns.tsv", full.names = T))
rownames(colData) <- colData$Library_ID
rownames=TRUE
#colData <- colData[-1] #remover la primera columna
# Modificamos caracteres ====
colNames <- gsub("_L001.isoforms.results", "", names(count))
#colNames <- sapply(strsplit(colNames, "_"), `[`, 1)#
colnames(count) <- c("transcript_id", colNames[-1])
query.ids <- count$transcript_id
library(dplyr)
count <- count %>% select_if(is.double) %>% as(., "matrix")
rownames(count) <- query.ids
# 1) Filter data by removing low-abundance genes ----
by_count <- 1; by_freq <- 2
keep <- rowSums(count > by_count) >= by_freq
sum(keep) # N transcripts
nrow(count <- count[keep,])
count <- round(count) # redondea los valores
# 2) Format metadata and count matrix ----
row.names(colData) <- colData$Library_ID #para colocar los nombres de Library_ID en rownames
row.names=TRUE
colData <- colData %>% arrange(match(Library_ID, colnames(count)))
#all(colnames(count) %in% rownames(colData))
#colData <- dplyr::rename(colData, samples = Library_ID)
#rownames(colData) <- colData$Library_ID
any(colnames(count) == colData$Library_ID) # sanity check
#colData <- mutate_if(colData, is.character, as.factor)
colData <- colData %>%
tibble::rownames_to_column(var = "ROW_ID") %>%
mutate(across(-ROW_ID, as.factor)) %>%
column_to_rownames(var = "ROW_ID")
# Usamos objeto colData para elaborar el diseno experimental
colData=colData %>% filter(Condition=="Bas") #filtro de condicion(Basal) por sitio
f_col <- "Site"
names(colData)[names(colData) %in% f_col] <- "Site"
# Si es posible, especificar el factor "control" como nivel de referencia:
colData <- colData %>% mutate(Condition= relevel(Condition, ref = "LOL")) #usa el factor Basal como nivel de referencia en las comparaciones estadisticas
keep=colnames(count) %in% colData$Library_ID
count=count[,keep ]
# PREVALENCE ----
apply(count, 1, function(x) sum(x > 0)) %>% table() #calcular elementos por cada fila >0
prevelancedf = apply(count, 1, function(x) sum(x > 0)) # calcula el n?mero de elementos mayores que cero en cada fila de la matriz count y los almacena en un vector llamado prevalence
#librerias
library(dplyr)
library(ggplot2)
data.frame(Prevalence = prevelancedf,
TotalAbundance = rowSums(count)) %>% # mean_se
as_tibble(rownames = "Name") %>%
arrange(desc(TotalAbundance)) -> prevelancedf
prevelancedf %>%
dplyr::count(Prevalence) %>%
ggplot(aes(Prevalence, n)) + geom_col() +
theme_minimal(base_family = "GillSans", base_size = 12) +
scale_y_continuous("Number of transcript", labels = scales::comma)
# BOXPLOT ====
qprobs <- function(x) {
x <- x[x > 1]
quantile(x, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))
}
apply(log2(count+1), 2, qprobs) %>%
t() %>%
as_tibble(rownames = 'Library_ID') %>%
left_join(colData) -> probs_df
probs_df %>%
ggplot(.,
aes(x = Library_ID, ymin = `5%`, lower = `25%`,
middle = `50%`, upper = `75%`, ymax = `95%`, group = Library_ID)) +
geom_errorbar(width = 0.3, position = position_dodge(0.6)) +
geom_boxplot(aes(fill = Site), width = 0.5, stat = 'identity',
position = position_dodge(0.6)) +
labs(y = expression(log[2]~ 'Abundance'), x = '') +
theme_classic(base_family = "GillSans", base_size = 18) +
theme(
legend.position = 'top',
panel.border = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.line.x = element_blank()) -> ptop
ptop <- ptop + facet_grid(~ Site, scales = 'free') +
theme(
strip.background = element_rect(fill = 'grey', color = 'white'))
ptop
# BOTTOM ====
apply(count, 2, function(x) sum(x > 0)) -> Total_genes
# How singletones are per sample? ----
apply(count, 2, function(x) sum(x > 0)) -> filtered_genes
cbind(as_tibble(Total_genes, rownames = 'name'), as_tibble(filtered_genes)) -> n_genes
names(n_genes) <- c('Library_ID','Raw', 'Filt')
n_genes %>% mutate(pct_genes_retained = Filt/Raw) -> n_genes
n_genes %>%
left_join(colData) %>%
mutate(Library_ID = factor(Library_ID, levels = unique(Library_ID))) %>%
ggplot() +
geom_col(aes(x = Library_ID, y = Raw, fill = Site)) +
labs(x = 'Samples', y = 'Total Transcript') +
scale_y_continuous(labels = scales::comma) +
theme_minimal(base_family = "GillSans", base_size = 12) +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45,
hjust = 1, vjust = 1, size = 10)) -> pbottom
pbottom <- pbottom + facet_grid(~ Site, scales = 'free') +
theme(
strip.background = element_blank(),
strip.text = element_blank())
pbottom
########
# aqui obtuve tres plots:
# el boxplot, grafico de barras (arriba) y el que tiene a ambos (abajo)
########
library(patchwork) # para unir las dos graficas
ptop / pbottom + patchwork::plot_layout(heights = c(1,1.2))
# DESEQ2 =====
ddsFullCountTable <- DESeqDataSetFromMatrix(
countData = count,
colData = colData,
design = ~ Site) ##para modelar la influencia de m?ltiples factores en los datos de exp g?nica, si no unicamente dejar ~ Condition #####OJOOOOO, AQUI cambie #Site# por condition
# (Above): transform or not the count matrix?
# In order to test for differential expression, we operate on raw counts and use discrete distributions ...
# 3) Run DESeq in the following functions order ----
# Negative Binomial GLM
# # For the analysis we need to estimate the effective library size to normalize for.
# Imagine, a gene has the same number of counts in two samples. But the library size(total number of reads) was twice as high in the first sample. Then we would conclude that the gene was higher expressed in the second sample.
# You can estimate the size factors from the count data using the function estimateSizeFactors():
dds <- estimateSizeFactors(ddsFullCountTable)
# sizeFactors(dds)
# Next we need to estimate for each condition a function that allows to predict the dispersion (= variance across samples). The core assumption of this method is that the # mean is a good predictor of the variance, i.e., that genes with a similar expression level also have similar variance across the samples:
dds <- estimateDispersions(dds)
##############
##aqui empieza a estimar las dispersiones geneticas y reduce estas estimaciones para generar estimaciones de dispersión más precisas para modelar los recuentos
##############
#boxplot(log2(counts(ddsFullCountTable)+0.5))
#boxplot(log2(counts(dds, normalized=TRUE)+0.5))
# Now we can test for differential expression of the genes between the two groups by calling the function nbinomWaldTest(). We provide the dds and the names of the groups of our samples to this function. This might take a few minutes:
dds <- nbinomWaldTest(dds, maxit = 200)
# (Above) use Wald test when contrasting a unique pair (ex. control vs cancer)
res <- results(dds)
# d1 <- plotCounts(dds, gene=which.min(res$padj), intgroup="Design", returnData=TRUE)
# d2 <- plotCounts(dds, gene=which.max(res$padj), intgroup="Design", returnData=TRUE)
#
# p1 <- ggplot(rbind(d1,d2), aes(x=Design, y=count)) +
# geom_jitter(position=position_jitter(w=0.1,h=0), size = 5, alpha = 0.5) +
# scale_y_log10() +
# stat_summary(fun.data = "mean_cl_boot", colour = "red", linewidth = 0.7, size = 1) +
# labs(y = "Gene count (Log10)", x = "Treatment") +
# theme_bw(base_family = "GillSans", base_size = 20)
#
#
# p2 <- rbind(d1,d2) %>%
# group_by(Design) %>%
# tally(count) %>%
# ggplot(aes(x=Design, y=n)) +
# geom_col(width = 0.4) +
# labs(y = "Library Size", x = "Treatment") +
# theme_bw(base_family = "GillSans", base_size = 20)
#
# p2 /p1
# normal vs gamma-poisson distribution
# library(bayestestR)
#
# prior <- distribution_normal(nrow(m), mean = 5)
# posterior <- distribution_gamma(nrow(m), 2) # Generate a skewed distribution
# #
# p <- bayesfactor_parameters(posterior, prior,
# direction = "two-sided", null = 0, verbose = FALSE)
# #
# plot(p)
# centrality <- point_estimate(posterior)
#
# plot(centrality)
# Note: also run as a single batch
dds <- DESeq(ddsFullCountTable)
res <- results(dds)
#4) (OPTIONAL) Test transformation for downstream analyses----
#However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data. Ex:
vst <- DESeq2::varianceStabilizingTransformation(dds) # vst if cols > 10
ntr <- DESeq2::normTransform(dds)
DESeq2::plotPCA(ntr, intgroup = "Site")
DESeq2::plotPCA(vst, intgroup = "Site")
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("vsn")
library(vsn)
raw_df <- vsn::meanSdPlot(assay(dds), plot = F)
vst_df <- vsn::meanSdPlot(assay(vst), plot = F)
ntr_df <- vsn::meanSdPlot(assay(ntr), plot = F)
# rbind(data.frame(py = vst_df$sd, px = vst_df$rank, col = "vst"),
# data.frame(py = ntr_df$sd, px = ntr_df$rank, col = "ntr"),
# data.frame(py = raw_df$sd, px = raw_df$rank, col = "raw")) %>%
# ggplot(aes(px, py, color = col)) +
# labs(x = "Ranks", y = "sd", color = "") +
# geom_line(orientation = NA, position = position_identity(), size = 2) +
# theme_bw(base_family = "GillSans", base_size = 20) +
# theme(legend.position = "top")
# 5) (OPTIONAL) Correlation heatmap ----
sample_cor = cor(assay(vst), method='pearson', use='pairwise.complete.obs')
sample_dist = dist(t(assay(vst)), method='euclidean')
hc_samples = hclust(sample_dist, method='complete')
hc_order <- hc_samples$labels[hc_samples$order]
heatmap(sample_cor, col = cm.colors(1))
sample_cor %>%
as_tibble(rownames = 'Library_ID') %>%
pivot_longer(cols = colnames(sample_cor), values_to = 'cor') %>%
left_join(colData) %>%
mutate(Library_ID = factor(Library_ID, levels = hc_order)) -> sample_cor_long
library(ggplot2)
library(ggh4x)
sample_cor_long %>%
ggplot(aes(x = Library_ID, y = name, fill = cor)) +
geom_tile(color = 'white', linewidth = 0.2) +
# geom_raster() +
# geom_text(aes(label = cor), color = 'white') +
scale_fill_viridis_c(option = "F", name = "Pearson", direction = -1) +
scale_x_discrete(position = 'top') +
ggh4x::scale_x_dendrogram(hclust = hc_samples, position = 'top') +
ggh4x::scale_y_dendrogram(hclust = hc_samples) +
theme_classic(base_size = 12, base_family = "GillSans") +
labs(x = '', y = '') +
theme(axis.text.x = element_text(angle = 50,
hjust = -0.15, vjust = 1)) # -> pheat
# 6) PCA =====
library(DESeq2)
ncol(data <- log2(count+1))
ncol(data <- assay(vst))
PCA = prcomp(t(data), center = FALSE, scale. = FALSE)
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)
sd_ratio <- sqrt(percentVar[2] / percentVar[1])
PCAdf <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2])
library(ggforce)
PCAdf %>%
mutate(Library_ID = rownames(.)) %>%
left_join(colData) %>%
ggplot(., aes(PC1, PC2)) +
# coord_fixed(ratio = sd_ratio) +
geom_abline(slope = 0, intercept = 0, linetype="dashed", alpha=0.5) +
geom_vline(xintercept = -74, linetype="dashed", alpha=0.5) +
ggforce::geom_mark_ellipse(aes(group = as.factor(Condition)),
fill = 'grey', con.colour = 'grey') +
geom_point(size = 7, alpha = 0.7, aes(color = Condition)) +
geom_text( family = "GillSans",
mapping = aes(label = Site), size = 2.5) +
labs(caption = '', color = "") +
xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) +
ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) +
theme_minimal(base_family = "GillSans", base_size = 20) +
theme(plot.title = element_text(hjust = 0.5), legend.position = 'top')
# EXPLORATORY DATA ====
summary(res)
#########################################################################
########### BASALES: Por sitio, BLA x LOL ########################################
# out of 126956 with nonzero total read count
#adjusted p-value < 0.1
#LFC > 0 (up) : 189, 0.15%
#LFC < 0 (down) : 258, 0.2%
#outliers [1] : 491, 0.39%
#low counts [2] : 44291, 35%
########################################################################
########################################################################
hist(res$pvalue)
res <- res %>% as_tibble(rownames = "transcript")
# ========================= Pruebas HeatMap ========================================
library(DESeq2)
library(ggplot2)
library(ComplexHeatmap)
#library(org.Hs.eg.db)
#library("org.Hs.eg.db") paquete de Bioconductor de anotacion de genes del genoma de Homo Sapiens
significant <- res %>% filter((log2FoldChange > 2 & padj < 0.05) |
(log2FoldChange < 2 & padj < 0.05))
sigs.df <- as.data.frame(significant)
row.names(sigs.df) <- sigs.df$transcript #para colocar los nombres de Transcript en rownamesrow.names=TRUE
#sigs.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigs.df), keytype = "TRINITY", column = "SYMBOL")
#################################################################################
###Silencie las siguientes dos lineas porque filtraban mis datos, reduciendolos a 5
#sigs.df <- sigs.df[(sigs.df$baseMean >100) & (abs(sigs.df$log2FoldChange) > 1), ]
#sig <- sigs.df[, -1]
################################################################################
mat <- counts(dds, normalized =T)[rownames(sigs.df), ]
mat.z <- t(apply(mat, 1, scale))
colnames(mat.z) <- rownames(colData)
Heatmap(mat.z, cluster_rows = T, cluster_columns = T, column_labels = colnames(mat.z), name = "Z-score", row_labels = sigs.df[rownames(mat.z),]$transcript)
#write.table(mat.z, file = "transcritos_DE.txt", sep = "\t", quote = FALSE, row.names = TRUE)
write.csv(sigs.df, file = "DEG_Basales.csv")
########
# #
# :) #
# #
########
#=============== Seguimiento con Script de Ricardo =========================
sigfc <- "Sign and FC";pv <- "Sign";fc <- "FC"
colors_fc <- c("red2", "#4169E1", "forestgreen", "grey30")
colors_fc <- structure(colors_fc, names = c(sigfc, pv, fc, "NS"))
res$cc <- 'NS'
res[which(abs(res$log2FoldChange) > 2), 'cc'] <- fc
res[which(abs(res$padj) <= 0.05), 'cc'] <- pv
res[which(res$padj <= 0.05 & abs(res$log2FoldChange) > 2), 'cc'] <- sigfc
p <- res %>%
ggplot(aes(y = -log10(pvalue), x = log2FoldChange, color = cc)) +
geom_point()
p <- p +
scale_color_manual(name = "", values = colors_fc) +
labs(x= expression(Log[2] ~ "Fold Change"), y = expression(-Log[10] ~ "P")) +
theme_bw(base_family = "GillSans", base_size = 18) +
theme(legend.position = "top") +
geom_abline(slope = 0, intercept = -log10(0.05), linetype="dashed", alpha=0.5) +
geom_vline(xintercept = 0, linetype="dashed", alpha=0.5)
p
# POSITIVE LFC == UP EXPRESSED IN CANCER
# NEGATIVE LFC == UP EXPRESSED IN CONTROL
res %>% mutate(g = sign(log2FoldChange)) %>%
dplyr::count(cc, g) %>%
mutate(g = ifelse(g == "1", "Osc", "Ctrl")) %>%
filter(cc != 'NS') %>%
group_by(g) %>%
mutate(pct = n / sum(n)) %>%
ggplot() +
geom_col(aes(x = g, y = pct, fill = cc), width = 0.5) +
scale_fill_manual(name = "", values = colors_fc[-4]) +
theme_bw(base_family = "GillSans", base_size = 18) +
labs(y = '% Transcripts', x = '') +
theme(legend.position = 'top') + coord_flip()
# Annotation ====
query.ids <- res %>% filter(cc == "Sign and FC") %>% pull(transcript)
query.ids <- res$transcript
url <- "https://raw.githubusercontent.com/RJEGR/Small-RNASeq-data-analysis/master/FUNCTIONS.R"
source(url)
annot_path <- paste0(path, "/anotacion/")
annot_f <- list.files(path = annot_path, pattern = "Trinotate_report.tsv", full.names = T)
annot <- read_tsv(annot_f, na = ".") %>% filter(transcript_id %in% query.ids)
blastp_df <- split_blast(annot, hit = "BLASTP") %>%
group_by(transcript) %>%
arrange(identity) %>%
sample_n(1) %>%
ungroup()
blastp_df %>% dplyr::count(genus, sort = T)
tex_dat <- res %>% left_join(blastp_df) %>% drop_na(name)
p + ggrepel::geom_text_repel(data = tex_dat, aes(label = name), min.segment.length = 0)
#############aqui salio un volcanoplot###########
```
> Archivos de salida:
> - [ ] transcritos_DE.txt
> - [ ] DEG_Basales.csv
> - [ ] Archivo metadata_Ns.tsv
En el caso de que se requiera obtener las GO de los DGE up-down, se añade como última parte del script lo siguiente:
```
##############ENRIQUECIMIENTO FUNCIONAL#####################
###########
# Functional enrichment ====
.bioc_packages <- c('biomaRt','GOSemSim') # DESeq2'
.inst <- .bioc_packages %in% installed.packages()
#if(any(!.inst)) {
# if (!require("BiocManager", quietly = TRUE))
#install.packages("BiocManager")
BiocManager::install(.bioc_packages[!.inst], ask = F)
#}
sapply(c(.bioc_packages), require, character.only = TRUE)
# Local functions
url <- "https://raw.githubusercontent.com/RJEGR/Cancer_sete_T_assembly/main/functions.R"
source(url)
go_df <- split_gene_ontology(annot, hit = "BLASTP")
table(go_df$ontology)
gene2GO <- split(go_df$go, go_df$transcript)
########yo agregué esta líneas para que agarrara a GOSemSim & org
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GOSemSim")
#install.packages("GOSemSim")
library(GOSemSim)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)
############################ aqui sigue el script
hsGO <- GOSemSim::godata('org.Hs.eg.db', ont="BP")
#hsGO <- read_rds(paste0(annot_path, '/Trinotate_GO_annotations.txt'))
library(readr)
# Assuming the file is a TSV file (tab-separated values)/read_rds es formato binario
hsGO <- read_tsv(paste0(annot_path, '/Trinotate_GO_annotations.txt'))
res_up <- res %>% filter(log2FoldChange >=2 & padj <0.05) #filtro para los sobre-expresados
res <- res %>% drop_na(padj)
res_down <- res %>% filter(log2FoldChange <2 & padj <0.05) #filtro para los sub-expresados
res_up %>% pull(padj) -> query.p
res_up %>% pull(transcript) -> query.names
#topGO es requerida, por lo que lo llamare por Biocmanager
BiocManager::install("topGO")
allRes_up <- GOenrichment(query.p, query.names, gene2GO, Nodes = 20)
allRes_up %>% as_tibble() %>%
arrange(Annotated) %>%
mutate(Term = factor(Term, levels= unique(Term))) %>%
ggplot(aes(x = Term, y = Annotated, fill = p.adj.ks)) + # , fill = p.adj.ks
coord_flip() +
geom_col() +
theme_minimal(base_family = "GillSans", base_size = 12)
#AQUI me dio un grafico de enriquecimiento funcional con los procesos UP
res_down %>% pull(padj) -> query.p
res_down %>% pull(transcript) -> query.names
allRes_down <- GOenrichment(query.p, query.names, gene2GO, Nodes = 20)
#null para devolver todos
allRes_down %>% as_tibble() %>%
arrange(Annotated) %>%
mutate(Term = factor(Term, levels= unique(Term))) %>%
ggplot(aes(x = Term, y = Annotated, fill = p.adj.ks)) + # , fill = p.adj.ks
coord_flip() +
geom_col() +
theme_minimal(base_family = "GillSans", base_size = 12)
#write.table(res_down, file = "down_DE.txt", sep = "\t", quote = FALSE, row.names = TRUE)
#write.csv(sigs.df, file= "matriz.csv", row.names= TRUE )
```
Si se realiza este paso, se generan los archivos:
> - [ ] down_DE.txt y up_DE.txt
> - [ ] matriz.csv
#### Script de BLA
```
##BLA
# DIFFERENTIAL EXPRESSION ANALYSIS
# Script modificado de Ricardo Gomez-Reyes
# Limpiar la memoria de la sesion de R ====
rm(list = ls())
if(!is.null(dev.list())) dev.off()
#if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("DESeq2")
library(DESeq2)
# Configuramos
# options(stringsAsFactors = FALSE)
# Cargamos nuestras librerias de trabajo ====
# sacar herramientas de nuestra caja de herramientas
#if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("vsn")
library(vsn)
#install.packages("tidyverse")
library(tidyverse)
library(DESeq2)
library(vsn)
.cran_packages <- c('tidyverse', 'DESeq2')
.inst <- .cran_packages %in% installed.packages()
if(any(!.inst)) {
install.packages(.cran_packages[!.inst], dep=TRUE, repos='http://cran.us.r-project.org')
}
sapply(c(.cran_packages), require, character.only = TRUE)
###############################################################
##############################################################
####### Establecemos directorio (escritorio) de trabajo ====
path <- "D:/OneDrive/Documentos/Tesis_Giselle/DEseq2"
count_f <- list.files(pattern = "good.count.isoform.counts.matrix", full.names = T)
# Cargamos archivos en R ====
#library(readr)
count <- read_tsv(count_f)
colData <- read_tsv(list.files(pattern = "metadata_Ns.tsv", full.names = T))
rownames(colData) <- colData$Library_ID
rownames=TRUE
#colData <- colData[-1] #remover la primera columna
# Modificamos caracteres ====
colNames <- gsub("_L001.isoforms.results", "", names(count))
#colNames <- sapply(strsplit(colNames, "_"), `[`, 1)#
colnames(count) <- c("transcript_id", colNames[-1])
query.ids <- count$transcript_id
library(dplyr)
count <- count %>% select_if(is.double) %>% as(., "matrix")
rownames(count) <- query.ids
# 1) Filter data by removing low-abundance genes ----
by_count <- 1; by_freq <- 2
keep <- rowSums(count > by_count) >= by_freq
sum(keep) # N transcripts
nrow(count <- count[keep,])
count <- round(count) # redondea los valores
# 2) Format metadata and count matrix ----
row.names(colData) <- colData$Library_ID #para colocar los nombres de Library_ID en rownames
row.names=TRUE
colData <- colData %>% arrange(match(Library_ID, colnames(count)))
#all(colnames(count) %in% rownames(colData))
#colData <- dplyr::rename(colData, samples = Library_ID)
#rownames(colData) <- colData$Library_ID
any(colnames(count) == colData$Library_ID) # sanity check
#colData <- mutate_if(colData, is.character, as.factor)
colData <- colData %>%
tibble::rownames_to_column(var = "ROW_ID") %>%
mutate(across(-ROW_ID, as.factor)) %>%
column_to_rownames(var = "ROW_ID")
# Usamos objeto colData para elaborar el diseno experimental
colData=colData %>% filter(Site=="BLA") #filtro de condicion(Basal) por sitio
f_col <- "Condition"
names(colData)[names(colData) %in% f_col] <- "Condition"
# Si es posible, especificar el factor "control" como nivel de referencia:
colData <- colData %>% mutate(Condition= relevel(Condition, ref = "Bas")) #usa el factor Basal como nivel de referencia en las comparaciones estadisticas
keep=colnames(count) %in% colData$Library_ID
count=count[,keep ]
# PREVALENCE ----
apply(count, 1, function(x) sum(x > 0)) %>% table() #calcular elementos por cada fila >0
prevelancedf = apply(count, 1, function(x) sum(x > 0)) # calcula el n?mero de elementos mayores que cero en cada fila de la matriz count y los almacena en un vector llamado prevalence
#librerias
library(dplyr)
library(ggplot2)
data.frame(Prevalence = prevelancedf,
TotalAbundance = rowSums(count)) %>% # mean_se
as_tibble(rownames = "Name") %>%
arrange(desc(TotalAbundance)) -> prevelancedf
prevelancedf %>%
dplyr::count(Prevalence) %>%
ggplot(aes(Prevalence, n)) + geom_col() +
theme_minimal(base_family = "GillSans", base_size = 12) +
scale_y_continuous("Number of transcript", labels = scales::comma)
# BOXPLOT ====
qprobs <- function(x) {
x <- x[x > 1]
quantile(x, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))
}
apply(log2(count+1), 2, qprobs) %>%
t() %>%
as_tibble(rownames = 'Library_ID') %>%
left_join(colData) -> probs_df
probs_df %>%
ggplot(.,
aes(x = Library_ID, ymin = `5%`, lower = `25%`,
middle = `50%`, upper = `75%`, ymax = `95%`, group = Library_ID)) +
geom_errorbar(width = 0.3, position = position_dodge(0.6)) +
geom_boxplot(aes(fill = Site), width = 0.5, stat = 'identity',
position = position_dodge(0.6)) +
labs(y = expression(log[2]~ 'Abundance'), x = '') +
theme_classic(base_family = "GillSans", base_size = 18) +
theme(
legend.position = 'top',
panel.border = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.line.x = element_blank()) -> ptop
ptop <- ptop + facet_grid(~ Condition, scales = 'free') +
theme(
strip.background = element_rect(fill = 'grey', color = 'white'))
ptop
# BOTTOM ====
apply(count, 2, function(x) sum(x > 0)) -> Total_genes
# How singletones are per sample? ----
apply(count, 2, function(x) sum(x > 0)) -> filtered_genes
cbind(as_tibble(Total_genes, rownames = 'name'), as_tibble(filtered_genes)) -> n_genes
names(n_genes) <- c('Library_ID','Raw', 'Filt')
n_genes %>% mutate(pct_genes_retained = Filt/Raw) -> n_genes
n_genes %>%
left_join(colData) %>%
mutate(Library_ID = factor(Library_ID, levels = unique(Library_ID))) %>%
ggplot() +
geom_col(aes(x = Library_ID, y = Raw, fill = Site)) +
labs(x = 'Samples', y = 'Total Transcript') +
scale_y_continuous(labels = scales::comma) +
theme_minimal(base_family = "GillSans", base_size = 12) +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45,
hjust = 1, vjust = 1, size = 10)) -> pbottom
pbottom <- pbottom + facet_grid(~ Site, scales = 'free') +
theme(
strip.background = element_blank(),
strip.text = element_blank())
pbottom
########
# aqui obtuve tres plots:
# el boxplot, grafico de barras (arriba) y el que tiene a ambos (abajo)
########
library(patchwork) # para unir las dos graficas
ptop / pbottom + patchwork::plot_layout(heights = c(1,1.2))
# DESEQ2 =====
ddsFullCountTable <- DESeqDataSetFromMatrix(
countData = count,
colData = colData,
design = ~ Condition) ##para modelar la influencia de m?ltiples factores en los datos de exp g?nica, si no unicamente dejar ~ Condition #####OJOOOOO, AQUI cambie #Site# por condition
# (Above): transform or not the count matrix?
# In order to test for differential expression, we operate on raw counts and use discrete distributions ...
# 3) Run DESeq in the following functions order ----
# Negative Binomial GLM
# # For the analysis we need to estimate the effective library size to normalize for.
# Imagine, a gene has the same number of counts in two samples. But the library size(total number of reads) was twice as high in the first sample. Then we would conclude that the gene was higher expressed in the second sample.
# You can estimate the size factors from the count data using the function estimateSizeFactors():
dds <- estimateSizeFactors(ddsFullCountTable)
# sizeFactors(dds)
# Next we need to estimate for each condition a function that allows to predict the dispersion (= variance across samples). The core assumption of this method is that the # mean is a good predictor of the variance, i.e., that genes with a similar expression level also have similar variance across the samples:
dds <- estimateDispersions(dds)
##############
##aqui empieza a estimar las dispersiones geneticas y reduce estas estimaciones para generar estimaciones de dispersión más precisas para modelar los recuentos
##############
#boxplot(log2(counts(ddsFullCountTable)+0.5))
#boxplot(log2(counts(dds, normalized=TRUE)+0.5))
# Now we can test for differential expression of the genes between the two groups by calling the function nbinomWaldTest(). We provide the dds and the names of the groups of our samples to this function. This might take a few minutes:
dds <- nbinomWaldTest(dds, maxit = 200)
# (Above) use Wald test when contrasting a unique pair (ex. control vs cancer)
res <- results(dds)
# d1 <- plotCounts(dds, gene=which.min(res$padj), intgroup="Design", returnData=TRUE)
# d2 <- plotCounts(dds, gene=which.max(res$padj), intgroup="Design", returnData=TRUE)
#
# p1 <- ggplot(rbind(d1,d2), aes(x=Design, y=count)) +
# geom_jitter(position=position_jitter(w=0.1,h=0), size = 5, alpha = 0.5) +
# scale_y_log10() +
# stat_summary(fun.data = "mean_cl_boot", colour = "red", linewidth = 0.7, size = 1) +
# labs(y = "Gene count (Log10)", x = "Treatment") +
# theme_bw(base_family = "GillSans", base_size = 20)
#
#
# p2 <- rbind(d1,d2) %>%
# group_by(Design) %>%
# tally(count) %>%
# ggplot(aes(x=Design, y=n)) +
# geom_col(width = 0.4) +
# labs(y = "Library Size", x = "Treatment") +
# theme_bw(base_family = "GillSans", base_size = 20)
#
# p2 /p1
# normal vs gamma-poisson distribution
# library(bayestestR)
#
# prior <- distribution_normal(nrow(m), mean = 5)
# posterior <- distribution_gamma(nrow(m), 2) # Generate a skewed distribution
# #
# p <- bayesfactor_parameters(posterior, prior,
# direction = "two-sided", null = 0, verbose = FALSE)
# #
# plot(p)
# centrality <- point_estimate(posterior)
#
# plot(centrality)
# Note: also run as a single batch
dds <- DESeq(ddsFullCountTable)
res <- results(dds)
#4) (OPTIONAL) Test transformation for downstream analyses----
#However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data. Ex:
vst <- DESeq2::varianceStabilizingTransformation(dds) # vst if cols > 10
ntr <- DESeq2::normTransform(dds)
DESeq2::plotPCA(ntr, intgroup = "Site")
DESeq2::plotPCA(vst, intgroup = "Site")
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("vsn")
library(vsn)
raw_df <- vsn::meanSdPlot(assay(dds), plot = F)
vst_df <- vsn::meanSdPlot(assay(vst), plot = F)
ntr_df <- vsn::meanSdPlot(assay(ntr), plot = F)
# rbind(data.frame(py = vst_df$sd, px = vst_df$rank, col = "vst"),
# data.frame(py = ntr_df$sd, px = ntr_df$rank, col = "ntr"),
# data.frame(py = raw_df$sd, px = raw_df$rank, col = "raw")) %>%
# ggplot(aes(px, py, color = col)) +
# labs(x = "Ranks", y = "sd", color = "") +
# geom_line(orientation = NA, position = position_identity(), size = 2) +
# theme_bw(base_family = "GillSans", base_size = 20) +
# theme(legend.position = "top")
# 5) (OPTIONAL) Correlation heatmap ----
sample_cor = cor(assay(vst), method='pearson', use='pairwise.complete.obs')
sample_dist = dist(t(assay(vst)), method='euclidean')
hc_samples = hclust(sample_dist, method='complete')
hc_order <- hc_samples$labels[hc_samples$order]
heatmap(sample_cor, col = cm.colors(1))
sample_cor %>%
as_tibble(rownames = 'Library_ID') %>%
pivot_longer(cols = colnames(sample_cor), values_to = 'cor') %>%
left_join(colData) %>%
mutate(Library_ID = factor(Library_ID, levels = hc_order)) -> sample_cor_long
library(ggplot2)
library(ggh4x)
sample_cor_long %>%
ggplot(aes(x = Library_ID, y = name, fill = cor)) +
geom_tile(color = 'white', linewidth = 0.2) +
# geom_raster() +
# geom_text(aes(label = cor), color = 'white') +
scale_fill_viridis_c(option = "F", name = "Pearson", direction = -1) +
scale_x_discrete(position = 'top') +
ggh4x::scale_x_dendrogram(hclust = hc_samples, position = 'top') +
ggh4x::scale_y_dendrogram(hclust = hc_samples) +
theme_classic(base_size = 12, base_family = "GillSans") +
labs(x = '', y = '') +
theme(axis.text.x = element_text(angle = 50,
hjust = -0.15, vjust = 1)) # -> pheat
# 6) PCA =====
library(DESeq2)
ncol(data <- log2(count+1))
ncol(data <- assay(vst))
PCA = prcomp(t(data), center = FALSE, scale. = FALSE)
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)
sd_ratio <- sqrt(percentVar[2] / percentVar[1])
PCAdf <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2])
library(ggforce)
PCAdf %>%
mutate(Library_ID = rownames(.)) %>%
left_join(colData) %>%
ggplot(., aes(PC1, PC2)) +
# coord_fixed(ratio = sd_ratio) +
geom_abline(slope = 0, intercept = 0, linetype="dashed", alpha=0.5) +
geom_vline(xintercept = -74, linetype="dashed", alpha=0.5) +
ggforce::geom_mark_ellipse(aes(group = as.factor(Site)),
fill = 'grey', con.colour = 'grey') +
geom_point(size = 7, alpha = 0.7, aes(color = Condition)) +
geom_text( family = "GillSans",
mapping = aes(label = Site), size = 2.5) +
labs(caption = '', color = "") +
xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) +
ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) +
theme_minimal(base_family = "GillSans", base_size = 20) +
theme(plot.title = element_text(hjust = 0.5), legend.position = 'top')
# EXPLORATORY DATA ====
summary(res)
#########################################################################
########### Osc: Por sitio, BLA x LOL ########################################
# out of 126387 with nonzero total read count
#adjusted p-value < 0.1
#LFC > 0 (up) : 775, 0.61%
#LFC < 0 (down) : 504, 0.4%
#outliers [1] : 520, 0.41%
#low counts [2] : 46531, 37%
########################################################################
########################################################################
hist(res$pvalue)
res <- res %>% as_tibble(rownames = "transcript")
# ========================= Pruebas HeatMap ========================================
library(DESeq2)
library(ggplot2)
library(ComplexHeatmap)
#library(org.Hs.eg.db)
#library("org.Hs.eg.db") paquete de Bioconductor de anotacion de genes del genoma de Homo Sapiens
significant <- res %>% filter((log2FoldChange > 2 & padj < 0.05) |
(log2FoldChange < 2 & padj < 0.05))
sigs.df <- as.data.frame(significant)
row.names(sigs.df) <- sigs.df$transcript #para colocar los nombres de Transcript en rownamesrow.names=TRUE
#sigs.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigs.df), keytype = "TRINITY", column = "SYMBOL")
#################################################################################
###Silencie las siguientes dos lineas porque filtraban mis datos, reduciendolos a 5
#sigs.df <- sigs.df[(sigs.df$baseMean >100) & (abs(sigs.df$log2FoldChange) > 1), ]
#sig <- sigs.df[, -1]
################################################################################
mat <- counts(dds, normalized =T)[rownames(sigs.df), ]
mat.z <- t(apply(mat, 1, scale))
colnames(mat.z) <- rownames(colData)
Heatmap(mat.z, cluster_rows = T, cluster_columns = T, column_labels = colnames(mat.z), name = "Z-score", row_labels = sigs.df[rownames(mat.z),]$transcript)
#write.table(mat.z, file = "transcritos_DE.txt", sep = "\t", quote = FALSE, row.names = TRUE)
write.csv(sigs.df, file = "DEG_BLA.csv")
########
# #
# :) #
# #
########
#=============== Seguimiento con Script de Ricardo =========================
sigfc <- "Sign and FC";pv <- "Sign";fc <- "FC"
colors_fc <- c("red2", "#4169E1", "forestgreen", "grey30")
colors_fc <- structure(colors_fc, names = c(sigfc, pv, fc, "NS"))
res$cc <- 'NS'
res[which(abs(res$log2FoldChange) > 2), 'cc'] <- fc
res[which(abs(res$padj) <= 0.05), 'cc'] <- pv
res[which(res$padj <= 0.05 & abs(res$log2FoldChange) > 2), 'cc'] <- sigfc
p <- res %>%
ggplot(aes(y = -log10(pvalue), x = log2FoldChange, color = cc)) +
geom_point()
p <- p +
scale_color_manual(name = "", values = colors_fc) +
labs(x= expression(Log[2] ~ "Fold Change"), y = expression(-Log[10] ~ "P")) +
theme_bw(base_family = "GillSans", base_size = 18) +
theme(legend.position = "top") +
geom_abline(slope = 0, intercept = -log10(0.05), linetype="dashed", alpha=0.5) +
geom_vline(xintercept = 0, linetype="dashed", alpha=0.5)
p
# POSITIVE LFC == UP EXPRESSED IN CANCER
# NEGATIVE LFC == UP EXPRESSED IN CONTROL
res %>% mutate(g = sign(log2FoldChange)) %>%
dplyr::count(cc, g) %>%
mutate(g = ifelse(g == "1", "Osc", "Ctrl")) %>%
filter(cc != 'NS') %>%
group_by(g) %>%
mutate(pct = n / sum(n)) %>%
ggplot() +
geom_col(aes(x = g, y = pct, fill = cc), width = 0.5) +
scale_fill_manual(name = "", values = colors_fc[-4]) +
theme_bw(base_family = "GillSans", base_size = 18) +
labs(y = '% Transcripts', x = '') +
theme(legend.position = 'top') + coord_flip()
# Annotation ====
query.ids <- res %>% filter(cc == "Sign and FC") %>% pull(transcript)
query.ids <- res$transcript
url <- "https://raw.githubusercontent.com/RJEGR/Small-RNASeq-data-analysis/master/FUNCTIONS.R"
source(url)
annot_path <- paste0(path, "/anotacion/")
annot_f <- list.files(path = annot_path, pattern = "Trinotate_report.tsv", full.names = T)
annot <- read_tsv(annot_f, na = ".") %>% filter(transcript_id %in% query.ids)
blastp_df <- split_blast(annot, hit = "BLASTP") %>%
group_by(transcript) %>%
arrange(identity) %>%
sample_n(1) %>%
ungroup()
blastp_df %>% dplyr::count(genus, sort = T)
tex_dat <- res %>% left_join(blastp_df) %>% drop_na(name)
p + ggrepel::geom_text_repel(data = tex_dat, aes(label = name), min.segment.length = 0)
#############aqui salio un volcanoplot###########
##############ENRIQUECIMIENTO FUNCIONAL#####################
###########
# Functional enrichment ====
.bioc_packages <- c('biomaRt','GOSemSim') # DESeq2'
.inst <- .bioc_packages %in% installed.packages()
#if(any(!.inst)) {
# if (!require("BiocManager", quietly = TRUE))
#install.packages("BiocManager")
BiocManager::install(.bioc_packages[!.inst], ask = F)
#}
sapply(c(.bioc_packages), require, character.only = TRUE)
# Local functions
url <- "https://raw.githubusercontent.com/RJEGR/Cancer_sete_T_assembly/main/functions.R"
source(url)
go_df <- split_gene_ontology(annot, hit = "BLASTP")
table(go_df$ontology)
gene2GO <- split(go_df$go, go_df$transcript)
########yo agregué esta líneas para que agarrara a GOSemSim & org
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GOSemSim")
#install.packages("GOSemSim")
library(GOSemSim)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)
############################ aqui sigue el script
hsGO <- GOSemSim::godata('org.Hs.eg.db', ont="BP")
#hsGO <- read_rds(paste0(annot_path, '/Trinotate_GO_annotations.txt'))
library(readr)
# Assuming the file is a TSV file (tab-separated values)/read_rds es formato binario
hsGO <- read_tsv(paste0(annot_path, '/Trinotate_GO_annotations.txt'))
res_up <- res %>% filter(log2FoldChange >=2 & padj <0.05) #filtro para los sobre-expresados
res <- res %>% drop_na(padj)
res_down <- res %>% filter(log2FoldChange <2 & padj <0.05) #filtro para los sub-expresados
res_up %>% pull(padj) -> query.p
res_up %>% pull(transcript) -> query.names
#topGO es requerida, por lo que lo llamare por Biocmanager
BiocManager::install("topGO")
allRes_up <- GOenrichment(query.p, query.names, gene2GO, Nodes = 20)
allRes_up %>% as_tibble() %>%
arrange(Annotated) %>%
mutate(Term = factor(Term, levels= unique(Term))) %>%
ggplot(aes(x = Term, y = Annotated, fill = p.adj.ks)) + # , fill = p.adj.ks
coord_flip() +
geom_col() +
theme_minimal(base_family = "GillSans", base_size = 12)
#AQUI me dio un grafico de enriquecimiento funcional con los procesos UP
res_down %>% pull(padj) -> query.p
res_down %>% pull(transcript) -> query.names
allRes_down <- GOenrichment(query.p, query.names, gene2GO, Nodes = 20)
#null para devolver todos
allRes_down %>% as_tibble() %>%
arrange(Annotated) %>%
mutate(Term = factor(Term, levels= unique(Term))) %>%
ggplot(aes(x = Term, y = Annotated, fill = p.adj.ks)) + # , fill = p.adj.ks
coord_flip() +
geom_col() +
theme_minimal(base_family = "GillSans", base_size = 12)
#write.table(res_down, file = "down_DE.txt", sep = "\t", quote = FALSE, row.names = TRUE)
#write.csv(sigs.df, file= "matriz.csv", row.names= TRUE )
```
### Diagrama de Venn
Una vez que se tienen los contrastes, se puede realizar un diagrama de Venn para ver cuales y cuantos son los genes que se comparten entre los contrastes.
El siguientre script se trabaja en RStudio.
> Archivos de entrada:
> - [ ] Script R Basales.R
> - [ ] Archivos DGE de cada contraste, DEG_Osc.csv
```
####----VENN Diagram-----###
#----Annotation-----#
library(dplyr)
# Set the working directory to the folder containing the CSV file
setwd("D:/OneDrive/Documentos/Tesis_Giselle/DEseq2")
# Read the CSV file into a data frame
df1 <- read.csv("DEG_Basales.csv") ##Bas_BLA vs Bas_LOL*
df2 <- read.csv("DEG_Osc.csv") ##Osc_BLA vs Osc_LOL*
df3 <- read.csv("DEG_BLA.csv") ##Bas_BLA* vs Osc_BLA
df4 <- read.csv("DEG_LOL.csv") ##Bas_LOL* vs Osc_LOL
#* es el grupo ctrl
#
##call de packages
#install.packages("VennDiagram")
library(VennDiagram)
###Extract unique values
library(VennDiagram)
# Create character vectors for each category
transcript_Bas <- unique(df1$transcript)
transcript_Osc <- unique(df2$transcript)
transcript_BLA <- unique(df3$transcript)
transcript_LOL <- unique(df4$transcript)
# Create a list of character vectors
venn_list <- list(
Bas = transcript_Bas,
Osc = transcript_Osc,
BLA = transcript_BLA,
LOL = transcript_LOL
)
# Define colors for each category
category_colors <- c("red", "blue", "green", "purple")
# Create the Venn diagram
venn.plot <- venn.diagram(
x = venn_list,
category.names = c("Bas (BLA vs LOL*)", "Osc (BLA vs LOL*)", "BLA (Bas* vs Osc)", "LOL (Bas* vs Osc)"),
filename = NULL,
output = TRUE,
fill = category_colors
)
# Display the Venn diagram
grid.draw(venn.plot)
# Extraer los tres genes compartidos entre los contrastes
genes_compartidos <- Reduce(intersect, venn_list)
# Mostrar los genes compartidos
print(genes_compartidos)
####[1] "TRINITY_DN13726_c0_g1_i15", es el gen Ral GTPase-activating protein subunit beta
# "TRINITY_DN2478_c1_g1_i14" Src kinase-associated phosphoprotein 2-A
#[3] "TRINITY_DN3091_c0_g1_i17" gen sin identificar
# Calcular la intersección entre las categorías "Osc" y "Bas"
interseccion_osc_bas <- intersect(venn_list$Osc, venn_list$Bas)
# Crear una tabla con la intersección y sus conteos
tabla_interseccion <- data.frame(
Gen = interseccion_osc_bas,
Count_Osc = sum(venn_list$Osc %in% interseccion_osc_bas),
Count_Bas = sum(venn_list$Bas %in% interseccion_osc_bas)
)
# Mostrar la tabla
print(tabla_interseccion)
```
> Archivos de salida:
> - [ ] Diagrama de venn (guardar manualmente)
---
Debido a que mi bitácora supera el número de caracteres de HackMD, a partir de la Anotación funcional esta disponible aquí: https://hackmd.io/@3Hs8iYAxRfK3NnYlJdTdyQ/BJmmIAfuC