# NF-core ChIP-seq
## Overview
Using tutorial data with ChIP-seq nf-core workflow.
### Goal
Run with subset data and determine outputs. Ideally we'd like heatmaps of ChIP-seq data.
### Required
- [x] Find subset of test data
- [x] Students to download their own files
- [x] Peak Heatmaps (they're created with deeptools using bigwig files)
- Can also use [ChIPseeker R package](https://bioconductor.org/packages/release/bioc/vignettes/ChIPseeker/inst/doc/ChIPseeker.html), but I think this uses bed files.
### Steps
1. Create list of SRA ID's --> `sra_list.txt`
2. Create download SRA script --> `download_sra.slurm`
3. Download genome
4. Create config file --> `params.config`
5. Download blacklisted regions
6. Create `samplesheet.csv`
7. Create nf-core script --> `chipseq.slurm`
8. Download output
9. <font color=red>Create a heatmap in R</font>
## Download the Data
### About the Data
The Full Dataset consists of FoxA1 (transcription factor) and EZH2 (histone,mark) IP experiments from Franco et al. 2015 ([GEO: GSE59530](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE59530), [PMID: 25752574](https://pubmed.ncbi.nlm.nih.gov/25752574/)) and Popovic et al. 2014 ([GEO: GSE57632](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE57632), [PMID: 25188243](https://pubmed.ncbi.nlm.nih.gov/25188243/)), respectively
### Minimal Dataset SRA ID List (sra_ids.txt)
The minimal dataset seems to stem from a different experiment and is also used in the atacseq workflow.
```
SRR1822153
SRR1822154
SRR1822157
SRR1822158
SRR5204809
SRR5204810
```
### Full Dataset SRA ID List (sra_ids_full.txt)
```
SRR1635435
SRR1635436
SRR1635437
SRR1635438
SRR1635459
SRR1635460
SRR1635461
SRR1635462
SRR1285070
SRR1285071
SRR1285072
SRR1285073
SRR1285074
SRR1285075
SRR1285076
SRR1285077
```
### sra-toolkit to Download Files
In order to download files based on their SRA ids, we'd need to use NCBI's sra-toolkit [https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump].
Thankfully, Greene has this as a module already.
```
#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=4:00:00
#SBATCH --mem=10MB
#SBATCH --job-name=download_sra
#SBATCH --output="%x.out"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@nyu.edu
#SBATCH --array 1-16 # use 6 for min dataset
module purge
# load sra tools
module load sra-tools/3.1.0
# list of sra ids
sra_ids="sra_ids.txt"
# pulling sra id from each line
sra_id=$(awk NR==${SLURM_ARRAY_TASK_ID} ${sra_ids})
echo "Processing SRA ID: $sra_id"
# downloads the sra
prefetch -v $sra_id
# splits sra to fastq_1 & fastq_2
fastq-dump --split-files ${sra_id}/${sra_id}.sra --outdir fastqs
# remove SRA folders
rm -rf ${sra_id}
# need to gzip fastq files to run with chipseq
gzip fastqs/${sra_id}_*.fastq
```
## Genome
The hg19 is downloaded from ucsc.edu as this was what was initially used with the test data.
I tried using GRCh38, but this did not provide any mapped reads.
I also tried to use the GRCh37 genome from [ENSEMBL](https://ftp.ensembl.org/pub/grch37/release-111/fasta/), but the blacklist BED files were not compatible with this genome. I *could* modify the blacklisted BED regions, but I just wanted to see this workflow function.
### Download Genome
```
mkdir genome
cd genome
wget -L https://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz
wget - https://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/genes/hg19.refGene.gtf.gz
```
## Sample Sheets
### Minimal Dataset Samplesheet
```
sample,fastq_1,fastq_2,antibody,control
SPT5_T0_REP1,/scratch/kk4764/chipseq/fastqs/SRR1822153_1.fastq.gz,/scratch/kk4764/chipseq/fastqs/SRR1822153_2.fastq.gz,SPT5,SPT5_INPUT_REP1
SPT5_T0_REP2,/scratch/kk4764/chipseq/fastqs/SRR1822154_1.fastq.gz,/scratch/kk4764/chipseq/fastqs/SRR1822154_2.fastq.gz,SPT5,SPT5_INPUT_REP2
SPT5_T15_REP1,/scratch/kk4764/chipseq/fastqs/SRR1822157_1.fastq.gz,/scratch/kk4764/chipseq/fastqs/SRR1822157_2.fastq.gz,SPT5,SPT5_INPUT_REP1
SPT5_T15_REP2,/scratch/kk4764/chipseq/fastqs/SRR1822158_1.fastq.gz,/scratch/kk4764/chipseq/fastqs/SRR1822158_2.fastq.gz,SPT5,SPT5_INPUT_REP2
SPT5_INPUT_REP1,/scratch/kk4764/chipseq/fastqs/SRR5204809_1.fastq.gz,/scratch/kk4764/chipseq/fastqs/SRR5204809_2.fastq.gz,,
SPT5_INPUT_REP2,/scratch/kk4764/chipseq/fastqs/SRR5204810_1.fastq.gz,/scratch/kk4764/chipseq/fastqs/SRR5204810_2.fastq.gz,,
```
### Full Dataset Samplesheet
```
sample,fastq_1,fastq_2,antibody,control
FOXA1_IP_VEH_REP1,/scratch/kk4764/chipseq/fastqs/SRR1635459_1.fastq.gz,,FOXA1,INPUT_VEH_REP1
FOXA1_IP_VEH_REP2,/scratch/kk4764/chipseq/fastqs/SRR1635460_1.fastq.gz,,FOXA1,INPUT_VEH_REP2
FOXA1_IP_E2_REP1,/scratch/kk4764/chipseq/fastqs/SRR1635461_1.fastq.gz,,FOXA1,INPUT_E2_REP1
FOXA1_IP_E2_REP2,/scratch/kk4764/chipseq/fastqs/SRR1635462_1.fastq.gz,,FOXA1,INPUT_E2_REP2
INPUT_VEH_REP1,/scratch/kk4764/chipseq/fastqs/SRR1635435_1.fastq.gz,,,
INPUT_VEH_REP2,/scratch/kk4764/chipseq/fastqs/SRR1635436_1.fastq.gz,,,
INPUT_E2_REP1,/scratch/kk4764/chipseq/fastqs/SRR1635437_1.fastq.gz,,,
INPUT_E2_REP2,/scratch/kk4764/chipseq/fastqs/SRR1635438_1.fastq.gz,,,
EZH2_IP_NTKO_REP1,/scratch/kk4764/chipseq/fastqs/SRR1285070_1.fastq.gz,,EZH2,INPUT_NTKO_REP1
EZH2_IP_NTKO_REP2,/scratch/kk4764/chipseq/fastqs/SRR1285071_1.fastq.gz,,EZH2,INPUT_NTKO_REP2
EZH2_IP_TKO_REP1,/scratch/kk4764/chipseq/fastqs/SRR1285072_1.fastq.gz,,EZH2,INPUT_TKO_REP1
EZH2_IP_TKO_REP2,/scratch/kk4764/chipseq/fastqs/SRR1285073_1.fastq.gz,,EZH2,INPUT_TKO_REP2
INPUT_NTKO_REP1,/scratch/kk4764/chipseq/fastqs/SRR1285074_1.fastq.gz,,,
INPUT_NTKO_REP2,/scratch/kk4764/chipseq/fastqs/SRR1285075_1.fastq.gz,,,
INPUT_TKO_REP1,/scratch/kk4764/chipseq/fastqs/SRR1285076_1.fastq.gz,,,
INPUT_TKO_REP2,/scratch/kk4764/chipseq/fastqs/SRR1285077_1.fastq.gz,,,
```
## Config File (params.config)
```
params {
config_profile_description = 'NYU NGS Analysis - CHiP-Seq'
// limit resources
//max_memory = 40.GB
//max_cpus = 2
max_time = 1.d
// main options
genome= "hg19"
blacklist= "/home/kk4764/.nextflow/assets/nf-core/chipseq/assets/blacklists/v2.0/hg19-blacklist.v2.bed"
read_length= 50
}
```
## NF Command
```
module load nextflow/23.04.1
nextflow run nf-core/chipseq \
-profile nyu_hpc \
--input samplesheet.csv \
--outdir res \
--fasta /scratch/kk4764/chipseq/genome/hg19.fa.gz \
--gtf /scratch/kk4764/chipseq/genome/hg19.refGene.gtf.gz \
-c params.config
```
# Results
It works.
Minimal dataset run at 2.5 hours, but broke at multiqc.
The full dataset took 5.5 hours, but this did not have any limits to the resources. Results from the full dataset were similar to what was tested from v1.2.0 of nf-core/chipseq
## Minimal Dataset

## Full Dataset


## NF-core Chipseq Results v1.2.0
This was generated using the full dataset
https://nf-co.re/chipseq/1.2.0/results/chipseq/results-048fd6854fcc85b355c61dfc2e21da0bcc6399ea
For comparison.

---
# Notes
## Mar 16, 2024
All of these are using the [Minimal Dataset](#Minimal-Dataset-SRA-ID-List-sra_idstxt) to test.
### :red_circle: Run: 44225518
apparently we need to specify `--read_length` & `--macs_gsize`
```
ERROR ~ Both '--read_length' and '--macs_gsize' not specified! Please specify either to infer MACS2 genome size for peak calling.
```
### :red_circle: Run: 44225553
Was able to run, but the filepaths need to be specific
### :red_circle: Run: 44225559
```
ERROR: Please check samplesheet header -> group,replicate,fastq_1,fastq_2,antibody,control != sample,fastq_1,fastq_2,antibody,control
```
### :red_circle: Run: 44225591
```
ERROR: Please check samplesheet -> FastQ file does not have extension '.fastq.gz' or '.fq.gz'!
Line: 'SPT5_T0_1,/scratch/kk4764/chipseq/fastqs/SRR1822153_1.fastq,/scratch/kk4764/chipseq/fastqs/SRR1822153_2.fastq,SPT5,SPT5_INPUT'
```
### :red_circle: Run: 44227270
```
ERROR: Please check samplesheet -> Control identifier has to match does a provided sample identifier!
Control: 'SPT5_INPUT'
```
### :red_circle: Run: 44227696
```
Command error:
WARNING: Skipping mount /opt/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
Chromosome "chr1" undefined in Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.sizes
```
## March 17, 2024
After adjusting many small things with the sample sheets and pathways ...
### :red_circle: Run: 44238240
There are stagings of files that are taking time.
After 2.5 hrs...
```
NFCORE_CHIPSEQ:CHIPSEQ:FILTER_BAM_BAMTOOLS:BAM_REMOVE_ORPHANS (SPT5_T0_1)
```
```
Command error:
[E::idx_find_and_load] Could not retrieve index file for 'SPT5_T0_1.mLb.clN.name.sorted.bam'
Traceback (most recent call last):
File "/home/kk4764/.nextflow/assets/nf-core/chipseq/bin/bampe_rm_orphan.py", line 167, in <module>
bampe_rm_orphan(BAMIn=args.BAM_INPUT_FILE, BAMOut=args.BAM_OUTPUT_FILE, onlyFRPairs=args.ONLY_FR_PAIRS)
File "/home/kk4764/.nextflow/assets/nf-core/chipseq/bin/bampe_rm_orphan.py", line 82, in bampe_rm_orphan
currRead = next(iter)
File "pysam/libcalignmentfile.pyx", line 2209, in pysam.libcalignmentfile.IteratorRowAll.__next__
StopIteration
```
#### command.sh
```
(base) [kk4764@log-1 chipseq]$ cat /scratch/kk4764/chipseq/work/f1/0dd2e3b81221fa755284e0c8659bd9/.command.sh
#!/bin/bash -euo pipefail
samtools sort -n -@ 6 -o SPT5_T0_1.mLb.clN.name.sorted.bam -T SPT5_T0_1.mLb.clN.name.sorted SPT5_T0_1.mLb.flT.sorted.bam
bampe_rm_orphan.py SPT5_T0_1.mLb.clN.name.sorted.bam SPT5_T0_1.mLb.clN.bam --only_fr_pairs
cat <<-END_VERSIONS > versions.yml
"NFCORE_CHIPSEQ:CHIPSEQ:FILTER_BAM_BAMTOOLS:BAM_REMOVE_ORPHANS":
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
END_VERSIONS
```
Looks like the files are not being indexed properly (no `.bai` files being made)
### :red_circle: Run: 44281685
The samplesheet was modified (incase there could have been clerical mistakes) and the script was re-run... The same error occurred.
A similar issue on sarek slack might suggest this is related to poor symlink creation?
After checking the work dir, **the bai files exist!** --> **SYMLINK PROBLEM**
Maybe I should just cancel remove orphans?
```
ERROR ~ Error executing process > 'NFCORE_CHIPSEQ:CHIPSEQ:FILTER_BAM_BAMTOOLS:BAM_REMOVE_ORPHANS (SPT5_T15_REP1)'
Caused by:
Process `NFCORE_CHIPSEQ:CHIPSEQ:FILTER_BAM_BAMTOOLS:BAM_REMOVE_ORPHANS (SPT5_T15_REP1)` terminated with an error exit status (1)
Command executed:
samtools sort -n -@ 6 -o SPT5_T15_REP1.mLb.clN.name.sorted.bam -T SPT5_T15_REP1.mLb.clN.name.sorted SPT5_T15_REP1.mLb.flT.sorted.bam
bampe_rm_orphan.py SPT5_T15_REP1.mLb.clN.name.sorted.bam SPT5_T15_REP1.mLb.clN.bam --only_fr_pairs
cat <<-END_VERSIONS > versions.yml
"NFCORE_CHIPSEQ:CHIPSEQ:FILTER_BAM_BAMTOOLS:BAM_REMOVE_ORPHANS":
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
[E::idx_find_and_load] Could not retrieve index file for 'SPT5_T15_REP1.mLb.clN.name.sorted.bam'
Traceback (most recent call last):
File "/home/kk4764/.nextflow/assets/nf-core/chipseq/bin/bampe_rm_orphan.py", line 167, in <module>
bampe_rm_orphan(BAMIn=args.BAM_INPUT_FILE, BAMOut=args.BAM_OUTPUT_FILE, onlyFRPairs=args.ONLY_FR_PAIRS)
File "/home/kk4764/.nextflow/assets/nf-core/chipseq/bin/bampe_rm_orphan.py", line 82, in bampe_rm_orphan
currRead = next(iter)
File "pysam/libcalignmentfile.pyx", line 2209, in pysam.libcalignmentfile.IteratorRowAll.__next__
StopIteration
```
### :red_circle: Run: 44292328 (Full Dataset)
Running full dataset to see if the symlink issue may still be a problem
```
ERROR ~ Error executing process > 'NFCORE_CHIPSEQ:CHIPSEQ:PHANTOMPEAKQUALTOOLS (FOXA1_IP_VEH_REP2)'
Caused by:
Process `NFCORE_CHIPSEQ:CHIPSEQ:PHANTOMPEAKQUALTOOLS (FOXA1_IP_VEH_REP2)` terminated with an error exit status (1)
Command executed:
RUN_SPP=`which run_spp.R`
Rscript -e "library(caTools); source(\"$RUN_SPP\")" -c="FOXA1_IP_VEH_REP2.mLb.clN.sorted.bam" -savp="FOXA1_IP_VEH_REP2.spp.pdf" -savd="FOXA1_IP_VEH_REP2.spp.Rdata" -out="FOXA1_IP_VEH_REP2.spp.out" -p=6
cat <<-END_VERSIONS > versions.yml
"NFCORE_CHIPSEQ:CHIPSEQ:PHANTOMPEAKQUALTOOLS":
phantompeakqualtools: 1.2.2
END_VERSIONS
Command exit status:
1
Command output:
################
ChIP data: FOXA1_IP_VEH_REP2.mLb.clN.sorted.bam
Control data: NA
strandshift(min): -500
strandshift(step): 5
strandshift(max) 1500
user-defined peak shift NA
exclusion(min): 10
exclusion(max): NaN
num parallel nodes: 6
FDR threshold: 0.01
NumPeaks Threshold: NA
Output Directory: .
narrowPeak output file name: NA
regionPeak output file name: NA
Rdata filename: FOXA1_IP_VEH_REP2.spp.Rdata
plot pdf filename: FOXA1_IP_VEH_REP2.spp.pdf
result filename: FOXA1_IP_VEH_REP2.spp.out
Overwrite files?: FALSE
Reading ChIP tagAlign/BAM file FOXA1_IP_VEH_REP2.mLb.clN.sorted.bam
opened /state/partition1/job-44302130/RtmpLP9ruZ/FOXA1_IP_VEH_REP2.mLb.clN.sorted.tagAlign297cadfb11
done. read 0 fragments
Command error:
WARNING: Skipping mount /opt/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
Loading required package: Rcpp
Error in read.table(bam2align.filename, nrows = 500) :
no lines available in input
Calls: source ... withVisible -> eval -> eval -> read.align -> read.table
Execution halted
```
Full dataset runs without symlink errors, but there's an error when running the Rscript. Looks like there's zero read frags. This issue occurs in [run_spp.R](https://github.com/kundajelab/phantompeakqualtools/blob/master/run_spp.R) in phantompeakqualtools
1. Check the file itself
<font color='red'>Most files were empty</font>
2. Is the genome correct?
<font color='red'>I was using GRCh38. Checking the test, seems like GRCh37 was used (hg19). This *shouldn't* be the issue since GRCh38 is more up-to-date...</font>
### :red_circle: Run: 44297612 (Minimal Dataset)
Removing `cleanup = True` from `params.config` to see if this would address the symlink issue...? Probably not, the bai files exist...
```
Command error:
[E::idx_find_and_load] Could not retrieve index file for 'SPT5_T0_REP1.mLb.clN.name.sorted.bam'
Traceback (most recent call last):
File "/home/kk4764/.nextflow/assets/nf-core/chipseq/bin/bampe_rm_orphan.py", line 167, in <module>
bampe_rm_orphan(BAMIn=args.BAM_INPUT_FILE, BAMOut=args.BAM_OUTPUT_FILE, onlyFRPairs=args.ONLY_FR_PAIRS)
File "/home/kk4764/.nextflow/assets/nf-core/chipseq/bin/bampe_rm_orphan.py", line 82, in bampe_rm_orphan
currRead = next(iter)
File "pysam/libcalignmentfile.pyx", line 2209, in pysam.libcalignmentfile.IteratorRowAll.__next__
StopIteration
```
## March 18, 2024
Changed the reference genome to GRCh37. Looking to see if this resolves issues seen in both the Full and Minimal dataset.
### :grey_question: Run: 44354409 (Minimal Dataset)
*ALMOST* succeeded
```
ERROR ~ Error executing process > 'NFCORE_CHIPSEQ:CHIPSEQ:MULTIQC (1)'
Caused by:
Process `NFCORE_CHIPSEQ:CHIPSEQ:MULTIQC (1)` terminated with an error exit status (1)
Command executed:
multiqc \
-f \
\
\
.
cat <<-END_VERSIONS > versions.yml
"NFCORE_CHIPSEQ:CHIPSEQ:MULTIQC":
multiqc: $( multiqc --version | sed -e "s/multiqc, version //g" )
END_VERSIONS
Command exit status:
1
Command output:
(empty)
```
### :heavy_check_mark: Run: 44354478 (Full Dataset)
```
Duration : 5h 59m 32s
CPU hours : 341.0
Succeeded : 525
```