NF-core ChIP-seq

# NF-core ChIP-seq ## Overview Using tutorial data with ChIP-seq nf-core workflow. ### Goal Run with subset data and determine outputs. Ideally we'd like heatmaps of ChIP-seq data. ### Required - [x] Find subset of test data - [x] Students to download their own files - [x] Peak Heatmaps (they're created with deeptools using bigwig files) - Can also use [ChIPseeker R package](https://bioconductor.org/packages/release/bioc/vignettes/ChIPseeker/inst/doc/ChIPseeker.html), but I think this uses bed files. ### Steps 1. Create list of SRA ID's --> `sra_list.txt` 2. Create download SRA script --> `download_sra.slurm` 3. Download genome 4. Create config file --> `params.config` 5. Download blacklisted regions 6. Create `samplesheet.csv` 7. Create nf-core script --> `chipseq.slurm` 8. Download output 9. Create a heatmap in R ## Download the Data ### About the Data The Full Dataset consists of FoxA1 (transcription factor) and EZH2 (histone,mark) IP experiments from Franco et al. 2015 ([GEO: GSE59530](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE59530), [PMID: 25752574](https://pubmed.ncbi.nlm.nih.gov/25752574/)) and Popovic et al. 2014 ([GEO: GSE57632](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE57632), [PMID: 25188243](https://pubmed.ncbi.nlm.nih.gov/25188243/)), respectively ### Minimal Dataset SRA ID List (sra_ids.txt) The minimal dataset seems to stem from a different experiment and is also used in the atacseq workflow. ``` SRR1822153 SRR1822154 SRR1822157 SRR1822158 SRR5204809 SRR5204810 ``` ### Full Dataset SRA ID List (sra_ids_full.txt) ``` SRR1635435 SRR1635436 SRR1635437 SRR1635438 SRR1635459 SRR1635460 SRR1635461 SRR1635462 SRR1285070 SRR1285071 SRR1285072 SRR1285073 SRR1285074 SRR1285075 SRR1285076 SRR1285077 ``` ### sra-toolkit to Download Files In order to download files based on their SRA ids, we'd need to use NCBI's sra-toolkit [https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump]. Thankfully, Greene has this as a module already. ``` #!/bin/bash # #SBATCH --nodes=1 #SBATCH --tasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH --time=4:00:00 #SBATCH --mem=10MB #SBATCH --job-name=download_sra #SBATCH --output="%x.out" #SBATCH --mail-type=ALL #SBATCH --mail-user=username@nyu.edu #SBATCH --array 1-16 # use 6 for min dataset module purge # load sra tools module load sra-tools/3.1.0 # list of sra ids sra_ids="sra_ids.txt" # pulling sra id from each line sra_id=$(awk NR==${SLURM_ARRAY_TASK_ID} ${sra_ids}) echo "Processing SRA ID: $sra_id" # downloads the sra prefetch -v $sra_id # splits sra to fastq_1 & fastq_2 fastq-dump --split-files ${sra_id}/${sra_id}.sra --outdir fastqs # remove SRA folders rm -rf ${sra_id} # need to gzip fastq files to run with chipseq gzip fastqs/${sra_id}_*.fastq ``` ## Genome The hg19 is downloaded from ucsc.edu as this was what was initially used with the test data. I tried using GRCh38, but this did not provide any mapped reads. I also tried to use the GRCh37 genome from [ENSEMBL](https://ftp.ensembl.org/pub/grch37/release-111/fasta/), but the blacklist BED files were not compatible with this genome. I *could* modify the blacklisted BED regions, but I just wanted to see this workflow function. ### Download Genome ``` mkdir genome cd genome wget -L https://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz wget - https://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/genes/hg19.refGene.gtf.gz ``` ## Sample Sheets ### Minimal Dataset Samplesheet ``` sample,fastq_1,fastq_2,antibody,control SPT5_T0_REP1,/scratch/kk4764/chipseq/fastqs/SRR1822153_1.fastq.gz,/scratch/kk4764/chipseq/fastqs/SRR1822153_2.fastq.gz,SPT5,SPT5_INPUT_REP1 SPT5_T0_REP2,/scratch/kk4764/chipseq/fastqs/SRR1822154_1.fastq.gz,/scratch/kk4764/chipseq/fastqs/SRR1822154_2.fastq.gz,SPT5,SPT5_INPUT_REP2 SPT5_T15_REP1,/scratch/kk4764/chipseq/fastqs/SRR1822157_1.fastq.gz,/scratch/kk4764/chipseq/fastqs/SRR1822157_2.fastq.gz,SPT5,SPT5_INPUT_REP1 SPT5_T15_REP2,/scratch/kk4764/chipseq/fastqs/SRR1822158_1.fastq.gz,/scratch/kk4764/chipseq/fastqs/SRR1822158_2.fastq.gz,SPT5,SPT5_INPUT_REP2 SPT5_INPUT_REP1,/scratch/kk4764/chipseq/fastqs/SRR5204809_1.fastq.gz,/scratch/kk4764/chipseq/fastqs/SRR5204809_2.fastq.gz,, SPT5_INPUT_REP2,/scratch/kk4764/chipseq/fastqs/SRR5204810_1.fastq.gz,/scratch/kk4764/chipseq/fastqs/SRR5204810_2.fastq.gz,, ``` ### Full Dataset Samplesheet ``` sample,fastq_1,fastq_2,antibody,control FOXA1_IP_VEH_REP1,/scratch/kk4764/chipseq/fastqs/SRR1635459_1.fastq.gz,,FOXA1,INPUT_VEH_REP1 FOXA1_IP_VEH_REP2,/scratch/kk4764/chipseq/fastqs/SRR1635460_1.fastq.gz,,FOXA1,INPUT_VEH_REP2 FOXA1_IP_E2_REP1,/scratch/kk4764/chipseq/fastqs/SRR1635461_1.fastq.gz,,FOXA1,INPUT_E2_REP1 FOXA1_IP_E2_REP2,/scratch/kk4764/chipseq/fastqs/SRR1635462_1.fastq.gz,,FOXA1,INPUT_E2_REP2 INPUT_VEH_REP1,/scratch/kk4764/chipseq/fastqs/SRR1635435_1.fastq.gz,,, INPUT_VEH_REP2,/scratch/kk4764/chipseq/fastqs/SRR1635436_1.fastq.gz,,, INPUT_E2_REP1,/scratch/kk4764/chipseq/fastqs/SRR1635437_1.fastq.gz,,, INPUT_E2_REP2,/scratch/kk4764/chipseq/fastqs/SRR1635438_1.fastq.gz,,, EZH2_IP_NTKO_REP1,/scratch/kk4764/chipseq/fastqs/SRR1285070_1.fastq.gz,,EZH2,INPUT_NTKO_REP1 EZH2_IP_NTKO_REP2,/scratch/kk4764/chipseq/fastqs/SRR1285071_1.fastq.gz,,EZH2,INPUT_NTKO_REP2 EZH2_IP_TKO_REP1,/scratch/kk4764/chipseq/fastqs/SRR1285072_1.fastq.gz,,EZH2,INPUT_TKO_REP1 EZH2_IP_TKO_REP2,/scratch/kk4764/chipseq/fastqs/SRR1285073_1.fastq.gz,,EZH2,INPUT_TKO_REP2 INPUT_NTKO_REP1,/scratch/kk4764/chipseq/fastqs/SRR1285074_1.fastq.gz,,, INPUT_NTKO_REP2,/scratch/kk4764/chipseq/fastqs/SRR1285075_1.fastq.gz,,, INPUT_TKO_REP1,/scratch/kk4764/chipseq/fastqs/SRR1285076_1.fastq.gz,,, INPUT_TKO_REP2,/scratch/kk4764/chipseq/fastqs/SRR1285077_1.fastq.gz,,, ``` ## Config File (params.config) ``` params { config_profile_description = 'NYU NGS Analysis - CHiP-Seq' // limit resources //max_memory = 40.GB //max_cpus = 2 max_time = 1.d // main options genome= "hg19" blacklist= "/home/kk4764/.nextflow/assets/nf-core/chipseq/assets/blacklists/v2.0/hg19-blacklist.v2.bed" read_length= 50 } ``` ## NF Command ``` module load nextflow/23.04.1 nextflow run nf-core/chipseq \ -profile nyu_hpc \ --input samplesheet.csv \ --outdir res \ --fasta /scratch/kk4764/chipseq/genome/hg19.fa.gz \ --gtf /scratch/kk4764/chipseq/genome/hg19.refGene.gtf.gz \ -c params.config ``` # Results It works. Minimal dataset run at 2.5 hours, but broke at multiqc. The full dataset took 5.5 hours, but this did not have any limits to the resources. Results from the full dataset were similar to what was tested from v1.2.0 of nf-core/chipseq ## Minimal Dataset ![image](https://hackmd.io/_uploads/Skh-HiwA6.png) ## Full Dataset ![image](https://hackmd.io/_uploads/ByK4NsPA6.png) ![image](https://hackmd.io/_uploads/BJHxwsPCT.png) ## NF-core Chipseq Results v1.2.0 This was generated using the full dataset https://nf-co.re/chipseq/1.2.0/results/chipseq/results-048fd6854fcc85b355c61dfc2e21da0bcc6399ea For comparison. ![image](https://hackmd.io/_uploads/S1Ph4jDAa.png) --- # Notes ## Mar 16, 2024 All of these are using the [Minimal Dataset](#Minimal-Dataset-SRA-ID-List-sra_idstxt) to test. ### :red_circle: Run: 44225518 apparently we need to specify `--read_length` & `--macs_gsize` ``` ERROR ~ Both '--read_length' and '--macs_gsize' not specified! Please specify either to infer MACS2 genome size for peak calling. ``` ### :red_circle: Run: 44225553 Was able to run, but the filepaths need to be specific ### :red_circle: Run: 44225559 ``` ERROR: Please check samplesheet header -> group,replicate,fastq_1,fastq_2,antibody,control != sample,fastq_1,fastq_2,antibody,control ``` ### :red_circle: Run: 44225591 ``` ERROR: Please check samplesheet -> FastQ file does not have extension '.fastq.gz' or '.fq.gz'! Line: 'SPT5_T0_1,/scratch/kk4764/chipseq/fastqs/SRR1822153_1.fastq,/scratch/kk4764/chipseq/fastqs/SRR1822153_2.fastq,SPT5,SPT5_INPUT' ``` ### :red_circle: Run: 44227270 ``` ERROR: Please check samplesheet -> Control identifier has to match does a provided sample identifier! Control: 'SPT5_INPUT' ``` ### :red_circle: Run: 44227696 ``` Command error: WARNING: Skipping mount /opt/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container Chromosome "chr1" undefined in Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.sizes ``` ## March 17, 2024 After adjusting many small things with the sample sheets and pathways ... ### :red_circle: Run: 44238240 There are stagings of files that are taking time. After 2.5 hrs... ``` NFCORE_CHIPSEQ:CHIPSEQ:FILTER_BAM_BAMTOOLS:BAM_REMOVE_ORPHANS (SPT5_T0_1) ``` ``` Command error: [E::idx_find_and_load] Could not retrieve index file for 'SPT5_T0_1.mLb.clN.name.sorted.bam' Traceback (most recent call last): File "/home/kk4764/.nextflow/assets/nf-core/chipseq/bin/bampe_rm_orphan.py", line 167, in <module> bampe_rm_orphan(BAMIn=args.BAM_INPUT_FILE, BAMOut=args.BAM_OUTPUT_FILE, onlyFRPairs=args.ONLY_FR_PAIRS) File "/home/kk4764/.nextflow/assets/nf-core/chipseq/bin/bampe_rm_orphan.py", line 82, in bampe_rm_orphan currRead = next(iter) File "pysam/libcalignmentfile.pyx", line 2209, in pysam.libcalignmentfile.IteratorRowAll.__next__ StopIteration ``` #### command.sh ``` (base) [kk4764@log-1 chipseq]$ cat /scratch/kk4764/chipseq/work/f1/0dd2e3b81221fa755284e0c8659bd9/.command.sh #!/bin/bash -euo pipefail samtools sort -n -@ 6 -o SPT5_T0_1.mLb.clN.name.sorted.bam -T SPT5_T0_1.mLb.clN.name.sorted SPT5_T0_1.mLb.flT.sorted.bam bampe_rm_orphan.py SPT5_T0_1.mLb.clN.name.sorted.bam SPT5_T0_1.mLb.clN.bam --only_fr_pairs cat <<-END_VERSIONS > versions.yml "NFCORE_CHIPSEQ:CHIPSEQ:FILTER_BAM_BAMTOOLS:BAM_REMOVE_ORPHANS": samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//') END_VERSIONS ``` Looks like the files are not being indexed properly (no `.bai` files being made) ### :red_circle: Run: 44281685 The samplesheet was modified (incase there could have been clerical mistakes) and the script was re-run... The same error occurred. A similar issue on sarek slack might suggest this is related to poor symlink creation? After checking the work dir, **the bai files exist!** --> **SYMLINK PROBLEM** Maybe I should just cancel remove orphans? ``` ERROR ~ Error executing process > 'NFCORE_CHIPSEQ:CHIPSEQ:FILTER_BAM_BAMTOOLS:BAM_REMOVE_ORPHANS (SPT5_T15_REP1)' Caused by: Process `NFCORE_CHIPSEQ:CHIPSEQ:FILTER_BAM_BAMTOOLS:BAM_REMOVE_ORPHANS (SPT5_T15_REP1)` terminated with an error exit status (1) Command executed: samtools sort -n -@ 6 -o SPT5_T15_REP1.mLb.clN.name.sorted.bam -T SPT5_T15_REP1.mLb.clN.name.sorted SPT5_T15_REP1.mLb.flT.sorted.bam bampe_rm_orphan.py SPT5_T15_REP1.mLb.clN.name.sorted.bam SPT5_T15_REP1.mLb.clN.bam --only_fr_pairs cat <<-END_VERSIONS > versions.yml "NFCORE_CHIPSEQ:CHIPSEQ:FILTER_BAM_BAMTOOLS:BAM_REMOVE_ORPHANS": samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//') END_VERSIONS Command exit status: 1 Command output: (empty) Command error: [E::idx_find_and_load] Could not retrieve index file for 'SPT5_T15_REP1.mLb.clN.name.sorted.bam' Traceback (most recent call last): File "/home/kk4764/.nextflow/assets/nf-core/chipseq/bin/bampe_rm_orphan.py", line 167, in <module> bampe_rm_orphan(BAMIn=args.BAM_INPUT_FILE, BAMOut=args.BAM_OUTPUT_FILE, onlyFRPairs=args.ONLY_FR_PAIRS) File "/home/kk4764/.nextflow/assets/nf-core/chipseq/bin/bampe_rm_orphan.py", line 82, in bampe_rm_orphan currRead = next(iter) File "pysam/libcalignmentfile.pyx", line 2209, in pysam.libcalignmentfile.IteratorRowAll.__next__ StopIteration ``` ### :red_circle: Run: 44292328 (Full Dataset) Running full dataset to see if the symlink issue may still be a problem ``` ERROR ~ Error executing process > 'NFCORE_CHIPSEQ:CHIPSEQ:PHANTOMPEAKQUALTOOLS (FOXA1_IP_VEH_REP2)' Caused by: Process `NFCORE_CHIPSEQ:CHIPSEQ:PHANTOMPEAKQUALTOOLS (FOXA1_IP_VEH_REP2)` terminated with an error exit status (1) Command executed: RUN_SPP=`which run_spp.R` Rscript -e "library(caTools); source(\"$RUN_SPP\")" -c="FOXA1_IP_VEH_REP2.mLb.clN.sorted.bam" -savp="FOXA1_IP_VEH_REP2.spp.pdf" -savd="FOXA1_IP_VEH_REP2.spp.Rdata" -out="FOXA1_IP_VEH_REP2.spp.out" -p=6 cat <<-END_VERSIONS > versions.yml "NFCORE_CHIPSEQ:CHIPSEQ:PHANTOMPEAKQUALTOOLS": phantompeakqualtools: 1.2.2 END_VERSIONS Command exit status: 1 Command output: ################ ChIP data: FOXA1_IP_VEH_REP2.mLb.clN.sorted.bam Control data: NA strandshift(min): -500 strandshift(step): 5 strandshift(max) 1500 user-defined peak shift NA exclusion(min): 10 exclusion(max): NaN num parallel nodes: 6 FDR threshold: 0.01 NumPeaks Threshold: NA Output Directory: . narrowPeak output file name: NA regionPeak output file name: NA Rdata filename: FOXA1_IP_VEH_REP2.spp.Rdata plot pdf filename: FOXA1_IP_VEH_REP2.spp.pdf result filename: FOXA1_IP_VEH_REP2.spp.out Overwrite files?: FALSE Reading ChIP tagAlign/BAM file FOXA1_IP_VEH_REP2.mLb.clN.sorted.bam opened /state/partition1/job-44302130/RtmpLP9ruZ/FOXA1_IP_VEH_REP2.mLb.clN.sorted.tagAlign297cadfb11 done. read 0 fragments Command error: WARNING: Skipping mount /opt/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container Loading required package: Rcpp Error in read.table(bam2align.filename, nrows = 500) : no lines available in input Calls: source ... withVisible -> eval -> eval -> read.align -> read.table Execution halted ``` Full dataset runs without symlink errors, but there's an error when running the Rscript. Looks like there's zero read frags. This issue occurs in [run_spp.R](https://github.com/kundajelab/phantompeakqualtools/blob/master/run_spp.R) in phantompeakqualtools 1. Check the file itself Most files were empty 2. Is the genome correct? I was using GRCh38. Checking the test, seems like GRCh37 was used (hg19). This *shouldn't* be the issue since GRCh38 is more up-to-date... ### :red_circle: Run: 44297612 (Minimal Dataset) Removing `cleanup = True` from `params.config` to see if this would address the symlink issue...? Probably not, the bai files exist... ``` Command error: [E::idx_find_and_load] Could not retrieve index file for 'SPT5_T0_REP1.mLb.clN.name.sorted.bam' Traceback (most recent call last): File "/home/kk4764/.nextflow/assets/nf-core/chipseq/bin/bampe_rm_orphan.py", line 167, in <module> bampe_rm_orphan(BAMIn=args.BAM_INPUT_FILE, BAMOut=args.BAM_OUTPUT_FILE, onlyFRPairs=args.ONLY_FR_PAIRS) File "/home/kk4764/.nextflow/assets/nf-core/chipseq/bin/bampe_rm_orphan.py", line 82, in bampe_rm_orphan currRead = next(iter) File "pysam/libcalignmentfile.pyx", line 2209, in pysam.libcalignmentfile.IteratorRowAll.__next__ StopIteration ``` ## March 18, 2024 Changed the reference genome to GRCh37. Looking to see if this resolves issues seen in both the Full and Minimal dataset. ### :grey_question: Run: 44354409 (Minimal Dataset) *ALMOST* succeeded ``` ERROR ~ Error executing process > 'NFCORE_CHIPSEQ:CHIPSEQ:MULTIQC (1)' Caused by: Process `NFCORE_CHIPSEQ:CHIPSEQ:MULTIQC (1)` terminated with an error exit status (1) Command executed: multiqc \ -f \ \ \ . cat <<-END_VERSIONS > versions.yml "NFCORE_CHIPSEQ:CHIPSEQ:MULTIQC": multiqc: $( multiqc --version | sed -e "s/multiqc, version //g" ) END_VERSIONS Command exit status: 1 Command output: (empty) ``` ### :heavy_check_mark: Run: 44354478 (Full Dataset) ``` Duration : 5h 59m 32s CPU hours : 341.0 Succeeded : 525 ```