ENCODE ATAC-Seq Pipeline Manual

# ENCODE ATAC-Seq Pipeline Manual ## Platform Using **MobaXterm** to access **Gekko Cluster.** >‘Session’ - ’SSH’ - ’host’(`gekko.hpc.ntu.edu.sg`) - ‘username’(`yourusername`) Related folders: >/home/stian007/apps/ : atac-seq-pipeline /home/stian007/ncbi/ : SRA data /scratch/stian007/output_*/ : Input JSON file and pbs script for each cell type /scratch/stian007/fastq_gz_file : fastq.zip files >Where to find raw ATAC-Seq? https://www.ebi.ac.uk/ena/browser/view/SRR3356451?show=reads ### .json file ![](https://i.imgur.com/rz2Fbfc.png) >*one file for each experiment(*Tcell.json, *GM50k.json, *GM500.json) > >*==The input in each line has to be 'atac.fastqs_repX_R1''atac.fastqs_repX_R2', can't have '_Day1T' in the format!== e.g. "atac.fastqs_rep11_R1" :heavy_check_mark: "atac.fastqs_Day1T_rep11_R1":x: ### .pbs file ![](https://i.imgur.com/toCNwU4.png) >one file for each cell, run at the same time >Where is the **`atac.wdl`**? (home?scratch) --- ## Dataset preprocessing ### 1. Install sratoolkit ```bashscript=1 $ eval "$(/usr/local/anaconda3-2020/bin/conda shell.bash hook)" $ conda create --name toolkit $ conda activate toolkit $ conda install -c bioconda sra-tools ``` ### 2. Download SRA dataset using sratoolkit >**Dataset**: We used previously published ATAC-seq libraries from **`GM12878`** human lymphoblastoid cell line, purified **`CD4+T cells28`**. (GEO: GSE47753)[:link:](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi) >Run selector: [:link:]( https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP024293&o=acc_s%3Aa) ‘selected’ - ‘download Accession List’ ```bashscript=1 $ prefetch --option-file SRA_Data/SRR_Acc_List.txt #prefetch SRR6232298 ``` > downloaded file is in 'home/stian007/ncbi/sra' ### 3. Split paired-end SRA data & Convert to fastq.gz files ```bashscript=1 $ for i in './sra/*.sra'; do fastq-dump --split-files --gzip $i; done ``` or ```bashcript=1 $ fastq-dump --split-files --gzip sra/SRR891272.sra ``` --- ## Install and Active Pipeline ```bashscript=1 $module load anaconda2020/python3 $ conda deactivate $ mkdir apps $ cd apps $ git clone https://github.com/ENCODE-DCC/atac-seq-pipeline $ cd atac-seq-pipeline $ bash scripts/install_conda_env.sh $ #wait for completion: `=== All done successfully ===` $ conda activate $ conda activate encode-atac-seq-pipeline ``` Just activate Pipeline: ```bashscript=1 $ module purge $ module load anaconda2020/python3 $ conda activate encode-atac-seq-pipeline ``` ### Install Caper #### Caper **Caper** (Cromwell Assisted Pipeline ExecutoR) is a wrapper Python package for Cromwell. **Cromwell** is an open-source Workflow Management System for bioinformatics. Once they are installed, Caper can completely work offline with local data files. ```pythonscript=1 pip install caper ``` >Use `pip`, not `conda` to install the package! > >Remember to replace the > `‘~/.conda/envs/encode-atac-seq-pipeline/lib/python3.7/site-packages/caper/cromwell_backend.py’` with the new one. ### Initialize Caper :exclamation: Choose a backend and ==initialize Caper==. This will create a default Caper configuration file ~/.caper/default.conf, which have only required parameters for the chosen backend. ```pythonscript=1 caper init pbs #$ caper init [BACKEND] ``` >pbs:HPC with PBS cluster engine. > Then edit `‘~/.caper/default.conf.’` ```bashscript=1 cat ~/.caper/default.conf #check the existed default file ``` --- ## Submit PBS job scripts >Remember to set into the pipeline environment first. ```bashscript=1 $ qsub runINPUTS_GM500.pbs ``` Check the status of jobs ```bashscript=1 $ qstat -u $USER ``` --- ## Organize Outputs >when the running has finished, use `croo` to export outputs. ```bashscript=1 $conda activate encode-atac-seq-pipeline croo ./atac/af4ea210-2953-4936-bcf2-b4d80fd616c3/metadata.json --out-dir /scratch/stian007/organized_outputs_T --method link ``` >`--out-dir `:Output directory/bucket (LOCAL OR REMOTE). > **IMPORTANT**: We recommmend to organize outputs on the same storage type. For example, organize outputs on `gs://` for original Cromwell outputs on `gs://`. `croo ... --out-dir gs://some/where/organized`. >`--method `:Method to localize files on output directory/bucket."link" means a soft-linking and it's for local directory only. ## Export the Organized Outputs ```bashscript=1 #jupyter terminal rsync -rLt -h --info=progress2 --stats --no-inc-recursive --dry-run -e ssh STIAN007@gekko.hpc.ntu.edu.sg:/scratch/stian007/organized_outputs_50K ./depo/pipeline_outputs/ ``` >`-rLt` `-r`:recurse into directories; `-L`:transfor symLinks `-t`: preserve modification times `-h` output human-readable #`-stat`: the statistics result i.e. No of files, files transferred, total bytes #`--no-inc-` disables incremental recursion #`–dry-run` allows the rsync command to run a trial without making any changes :slightly_smiling_face: