# ENCODE ATAC-Seq Pipeline Manual
## Platform
Using **MobaXterm** to access **Gekko Cluster.**
>‘Session’ - ’SSH’ - ’host’(`gekko.hpc.ntu.edu.sg`) - ‘username’(`yourusername`)
Related folders:
>/home/stian007/apps/ : atac-seq-pipeline
/home/stian007/ncbi/ : SRA data
/scratch/stian007/output_*/ : Input JSON file and pbs script for each cell type
/scratch/stian007/fastq_gz_file : fastq.zip files
>Where to find raw ATAC-Seq?
https://www.ebi.ac.uk/ena/browser/view/SRR3356451?show=reads
### .json file

>*one file for each experiment(*Tcell.json, *GM50k.json, *GM500.json)
>
>*==The input in each line has to be 'atac.fastqs_repX_R1''atac.fastqs_repX_R2', can't have '_Day1T' in the format!==
e.g. "atac.fastqs_rep11_R1" :heavy_check_mark:
"atac.fastqs_Day1T_rep11_R1":x:
### .pbs file

>one file for each cell, run at the same time
>Where is the **`atac.wdl`**? (home?scratch)
---
## Dataset preprocessing
### 1. Install sratoolkit
```bashscript=1
$ eval "$(/usr/local/anaconda3-2020/bin/conda shell.bash hook)"
$ conda create --name toolkit
$ conda activate toolkit
$ conda install -c bioconda sra-tools
```
### 2. Download SRA dataset using sratoolkit
>**Dataset**: We used previously published ATAC-seq libraries from **`GM12878`** human lymphoblastoid cell line, purified **`CD4+T cells28`**. (GEO: GSE47753)[:link:](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi)
>Run selector: [:link:]( https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP024293&o=acc_s%3Aa)
‘selected’ - ‘download Accession List’
```bashscript=1
$ prefetch --option-file SRA_Data/SRR_Acc_List.txt
#prefetch SRR6232298
```
> downloaded file is in 'home/stian007/ncbi/sra'
### 3. Split paired-end SRA data & Convert to fastq.gz files
```bashscript=1
$ for i in './sra/*.sra';
do
fastq-dump --split-files --gzip $i;
done
```
or
```bashcript=1
$ fastq-dump --split-files --gzip sra/SRR891272.sra
```
---
## Install and Active Pipeline
```bashscript=1
$module load anaconda2020/python3
$ conda deactivate
$ mkdir apps
$ cd apps
$ git clone https://github.com/ENCODE-DCC/atac-seq-pipeline
$ cd atac-seq-pipeline
$ bash scripts/install_conda_env.sh
$ #wait for completion: `=== All done successfully ===`
$ conda activate
$ conda activate encode-atac-seq-pipeline
```
Just activate Pipeline:
```bashscript=1
$ module purge
$ module load anaconda2020/python3
$ conda activate encode-atac-seq-pipeline
```
### Install Caper
#### Caper
**Caper** (Cromwell Assisted Pipeline ExecutoR) is a wrapper Python package for Cromwell. **Cromwell** is an open-source Workflow Management System for bioinformatics. Once they are installed, Caper can completely work offline with local data files.
```pythonscript=1
pip install caper
```
>Use `pip`, not `conda` to install the package!
>
>Remember to replace the
> `‘~/.conda/envs/encode-atac-seq-pipeline/lib/python3.7/site-packages/caper/cromwell_backend.py’` with the new one.
### Initialize Caper :exclamation:
Choose a backend and ==initialize Caper==. This will create a default Caper configuration file ~/.caper/default.conf, which have only required parameters for the chosen backend.
```pythonscript=1
caper init pbs
#$ caper init [BACKEND]
```
>pbs:HPC with PBS cluster engine.
>
Then edit `‘~/.caper/default.conf.’`
```bashscript=1
cat ~/.caper/default.conf
#check the existed default file
```
---
## Submit PBS job scripts
>Remember to set into the pipeline environment first.
```bashscript=1
$ qsub runINPUTS_GM500.pbs
```
Check the status of jobs
```bashscript=1
$ qstat -u $USER
```
---
## Organize Outputs
>when the running has finished, use `croo` to export outputs.
```bashscript=1
$conda activate encode-atac-seq-pipeline
croo ./atac/af4ea210-2953-4936-bcf2-b4d80fd616c3/metadata.json --out-dir /scratch/stian007/organized_outputs_T --method link
```
>`--out-dir `:Output directory/bucket (LOCAL OR REMOTE).
> **IMPORTANT**: We recommmend to organize outputs on the same storage type. For example, organize outputs on `gs://` for original Cromwell outputs on `gs://`. `croo ... --out-dir gs://some/where/organized`.
>`--method `:Method to localize files on output directory/bucket."link" means a soft-linking and it's for local directory only.
## Export the Organized Outputs
```bashscript=1
#jupyter terminal
rsync -rLt -h --info=progress2 --stats --no-inc-recursive --dry-run -e ssh STIAN007@gekko.hpc.ntu.edu.sg:/scratch/stian007/organized_outputs_50K ./depo/pipeline_outputs/
```
>`-rLt` `-r`:recurse into directories; `-L`:transfor symLinks `-t`: preserve modification times `-h` output human-readable
#`-stat`: the statistics result i.e. No of files, files transferred, total bytes
#`--no-inc-` disables incremental recursion
#`–dry-run` allows the rsync command to run a trial without making any changes
:slightly_smiling_face: