# PEAS Manual
## 1. ATAC-Seq dataset pre-processing
## 2. Feature Extraction
(1)First, activate `peas` env.
(2)Download hg38.fa:
```script=1
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
```
(3)Feature extraction:
```bashscript=1
PEASFeatureExtraction.sh <arg0> <arg1> <arg2> <arg3> <arg4> <arg5> <arg6> <arg7> <arg8> <arg9> <arg10>
```
Input file: `.nodup.bam`: Filtered/deduped BAM. <------ This is a final BAM.
`<arg0>`: the path to the input file
e.g.
'/home/stian/outerspace/PEAS_trying/depo/pipeline_outputs/organized_outputs_U266/align/rep1'
| arg | What | Example |
| -------- | -------- | -------- |
| <arg0> | the path to the input file| /home/stian/outerspace/PEAS_trying/depo/pipeline_outputs/organized_outputs_U266/align/rep1 |
| <arg1> | input file prefix| SRR12098207_1.trim.srt.nodup.no_chrM_MT |
| <arg2> | output path| /home/stian/outerspace/output/U266_rep1 |
| <arg3> | The path to the fasta file location| /home/stian/outerspace/PEAS_trying/depo/example_data/hg38.fa |
| <arg4> | The reference genome to use for HOMER, (also need to be installed/configured in HOMER)| hg38 |
| <arg5> | The path to the bed file containing error prone regions of the genome to remove. This file is provided in the PEAS Github| /home/stian/outerspace/PEAS_trying/depo/example_data/filter_hg38.bed |
| <arg6> | motifs file path(Github)| /home/stian/outerspace/PEAS_trying/depo/example_data/humantop_Nov2016_HOMER.motifs |
| <arg7> | conservation bed file path(Github)| /home/stian/outerspace/PEAS_trying/depo/example_data/phastCons46wayPlacental.bed |
| <arg8> | CTCF motifs file path(Github)| /home/stian/outerspace/PEAS_trying/depo/example_data/CTCF.motifs |
| <arg9> | PEAS path| /opt/datasci_apps/PEAS_v1.2.1 |
| ==<arg10>== | chromosomes list(1~22)| home/stian/outerspace/find_bug/chromosomes_hg38.txt |
>`<arg10>` is used to fix "insertsizethresh" bug--->

---e.g.
```bashscript=1
PEASFeatureExtraction.sh /home/stian/outerspace/PEAS_trying/depo/pipeline_outputs/organized_outputs_U266/align/rep1 SRR12098207_1.trim.srt.nodup.no_chrM_MT /home/stian/outerspace/find_bug/U266_rep1 /home/stian/outerspace/PEAS_trying/depo/example_data/hg38.fa hg38 /home/stian/outerspace/PEAS_trying/depo/example_data/filter_hg38.bed /home/stian/outerspace/PEAS_trying/depo/example_data/humantop_Nov2016_HOMER.motifs /home/stian/outerspace/PEAS_trying/depo/example_data/phastCons46wayPlacental.bed /home/stian/outerspace/PEAS_trying/depo/example_data/CTCF.motifs /opt/datasci_apps/PEAS_v1.2.1 /home/stian/outerspace/find_bug/chromosomes_hg38.txt
```
Finally, we use '*.features.txt*' as training feature dataset(27 columns)
## 3. Feature data annotation
Combine `ChromHMM states segmentation` file and `feature data` file to get the final training feature dataset(31 columns)
(1)*labels.csv*(4 columns)
```bashscript=1
PEASTools.sh annotate <arg0> <arg1> <arg2>
```
| arg | What | Example |
| -------- | -------- | -------- |
| <arg0> | the path to the delimitted feature file| ./delimited_features/v2_MM1S_rep1_delimited.csv |
| <arg1> | ChromHMM annotation file| ./Feature_State_toCombine/15labels/MM1S_15labels.csv |
| <arg2> | output path| ./TrainingData_PEAS_labels/15labels/v2_MM1S_rep1_PEAS_labels.csv|
>`Delimited files` keep column index, `ChromHmm annotation files` dont have column index.(header), sep='\t'
```pythonscript=1
.to_csv("./v2_MM1S_rep1_delimited.csv",index=False,sep='\t')#,header=False)
```
---e.g.
```bashscript=1
PEASTools.sh annotate ./delimited_features/v2_MM1S_rep1_delimited.csv ./Feature_State_toCombine/15labels/MM1S_15labels.csv ./TrainingData_PEAS_labels/15labels/v2_MM1S_rep1_PEAS_labels.csv
```
(2)Combine *labels.csv* and *features.csv*
```pythonscript=1
result = pd.merge(feature, labels)
result.to_csv("./v2_MM1S_rep1.txt",index=False,sep = "\t")
```
Finally, we get the feature dataset(31 columns)
## 4. Training Model
```bashscript=1
python ./PEASTrainer.py -o ./depo/Models -n model_15labels -p "solver='adam',beta_1=0.999,beta_2=0.9999,epsilon=0.0000000001,activation='relu',hidden_layer_sizes=(25,)" -f ./depo/Models/features.txt -c ./depo/Models/classes.txt -l ./depo/Models/labelencoder.txt ./depo/Models/featurefile_List.csv
```
```bashscript=1
PEASTrainer.py [-h] [-o OUT] [-n NAME] [-p PARAMSTRING] [-f FEATURES] [-c CLASSES] [-l LABELENCODER] [-r RANDOMSTATE] featurefiles
```
>positional arguments:
featurefiles:File listing the file paths of all features to train the model.
>optional arguments:
-h, --help show this help message and exit
-o OUT The selected directory saving outputfiles.
-n NAME Name of the Model.
-p PARAMSTRING String containing the parameters for the model.
-f FEATURES Feature index file specifying which columns to include in the feature matrix.
-c CLASSES File containing class label transformations into integer representations.
-l LABELENCODER File containing feature label transformations into integer representations.
-r RANDOMSTATE Integer for setting the random number generator seed
## 5. Prediction
```bashscript=1
python ./PEASPredictor.py -o ./depo/Predict_Result/model15_states -p model15_states_KMS11_rep1 -f ./depo/Models/features.txt -c ./depo/Models/classes.txt -l ./depo/Models/labelencoder.txt -e -a ./depo/Predict_Result/model15_states/KMS11_rep1_result_features ./depo/Models/model_15labels.pkl ./depo/Predict_Result/model15_states/predict_KMS11_rep1.csv
```
```bashscript=1
PEASPredictor.py [-h] [-o OUT] [-p PREFIX] [-f FEATURES] [-c CLASSES] [-l LABELENCODER] [-e] [-a ANNOTATIONDEST] modelfile featurefiles
```
>positional arguments:
modelfile:The filepath for the saved model (*.pkl).
featurefiles:File listing the file paths of all features to train the model.
>optional arguments:
-h, --help show this help message and exit
-o OUT The selected directory saving outputfiles.
-p PREFIX Prefix for saving files.
-f FEATURES Feature index file specifying which columns to include in the feature matrix.
-c CLASSES File containing class label transformations into integer representations. Also used for filtering peaks based on classes.
-l LABELENCODER File containing feature label transformations into integer representations.
-e Whether or not to compare predictions with provided class labels.
-a ANNOTATIONDEST Include an annotation file path prefix to append the feature file with a column for predictions.
input files for prediction(test):
(1)feature data files(==no header==)
```pythonscript=1
a=pd.read_csv('./depo/Training_Data/Features_Data/Final_Features/15labels/GM500_rep3.txt',sep='\t')
a.to_csv("./depo/Predict_Result/model15_31columns/GM500_rep3_new.txt",index=False,sep='\t',header = False)
```
(2)File listing the file paths of all features to train the model.
e.g.
```pythonscript=1
v2_MM1S_rep1 /home/stian/outerspace/PEAS_trying/depo/Predict_Result/model15_31columns/v2_MM1S_rep1_new.txt
```