PEAS Manual - HackMD

# PEAS Manual ## 1. ATAC-Seq dataset pre-processing ## 2. Feature Extraction (1)First, activate `peas` env. (2)Download hg38.fa: ```script=1 wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz gunzip hg38.fa.gz ``` (3)Feature extraction: ```bashscript=1 PEASFeatureExtraction.sh <arg0> <arg1> <arg2> <arg3> <arg4> <arg5> <arg6> <arg7> <arg8> <arg9> <arg10> ``` Input file: `.nodup.bam`: Filtered/deduped BAM. <------ This is a final BAM. `<arg0>`: the path to the input file e.g. '/home/stian/outerspace/PEAS_trying/depo/pipeline_outputs/organized_outputs_U266/align/rep1' | arg | What | Example | | -------- | -------- | -------- | | <arg0> | the path to the input file| /home/stian/outerspace/PEAS_trying/depo/pipeline_outputs/organized_outputs_U266/align/rep1 | | <arg1> | input file prefix| SRR12098207_1.trim.srt.nodup.no_chrM_MT | | <arg2> | output path| /home/stian/outerspace/output/U266_rep1 | | <arg3> | The path to the fasta file location| /home/stian/outerspace/PEAS_trying/depo/example_data/hg38.fa | | <arg4> | The reference genome to use for HOMER, (also need to be installed/configured in HOMER)| hg38 | | <arg5> | The path to the bed file containing error prone regions of the genome to remove. This file is provided in the PEAS Github| /home/stian/outerspace/PEAS_trying/depo/example_data/filter_hg38.bed | | <arg6> | motifs file path(Github)| /home/stian/outerspace/PEAS_trying/depo/example_data/humantop_Nov2016_HOMER.motifs | | <arg7> | conservation bed file path(Github)| /home/stian/outerspace/PEAS_trying/depo/example_data/phastCons46wayPlacental.bed | | <arg8> | CTCF motifs file path(Github)| /home/stian/outerspace/PEAS_trying/depo/example_data/CTCF.motifs | | <arg9> | PEAS path| /opt/datasci_apps/PEAS_v1.2.1 | | ==<arg10>== | chromosomes list(1~22)| home/stian/outerspace/find_bug/chromosomes_hg38.txt | >`<arg10>` is used to fix "insertsizethresh" bug---> ![](https://i.imgur.com/asO3sM5.png) ---e.g. ```bashscript=1 PEASFeatureExtraction.sh /home/stian/outerspace/PEAS_trying/depo/pipeline_outputs/organized_outputs_U266/align/rep1 SRR12098207_1.trim.srt.nodup.no_chrM_MT /home/stian/outerspace/find_bug/U266_rep1 /home/stian/outerspace/PEAS_trying/depo/example_data/hg38.fa hg38 /home/stian/outerspace/PEAS_trying/depo/example_data/filter_hg38.bed /home/stian/outerspace/PEAS_trying/depo/example_data/humantop_Nov2016_HOMER.motifs /home/stian/outerspace/PEAS_trying/depo/example_data/phastCons46wayPlacental.bed /home/stian/outerspace/PEAS_trying/depo/example_data/CTCF.motifs /opt/datasci_apps/PEAS_v1.2.1 /home/stian/outerspace/find_bug/chromosomes_hg38.txt ``` Finally, we use '*.features.txt*' as training feature dataset(27 columns) ## 3. Feature data annotation Combine `ChromHMM states segmentation` file and `feature data` file to get the final training feature dataset(31 columns) (1)*labels.csv*(4 columns) ```bashscript=1 PEASTools.sh annotate <arg0> <arg1> <arg2> ``` | arg | What | Example | | -------- | -------- | -------- | | <arg0> | the path to the delimitted feature file| ./delimited_features/v2_MM1S_rep1_delimited.csv | | <arg1> | ChromHMM annotation file| ./Feature_State_toCombine/15labels/MM1S_15labels.csv | | <arg2> | output path| ./TrainingData_PEAS_labels/15labels/v2_MM1S_rep1_PEAS_labels.csv| >`Delimited files` keep column index, `ChromHmm annotation files` dont have column index.(header), sep='\t' ```pythonscript=1 .to_csv("./v2_MM1S_rep1_delimited.csv",index=False,sep='\t')#,header=False) ``` ---e.g. ```bashscript=1 PEASTools.sh annotate ./delimited_features/v2_MM1S_rep1_delimited.csv ./Feature_State_toCombine/15labels/MM1S_15labels.csv ./TrainingData_PEAS_labels/15labels/v2_MM1S_rep1_PEAS_labels.csv ``` (2)Combine *labels.csv* and *features.csv* ```pythonscript=1 result = pd.merge(feature, labels) result.to_csv("./v2_MM1S_rep1.txt",index=False,sep = "\t") ``` Finally, we get the feature dataset(31 columns) ## 4. Training Model ```bashscript=1 python ./PEASTrainer.py -o ./depo/Models -n model_15labels -p "solver='adam',beta_1=0.999,beta_2=0.9999,epsilon=0.0000000001,activation='relu',hidden_layer_sizes=(25,)" -f ./depo/Models/features.txt -c ./depo/Models/classes.txt -l ./depo/Models/labelencoder.txt ./depo/Models/featurefile_List.csv ``` ```bashscript=1 PEASTrainer.py [-h] [-o OUT] [-n NAME] [-p PARAMSTRING] [-f FEATURES] [-c CLASSES] [-l LABELENCODER] [-r RANDOMSTATE] featurefiles ``` >positional arguments: featurefiles:File listing the file paths of all features to train the model. >optional arguments: -h, --help show this help message and exit -o OUT The selected directory saving outputfiles. -n NAME Name of the Model. -p PARAMSTRING String containing the parameters for the model. -f FEATURES Feature index file specifying which columns to include in the feature matrix. -c CLASSES File containing class label transformations into integer representations. -l LABELENCODER File containing feature label transformations into integer representations. -r RANDOMSTATE Integer for setting the random number generator seed ## 5. Prediction ```bashscript=1 python ./PEASPredictor.py -o ./depo/Predict_Result/model15_states -p model15_states_KMS11_rep1 -f ./depo/Models/features.txt -c ./depo/Models/classes.txt -l ./depo/Models/labelencoder.txt -e -a ./depo/Predict_Result/model15_states/KMS11_rep1_result_features ./depo/Models/model_15labels.pkl ./depo/Predict_Result/model15_states/predict_KMS11_rep1.csv ``` ```bashscript=1 PEASPredictor.py [-h] [-o OUT] [-p PREFIX] [-f FEATURES] [-c CLASSES] [-l LABELENCODER] [-e] [-a ANNOTATIONDEST] modelfile featurefiles ``` >positional arguments: modelfile:The filepath for the saved model (*.pkl). featurefiles:File listing the file paths of all features to train the model. >optional arguments: -h, --help show this help message and exit -o OUT The selected directory saving outputfiles. -p PREFIX Prefix for saving files. -f FEATURES Feature index file specifying which columns to include in the feature matrix. -c CLASSES File containing class label transformations into integer representations. Also used for filtering peaks based on classes. -l LABELENCODER File containing feature label transformations into integer representations. -e Whether or not to compare predictions with provided class labels. -a ANNOTATIONDEST Include an annotation file path prefix to append the feature file with a column for predictions. input files for prediction(test): (1)feature data files(==no header==) ```pythonscript=1 a=pd.read_csv('./depo/Training_Data/Features_Data/Final_Features/15labels/GM500_rep3.txt',sep='\t') a.to_csv("./depo/Predict_Result/model15_31columns/GM500_rep3_new.txt",index=False,sep='\t',header = False) ``` (2)File listing the file paths of all features to train the model. e.g. ```pythonscript=1 v2_MM1S_rep1 /home/stian/outerspace/PEAS_trying/depo/Predict_Result/model15_31columns/v2_MM1S_rep1_new.txt ```