ASE and TSR Pipeline

# ASE and TSR Pipeline ## Tumor Subclonal Reconstruction (TSR) with PyClone-VI #### Please note: all files must be updated with proper paths depending on your own personal input data. --- ## Using ASCAT/Mutect: *Requires successful run of ASCAT and Mutect. Need folder for all of your tumor samples with:* #### ASCAT files: "...ASCAT_final_output.tsv" (for copy numbers) "...ASCAT.output.second.RDA" (for tumor purity values) #### Mutect files: "...tumoronly_filtered.vcf.gz" (for chromosome, position, reference allele counts, alternate allele counts) ### Data Cleaning (ASCAT/Mutect) PyClone-VI requires a specified input. For ASCAT and Mutect, the following file cleans input data: ``` /rsrch8/home/tdccct/rt/jereilly/TSR_PycloneVI/R-Scripts/mutect+ascat_Data_Cleaning_PycloneVI.R ``` The following LSF-File will run this R-script: ``` /rsrch8/home/tdccct/rt/jereilly/TSR_PycloneVI/LSF-Scripts/ascat+mutectCleaner-lsf/ascat+mutectCleaner.lsf ``` --- ## Using DepMap Files: Requires the following **DepMap information files**: 1. "...OmicsAbsoluteCNSegmentsModelID.csv" (for copy number information) 2. "...FY24Q4_OmicsSignatures.csv" (for signatures) 3. "...OmicsSomaticMutations.csv" (for mutation data) 4. *optional* "Model.csv" - or equivalennt (for metadata purposes) ### Data Cleaning (for DepMap files) The following file cleans input data from DepMap: ``` /rsrch8/home/tdccct/rt/jereilly/TSR_PycloneVI/R-Scripts/DepMap_Data_Cleaning_PycloneVI.R ``` This is the corresponding LSF script to run this R-script on HPC: ``` /rsrch8/home/tdccct/rt/jereilly/TSR_PycloneVI/LSF-Scripts/depmapCleaner-lsf/DepMapCleaner.lsf ``` --- ## Using TCGA Files: Requiress the following **TCGA input files**: 1. Copy number calls 2. Mutation calls ### Data Cleaning (for TCGA files) The following paths lead to specified cleaners for TCGA data. These R-scripts are built to only analyze specific cancer types, but will work on other inputs as well, *simply change the file input path within the scripts*: ``` /rsrch8/home/tdccct/rt/jereilly/TSR_PycloneVI/R-Scripts/TCGA_Data_Cleaning_COADREAD.R /rsrch8/home/tdccct/rt/jereilly/TSR_PycloneVI/R-Scripts/TCGA_Data_Cleaning_LAML.R ``` Here are the corresponding LSF scripts to run these R-Scripts on HPC: ``` /rsrch8/home/tdccct/rt/jereilly/TSR_PycloneVI/LSF-Scripts/TCGA_cleaner-lsf ``` Look within the LSF-Scripts folder to find other examples of LSFs to submit scripts. --- ## Using PDX files Requires the following **PDX input files**: 1. vep3 results (...maf) - used to reference and alternate allele counts 2. Facets file (...hisens.rds) - used to determine copy numbers ### Data Cleaning/Running PyClone-VI (for PDX files) In our run of PDX files, we did structure things slightly different than in TCGA, DepMap, and ASCAT/Mutect. Since there were more samples, we ran a bash script located at which created a script to run the cleaner as well as Pyclone-VI: ``` /rsrch8/home/tdccct/rt/jereilly/TSR_PycloneVI/LSF-Scripts/PDX_cleaning_pyclone-lsfs/wes_script_maker.sh ``` In the bash script above, this is the R-file utilized to clean the data: ``` /rsrch8/home/tdccct/rt/jereilly/TSR_PycloneVI/LSF-Scripts/PDX_cleaning_pyclone-lsfs/CGC_Data_Cleaner.R ``` This is an example of one of the LSF scripts that the bash script generates: ``` /rsrch8/home/tdccct/rt/jereilly/TSR_PycloneVI/LSF-Scripts/PDX_cleaning_pyclone-lsfs/cgc_00013.lsf ``` ## Outputs: Regardless of input data, once the cleaning R-scripts have been ran, the output format should look like this: ![Screenshot 2025-08-06 at 12.56.44 PM](https://hackmd.io/_uploads/SkQFwzb_ex.png) PyClone-VI requires this input to be ran (mutation_ID does not have to look exactly like that) --- ## Running Pyclone-VI Refer to GitHub for more detailed explanation: [Link](https://github.com/Roth-Lab/pyclone-vi) #### Set up Conda environment built for PyClone-VI: MD Anderson utilizes miniforge3: ``` module load miniforge3 *enter miniforge base* ``` Create PyClone conda: ```conda create -c conda-forge -n pyclone-vi --file requirements.txt --yes``` Install PyClone-VI: ```pip install git+https://github.com/Roth-Lab/pyclone-vi.git``` Enter conda environment: ```conda activate pyclone-vi``` #### Run Pyclone commands Fit your data! (input file is cleaned.tsv from R script) (refer to Pyclone-VI github for information on inputs to fit function (i.e. -o, -c, etc.)) Run each tumor sample individually (PyClone will eliminate any mutations that are not included in all samples if mutliple samples are passed) ``` pyclone-vi fit -i /input/path.tsv -o /desiredoutput/path.h5 -c 50 -g 100 -d beta-binomial -r 10 ``` Write final PyClone-VI file: ``` pyclone-vi write-results-file -i /input/path.h5 -o /desiredoutput/path.tsv ``` This .tsv output file will include: 1. mutation_id - Mutation identifier as used in the input file. 2. sample_id - Unique identifier for the sample as used in the input file. 3. cluster_id - Most probable cluster or clone the mutation was assigned to. 4. cellular_prevalence - Proportion of malignant cells with the mutation in the sample. This is also called cancer cell fraction (CCF) in the literature. 5. cellular_prevalence_std - Standard error of the cellular_prevalence estimate. 7. cluster_assignment_prob - Posterior probability the mutation is assigned to the cluster. This can be used as a confidence score to remove mutations with low probability of belonging to a cluster. For example scripts on automating the process of PyClone-VI submission, we can take a look at the following scripts: TCGA: ``` /rsrch8/home/tdccct/rt/jereilly/TSR_PycloneVI/LSF-Scripts/TCGA_LAML_pyclone-lsf ``` DepMap: ``` /rsrch8/home/tdccct/rt/jereilly/TSR_PycloneVI/LSF-Scripts/depmapPyclone-lsf ``` PDX: *see above section on PDX, explains process* --- ## Allele-Specific Expression (ASE) with GATK ASEReadCounter ## Overview **What is Allele Specific Expression Analysis?** A technique utilized to determine the relative abundance of different alleles of a gene within a single individual. - Examines whether both copies of a gene are expressed equally or if there is bias toward one allele - Identifies *cis-regulatory variation*, where genetic differences in a gene's regulatory region influence its expression levels. ## Steps in the Pipeline ### 1. Heterozygous Site Identification *What does it do?* Identifies positions in the genome where an individual has 2 different alleles (one from each parents). - called from DNA or RNA-seq data *Why is it important?* Allele Specific Expression (ASE) can only be measured at heterozygous loci - because you need to distinguish maternal vs. paternal alleles ### 2. Phasing (Determining Haplotypes) *What does it do?* - Determines which alleles are inherited together (on the same chromosome) - Groups heterozygous sites into haplotype blocks *Why is it important?* - When multiple heterozygous sites fall within a gene, phasing allows you to say (for example): "All of these allele 1s came form the same parent - so this is haplotype 1" - Enables gene-level Allele Specific Expression, not just single-site ASE - Distinguishes mapping biases (where true ASE is not distinguished from technical artifacts) ### 3. ASE Quantification *What does it do?* - Counts how many RNA-seq reads support the reference allele vs. the alternate allele at heterozygous sites - Optionally, aggregates across multiple sites wihtin a gene using phased data *Why is it important?* - Allows you to measure imbalances in expression between two alleles - Normal, diploid expression is often 50/50, so deviations (identified here) may signal regulatory mutations, copy number changes, or epigenetic effects ## Some downloadable inputs to ASE: Use this google cloud bucket for downloads: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0?pli=1&inv=1&invt=Ab4xYg GATK will require: 1. STAR genome directory. Use the following LSF file to generate this directory used as reference for STAR: ``` /rsrch8/home/tdccct/rt/jereilly/ASE_GATK/build_star_index.lsf ``` 3. Reference FASTA database: references_hg38_v0_Homo_sapiens_assembly38.fasta 2. Single nucleotide polymorphism database: Homo_sapiens_assembly38.dbsnp138.vcf.gz 3. Variant calling reference database: Mills_and_1000G_gold_standard.indels.hg38.vcf.gz *These references are for **hg38**, download hg19 if that is your required reference!* phASER will require: 1. Reference FASTA database: references_hg38_v0_Homo_sapiens_assembly38.fasta (same as GATK) 2. BED file for Gene Annotation data: gencode.v26.GRCh38.genes.chr.bed # Allele Specific Expression with GATK ASEReadCounter We ran GATK ASEReadCounter on our PDX cohort. For ASEReadCounter, we constructed a bash script to generate LSFs that run the entire GATK pipeline from RNAseq data. These are the steps ran: 1. Beginning w/ RNAseq fastqs, STAR aligner is ran 2. MarkDuplicates 3. MergeBAMAlignment 4. SplitNCigarReads 5. BaseRecalibrator 6. ApplyBSQR 7. HaplotypeCaller 8. VariantFiltration 9. SelectVariants 10. ASEReadCounter The bash script to construct LSFs with these steps is located here: ``` /rsrch8/home/tdccct/rt/jereilly/ASE_GATK/ase_pipeline_multi_sample.sh ``` An example script to be generated by this file is held here: ``` /rsrch8/home/tdccct/rt/jereilly/ASE_GATK/rnaseq_00001.lsf ``` This pipeline is completely compatible with raw fastq files from bulk RNAseq data. *to run phASER afterward, you will need to alter the script not to delete intermediate files after they are generated, we implemented this to save memory, but phASER requires some of these intermediate results. See next section to determine which files are required* # Allele Specific Expression with phASER **Important to note: Our phASER steps require output from the GATK pipeline, run ASEReadCounter pipeline first!** We also ran phASER for ASE analysis. This pipeline takes the following outputs from GATK (will need to not be removed in GATK pipeline): 1. GATK recalibrated BAM file (.recal.bam suffix) 2. GATK normalized VCF file (.normalized.vcf.gz suffix) With those outputs, we created a bash script to generate LSFs to run phASER Gene AE with the following steps: 1. picard ReplaceSAMHeader (for compatibility with phASER) 2. phASER snp-level ASE 3. phASER gene-level ASE The bash script to generate these LSFs can be located here: ``` /rsrch8/home/tdccct/rt/jereilly/ASE_phASER/phaser_pipeline.sh ``` Submitting the generated scripts should successfully run phASER ASE with the required inputs being met. ## Conclusion The above files should allow you to run both Tumor Subclonal Reconstruction with PyClone-VI and Allele Specific Expression Analysis with GATK's ASEReadCounter and phASER's GENE AE algorithm. All results will be contained in folders prepared for downstream analysis. ### PLEASE NOTE: YOU WILL NEED TO CHANGE THE SCRIPTS FROM WHAT IS ATTACHED ABOVE. File paths will need to be updated, as well as locations of inputs/outputs. Sample Locations (just for reference, idk): These are the PDX WES files (aggregated): ``` /rsrch8/home/iacs/jereilly/pycloneVI/wes_pipe/sample_aggregation/all_samples ``` Metadata for the above file to find cgc_run ID: ``` /rsrch8/home/iacs/jereilly/pycloneVI/wes_pipe/sample_aggregation/run_sample_metadata.tsv ``` This folder has some of the results of running pyclone and the cleaning scripts as well (on TCGA, DepMap, etc.): ``` /rsrch8/home/iacs/jereilly/pycloneVI ``` DepMap Data: ``` /rsrch8/home/iacs/jereilly/DepMapData ``` TCGA Data: ``` /rsrch8/home/iacs/jereilly/TCGA_Data ``` RNAseq samples from PDX: ``` /rsrch8/home/tdccct/rt/jereilly/pdx_omics/raw_files/rnaseq/samples ``` That should be all the important data :) -- James