Biosurfer Analysis

# Biosurfer Analysis ## Analysis accompanying the manuscript *"Biosurfer: Connecting genomic, transcriptomic, and proteomic information layers to track mechanisms of protein isoform variation"* This repository contains steps to run Biosurfer to reproduces the results, summary plots, and figures for the Biosurfer manuscript ([bioRxiv]()). > Note: Here please be aware that the input files downloaded and the databases created by Biosurfer will reside within the Biosurfer directory. These can be large (~3 GB) for the full analysis contained here. **Contents** 1. [Download Biosurfer analysis repository](#download-biosurfer-analysis) 2. [Download and install Biosurfer package](#download-and-install-biosurfer) 3. [Download input data](#download-input-data) 4. [Run Biosurfer modules](#run-biosurfer-modules) a. [Load database](#load-database) b. [Run hybrid alignment](#run-hybrid-alignment) c. [Visualize protein isoforms](#visualize-protein-isoforms) 5. [Global characterization of altered protein regions in the human annotation (GENCODE)](#post-processing) a. [Altered protein regions across the human proteome](#genome-wide-summary) b. [Analysis of alternative splicing events that alter the N-terminus of proteins](#n-term) c. [Characterization of splicing patterns underlying internal protein region differences](#internal-region) d. [Analyzing splicing patterns for C-terminal alterations](#c-term) <a id="download-biosurfer-analysis"></a> ## 1. Download Biosurfer analysis repository You can use the latest version from the source code. ``` git clone https://github.com/sheynkman-lab/biosurfer_analysis cd biosurfer_analysis ``` <a id="download-and-install-biosurfer"></a> ## 2. Download and install Biosurfer package #### Create the conda environment for Biosurfer via terminal ``` conda create --name biosurfer-install --channel conda-forge python=3 pip ``` #### Activate the conda environment: ``` conda activate biosurfer-install conda install --channel conda-forge graph-tool ``` #### Clone Biosurfer repository ``` git clone https://github.com/sheynkman-lab/biosurfer.git ``` > Note: The Biosurfer package will be downloaded within the `biosurfer-analysis` directory. #### Run setup The editable installation of Biosurfer package looks for the `setup.py` within biosurfer directory and installs the Biosurfer package within the conda env. ``` pip install --editable biosurfer ``` > Note: if you get a `importlib.metadata.PackageNotFoundError` error, please deactivate and then activate the conda env again --- <a id="download-input-data"></a> ## 3. Download input data The below script would download the following data from [Zenodo](https://zenodo.org/record/7297008): 1. **GENCODE toy**: * Description: Test dataset generated from GENCODE v38 * Use: This dataset can be used to test the functionality and modules of Biosurfer. * Size: 4.2 MB 2. **GENCODE v42**: * Description: It contains the basic gene annotation on the primary assembly sequence regions * Use: Used for the analyses conducted in the manuscript * Size: 1.29 GB 3. **WTC11**: * Description: WTC11 is long-read RNA-seq data from a human induced pluripotent stem cells (iPSC) ([Kreitzer et al. 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3708511/)) * Use: Used for the analyses conducted in the manuscript. * Size: 644 MB ``` for source in gencode_toy gencode_v42 wtc11 do bash "./scripts/download_$source.sh" done ``` > Note: Any GENCODE version can be used with the appropriate GTF, transcript FASTA, and translation FASTA files. --- <a id="run-biosurfer-modules"></a> ## 4. Run Biosurfer modules For more information on the modules, refer to Biosurfer package repo ([here](https://github.com/sheynkman-lab/biosurfer#usage)) <a id="load-database"></a> ### a. Load database Running the load database module creates a SQLite database file in the `biosurfer/databases/` directory. #### **GENCODE toy** ``` biosurfer load_db \ --source=GENCODE \ --gtf A_gencode_toy/biosurfer_gencode_toy_data/gencode.v38.toy.gtf \ --tx_fasta A_gencode_toy/biosurfer_gencode_toy_data/gencode.v38.toy.transcripts.fa \ --tl_fasta A_gencode_toy/biosurfer_gencode_toy_data/gencode.v38.toy.translations.fa \ -d gencode_toy ``` #### **GENCODE v42** ``` biosurfer load_db \ --source=GENCODE \ --gtf A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.basic.annotation.gtf \ --tx_fasta A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.pc_transcripts.fa \ --tl_fasta A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.pc_translations.fa \ -d gencode_v42 ``` #### **WTC11** Load the GENCODE v42 GTF annotations first to set the reference isoforms for WTC11 PacBio data ``` biosurfer load_db \ --source=GENCODE \ --gtf A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.basic.annotation.gtf \ --tx_fasta A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.pc_transcripts.fa \ --tl_fasta A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.pc_translations.fa \ -d wtc11 ``` Load the WTC11 PacBio data ``` biosurfer load_db \ --source=PacBio \ --gtf A_wtc11/biosurfer_wtc11_data/wtc11_with_cds.gtf \ --tx_fasta A_wtc11/biosurfer_wtc11_data/wtc11_corrected.fasta \ --tl_fasta A_wtc11/biosurfer_wtc11_data/wtc11_orf_refined.fasta \ --sqanti A_wtc11/biosurfer_wtc11_data/wtc11_classification.txt \ -d wtc11 ``` --- <a id="run-hybrid-alignment"></a> ### b. Run hybrid alignment > Note: Running this step could take some time(~30 mins) depending on the size of the input data. #### **GENCODE toy** ``` mkdir B_hybrid_aln_results_toy biosurfer hybrid_alignment \ -d gencode_toy \ -o B_hybrid_aln_results_toy \ --gencode ``` #### **GENCODE v42** ``` mkdir B_hybrid_aln_gencode_v42 biosurfer hybrid_alignment \ -d gencode_v42 \ -o B_hybrid_aln_gencode_v42 \ --gencode ``` #### **WTC11** ``` mkdir B_hybrid_aln_wtc11 biosurfer hybrid_alignment \ -d wtc11 \ -o B_hybrid_aln_wtc11 ``` --- <a id="visualize-protein-isoforms"></a> ### c. Visualize protein isoforms The script below invokes the plotting module for the *CRYBG2* gene and outputs a PNG file. Users can alter the below script to view protein isoforms of any gene they desire. ``` bash ./scripts/isoform_plotting.sh ``` --- <a id="post-processing"></a> ## 5. Global characterization of altered protein regions in the human annotation (GENCODE) The following steps reproduces the results for GENCODE v42. #### Install required libraries ``` pip install ipykernel xlsxwriter openpyxl plotly ``` <a id="genome-wide-summary"></a> ### a. Altered protein regions across the human proteome Genome-wide analysis of protein isoforms in the GENCODE annotation/WTC11 ``` python3 ./scripts/genome_wide_summary.py ``` <a id="n-term"></a> ### b. Analysis of alternative splicing events that alter the N-terminus of proteins ``` python3 ./scripts/n_termini_summary.py ``` <a id="internal-region"></a> ### c. Characterization of splicing patterns underlying internal protein region differences ``` python3 ./scripts/internal_summary.py ``` <a id="c-term"></a> ### d. Analyzing splicing patterns for C-terminal alterations ``` python3 ./scripts/c_termini_summary.py ``` To reproduce the results for for WTC11: in [`plot_config.py`](https://github.com/sheynkman-lab/biosurfer_analysis/blob/main/scripts/plot_config.py) comment [`line 76`](https://github.com/sheynkman-lab/biosurfer_analysis/blob/84a32406ee70e0fea686a19be5b54599c21e5189/scripts/plot_config.py#L76) and uncomment [`line 78`](https://github.com/sheynkman-lab/biosurfer_analysis/blob/84a32406ee70e0fea686a19be5b54599c21e5189/scripts/plot_config.py#L78)