CELLECT S-LDSC pipeline

# Step to run the CELLECT S-LDSC pipeline ## Step 1: Install packages and environments ### Step 1A: Clone CELLECT GitHub repo > Comment: The --recurse-submodules is needed to clone the git submodule 'ldsc' (pascaltimshel/ldsc), which is a modfied version of the original ldsc repository. (Cloning the repo might take few minutes as the CELLECT data files (> 1-3 GB) will be downloaded. To skip downloading the data files, use GIT_LFS_SKIP_SMUDGE=1 git clone --recurse-submodules https://github.com/perslab/CELLECT.git instead.) ``` git clone --recurse-submodules https://github.com/perslab/CELLECT.git use Git-LFS git lfs fetch git lfs checkout ``` ### Step 1B: Create a conda env for CELLECT > Note: Make sure to create and download conda envs on a VM (qrsh) with VM >4G. ``` conda create --prefix /broad/mcl/members_dir/mmurali/envs/cellect_env conda activate /broad/mcl/members_dir/mmurali/envs/cellect_env ``` ### Step 1C: Install snakemake via conda ``` conda install -c conda-forge mamba mamba create -c conda-forge -c bioconda -n snakemake snakemake # To use snakemake, use the following command conda activate /broad/mcl/members_dir/mmurali/envs/cellect_env/envs/snakemake ``` ### Step 1D: Clone CELLEX GitHub repo ``` git clone https://github.com/perslab/CELLEX.git --branch develop --single-branch cd CELLEX ``` ### Step 1E: Create a conda env for CELLEX > Note: Make sure to create and download conda envs on a VM (qrsh) with VM >4G. ``` conda create --prefix /broad/mcl/members_dir/mmurali/envs/cellex_env conda activate /broad/mcl/members_dir/mmurali/envs/cellex_env ``` ### Step 1F: Install from source using pip ```sh # Alternatively install latest version using PyPi pip install cellex # Error: AttributeError: module 'numpy' has no attribute 'float'. conda install python=3.11 python -m pip install numpy==1.23.3 # in python3 import pandas as pd import numpy as np import cellex ``` --- ## Step 2: Follow the CELLECT LDSC Tutorial ([link](https://github.com/perslab/CELLECT/wiki/CELLECT-LDSC-Tutorial)) ### Step 2A: Create conda environment for munging ``` cd ~/CELLECT conda env create -f ldsc/environment_munge_ldsc.yml --prefix /broad/mcl/members_dir/mmurali/envs/munge_ldsc conda activate /broad/mcl/members_dir/mmurali/envs/munge_ldsc ``` ### Step 2B: Download GWAS sumstats > BMI GWAS from [Yengo (HMG, 2018)](https://academic.oup.com/hmg/article/27/20/3641/5067845?login=true) and Educational Attainment GWAS from [Lee (Nat. Gen., 2018)](https://www.nature.com/articles/s41588-018-0147-3). ``` wget https://portals.broadinstitute.org/collaboration/giant/images/c/c8/Meta-analysis_Locke_et_al%2BUKBiobank_2018_UPDATED.txt.gz -P example/ wget https://www.dropbox.com/s/ho58e9jmytmpaf8/GWAS_EA_excl23andMe.txt -P example/ ``` ### Step 2C: Munge the GWAS sumstats > Note: Re-download the `w_hm4.snplist` as the file doesnt get downloaded because of the size limit. ``` python ldsc/mtag_munge.py \ --sumstats example/GWAS_EA_excl23andMe.txt \ --merge-alleles data/ldsc/w_hm3.snplist \ --n-value 766345 \ --keep-pval \ --p PVAL \ --out example/EA3_Lee2018 python ldsc/mtag_munge.py \ --sumstats example/Meta-analysis_Locke_et_al+UKBiobank_2018_UPDATED.txt.gz \ --a1 Tested_Allele \ --a2 Other_Allele \ --merge-alleles data/ldsc/w_hm3.snplist \ --keep-pval \ --p PVAL \ --out example/BMI_Yengo2018 ``` ### Step 2D: Generate cell-type specificity input using CELLEX > If you get error with numpy, do the following: python -m pip install numpy==1.22.4 ``` # Deactivate munge_ldsc conda deactivate # Activate cellex_env conda activate /broad/mcl/members_dir/mmurali/envs/cellex_env import numpy as np import pandas as pd import cellex # Load input data and metadata mousebrain_sc_rnaseq_data=pd.read_csv("abc.csv", index_col=0) celltype_labels=pd.read_csv("xyz.csv", index_col=0) # Create ESObject and compute ESmu eso = cellex.ESObject(data=mousebrain_sc_rnaseq_data, annotation=celltype_labels, verbose=True) eso.compute(verbose=True) # View Expression Specificity scores eso.results["esmu"] eso.results["esmu"].to_csv("mousebrain-test.csv.gz") ``` ### Step 2E: Run CELLECT-LDSC ``` # Activate the snakemake env conda create --prefix /broad/mcl/members_dir/mmurali/envs/snakemake conda activate /broad/mcl/members_dir/mmurali/envs/snakemake conda config --add channels bioconda conda config --add channels conda-forge conda install snakemake # Within the CELLECT directory snakemake --use-conda -s cellect-ldsc.snakefile --configfile config.yml --cores 20 -j 50 --conda-frontend conda ``` Update: This ran but didnt produce an output. ## Citations: 1) Finucane (Nature Genetics, 2015): Partitioning heritability by functional annotation using genome-wide association summary statistics 2) Timshel (eLife, 2020): Genetic mapping of etiologic brain cell types for obesity