GA4GH API Demo on mahuika / ga-vl01

# GA4GH API Demo on mahuika / ga-vl01 ### htsget: PYTHON 3 ```bash ## Log in to GA workshop machine ssh ga-vl01 ``` ```bash ml Python python3 -m venv ~/venv/ga4gh source ~/venv/ga4gh/bin/activate pip install htsget # copy and paste python code into htsgetTest.py (see below) python htsgetTest.py ``` Code for `htsgetTest.py`: ```python import htsget url = "http://htsnexus.rnd.dnanex.us/v1/reads/BroadHiSeqX_b37/NA12878" with open("NA12878_2.bam", "wb") as output: htsget.get(url, output, reference_name="2", start=1000, end=20000) ``` ### htsget: PYTHON 2.7.5 After logging in to ga-vl01: ```bash # Install pip curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py python get-pip.py # Install virtualenv python -m pip install --user virtualenv # Generate virtual environment for Python v2 python -m virtualenv ~/venv/ga4gh_p2 # Activate the Python2 environment source ~/venv/ga4gh_p2/bin/activate # Install htsget pip install htsget # Run code to retrieve BAM file python htsgetTest.py ``` ### htsget (cmd line) vcf file ```bash module load BCFtools/1.9-GCC-7.4.0 module load ga4gh # Subsetting region from vcf htsget http://htsnexus.rnd.dnanex.us/v1/variants/1000genomes/20130502_autosomes --reference-name=12 --start=112204691 --end=112247789 -O my_vcf.vcf.gz # subsetting region and extracting a sample from vcf # also possible to omit '-O /dev/stdout' htsget http://htsnexus.rnd.dnanex.us/v1/variants/1000genomes/20130502_autosomes --reference-name=12 --start=112204691 --end=112247789 -O /dev/stdout | bcftools view -s HG00096 -o HG00096_range.vcf.gz -O z ``` #### (cmd line) using SnpSift to extract specific fields ```shell= htsget http://htsnexus.rnd.dnanex.us/v1/variants/1000genomes/20130502_autosomes \ --reference-name=1 --start=112204 --end=1122477 -O /dev/stdout | \ bgzip -d | \ SnpSift extractFields - CHROM POS ID AF | head ``` The result: ```shell= CHROM POS ID AF 1 104912 rs555881142 1.99681E-4 1 104935 rs575831356 1.99681E-4 1 104971 rs541628624 1.99681E-4 1 105005 rs561611323 1.99681E-4 1 106027 rs566022480 0.00858626 1 106606 rs541390744 1.99681E-4 1 106820 rs564304387 1.99681E-4 1 106845 rs533395091 1.99681E-4 1 108030 rs550285297 0.0317492 ``` #### (cmd line) Filter for certain SNPs? ```sh # create a list of SNPs to filter in a txt file tee -a my_rs.txt <<EOF rs538998844 rs5745113 rs255292 EOF # grab region, uncompress and pass to SnpSift filter htsget http://htsnexus.rnd.dnanex.us/v1/variants/1000genomes/20130502_autosomes \ --reference-name=5 --start=172082000 --end=172968480 -O /dev/stdout | \ bgzip -d | \ SnpSift filter --set my_rs.txt "ID in SET[0]" > filtered.vcf ``` ... and if you want to combine the above: ```shell= htsget http://htsnexus.rnd.dnanex.us/v1/variants/1000genomes/20130502_autosomes \ --reference-name=5 --start=172082000 --end=172968480 -O /dev/stdout | \ bgzip -d | \ SnpSift filter --set my_rs.txt "ID in SET[0]" | \ SnpSift extractFields - CHROM POS ID AF ``` result: ```shell= CHROM POS ID AF 5 172571422 rs538998844 5.99042E-4 5 172573652 rs5745113 0.00698882 5 172580866 rs255292 0.365815 ``` #### htsget vcf file in python ```python url = "http://htsnexus.rnd.dnanex.us/v1/variants/1000genomes/20130502_autosomes" with open("test.vcf.gz", "wb") as output: htsget.get(url, output, reference_name="12", start=112204691, end=112247789) vcfdata = vcf.Reader("test.vcf.gz") print(vcfdata) ``` ### Notes on R example (from Joep's slides) On NeSI I used: ``` module load R/4.0.1-gimkl-2020a ``` to get R 4.0.1. A matching Bioconductor module isn't available through. ```R ## Install needed Bioconductor packages: install.packages('BiocManager') BiocManager::install("GenomicRanges") BiocManager::install("VariantAnnotation") BiocManager::install("remotes") BiocManager::install("Bioconductor/Rhtsget") ``` Load packages ```R library(GenomicRanges) library(VariantAnnotation) library(Rhtsget) ``` R script from slides (first part): ```R gr <- GenomicRanges::GRanges("chr12:111766922-111817529") sample_id <- "platinum/NA12878" url <- "https://htsnexus.rnd.dnanex.us/v1" fl <- tempfile(fileext=".bam") bam <- htsget_reads(gr, sample_id, url, destination = fl) ``` On NeSI, the script dies with: ``` Error in curl::curl_fetch_memory(url, handle = handle) : OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to htsnexus.rnd.dnanex.us:443 ``` On my laptop, it dies with: ``` Error in curl::curl_fetch_memory(url, handle = handle) : SSL certificate problem: certificate has expired ``` On my laptop I can get around this via (needs `httr` package): ```R bam = httr::with_config( config = httr::config(ssl_verifypeer = FALSE), htsget_reads(gr, sample_id, url, destination = fl) ) ``` but this "fix" doesn't work on NeSI (I still get the same error as previously). The rest of the script runs fine on my laptop (with a similar "fix" as above for the `htsget_variants` call): ```R gr <- GenomicRanges::GRanges("12:112204691-112247789") sample_id <- "1000genomes/20130502_autosomes" url <- "https://htsnexus.rnd.dnanex.us/v1" fl <- tempfile(fileext=".vcf") ## Ignore expired SSL "fix": vcf <- httr::with_config( config = httr::config(ssl_verifypeer = FALSE), htsget_variants(gr, sample_id, url, destination = fl) ) VariantAnnotation::readVcf(vcf) ``` ## Section 1. (Joep will run this demo : port config for a single user) ### Running `ga4gh-example-data` #### Terminal Session 1 1. Log into corresponding vm via $ ssh -L 8083:localhost:8083 mahuika 2. clone `ga4gh-example-data` set and the associated `registry.db` file to `$HOME` ``` $ cp -r /nesi/nobackup/nesi02659/GA4GH/ga4gh-example-1 . && cd ga4gh-example-1 $ wget https://github.com/ga4gh/server/releases/download/data/ga4gh-example-data_4.6.tar && tar -xvf ga4gh-example-data_4.6.tar ``` 3. Launch apache server ``` $ export REGISTRY_PATH=~/ga4gh-example-1/registry $ ./run_ga4gh [Mon Aug 03 16:44:48.587634 2020] [mpm_prefork:notice] [pid 9509] AH00163: Apache/2.4.38 (Debian) PHP/7.4.8 mod_wsgi/4.6.5 Python/2.7 configured -- resuming normal operations [Mon Aug 03 16:44:48.587766 2020] [core:notice] [pid 9509] AH00094: Command line: 'apache2 -D FOREGROUND' ``` 4. We now have a server running in the foreground. When it receives requests, it will print out log entries to the terminal. A summary of the server’s configuration and data is available in HTML format at http://localhost:8083/ga4gh, which can be viewed in a web browser. Leave the server running and open another terminal to complete the rest of the demo. #### Terminal Session 2 1. Open another terminal session and log into mahuika (please leave the first teminal running) $ ssh mahuika 2. Launch ga4gh singularity shell ``` $ export REGISTRY_PATH=~/ga4gh-example-1/registry $ cd ga4gh-example-1/ && ./run_ga4gh_shell Singularity> ``` 3. To try out the server, we must send some requests to it using the GA4GH protocol. One way in which we can do this is to manually create the JSON requests, and send these to the server using cURL: ``` Singularity> curl -s -N --data '{}' --header 'Content-Type: application/json' http://localhost:8083/ga4gh/datasets/search | jq . ``` In this example, we used the search_datasets method to ask the server for all the Datasets on the server. It responded by sending back some JSON, which we piped into the jq JSON processor to make it easier to read. We get the following result: ``` { "nextPageToken": "", "datasets": [ { "info": {}, "description": "Sample data from 1000 Genomes phase 3", "id": "WyIxa2ctcDMtc3Vic2V0Il0", "name": "1kg-p3-subset" } ] } ``` 4. In this example we sent a SearchDatasetsRequest object to the server and received a SearchDatasetsResponse object in return. This response object contained one Dataset object, which is contained in the `datasets` array. This approach to interacting with the server is tedious and error prone, as we have to hand-craft the request objects. It is also quite inconvenient, as we may have to request many pages of objects to get all the objects that satisfy our search criteria. To simplify interacting with the server and to abstract away the low-level network-level details of the server, we provide a client application. To try this out, we start another instance of our virtualenv, and then send the equivalent command using: * To simplify interacting with the server and to abstract away the low-level network-level details of the server, we provide a client application. To try this out, we start another instance of our virtualenv, and then send the equivalent command using: ``` Siingularity> ga4gh_client datasets-search http://localhost:8083/ga4gh WyIxa2ctcDMtc3Vic2V0Il0 1kg-p3-subset ``` 5. The output of this command is a summary of the Datasets on that are present on the server. We can also get the output in JSON form such that each object is written on one line: ``` Singularity> ga4gh_client datasets-search -O json http://localhost:8083/ga4gh ``` ``` {"info": {}, "description": "Sample data from 1000 Genomes phase 3", "id": "WyIxa2ctcDMtc3Vic2V0Il0", "name": "1kg-p3-subset"} ``` 6. This format is quite useful for larger queries, and can be piped into jq to extract fields of interest, pretty printing and so on. * We can perform similar queries for variant data using the search_variants API call. First, we find the IDs of the VariantSets on the server using the search_variant_sets method: ``` Singularity> ga4gh_client variantsets-search http://localhost:8083/ga4gh WyIxa2ctcDMtc3Vic2V0IiwidnMiLCJtdm5jYWxsIl0 mvncall ``` 7. This tells us that we have one VariantSet on the server, with ID `WyIxa2ctcDMtc3Vic2V0IiwidnMiLCJtdm5jYWxsIl0` and name `mvncall`. We can then search for variants overlapping a given interval in a VariantSet as follows: ``` Singularity> ga4gh_client variants-search http://localhost:8083/ga4gh --referenceName=1 --start=45000 --end=50000 ``` * Although this should have generated a summary of the data in a free text form, a conflict in pre-assigned relative paths in the registry.db file provided with the example dataset will cause an error. ( regenerating the registry using the repo CLI will resolve it) ## Section 2. ### Host the 1000 Genomes VCF The GA4GH reference server uses a registry of files and URLs to populate its data repository. In this tutorial we will use the command-line client to create a registry similar to that used by 1kgenomes.ga4gh.org. Your system should have samtools installed, and at least 30GB to host the VCF and reference sets. #### Repo administrator CLI The CLI has methods for adding and removing Feature Sets, Read Group Sets, Variant Sets, etc. Before we can begin adding files we must first initialize an empty registry database. The directory that this database is in should be readable and writable by the current user, as well as the user running the server. Copy the source directory to $HOME or $CWD ``` $ cp -r /nesi/nobackup/nesi02659/GA4GH/ga4gh-example-2 ~ ``` Initialises a new registry DB `registry/registry.db` ``` $ ml ga4gh && ga4gh_repo init registry/registry.db ``` **verify** The `verify` command is used to check that the integrity of the data in a repository. The command checks each container object in turn and ensures that it can read data from it. Read errors can occur for any number of reasons (for example, a VCF file may have been moved to another location since it was added to the registry), and the verify command allows an administrator to check that all is well in their repository. ``` $ ga4gh_repo verify registry/registry.db ``` Now add a dataset and provide a helpful description using the `--description` flag. ``` $ ga4gh_repo add-dataset registry/registry.db 1kgenomes --description "Variants from the 1000 Genomes project and GENCODE genes annotations" ``` #### Add a Reference Set ``` $ ga4gh_repo add-referenceset registry/registry.db registry/hs37d5.fa.gz \ --description "NCBI37 assembly of the human genome" \ --species '{"termId": "NCBI:9606", "term": "Homo sapiens"}' \ --name NCBI37 \ --sourceUri ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz ``` ##### Add an ontology Ontologies provide a source for parsing variant annotations, as well as organizing feature types into ontology terms. A sequence ontology instance must be added to the repository to translate ontology term names in sequence and variant annotations to IDs. Sequence ontology definitions can be downloaded from the Sequence Ontology site. ``` $ ga4gh_repo add-ontology registry/registry.db registry/so-xp.obo -n so-xp ``` #### Add the 1000 Genomes VCFs The 1000 Genomes are publicly available on the EBI server. This command uses wget to download the “release” VCFs to a directory named release. * Downloading files take about 5 minutes.Therefore, use the localy stored copy `/nesi/nobackup/nesi02659/GA4GH/release/` ``` $ ga4gh_repo add-variantset registry/registry.db 1kgenomes /nesi/nobackup/nesi02659/GA4GH/release/ --name phase3-release --referenceSetName NCBI37 ``` #### Add a BAM as a Read Group Set Read Group Sets are the logical equivalent to BAM files within the server. We will add a BAM hosting by the 1000 Genomes S3 bucket. We will first download the index and then add it to the registry. ``` ga4gh_repo add-readgroupset registry/registry.db 1kgenomes \ -I HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \ --referenceSetName NCBI37 \ http://s3.amazonaws.com/1000genomes/phase3/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam ``` Send some requests to it using the GA4GH protocol ``` $ ml purge && ml Singularity/3.5.2 $ singularity shell ga4gh_nesi.img ``` ``` Singularity> curl --data '{}' --header 'Content-Type: application/json' http://localhost:8083/ga4gh/datasets/search | jq . ```