JGI-NCBI Workshop - September 1, 2022

# JGI-NCBI Workshop - September 1, 2022 The Department of Energy Joint Genome Institute ([DOE JGI](https://jgi.doe.gov)) and National Center for Biotechnology Information ([NCBI](https://www.ncbi.nlm.nih.gov/)) collaborated to run a workshop to support efforts to train the community on cloud computing and petabyte scale sequence search. This effort was supported by the DOE JGI, NCBI, and Berkeley Lab IT. The workshop took place as part of the [DOE JGI's User Meeting](https://usermeeting.jgi.doe.gov/). # Workshop Goals There is an overwhelming amount of data in SRA - getting close to 18 Petabytes! There are a lot of different questions that we could ask of the data if we had the right fast and efficient tools. The goal for the workshop was to introduce participants to Elastic-BLAST, sourmash, and mastiff by running these tools in a cloud computing environment. This workshop complements the talks given by [Tom Madden](https://youtu.be/num3BLnyj14) and [Luiz Irber](https://youtu.be/oWJVEkyHal0) at the 2022 JGI User Meeting (we will post links to the talks here once they are available). There were 13 participants that completed the workshop activity. The workshop organizers will make the protocol available as a standalone activity for the community. ## Lessons learned There is value in developing standalone resources for the community to help lower the barrier to using new computing resources like the cloud. Our team can construct additional training resources to be part of some materials that can be posted on a site for easy access and reuse. Having the workshop associated with another event is a good way to get participants to sign up for the event. Materials need to be sent to participants ahead of time. We had all infrastructure in place - accounts, the proper image, and the protocol for participants to follow. Some folks who had signed up for the workshop did not show up for a few reasons: the Zoom coordinates were not shared early enough, some attendees had been exposed to COVID, and the building was locked. When considering a focus for the next event, we should do a round or two of user research to determine what might be of most interest or use to the scientific community. Some outreach is still required to attract participants with other areas of expertise (e.g. math, computer science, statistics). # Workshop Activity During the workshop the participants used the following instructions as a guide to the use of AWS to analyze data for 2,631 draft genomes (TOBG-GENOMES.tar.gz) generated using the Tara Oceans microbial metagenomic data sets (requires a link to the study). ## Creating the proper image in AWS This [repository](https://github.com/lbnl-science-it/jgi-user-meeting-aws-ec2) provides the instructions and details for setting up your own EC2 instance so you can work through these exercises. The cost to run through the protocol is ~$50 per user. ## Computing and Storage For this workshop we've set up an EC2 instance in AWS for each participant. Each user's instance has 200G of storage in `/home/ubuntu` Each user's instance also has 237GB of SSD in `/home/ubuntu/nvme_ssd` You will also set up an S3 bucket for the outputs from your Elastic-BLAST runs. These buckets can persist once the computing instances are shut down. ## Setting up the environment You will set up your environment with the following command: source .elb-venv/bin/activate Check your `$PATH` and environment variables. The S3 bucket (storage), we will use for this workshop is echo $S3URL You also need to create your own bucket using the AWS command line tools (already installed) to store the OUTPUT from Elastic-BLAST. Replace `USERNAME` with your username. aws s3 mb s3://elasticblast-USERNAME MYURL=s3://elasticblast-USERNAME echo $MYURL Check the version of elastic-BLAST to verify it's installed properly elastic-blast --version Should return 0.2.7 ## Downloading the data Before we can run anything, we need to have some data downloaded and accessible to our environment. We are going to work with data from the Tara Oceans microbial metagenomic data sets (metagenome assembled genomes or MAGs). These datasets can all be found on Figshare, a great way to get a persistent URL to data for reuse. The following commands will copy the data from Figshare to your local drive `/home/ubuntu` and then extract the data into a usable form. wget -O TOBG-GENOMES.tar.gz https://figshare.com/ndownloader/files/8849371 tar xf TOBG-GENOMES.tar.gz TOBG_NP-110.fna ## Where do these MAGs show up in SRA? The next step is to execute Mastiff -- this software will look for matches from the Tara Oceans dataset against all SRA metagenomes. Mastiff is a tool in the sourmash suite developed in the lab of C. Titus Brown at UC Davis. More information and tools can be found here - https://github.com/sourmash-bio/ curl -o mastiff -L https://github.com/sourmash-bio/mastiff/releases/latest/download/mastiff-client-x86_64-unknown-linux-musl chmod +x mastiff ./mastiff -o matches.csv TOBG_NP-110.fna Wasn't that fast?? If you want to check out the content of the matches cat matches.csv | more The results are a set of accessions and the corresponding containment scores. For example, SRR5506584,0.9992119779353822,TOBG_NP-110.fna SRR6056557,0.04018912529550828,TOBG_NP-110.fna The first hit has 99.9% containment and the second example has only 4% containment. Depending on what you want to do, filtering the results may be sensible. The more accessions you feed into it the longer Elastic-BLAST will take to return results. For a quicker experiment, you can filter out hits with less than 99% containment. cat matches.csv | awk -F , '{ if ($2 > "0.99") print $1}'| sed 1d > extract.csv ## Validating the results Sourmash was used to build an index for SRA. Mastiff lets you query that index very quickly and returns accessions. We'd like to check the validity of these results with the more time and resource consuming Elastic-BLAST software. By running Elastic-BLAST on a subset of the results, we're able to explore the data much faster... which we hope enables you to think of new ways to interrogate the data! ### Create a BLAST Database We need a database to use with Elast-BLAST. Create a directory to store the data mkdir -p data && cd data cp $HOME/TOBG_NP-110.fna . Create a database from the Tara Oceans MAGs makeblastdb -in TOBG_NP-110.fna -input_type fasta -dbtype nucl -parse_seqids We also need to create a metadata file for Elastic-BLAST create-blastdb-metadata.py --db TOBG_NP-110.fna --dbtype nucl --pretty To check out the content of this file cat TOBG_NP-110.fna-nucl-metadata.json For information on why Elastic-BLAST needs this information, please see the [NCBI tutorial](https://blast.ncbi.nlm.nih.gov/doc/elastic-blast/tutorials/create-blastdb-metadata.html#tutorial-create-blastdb-metadata). Elastic-BLAST needs our data to be accessible in an S3 bucket, so we'll copy the data over with the AWS command line tools (don't forget to change the path to your bucket!) aws s3 cp TOBG_NP-110.fna-nucl-metadata.json s3://elasticblast-USERNAME/custom_blastdb/ for f in TOBG_NP-110.fna.n* ; do aws s3 cp $f s3://elasticblast-USERNAME/custom_blastdb/; done ## Downloading SRA data for validation We have a set of accessions from the matches, we're going to download this data from the SRA instance in the cloud using a tool called `fasterq-dump`. Create a directory for the SRA data, for example mkdir sra-cache We found the fastest way to get the SRA data into the S3 bucket associated with the EC2 instance was to transfer locally to the node and then to the S3 bucket. Copying data to the node's storage (change the /path_to_sra-cache to the directory you created above): `aws s3 cp s3://sra-pub-run-odp/sra/$srr_id/$srr_id --no-sign-request /path_to_sra-cache` Copying the data to the S3 bucket (change the paths again): `fasterq-dump -e 32 --fasta-unsorted --skip-technical -t /path_to_sra-cache /path_to_sra-cache/$srr_id -Z | aws s3 cp - s3://elasticblast-USERNAME/queries/${srr_id}.fa --only-show-errors` ### This is a time-consuming process!! Priyanka copied all the queries to a public S3 bucket for the workshop. You can access that here: s3://elasticblast-pghosh2/queries/ Substitute that path into the commands below. Do NOT use this as your output path or the jobs will fail. ## Running Elastic-BLAST This next instruction relies on gnu parallel to run more than one search at a time. First, check to be sure that the parameters are set properly `parallel -t --tag -q -a extract.csv elastic-blast submit --dry-run --query s3://elasticblast-USERNAME/queries/{}.fa --db s3://elasticblast-USERNAME/custom_blastdb/TOBG_NP-110.fna --program blastn --num-nodes 8 --batch-len 1000000000 --results s3://elasticblast-USERNAME/results/results_parallel/output_{} -- -task megablast -word_size 28 -evalue 0.00001 -max_target_seqs 10000000 -perc_identity 97 -outfmt "6 std qlen slen qcovs" -mt_mode 1 ` Do you see anything in the commands that doesn't look quite right? After fixing the error, submit a job that will use at most 2 nodes `parallel -t --jobs 2 --tag -q -a extract.csv elastic-blast submit --query s3://elasticblast-USERNAME/queries/{}.fa --db s3://elasticblast-USERNAME/custom_blastdb/TOBG_NP-110.fna --program blastn --num-nodes 8 --batch-len 1000000000 --results s3://elasticblast-USERNAME/results/results_parallel/output_{} -- -task megablast -word_size 28 -evalue 0.00001 -max_target_seqs 10000000 -perc_identity 97 -outfmt "6 std qlen slen qcovs" -mt_mode 1` Monitor the progress `elastic-blast status --results $MYURL/results/results_parallel/output_SRR5506583` Once you see a status of "your ElasticBLAST search succeeded", you can download the search results from your S3 bucket to your local computer. export YOUR_RESULTS_BUCKET=$MYURL/results/results_parallel/output_SRR5506583 aws s3 cp ${YOUR_RESULTS_BUCKET}/ . --exclude "*" --include "*.out.gz" --recursive To get a summary of your Elastic-BLAST output run the following `elastic-blast run-summary --results s3://elasticblast-USERNAME/results/results_parallel/output_SRR5506583` ## Running in Serial You can also run Elastic-BLAST in serial, which may be more ideal if you have 1 or 2 queries that you'd like to run. `parallel -t --tag -q elastic-blast submit --query s3://elasticblast-USERNAME/queries/{}.fa --db s3://elasticblast-USERNAME/custom_blastdb/TOBG_NP-110.fna --program blastn --num-nodes 8 --batch-len 1000000000 --results s3://elasticblast-USERNAME/results/results_parallel/output_{} -- -task megablast -word_size 28 -evalue 0.00001 -max_target_seqs 10000000 -perc_identity 97 -outfmt "6 std qlen slen qcovs" -mt_mode 1 ::: SRR5506583 ` ## Cleaning up your cloud instance You can delete results from your s3 bucket with the following command. Modify to include the proper path to the output you'd like to delete. `elastic-blast delete --results s3://elasticblast-USERNAME/results/results_parallel/output_SRR5506583` For more information on running Elastic-BLAST on the command line, check out [this resource](https://blast.ncbi.nlm.nih.gov/doc/elastic-blast/tutorials/tutorial-cli.html#tutorial-cli).