# JGI-NCBI Workshop - September 1, 2022
The Department of Energy Joint Genome Institute ([DOE JGI](https://jgi.doe.gov)) and National Center for Biotechnology Information ([NCBI](https://www.ncbi.nlm.nih.gov/)) collaborated to run a workshop to support efforts to train the community on cloud computing and petabyte scale sequence search. This effort was supported by the DOE JGI, NCBI, and Berkeley Lab IT. The workshop took place as part of the [DOE JGI's User Meeting](https://usermeeting.jgi.doe.gov/).
# Workshop Goals
There is an overwhelming amount of data in SRA - getting close to 18 Petabytes! There are a lot of different questions that we could ask of the data if we had the right fast and efficient tools.
The goal for the workshop was to introduce participants to Elastic-BLAST, sourmash, and mastiff by running these tools in a cloud computing environment. This workshop complements the talks given by [Tom Madden](https://youtu.be/num3BLnyj14) and [Luiz Irber](https://youtu.be/oWJVEkyHal0) at the 2022 JGI User Meeting (we will post links to the talks here once they are available).
There were 13 participants that completed the workshop activity. The workshop organizers will make the protocol available as a standalone activity for the community.
## Lessons learned
There is value in developing standalone resources for the community to help lower the barrier to using new computing resources like the cloud. Our team can construct additional training resources to be part of some materials that can be posted on a site for easy access and reuse.
Having the workshop associated with another event is a good way to get participants to sign up for the event.
Materials need to be sent to participants ahead of time. We had all infrastructure in place - accounts, the proper image, and the protocol for participants to follow. Some folks who had signed up for the workshop did not show up for a few reasons: the Zoom coordinates were not shared early enough, some attendees had been exposed to COVID, and the building was locked.
When considering a focus for the next event, we should do a round or two of user research to determine what might be of most interest or use to the scientific community. Some outreach is still required to attract participants with other areas of expertise (e.g. math, computer science, statistics).
# Workshop Activity
During the workshop the participants used the following instructions as a guide to the use of AWS to analyze data for 2,631 draft genomes (TOBG-GENOMES.tar.gz) generated using the Tara Oceans microbial metagenomic data sets (requires a link to the study).
## Creating the proper image in AWS
This [repository](https://github.com/lbnl-science-it/jgi-user-meeting-aws-ec2) provides the instructions and details for setting up your own EC2 instance so you can work through these exercises. The cost to run through the protocol is ~$50 per user.
## Computing and Storage
For this workshop we've set up an EC2 instance in AWS for each participant.
Each user's instance has 200G of storage in
`/home/ubuntu`
Each user's instance also has 237GB of SSD in
`/home/ubuntu/nvme_ssd`
You will also set up an S3 bucket for the outputs from your Elastic-BLAST runs. These buckets can persist once the computing instances are shut down.
## Setting up the environment
You will set up your environment with the following command:
source .elb-venv/bin/activate
Check your `$PATH` and environment variables. The S3 bucket (storage), we will use for this workshop is
echo $S3URL
You also need to create your own bucket using the AWS command line tools (already installed) to store the OUTPUT from Elastic-BLAST.
Replace `USERNAME` with your username.
aws s3 mb s3://elasticblast-USERNAME
MYURL=s3://elasticblast-USERNAME
echo $MYURL
Check the version of elastic-BLAST to verify it's installed properly
elastic-blast --version
Should return 0.2.7
## Downloading the data
Before we can run anything, we need to have some data downloaded and accessible to our environment. We are going to work with data from the Tara Oceans microbial metagenomic data sets (metagenome assembled genomes or MAGs). These datasets can all be found on Figshare, a great way to get a persistent URL to data for reuse. The following commands will copy the data from Figshare to your local drive `/home/ubuntu` and then extract the data into a usable form.
wget -O TOBG-GENOMES.tar.gz https://figshare.com/ndownloader/files/8849371
tar xf TOBG-GENOMES.tar.gz TOBG_NP-110.fna
## Where do these MAGs show up in SRA?
The next step is to execute Mastiff -- this software will look for matches from the Tara Oceans dataset against all SRA metagenomes. Mastiff is a tool in the sourmash suite developed in the lab of C. Titus Brown at UC Davis. More information and tools can be found here - https://github.com/sourmash-bio/
curl -o mastiff -L https://github.com/sourmash-bio/mastiff/releases/latest/download/mastiff-client-x86_64-unknown-linux-musl
chmod +x mastiff
./mastiff -o matches.csvĀ TOBG_NP-110.fna
Wasn't that fast??
If you want to check out the content of the matches
cat matches.csv | more
The results are a set of accessions and the corresponding containment scores. For example,
SRR5506584,0.9992119779353822,TOBG_NP-110.fna
SRR6056557,0.04018912529550828,TOBG_NP-110.fna
The first hit has 99.9% containment and the second example has only 4% containment. Depending on what you want to do, filtering the results may be sensible.
The more accessions you feed into it the longer Elastic-BLAST will take to return results. For a quicker experiment, you can filter out hits with less than 99% containment.
cat matches.csv | awk -F , '{ if ($2 > "0.99") print $1}'| sed 1d > extract.csv
## Validating the results
Sourmash was used to build an index for SRA. Mastiff lets you query that index very quickly and returns accessions. We'd like to check the validity of these results with the more time and resource consuming Elastic-BLAST software. By running Elastic-BLAST on a subset of the results, we're able to explore the data much faster... which we hope enables you to think of new ways to interrogate the data!
### Create a BLAST Database
We need a database to use with Elast-BLAST.
Create a directory to store the data
mkdir -p data && cd data
cp $HOME/TOBG_NP-110.fna .
Create a database from the Tara Oceans MAGs
makeblastdb -in TOBG_NP-110.fna -input_type fasta -dbtype nucl -parse_seqids
We also need to create a metadata file for Elastic-BLAST
create-blastdb-metadata.py --db TOBG_NP-110.fna --dbtype nucl --pretty
To check out the content of this file
cat TOBG_NP-110.fna-nucl-metadata.json
For information on why Elastic-BLAST needs this information, please see the [NCBI tutorial](https://blast.ncbi.nlm.nih.gov/doc/elastic-blast/tutorials/create-blastdb-metadata.html#tutorial-create-blastdb-metadata).
Elastic-BLAST needs our data to be accessible in an S3 bucket, so we'll copy the data over with the AWS command line tools (don't forget to change the path to your bucket!)
aws s3 cp TOBG_NP-110.fna-nucl-metadata.json s3://elasticblast-USERNAME/custom_blastdb/
for f in TOBG_NP-110.fna.n* ; do aws s3 cp $f s3://elasticblast-USERNAME/custom_blastdb/; done
## Downloading SRA data for validation
We have a set of accessions from the matches, we're going to download this data from the SRA instance in the cloud using a tool called `fasterq-dump`.
Create a directory for the SRA data, for example
mkdir sra-cache
We found the fastest way to get the SRA data into the S3 bucket associated with the EC2 instance was to transfer locally to the node and then to the S3 bucket.
Copying data to the node's storage (change the /path_to_sra-cache to the directory you created above):
`aws s3 cp s3://sra-pub-run-odp/sra/$srr_id/$srr_id --no-sign-request /path_to_sra-cache`
Copying the data to the S3 bucket (change the paths again):
`fasterq-dump -e 32 --fasta-unsorted --skip-technical -t /path_to_sra-cache /path_to_sra-cache/$srr_id -Z | aws s3 cp - s3://elasticblast-USERNAME/queries/${srr_id}.fa --only-show-errors`
### This is a time-consuming process!!
Priyanka copied all the queries to a public S3 bucket for the workshop. You can access that here:
s3://elasticblast-pghosh2/queries/
Substitute that path into the commands below. Do NOT use this as your output path or the jobs will fail.
## Running Elastic-BLAST
This next instruction relies on gnu parallel to run more than one search at a time.
First, check to be sure that the parameters are set properly
`parallel -t --tag -q -a extract.csv elastic-blast submit --dry-run --query s3://elasticblast-USERNAME/queries/{}.fa --db s3://elasticblast-USERNAME/custom_blastdb/TOBG_NP-110.fna --program blastn --num-nodes 8 --batch-len 1000000000 --results s3://elasticblast-USERNAME/results/results_parallel/output_{} -- -task megablast -word_size 28 -evalue 0.00001 -max_target_seqs 10000000 -perc_identity 97 -outfmt "6 std qlen slen qcovs" -mt_mode 1
`
Do you see anything in the commands that doesn't look quite right?
After fixing the error, submit a job that will use at most 2 nodes
`parallel -t --jobs 2 --tag -q -a extract.csv elastic-blast submit --query s3://elasticblast-USERNAME/queries/{}.fa --db s3://elasticblast-USERNAME/custom_blastdb/TOBG_NP-110.fna --program blastn --num-nodes 8 --batch-len 1000000000 --results s3://elasticblast-USERNAME/results/results_parallel/output_{} -- -task megablast -word_size 28 -evalue 0.00001 -max_target_seqs 10000000 -perc_identity 97 -outfmt "6 std qlen slen qcovs" -mt_mode 1`
Monitor the progress
`elastic-blast status --results $MYURL/results/results_parallel/output_SRR5506583`
Once you see a status of "your ElasticBLAST search succeeded", you can download the search results from your S3 bucket to your local computer.
export YOUR_RESULTS_BUCKET=$MYURL/results/results_parallel/output_SRR5506583
aws s3 cp ${YOUR_RESULTS_BUCKET}/ . --exclude "*" --include "*.out.gz" --recursive
To get a summary of your Elastic-BLAST output run the following
`elastic-blast run-summary --results s3://elasticblast-USERNAME/results/results_parallel/output_SRR5506583`
## Running in Serial
You can also run Elastic-BLAST in serial, which may be more ideal if you have 1 or 2 queries that you'd like to run.
`parallel -t --tag -q elastic-blast submit --query s3://elasticblast-USERNAME/queries/{}.fa --db s3://elasticblast-USERNAME/custom_blastdb/TOBG_NP-110.fna --program blastn --num-nodes 8 --batch-len 1000000000 --results s3://elasticblast-USERNAME/results/results_parallel/output_{} -- -task megablast -word_size 28 -evalue 0.00001 -max_target_seqs 10000000 -perc_identity 97 -outfmt "6 std qlen slen qcovs" -mt_mode 1 ::: SRR5506583
`
## Cleaning up your cloud instance
You can delete results from your s3 bucket with the following command. Modify to include the proper path to the output you'd like to delete.
`elastic-blast delete --results s3://elasticblast-USERNAME/results/results_parallel/output_SRR5506583`
For more information on running Elastic-BLAST on the command line, check out [this resource](https://blast.ncbi.nlm.nih.gov/doc/elastic-blast/tutorials/tutorial-cli.html#tutorial-cli).