Integrating custom Kraken2 databases into Epi2Me

# Integrating custom Kraken2 databases into Epi2Me ## Introduction The Oxford Nanopore EPI2ME metagenomics workflow uses Kraken2 for taxonomic classification of reads. By default, it offers standard databases (e.g. whole RefSeq or 16S/18S/ITS targeted loci). this workflow walks through integrating custom reference databases. Below is a step-by-step plan covering database preparation, building with kraken2-build, and configuring EPI2ME to recognize and use the custom databases. We also include relevant Kraken2/EPI2ME settings and troubleshooting tips to ensure the classification is accurate with minimal errors. Information for building this database was sourced by [How to build and use databases to run wf-metagenomics and wf-16s offline](https://epi2me.nanoporetech.com/how-to-meta-offline/#:~:text=In%20EPI2ME%20you%20can%20set,section) provided by ONT's Epi2Me Documentation page. ## 1. Preparing Reference FASTA Files for Kraken2 > Before building the databases, ensure your reference sequences are formatted correctly for Kraken2: ### Compile and format reference FASTA: Gather your reference sequences in FASTA format (multi-FASTA files are fine). Each sequence header must contain a taxonomy identifier that Kraken2 can use. Kraken2 requires either a valid NCBI accession number or an explicit taxonomic ID in the header. Since these are custom datasets, it’s easiest to embed the taxon ID in each FASTA header using the `kraken:taxid|<taxID> syntax`. For example: ``` MySpecies_COI|kraken:taxid|12345 ATGCGT... (COI gene sequence) ``` * This header assigns the sequence to taxon ID 12345 in the taxonomy. Using kraken:taxid|XXX in the sequence ID explicitly sets its taxon to XXX. * Make sure to replace 12345 with the actual NCBI taxonomic ID for that species or genus. Do this for every sequence in your reference FASTA files. When downloading FASTA files, it is important to ensure there is consistent file name going on suggest `Genus_Species_Gene.fasta`. If you are looking through your downloaded files and recognize that not every fasta file has the appropriate file extrention you can fix this in the terminal ``` #!/bin/bash # Directory containing the fasta files DIR="fasta_Database" # Loop through all files in the directory for file in "$DIR"/*; do # Check if the file does not already have the .fasta extension if [[ ! "$file" =~ \.fasta$ ]]; then # Rename the file to add .fasta extension mv "$file" "$file.fasta" echo "Renamed: $file → $file.fasta" fi done echo "All files now have the .fasta extension." ``` Follwoing this you should confirm that every fasta file has the correct taxa ID for kraken.It is recoended that when downloading the fasta files you create a tsv file that contains a column for `Species_Name` and another for `TaxID` *(the NCBI taxonomic ID)* For this example the file `taxid_lookup.tsv` looks like the following: ``` Species_Name TaxID Salpa_thompsoni 569448 Salpa_maxima 942555 Salpa_fusiformis 942554 Salpa_aspera 942553 ``` this file is where the following script will call in to ensure we are calling in the right taxonimic ID while fixing the taxonomy identifier that Kraken2 can use. ``` #!/bin/bash # Define paths TAXID_LOOKUP="taxid_lookup.tsv" FASTA_DIR="fasta_Database" # Change this to the correct directory # Check if taxid_lookup.tsv exists if [ ! -f "$TAXID_LOOKUP" ]; then echo "ERROR: $TAXID_LOOKUP not found!" exit 1 fi # Convert taxid_lookup.tsv into an associative array declare -A TAXID_MAP while IFS=$'\t' read -r species taxid; do TAXID_MAP["$species"]="$taxid" done < <(tail -n +2 "$TAXID_LOOKUP") # Skip header row # Process each FASTA file for file in "$FASTA_DIR"/*.fasta; do base_name=$(basename "$file") # Extract species name (assume filename format: Genus_species_marker.fasta) species=$(echo "$base_name" | sed -E 's/_COI|_18S|\.fasta//g') # Look up TaxID taxid="${TAXID_MAP["$species"]}" # If no TaxID found, log warning if [ -z "$taxid" ]; then echo "WARNING: No taxid found for $species" continue fi # Fix headers in the FASTA file awk -v taxid="$taxid" ' /^>/ { sub(/^>/, ">" genus "_" species "|kraken:taxid|" taxid " " $0) print next } { print } ' "$file" > "${file%.fasta}_fixed.fasta" # Replace the original file with the fixed one mv "${file%.fasta}_fixed.fasta" "$file" done echo "FASTA headers have been updated with TaxIDs." ``` Following that we can validate how this went ``` #!/bin/bash FASTA_DIR="fasta_Database" # Update this path # Loop through all FASTA files for file in "$FASTA_DIR"/*.fasta; do while read -r line; do if [[ $line == ">"* ]]; then # Check if the kraken taxid is in the correct position if [[ ! $line =~ \|kraken:taxid\|[0-9]+ ]]; then echo "ERROR: Invalid Kraken2 header in $file: $line" fi fi done < "$file" done echo "Header validation complete." ``` This script will help ensure that the `|kraken:taxid\|` is is the right location in the header. * **No spaces in sequence IDs:** When adding the kraken:taxid tag, ensure the > line has no spaces before the taxid tag. The portion before any whitespace is considered the sequence ID. For example, use >Seq1|kraken:taxid|12345 Description (the taxid tag is part of the identifier) rather than putting the description first. This guarantees Kraken2 reads the ID and taxid correctly. * **Verify taxon IDs:** Ensure the IDs you use correspond to entries in the NCBI taxonomy. If your sequences are from known species, you can find their taxon IDs via NCBI (for example, the species Homo sapiens has taxid 9606). If a sequence’s organism isn’t in NCBI’s taxonomy, you have two options: (a) assign it to the nearest higher taxon that is in NCBI (e.g. genus or family), or (b) add a new entry to the taxonomy dump manually. Option (a) is simpler and usually sufficient. Option (b) involves editing names.dmp and nodes.dmp in the taxonomy files to create a custom node. ## 2. Install Kraken Step 1: Ensure Kraken2 is Installed If not installed, install it via Homebrew: ``` brew install brewsci/bio/kraken2 ``` Verify installation: ``` kraken2 --version ``` ## 3. Building Kraken2 Databases with kraken2-build ### Create a database directory & download taxonomy: Decide on a name/path for your new database (e.g., COI_DB). Run the Kraken2 build command to download the NCBI taxonomy into that folder. In the command line yu can enter the following after ensuring you are in the correct directory. Set up the database folder: ``` mkdir -p "/Users/.../kraken2_dbs/COI_DB" # update this to actual file path ``` Download the NCBI Taxonomy: ``` kraken2-build --download-taxonomy --skip-maps --db "/Users/.../kraken2_dbs/COI_DB" # update this to actual file path ``` * `--skip maps`: This avoids downloading gigabytes of accession mapping files that you won’t need for custom sequences. * After this step, the database folder will have a `taxonomy/` subdirectory with the NCBI taxonomy data. Verify the download of files: ``` ls "/Users/.../kraken2_dbs/COI_DB/taxonomy" # update this to actual file path ``` You should You should see files like `names.dmp`, `nodes.dmp`. ## 4. Concatenate Your FASTA Files Since Kraken2 requires a single FASTA file per database, merge all FASTA files: ``` cat "/Users/.../fasta_Database"*.fasta > "/Users/.../kraken2_dbs/COI_combined.fasta" ``` There should now be a file in the directory with the concatenated FASTA files Check if the files exist: ls -lh "/Users/.../kraken2_dbs/" ## 5. Add FASTA Sequences to Kraken2 ### Add your reference sequences to the database: Now add the formatted FASTA file(s) to the Kraken2 database library. ``` kraken2-build --add-to-library "/Users/.../kraken2_dbs/COI_combined.fasta" --db "/Users/.../kraken2_dbs/COI_combined.fasta" ``` You will recieve a message ``` Masking low-complexity regions of new file... done. Added "/Users/.../kraken2_dbs/COI_combined.fasta" to library (/Users/.../kraken2_dbs/COI_DB) ``` * This copies your COI reference sequences into the database’s library for indexing. * You should see the files appear under COI_DB/library/ after this step. * Each sequence’s taxonomic assignment will be parsed from the headers at this stage. (If a sequence header’s taxid isn’t found in the taxonomy data, Kraken2 will warn or error – ensure all taxids exist in the taxonomy dump to avoid this.) ## 6. Build the Kraken2 Database ### Build the Kraken2 k-mer index: Once the taxonomy is in place and the sequences are added, build the database: ``` kraken2-build --build --db "/Users/.../kraken2_dbs/COI_DB" ``` If the build is successful, You should see: `taxo.k2d` `hash.k2d` `opts.k2d` `seqid2taxid.map` * This step constructs the Kraken2 index (hash table of k-mers). * Each build may take some time depending on the number of sequences and their lengths, but something like COI and 18S gene databases are usually much smaller than whole-genome databases and should build relatively quickly. ### (Optional) Clean up and inspect: Kraken2 allows you to remove the raw library files after building to save space using `kraken2-build --clean --db <DBNAME>`. **This is optional** – you may keep the FASTA library in place for record-keeping. It’s also a good idea to verify the contents of your new databases. You can run `kraken2-inspect --db COI_DB > COI_db_report.txt` to get a summary of how many sequences and k-mers are associated with each taxon in the your database. This inspection can confirm that your sequences were indexed under the expected taxonomic IDs. > For example, you should see entries for the species you included, each with the count of k-mers. If something looks off (e.g., all sequences lumped under taxid 0 or an unexpected taxon), it may indicate a formatting issue in the FASTA headers that should be fixed before proceeding. By following the above steps, you will have two Kraken2-compatible databases: one for COI and one for 18S. Each is essentially a folder containing a library/ (with your sequences), a taxonomy/ (with NCBI taxonomy files), and the built index files. You can now integrate these into the EPI2ME workflow. ## 7. Integrating Custom Databases into the EPI2ME Workflow With the Kraken2 databases built, the next step is to configure the EPI2ME metagenomics workflow to use them instead of a default database. Since we are using the EPI2ME desktop (with cloud or offline mode), we will proceed as though we are doing this via the EPI2ME GUI. Although, you could do this via command-line parameters (if running the workflow with Nextflow offline). ### Using the EPI2ME Desktop application (GUI): In the EPI2ME Labs Desktop interface, go to “Reference Options” (this is available when setting up the wf-metagenomics run). Instead of choosing one of the default databases, select an option to use a custom or local database. The interface allows you to set the path to your database folder here. Choose the folder of your custom Kraken2 database (e.g., the COI_DB directory you created). You will also need to specify the taxonomy data – in many cases, pointing to the same database folder is enough if it contains the taxonomy/ subfolder. (If the interface provides a separate field for “taxonomy dump”, browse to the taxonomy subdirectory inside your DB folder or to the directory where you stored the NCBI taxonomy.) EPI2ME will store these settings and use your files for classification. > Tip: Ensure the path is accessible from your environment. If you built the DB on a WSL or Linux VM, you may need to copy it to a Windows-accessible location or mount it so the EPI2ME app can read it. ![custom_db](https://hackmd.io/_uploads/B1OkdI0iyg.png) Once configured, proceed to run the workflow – the pipeline will load your custom Kraken2 database for classification instead of downloading a preset one. ## 8. Kraken2 and EPI2ME Settings for Optimal Accuracy To optimize classification accuracy for Nanopore reads with Kraken2, consider the following settings and parameters. These were sourced from [wf-metagenomics documentation](https://epi2me.nanoporetech.com/workflows/wf-metagenomics/#:~:text=kraken2_confidence%20number%20Kraken2%20Confidence%20score,0). ### Confidence score threshold: Kraken2 can optionally use a confidence score to decide whether to assign a read to a taxonomy or leave it unclassified. By default, EPI2ME’s Kraken2 workflow uses a confidence threshold of 0.0 (no minimum), meaning even a single k-mer match could classify a read (Kraken2 picks the lowest common ancestor of matches by default). Nanopore reads have a higher error rate, so using a mild confidence filter can reduce spurious hits. You can adjust the `kraken2_confidence parameter` in EPI2ME. For example, setting `--kraken2_confidence 0.1`(10%) or 0.2 will require that at least that fraction of k-mers support the assignment. This helps ensure a read is only classified if there’s sufficient evidence, improving accuracy at the expense of labeling more reads as “unclassified.” In the EPI2ME desktop UI, there may be an advanced option to set confidence; if not, you can modify the Nextflow command as above. Start with 0.1 and evaluate results – you can increase it if you still see questionable classifications, or lower it if too many reads are unclassified. ### K-mer length and minimizer settings: By default Kraken2 builds nucleotide databases with k=35 (k-mer length) and ℓ=31 (minimizer length). These defaults work well for most cases and were used when you built the database (unless you specified otherwise). A k-mer of 35 is short enough to tolerate some Nanopore errors and long enough to be specific. In most cases you do not need to change this. However, if you suspect that read errors are causing Kraken2 to miss classifications (e.g., if your reads are very error-prone or shorter than 35 bp in some regions), you could rebuild the database with a smaller k-mer (Kraken2 allows this via the `--kmer-len` option during the build). For example, k=31 or Kent in theoretical. But be cautious: reducing k-mer size can increase false positives because shorter k-mers are more common. Generally, stick to k=35 and use confidence scoring to handle errors – this approach is recommended by Kraken2 (false-positive k-mer hits are expected <1% and can be controlled by confidence scoring). ### Quality filtering: Kraken2 has an option `--minimum-base-quality` for read classification. By default it’s 0 (no quality filter). If your Nanopore reads have low-quality segments, you might set this to, say, 7 or 10, which tells Kraken2 to ignore k-mers containing bases with quality below that threshold. This can sometimes improve accuracy by skipping k-mers likely to be errors. ### Memory mapping for Kraken2: Classification of reads with Kraken2 requires loading the database into memory by default. If you have limited RAM on your system, a large database can cause performance issues. Kraken2 offers a `--memory-mapping` mode to load the database from disk on-the-fly, reducing RAM usage (at some cost to speed). In the EPI2ME workflow, you can enable this via `--kraken2_memory_mapping`. > The EPI2ME documentation suggests using this option if you are concerned about high memory use.Since your custom COI/18S databases are likely not extremely large, you may not need it, but it’s good to know. If you do find memory to be a bottleneck (e.g., the process is crashing due to RAM), turn on memory mapping in the workflow’s advanced options. ### Bracken integration (optional): Kraken2 reports can be further refined with Bracken (a tool to re-estimate abundances, often used for short reads). If EPI2ME’s workflow supports Bracken and you want to use it, you would need to build a Bracken database for your custom DB as well. This involves running `bracken-build` on the Kraken2 DB (and possibly generating a k-mer distribution file). For full-length reads, Bracken is usually not necessary, but if the pipeline has it enabled for consistency, follow the EPI2ME guide for creating the Bracken file. Make sure to match the read length parameter (for Nanopore COI/18S, use the approximate read length, e.g., -l 1500 for full-length 18S). Then provide that to EPI2ME (the workflow might auto-detect bracken file if in the same folder). This step is only if you want to refine the Kraken2 results – otherwise, Kraken2 alone can suffice for classification. If proceeding with bracken Build #### Install Bracken: ``` brew install bracken ``` #### Build the bracken Database: Bracken refines Kraken2 results by estimating species abundance using k-mer distributions. Run the following command to create a Bracken database for COI_DB: You should recieve a list of sequences convereted and the following message ``` PROGRAM START TIME: 03-12-2025 00:59:32 ...139 total genomes read from kraken output file ...creating kmer counts file -- lists the number of kmers of each classification per genome ...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome PROGRAM END TIME: 03-12-2025 00:59:33 Finished creating database100mers.kraken and database100mers.kmer_distrib [in DB folder] *NOTE: to create read distribution files for multiple read lengths, rerun this script specifying the same database but a different read length Bracken build complete. ``` Kraken classifies reads by matching k-mers (subsequences of length k) from your sequences against a pre-built database of known sequences. However, Kraken often reports classifications at higher taxonomic levels (like family or genus) because there might not be enough unique k-mers to confidently assign reads at the species level. Bracken uses Kraken's output and recalculates the abundance at lower taxonomic levels by considering the distribution of k-mers across species. It reassigns reads to the most likely species based on the observed k-mer distribution, improving species-level identification.