16S metagenomic raw data processing (FRCI)

# 16S metagenomic raw data processing Use of **FRCI (Far-red cyanobacteria identification)** to identify Far-Red Light utilizing cyanobacteria in your target environments from your own samples or literature sources. This workflow provides a pipeline for processing raw sequencing data from **Illumina amplicon sequencing (single-end and paired-end) and PacBio sequencing**. Hope you enjoy the analysis process!! **Before You Begin:** * This workflow requires [QIIME2](https://docs.qiime2.org/2023.9/tutorials/qiime2-for-experienced-microbiome-researchers/) to be installed within a conda environment, along with [parallel](https://anaconda.org/conda-forge/parallel) and [seqtk](https://github.com/lh3/seqtk). * For taxonomy classification, if you are using QIIME2 for the first time, please download and import the [silva](https://www.arb-silva.de/download/arb-files/) database into QIIME2 format for use. **For public data searching:** * Make sure your paper has these info: 16S rRNA gene sequencing, accesion number, primer tag (or check the references if you can’t find it). You can also use Table 1 in FRCI for practice.👍 * If you have any questions about the parameters in the command or want to learn more, please check out the Qiime2 website. **For your own samples:** * You can skip step 1 and start from the **paired-end sequencing** or **Pacbio sequencing data** step. * If you are sure that the parameters are correct but the quality is problematic, please keep in touch with the sequencing provider. **FRCI:** [FRCI article](https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13871) and [FRCI database](https://doi.org/10.5281/zenodo.7870551) **Overview of QIIME 2** ![overview](https://hackmd.io/_uploads/Hk-8aPFcJe.png) ## Download accession number file * Find 16S metagenomic raw data from article Ex: SRA number or project number etc. * Go to NCBI download accession number file: "SraAc List.csv" * Create and Upload SraAc List to your Linux folder Ex: `mkdir Stromatolite_1` ## Step 0: Build enviroment * Create a [tmux](https://github.com/tmux/tmux/wiki/Installing) session to avoid interrupting your ongoing work: ```tmux``` * Don’t forget your tmux session number. If you close the linux window, you can reopen it with ```tmux attach -t number``` , and use ```tmux ls``` to see the list of sessions. If your work is done, use ```exit``` to kill tmux session. * Activate Qiime2 environment: ```conda activate qiime2-2021.11``` ## Step 1: Download raw reads from NCBI SRA * Use of tools prefetch and fastq-dump to download ```linux parallel -j 1 prefetch {} ::: $(cat SraAccList.csv) parallel -j 1 fastq-dump --skip-technical -F --split-files -O fastq {} ::: $(cat SraAccList.csv) ``` * Check ```/fastq/``` folder after download: ```ls fastq``` If you have **only one file for one SRA number**, please go to the **single-end session**. * Don’t waste time on this paper if SRA takes more than a day to download.😊Or you can also open another tmux session while waiting. # Paired-end sequence ## Step 2: Create a manifest **Manifest file** requires the information of “sample-id”, “absolute-filepath”, and sequence “direction”. * No matter what method you use, just make sure your manifest file looks like the following: ```linux sample-id,absolute-filepath,direction BM1,/home/lab617/YY/samplingNGS/BeiMen_230215_Mangrove/BM1_R1.fastq.gz,forward BM1,/home/lab617/YY/samplingNGS/BeiMen_230215_Mangrove/BM1_R2.fastq.gz,reverse BM2,/home/lab617/YY/samplingNGS/BeiMen_230215_Mangrove/BM2_R1.fastq.gz,forward BM2,/home/lab617/YY/samplingNGS/BeiMen_230215_Mangrove/BM2_R2.fastq.gz,reverse BM3,/home/lab617/YY/samplingNGS/BeiMen_230215_Mangrove/BM3_R1.fastq.gz,forward BM3,/home/lab617/YY/samplingNGS/BeiMen_230215_Mangrove/BM3_R2.fastq.gz,reverse ``` Here I will demonstrate how to edit a manifest file using R. * First, use ```ls``` to list the absolute filepath of all files in the fastq folder then write them into the manifest file. ```linux pwd # get your ".fastq*" location ls /your location/fastq/*.fastq* -d > manifest # ex: ls /home/lab617/YY/2023NGS_Classreport/Stromatolite_1/fastq/*.fastq* -d ``` * Edit columns: **“sample-id”**: You can create your own sample ID or use the sample size number as shown in this example. Save it as a CSV file. **"direction"**: A paired-end sequence would have two fastq files, forward reads and reverse reads. ```R R manifest <- read.table('manifest', header=FALSE) colnames(manifest)='absolute-filepath' manifest$"sample-id" <- rep(c(1:sample size), each = 2) manifest$"direction"=c('forward',"reverse") manifest <- manifest[, c("sample-id", "absolute-filepath", "direction")] write.csv(manifest, "manifest", row.names = FALSE, quote = FALSE) quit() ``` ## Step 3: Import data from manifest * Importing data for use in QIIME 2. ```linux mkdir reads_qza qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path manifest \ --output-path reads_qza/paired_end.qza \ --input-format PairedEndFastqManifestPhred33 ``` * Got an error while importing data? Make sure there’s nothing wrong with the manifest. If it’s all good, maybe try switching to "Phred64V2". (Or just drop this paper if you have other choices.👍) ``` PairedEndFastqManifestPhred33V2 PairedEndFastqManifestPhred64V2 ``` ## Step 4: Remove amplicon primers * Cutadapt: get the forward and reverse primer tags from your paper. ```linux time qiime cutadapt trim-paired \ --i-demultiplexed-sequences reads_qza/paired_end.qza \ --p-cores 40 \ --p-front-f forword sequence \ --p-front-r reverse sequence \ --p-discard-untrimmed \ --o-trimmed-sequences reads_qza/primer-trimmed.qza \ --verbose \ &> primer_trimming.log ``` * Check the ```primer_trimming.log``` and ensure that **Total written (filtered)** is sufficient. For example, >90% is good, while <60% is not, in which case you should review your primer tag or consider using a different one. * Common primer pairs are listed below. Use these options in the above command for the forward and reverse sequences: **V3V4:** ```linux --p-front-f CCTACGGGNGGCWGCAG \ --p-front-r GACTACHVGGGTATCTAATCC \ ``` **V5V6:** ```linux --p-front-f GTGYCAGCMGCCGCGGTAA \ --p-front-r CCGYCAATTYMTTTRAGTTT \ ``` **V6V8:** ```linux --p-front-f TYAATYGGANTCAACRCC \ --p-front-r CRGTGWGTRCAAGGRGCA \ ``` ## Step 5: Examine the quality of the data * Create a ```.qzv``` file in ```/reads_qza/``` directory and download it to your computer. ```linux qiime demux summarize \ --i-data reads_qza/primer-trimmed.qza \ --o-visualization reads_qza/primer-trimmed.qzv ``` * Head over to [qiime2 view](https://view.qiime2.org/) and upload your ```.qza``` file as an attachment to check the sequence quality. You can get this figure at **Interactive Quality Plot**. ![螢幕擷取畫面 2025-02-24 115554](https://hackmd.io/_uploads/HykZ0wY9Jg.png) ![螢幕擷取畫面 2025-02-24 115611](https://hackmd.io/_uploads/SkUlCDKq1g.png) * Trim sequences with a mean quality score below 25. Feel free to try different cutoff, but to save time, consider setting the cutoff slightly above 25. In this case, set the cutoff: F:230, R:230. ## Step 6: Denoise sample sequences: DADA2 * Fill in the ```--p-trunc-len-f``` and ```--p-trunc-len-r``` with the cutoff values determined by the **Interactive Quality Plot**, and use DADA2 for sequence denoising. ```linux time qiime dada2 denoise-paired \ --i-demultiplexed-seqs reads_qza/primer-trimmed.qza \ --p-trunc-len-f FCutoff \ --p-trunc-len-r RCutoff \ --p-n-threads 40 \ --output-dir dada2_output ``` ## Step 7: Check stats output * Generate a report for the denoised data, allowing you to review the read counts for each sample after filtering, denoising, merging, and removing chimeras. ```linux qiime tools export \ --input-path dada2_output/denoising_stats.qza \ --output-path dada2_output ``` * Check the ```dada2_output/stats.tsv``` file to ensure you have over 10,000 reads left after all the filtering. (If not, it's not necessary for this report👌) Once you've completed this step, you're ready to move on to the **Taxonomy Classification** stage.  # Single-end sequence ## Step 2: Create a manifest **Manifest file** requires the information of “sample-id”, “absolute-filepath”, and sequence “direction”. * No matter what method you use, just make sure your manifest file looks like the following: ```linux sample-id,absolute-filepath,direction 1,/home/lab617/YY/2023NGS_Classreport/Stromatolite_1/fastq/SRR1929474_1.fastq,forward 2,/home/lab617/YY/2023NGS_Classreport/Stromatolite_1/fastq/SRR1929475_1.fastq,forward 3,/home/lab617/YY/2023NGS_Classreport/Stromatolite_1/fastq/SRR1929476_1.fastq,forward ``` Here I will demonstrate how to edit a manifest file using R. * First, use ```ls``` to list the absolute filepath of all files in the fastq folder then write them into the manifest file. ```linux pwd # get your ".fastq*" location ls /your location/fastq/*.fastq* -d > manifest # ex: ls /home/lab617/YY/2023NGS_Classreport/Stromatolite_1/fastq/*.fastq* -d ``` * Edit columns: **sample-id**: You can create your own sample ID or use the number as shown in this example. save as a csv file. **direction**: A single-end sequence would have one fastq files. ```r R manifest <- read.table('manifest', header=FALSE) colnames(manifest)='absolute-filepath' manifest$"sample-id" <- c(1:9) manifest$"direction"=c('forward') manifest <- manifest[, c("sample-id", "absolute-filepath", "direction")] write.csv(manifest, "manifest", row.names = FALSE, quote = FALSE) quit() ``` ## Step 3: Import data from manifest * Importing data for use in QIIME 2. ```linux mkdir reads_qza qiime tools import \ --type 'SampleData[SequencesWithQuality]' \ --input-path manifest \ --output-path reads_qza/SingleEnd.qza \ --input-format SingleEndFastqManifestPhred33 ``` * Got an error while importing data? Make sure there’s nothing wrong with the manifest. If it’s all good, maybe try switching to "Phred64V2". (Or just drop this paper if you have other choices.👍) ``` SingleEndFastqManifestPhred33V2 SingleEndFastqManifestPhred64V2 ``` ## Step 4: Remove amplicon primers * Cutadapt: get the forward and reverse primer tags from your paper. ```linux qiime cutadapt trim-single \ --i-demultiplexed-sequences reads_qza/SingleEnd.qza \ --p-cores 40 \ --p-front forword sequence \ --p-front reverse sequence \ --p-discard-untrimmed \ --o-trimmed-sequences reads_qza/primer-trimmed.qza \ --verbose \ &> primer_trimming.log ``` * Check the `primer_trimming.log` and ensure that **Total written (filtered)** is sufficient. For example, >90% is good, while <60% is not, in which case you should review your primer tag or consider using a different one. ## Step 5: Examine the quality of the data * Create a `.qzv` file in `/reads_qza/` directory and download it to your computer. ```linux qiime demux summarize \ --i-data reads_qza/primer-trimmed.qza \ --o-visualization reads_qza/primer-trimmed.qzv ``` * Head over to [qiime2 view](https://view.qiime2.org/) and upload your ```.qza``` file as an attachment to check the sequence quality. You can get this figure at **Interactive Quality Plot**. ![螢幕擷取畫面 2025-02-24 115554](https://hackmd.io/_uploads/HykZ0wY9Jg.png) * Trim sequences with a mean quality score below 25. Feel free to try different cutoff, but to save time, consider setting the cutoff slightly above 25. In this case, set the cutoff: 80. ## Step 6: Denoise sample sequences: Dada2 * Fill in the ```--p-trunc-len-f``` and ```--p-trunc-len-r``` with the cutoff values determined by the **Interactive Quality Plot**, and use DADA2 for sequence denoising. ```linux qiime dada2 denoise-single \ --i-demultiplexed-seqs reads_qza/primer-trimmed.qza \ --p-trim-left 0 #StartSite \ --p-trunc-len Cutoff \ --p-n-threads 40 \ --output-dir dada2_output ``` ## Step 7: Check stats output * Generate a report for the denoised data, allowing you to review the read counts for each sample after filtering, denoising, merging, and removing chimeras. ```linux qiime tools export \ --input-path dada2_output/denoising_stats.qza \ --output-path dada2_output ``` * Check the ```dada2_output/stats.tsv``` file to ensure you have over 10,000 reads left after all the filtering. Once you've completed this step, you're ready to move on to the **Taxonomy Classification** stage.  # PacBio-Based Full-Length 16S rRNA Sequencing Unlike Illumina sequencing, Circular Consensus Sequencing (CCS) PacBio data does not generate reads in a consistent forward or reverse direction. This can be challenging since many bioinformatics tools are designed to handle only one orientation. To address this, we follow the [microbiome_helper](https://github.com/LangilleLab/microbiome_helper/wiki/PacBio-CCS-Amplicon-SOP-v1-%28qiime2%29) approach. Additionally, the quality scores of PacBio CCS data are nearly 100, which is out of QIIME2's expected range. Therefore, this workflow recommends checking sequence quality using FastQC. ## Step 2: Resolving orientation issues First, make a copy of the raw files, generate their reverse complement, and concatenate them. ```linux mkdir rawdata # This folder contains your PacBio raw data. gzip -d rawdata/*.gz mkdir raw_data_rc/ mkdir raw_data_cat/ parallel -j 40 'seqtk seq -r {} > raw_data_rc/{/.}_rc.fastq' ::: rawdata/*.fastq parallel -j 40 --link 'cat {1} {2} > raw_data_cat/{1/.}_cat.fastq' ::: rawdata/*.fastq ::: raw_data_rc/*_rc.fastq ``` ## Step 3: Remove amplicon primers By specifying the correct 5'-3' orientation, we filter out extra reads from the merged files, ensuring all reads have a consistent direction. **Bacteria universal primer set:** 27F (5′-AGRGTTYGATYMTGGCTCAG-3') 1492R (5'-AAGTCGTAACAAGGTARCY-3') ```r mkdir trimmed_reads/ parallel -j 40 'cutadapt -g AGRGTTYGATYMTGGCTCAG...AAGTCGTAACAAGGTARCY \ --discard-untrimmed --no-indels -j 1 -m 1200 -M 1800 \ -o trimmed_reads/{/.}_trim.fastq {}' \ ::: raw_data_cat/*_cat.fastq ``` This step outputs four numerical columns representing **length**, **count**, **expected errors** (max.err), and **error counts**, separated by tabs. ## Step 4: Verify read counts after filtering After trimming primers with `cutadapt`, let's check the read counts. In `fastq` format, each read consists of 4 lines, so the total number of reads can be calculated using: ```linux echo -e "Sample\tRaw_Reads\tTrimmed_Reads\tPercentage" > read_counts.tsv paste <(for f in rawdata/*.fastq; do expr $(wc -l < "$f") / 4; done) \ <(for f in trimmed_reads/*.fastq; do expr $(wc -l < "$f") / 4; done) \ | awk 'BEGIN {print "Sample\tRaw_Reads\tTrimmed_Reads\tPercentage"} {if ($1 > 0) per = ($2 / $1) * 100; else per = "NA"; print NR "\t" $1 "\t" $2 "\t" per "%"}' >> read_counts.tsv ``` After generating `read_counts.tsv`, check the number of reads that passed the filtering process. This file contains a summary of raw and trimmed read counts for each sample, along with the percentage of reads retained after trimming. | Sample | Raw_Reads | Trimmed_Reads | Percentage | |:------:|:---------:|:-------------:|:----------:| | 1 | 177242 | 145221 | 81.9337% | | 2 | 166646 | 145409 | 87.2562% | | 3 | 155264 | 136864 | 88.1492% | The step 2 and 3 will generate two intermediate files in the folders `raw_data_rc` and `raw_data_cat`. After completing this step and ensuring no major issues, these files can be deleted to save space. ## Step 5: Create a manifest Then, create a manifest by following the command in [Single-End Sequence step 3](https://hackmd.io/HGRIDcE5SNaFfQ4HX-_Wrg#Step-2-Create-a-manifest12). Alternatively, you can download the manifest generated by this command and modify it using Excel. ```linux pwd # get your ".fastq*" location ls /your_location/trimmed_reads/*fastq -d > manifest ``` No matter what method you use, just make sure your manifest file looks like the following: ```linux sample-id,absolute-filepath,direction 1,/your_location/fastq/SRR1929474_1.fastq,forward 2,/your_location/fastq/SRR1929475_1.fastq,forward 3,/your_location/fastq/SRR1929476_1.fastq,forward ``` ## Step 6: Import data from manifest Importing data for use in QIIME 2. ```linux mkdir reads_qza qiime tools import \ --type 'SampleData[SequencesWithQuality]' \ --input-path manifest \ --output-path reads_qza/trimmed_reads.qza \ --input-format SingleEndFastqManifestPhred33 ``` ## Step 7: DADA2 denoising Run the DADA2 workflow to correct reads and generate amplicon sequence variants (ASVs). No initial quality filtering or truncation is applied, as CCS reads already have high accuracy (99.9% consensus). ```linux qiime dada2 denoise-single \ --i-demultiplexed-seqs reads_qza/trimmed_reads.qza \ --p-trunc-len 0 \ --p-max-ee 3 \ --p-n-threads 40 \ --p-n-reads-learn 100000 \ --output-dir dada2_output ``` ## Step 8: Check stats output Generate a report for the denoised data to review the read counts for each sample. ```linux qiime tools export --input-path dada2_output/denoising_stats.qza --output-path dada2_output mv dada2_output/stats.tsv dada2_output/dada2_stats.tsv ``` Notably, the input column in this table represents reads after primer trimming with cutadapt. | Sample-ID | Input | Filtered | % Passed Filter | Denoised | Non-Chimeric | % Non-Chimeric | |:---------:|:------:|:--------:|:---------------:|:--------:|:------------:|:--------------:| | HYbig | 145221 | 133155 | 91.69% | 131735 | 126978 | 87.44% | | HYrock | 145409 | 133049 | 91.5% | 130938 | 129534 | 89.08% | | PKX1 | 136864 | 124831 | 91.21% | 120618 | 119467 | 87.29% | # Taxonomy Classification Welcome to this stage!! you're almost there for a successful outcome!!🪇 ## FRCI Database * FRLDB holds 62 strains, including FaRLiP cyanobacteria and Chl *d* containing strains. Using a 97% identity threshold effectively distinguishes FRL and VL cyanobacteria of the same genus. ```linux #FRCI time qiime feature-classifier classify-consensus-vsearch \ --i-query dada2_output/representative_sequences.qza \ --i-reference-reads /home/lab617/Genomics/FRLDB/220921_FRLDB_01.qza \ --i-reference-taxonomy /home/lab617/Genomics/FRLDB/220921_FRLDB_TX.qza \ --p-maxaccepts 1 \ --p-perc-identity 0.97 \ --p-query-cov 1 \ --p-threads 40 \ --verbose \ --output-dir FRL_taxa_97 qiime tools export --input-path FRL_taxa_97/classification.qza --output-path FRL_taxa_97 sed -i -e '1 s/Feature/#Feature/' -e '1 s/Taxon/taxonomy/' FRL_taxa_97/taxonomy.tsv qiime tools export --input-path dada2_output/table.qza --output-path dada2_output_silva_exported biom add-metadata -i dada2_output_silva_exported/feature-table.biom -o dada2_output_silva_exported/feature-table_w_tax.biom --observation-metadata-fp FRL_taxa_97/taxonomy.tsv --sc-separated taxonomy biom convert -i dada2_output_silva_exported/feature-table_w_tax.biom -o dada2_output_silva_exported/FRL_taxa_97.txt --to-tsv --header-key taxonomy ``` * When you see the image below, it means you have completed FRCI!🐋 Please copy the results `Matching query sequences:` to your result report.🐻👍 ![螢幕擷取畫面 2025-02-24 164320](https://hackmd.io/_uploads/H1gIWhK9ye.png) * Far-red light cyanobacteria taxonomy classification will be stored in `dada2_output_silva_exported/FRL_taxa_97.txt`. | OTU ID | Sample1 | Sample2 | taxonomy | | ---------- | ------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------- | | 7a7df92c4f | 44142 | 13 | K_Bacteria; P_Cyanobacteria; O_Synechococcales; F_Leptolyngbyaceae; G_Leptolyngbya; S_cf. Leptolyngbya sp. CCNUM2 | | 8958f347d | 13893 | 4 | Unassigned | | 7c51777f79 | 10791 | 4 | K_Bacteria; P_Cyanobacteria; O_Chroococcidiopsidales; F_Chroococcidiopsidaceae; G_Chroococcidiopsis; S_Chroococcidiopsis sp. CCNUC1 | ## CyanoDB * For convenience, we separated cyanobacteria phyla **(without chloroplast)** from the silva database, so we can analyze the cyanobacteria results in the samples separately. ```linux #cyano time qiime feature-classifier classify-consensus-vsearch \ --i-query dada2_output/representative_sequences.qza \ --i-reference-reads /home/lab617/Genomics/databases/Silva138_OnlyCyano.qza \ --i-reference-taxonomy /home/lab617/Genomics/databases/Silva138_Cyano_TX.qza \ --p-maxaccepts 1 \ --p-perc-identity 0.97 \ --p-query-cov 1 \ --p-threads 40 \ --verbose \ --output-dir Cyano_taxa_97 qiime tools export --input-path Cyano_taxa_97/classification.qza --output-path Cyano_taxa_97 sed -i -e '1 s/Feature/#Feature/' -e '1 s/Taxon/taxonomy/' Cyano_taxa_97/taxonomy.tsv qiime tools export --input-path dada2_output/table.qza --output-path dada2_output_silva_exported biom add-metadata -i dada2_output_silva_exported/feature-table.biom -o dada2_output_silva_exported/feature-table_w_tax.biom --observation-metadata-fp Cyano_taxa_97/taxonomy.tsv --sc-separated taxonomy biom convert -i dada2_output_silva_exported/feature-table_w_tax.biom -o dada2_output_silva_exported/Cyano_taxa_97 --to-tsv --header-key taxonomy ``` ## SilvaDB * Silva database has lots of bacterial classification and sequence data, and it’s widely used for bacteria metagenomic analysis. This step might take a long time to analyze the whole picture of the samples. * ***Note***: When working with PacBio data, the default taxonomy classification may be very slow. To improve speed, add the following options to your command: ``--p-maxaccepts 1 --p-strand "plus" \`` ```linux #silva time qiime feature-classifier classify-consensus-vsearch \ --i-query dada2_output/representative_sequences.qza \ --i-reference-reads /home/lab617/Genomics/databases/silva-138-99-seqs.qza \ --i-reference-taxonomy /home/lab617/Genomics/databases/silva-138-99-tax.qza \ --p-threads 40 \ --verbose \ --output-dir silva_taxa qiime tools export --input-path silva_taxa/classification.qza --output-path silva_taxa sed -i -e '1 s/Feature/#Feature/' -e '1 s/Taxon/taxonomy/' silva_taxa/taxonomy.tsv qiime tools export --input-path dada2_output/table.qza --output-path dada2_output_silva_exported biom add-metadata -i dada2_output_silva_exported/feature-table.biom -o dada2_output_silva_exported/feature-table_w_tax.biom --observation-metadata-fp silva_taxa/taxonomy.tsv --sc-separated taxonomy biom convert -i dada2_output_silva_exported/feature-table_w_tax.biom -o dada2_output_silva_exported/silvatax.txt --to-tsv --header-key taxonomy ``` **Done! Make some happy Cyanobacteria noise!!** ![SpongebobMenGIF.gif](https://media3.giphy.com/media/X4M6homF66qFq/giphy.gif)