# 16S metagenomic raw data processing
Use of **FRCI (Far-red cyanobacteria identification)** to identify Far-Red Light utilizing cyanobacteria in your target environments from your own samples or literature sources. This workflow provides a pipeline for processing raw sequencing data from **Illumina amplicon sequencing (single-end and paired-end) and PacBio sequencing**. Hope you enjoy the analysis process!!
**Before You Begin:**
* This workflow requires [QIIME2](https://docs.qiime2.org/2023.9/tutorials/qiime2-for-experienced-microbiome-researchers/) to be installed within a conda environment, along with [parallel](https://anaconda.org/conda-forge/parallel) and [seqtk](https://github.com/lh3/seqtk).
* For taxonomy classification, if you are using QIIME2 for the first time, please download and import the [silva](https://www.arb-silva.de/download/arb-files/) database into QIIME2 format for use.
**For public data searching:**
* Make sure your paper has these info: 16S rRNA gene sequencing, accesion number, primer tag (or check the references if you can’t find it). You can also use Table 1 in FRCI for practice.👍
* If you have any questions about the parameters in the command or want to learn more, please check out the Qiime2 website.
**For your own samples:**
* You can skip step 1 and start from the **paired-end sequencing** or **Pacbio sequencing data** step.
* If you are sure that the parameters are correct but the quality is problematic, please keep in touch with the sequencing provider.
**FRCI:**
[FRCI article](https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13871) and [FRCI database](https://doi.org/10.5281/zenodo.7870551)
**Overview of QIIME 2**

## Download accession number file
* Find 16S metagenomic raw data from article
Ex: SRA number or project number etc.
* Go to NCBI download accession number file: "SraAc List.csv"
* Create and Upload SraAc List to your Linux folder
Ex: `mkdir Stromatolite_1`
## Step 0: Build enviroment
* Create a [tmux](https://github.com/tmux/tmux/wiki/Installing) session to avoid interrupting your ongoing work: ```tmux```
* Don’t forget your tmux session number. If you close the linux window, you can reopen it with ```tmux attach -t number``` , and use ```tmux ls``` to see the list of sessions. If your work is done, use ```exit``` to kill tmux session.
* Activate Qiime2 environment: ```conda activate qiime2-2021.11```
## Step 1: Download raw reads from NCBI SRA
* Use of tools prefetch and fastq-dump to download
```linux
parallel -j 1 prefetch {} ::: $(cat SraAccList.csv)
parallel -j 1 fastq-dump --skip-technical -F --split-files -O fastq {} ::: $(cat SraAccList.csv)
```
* Check ```/fastq/``` folder after download: ```ls fastq```
If you have **only one file for one SRA number**, please go to the **single-end session**.
* Don’t waste time on this paper if SRA takes more than a day to download.😊Or you can also open another tmux session while waiting.
# Paired-end sequence
## Step 2: Create a manifest
**Manifest file** requires the information of “sample-id”, “absolute-filepath”, and sequence “direction”.
* No matter what method you use, just make sure your manifest file looks like the following:
```linux
sample-id,absolute-filepath,direction
BM1,/home/lab617/YY/samplingNGS/BeiMen_230215_Mangrove/BM1_R1.fastq.gz,forward
BM1,/home/lab617/YY/samplingNGS/BeiMen_230215_Mangrove/BM1_R2.fastq.gz,reverse
BM2,/home/lab617/YY/samplingNGS/BeiMen_230215_Mangrove/BM2_R1.fastq.gz,forward
BM2,/home/lab617/YY/samplingNGS/BeiMen_230215_Mangrove/BM2_R2.fastq.gz,reverse
BM3,/home/lab617/YY/samplingNGS/BeiMen_230215_Mangrove/BM3_R1.fastq.gz,forward
BM3,/home/lab617/YY/samplingNGS/BeiMen_230215_Mangrove/BM3_R2.fastq.gz,reverse
```
Here I will demonstrate how to edit a manifest file using R.
* First, use ```ls``` to list the absolute filepath of all files in the fastq folder then write them into the manifest file.
```linux
pwd
# get your ".fastq*" location
ls /your location/fastq/*.fastq* -d > manifest
# ex: ls /home/lab617/YY/2023NGS_Classreport/Stromatolite_1/fastq/*.fastq* -d
```
* Edit columns:
**“sample-id”**: You can create your own sample ID or use the sample size number as shown in this example. Save it as a CSV file.
**"direction"**: A paired-end sequence would have two fastq files, forward reads and reverse reads.
```R
R
manifest <- read.table('manifest', header=FALSE)
colnames(manifest)='absolute-filepath'
manifest$"sample-id" <- rep(c(1:sample size), each = 2)
manifest$"direction"=c('forward',"reverse")
manifest <- manifest[, c("sample-id", "absolute-filepath", "direction")]
write.csv(manifest, "manifest", row.names = FALSE, quote = FALSE)
quit()
```
## Step 3: Import data from manifest
* Importing data for use in QIIME 2.
```linux
mkdir reads_qza
qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path manifest \
--output-path reads_qza/paired_end.qza \
--input-format PairedEndFastqManifestPhred33
```
* Got an error while importing data? Make sure there’s nothing wrong with the manifest. If it’s all good, maybe try switching to "Phred64V2". (Or just drop this paper if you have other choices.👍)
```
PairedEndFastqManifestPhred33V2
PairedEndFastqManifestPhred64V2
```
## Step 4: Remove amplicon primers
* Cutadapt: get the forward and reverse primer tags from your paper.
```linux
time qiime cutadapt trim-paired \
--i-demultiplexed-sequences reads_qza/paired_end.qza \
--p-cores 40 \
--p-front-f forword sequence \
--p-front-r reverse sequence \
--p-discard-untrimmed \
--o-trimmed-sequences reads_qza/primer-trimmed.qza \
--verbose \
&> primer_trimming.log
```
* Check the ```primer_trimming.log``` and ensure that **Total written (filtered)** is sufficient.
For example, >90% is good, while <60% is not, in which case you should review your primer tag or consider using a different one.
* Common primer pairs are listed below. Use these options in the above command for the forward and reverse sequences:
**V3V4:**
```linux
--p-front-f CCTACGGGNGGCWGCAG \
--p-front-r GACTACHVGGGTATCTAATCC \
```
**V5V6:**
```linux
--p-front-f GTGYCAGCMGCCGCGGTAA \
--p-front-r CCGYCAATTYMTTTRAGTTT \
```
**V6V8:**
```linux
--p-front-f TYAATYGGANTCAACRCC \
--p-front-r CRGTGWGTRCAAGGRGCA \
```
## Step 5: Examine the quality of the data
* Create a ```.qzv``` file in ```/reads_qza/``` directory and download it to your computer.
```linux
qiime demux summarize \
--i-data reads_qza/primer-trimmed.qza \
--o-visualization reads_qza/primer-trimmed.qzv
```
* Head over to [qiime2 view](https://view.qiime2.org/) and upload your ```.qza``` file as an attachment to check the sequence quality. You can get this figure at **Interactive Quality Plot**.


* Trim sequences with a mean quality score below 25. Feel free to try different cutoff, but to save time, consider setting the cutoff slightly above 25. In this case, set the cutoff: F:230, R:230.
## Step 6: Denoise sample sequences: DADA2
* Fill in the ```--p-trunc-len-f``` and ```--p-trunc-len-r``` with the cutoff values determined by the **Interactive Quality Plot**, and use DADA2 for sequence denoising.
```linux
time qiime dada2 denoise-paired \
--i-demultiplexed-seqs reads_qza/primer-trimmed.qza \
--p-trunc-len-f FCutoff \
--p-trunc-len-r RCutoff \
--p-n-threads 40 \
--output-dir dada2_output
```
## Step 7: Check stats output
* Generate a report for the denoised data, allowing you to review the read counts for each sample after filtering, denoising, merging, and removing chimeras.
```linux
qiime tools export \
--input-path dada2_output/denoising_stats.qza \
--output-path dada2_output
```
* Check the ```dada2_output/stats.tsv``` file to ensure you have over 10,000 reads left after all the filtering. (If not, it's not necessary for this report👌) Once you've completed this step, you're ready to move on to the **Taxonomy Classification** stage.
<!-- 如果是自己的sample就要在意是否超過10000 reads -->
# Single-end sequence
## Step 2: Create a manifest
**Manifest file** requires the information of “sample-id”, “absolute-filepath”, and sequence “direction”.
* No matter what method you use, just make sure your manifest file looks like the following:
```linux
sample-id,absolute-filepath,direction
1,/home/lab617/YY/2023NGS_Classreport/Stromatolite_1/fastq/SRR1929474_1.fastq,forward
2,/home/lab617/YY/2023NGS_Classreport/Stromatolite_1/fastq/SRR1929475_1.fastq,forward
3,/home/lab617/YY/2023NGS_Classreport/Stromatolite_1/fastq/SRR1929476_1.fastq,forward
```
Here I will demonstrate how to edit a manifest file using R.
* First, use ```ls``` to list the absolute filepath of all files in the fastq folder then write them into the manifest file.
```linux
pwd
# get your ".fastq*" location
ls /your location/fastq/*.fastq* -d > manifest
# ex: ls /home/lab617/YY/2023NGS_Classreport/Stromatolite_1/fastq/*.fastq* -d
```
* Edit columns:
**sample-id**: You can create your own sample ID or use the number as shown in this example. save as a csv file.
**direction**: A single-end sequence would have one fastq files.
```r
R
manifest <- read.table('manifest', header=FALSE)
colnames(manifest)='absolute-filepath'
manifest$"sample-id" <- c(1:9)
manifest$"direction"=c('forward')
manifest <- manifest[, c("sample-id", "absolute-filepath", "direction")]
write.csv(manifest, "manifest", row.names = FALSE, quote = FALSE)
quit()
```
## Step 3: Import data from manifest
* Importing data for use in QIIME 2.
```linux
mkdir reads_qza
qiime tools import \
--type 'SampleData[SequencesWithQuality]' \
--input-path manifest \
--output-path reads_qza/SingleEnd.qza \
--input-format SingleEndFastqManifestPhred33
```
* Got an error while importing data? Make sure there’s nothing wrong with the manifest. If it’s all good, maybe try switching to "Phred64V2". (Or just drop this paper if you have other choices.👍)
```
SingleEndFastqManifestPhred33V2
SingleEndFastqManifestPhred64V2
```
## Step 4: Remove amplicon primers
* Cutadapt: get the forward and reverse primer tags from your paper.
```linux
qiime cutadapt trim-single \
--i-demultiplexed-sequences reads_qza/SingleEnd.qza \
--p-cores 40 \
--p-front forword sequence \
--p-front reverse sequence \
--p-discard-untrimmed \
--o-trimmed-sequences reads_qza/primer-trimmed.qza \
--verbose \
&> primer_trimming.log
```
* Check the `primer_trimming.log` and ensure that **Total written (filtered)** is sufficient.
For example, >90% is good, while <60% is not, in which case you should review your primer tag or consider using a different one.
## Step 5: Examine the quality of the data
* Create a `.qzv` file in `/reads_qza/` directory and download it to your computer.
```linux
qiime demux summarize \
--i-data reads_qza/primer-trimmed.qza \
--o-visualization reads_qza/primer-trimmed.qzv
```
* Head over to [qiime2 view](https://view.qiime2.org/) and upload your ```.qza``` file as an attachment to check the sequence quality. You can get this figure at **Interactive Quality Plot**.

* Trim sequences with a mean quality score below 25. Feel free to try different cutoff, but to save time, consider setting the cutoff slightly above 25. In this case, set the cutoff: 80.
## Step 6: Denoise sample sequences: Dada2
* Fill in the ```--p-trunc-len-f``` and ```--p-trunc-len-r``` with the cutoff values determined by the **Interactive Quality Plot**, and use DADA2 for sequence denoising.
```linux
qiime dada2 denoise-single \
--i-demultiplexed-seqs reads_qza/primer-trimmed.qza \
--p-trim-left 0 #StartSite \
--p-trunc-len Cutoff \
--p-n-threads 40 \
--output-dir dada2_output
```
## Step 7: Check stats output
* Generate a report for the denoised data, allowing you to review the read counts for each sample after filtering, denoising, merging, and removing chimeras.
```linux
qiime tools export \
--input-path dada2_output/denoising_stats.qza \
--output-path dada2_output
```
* Check the ```dada2_output/stats.tsv``` file to ensure you have over 10,000 reads left after all the filtering. Once you've completed this step, you're ready to move on to the **Taxonomy Classification** stage.
<!-- 如果是自己的sample就要在意是否超過10000 reads -->
# PacBio-Based Full-Length 16S rRNA Sequencing
Unlike Illumina sequencing, Circular Consensus Sequencing (CCS) PacBio data does not generate reads in a consistent forward or reverse direction. This can be challenging since many bioinformatics tools are designed to handle only one orientation. To address this, we follow the [microbiome_helper](https://github.com/LangilleLab/microbiome_helper/wiki/PacBio-CCS-Amplicon-SOP-v1-%28qiime2%29) approach. Additionally, the quality scores of PacBio CCS data are nearly 100, which is out of QIIME2's expected range. Therefore, this workflow recommends checking sequence quality using FastQC.
## Step 2: Resolving orientation issues
First, make a copy of the raw files, generate their reverse complement, and concatenate them.
```linux
mkdir rawdata
# This folder contains your PacBio raw data.
gzip -d rawdata/*.gz
mkdir raw_data_rc/
mkdir raw_data_cat/
parallel -j 40 'seqtk seq -r {} > raw_data_rc/{/.}_rc.fastq' ::: rawdata/*.fastq
parallel -j 40 --link 'cat {1} {2} > raw_data_cat/{1/.}_cat.fastq' ::: rawdata/*.fastq ::: raw_data_rc/*_rc.fastq
```
## Step 3: Remove amplicon primers
By specifying the correct 5'-3' orientation, we filter out extra reads from the merged files, ensuring all reads have a consistent direction.
**Bacteria universal primer set:**
27F (5′-AGRGTTYGATYMTGGCTCAG-3')
1492R (5'-AAGTCGTAACAAGGTARCY-3')
```r
mkdir trimmed_reads/
parallel -j 40 'cutadapt -g AGRGTTYGATYMTGGCTCAG...AAGTCGTAACAAGGTARCY \
--discard-untrimmed --no-indels -j 1 -m 1200 -M 1800 \
-o trimmed_reads/{/.}_trim.fastq {}' \
::: raw_data_cat/*_cat.fastq
```
This step outputs four numerical columns representing **length**, **count**, **expected errors** (max.err), and **error counts**, separated by tabs.
## Step 4: Verify read counts after filtering
After trimming primers with `cutadapt`, let's check the read counts. In `fastq` format, each read consists of 4 lines, so the total number of reads can be calculated using:
```linux
echo -e "Sample\tRaw_Reads\tTrimmed_Reads\tPercentage" > read_counts.tsv
paste <(for f in rawdata/*.fastq; do expr $(wc -l < "$f") / 4; done) \
<(for f in trimmed_reads/*.fastq; do expr $(wc -l < "$f") / 4; done) \
| awk 'BEGIN {print "Sample\tRaw_Reads\tTrimmed_Reads\tPercentage"}
{if ($1 > 0) per = ($2 / $1) * 100; else per = "NA";
print NR "\t" $1 "\t" $2 "\t" per "%"}' >> read_counts.tsv
```
After generating `read_counts.tsv`, check the number of reads that passed the filtering process. This file contains a summary of raw and trimmed read counts for each sample, along with the percentage of reads retained after trimming.
| Sample | Raw_Reads | Trimmed_Reads | Percentage |
|:------:|:---------:|:-------------:|:----------:|
| 1 | 177242 | 145221 | 81.9337% |
| 2 | 166646 | 145409 | 87.2562% |
| 3 | 155264 | 136864 | 88.1492% |
The step 2 and 3 will generate two intermediate files in the folders `raw_data_rc` and `raw_data_cat`. After completing this step and ensuring no major issues, these files can be deleted to save space.
## Step 5: Create a manifest
Then, create a manifest by following the command in [Single-End Sequence step 3](https://hackmd.io/HGRIDcE5SNaFfQ4HX-_Wrg#Step-2-Create-a-manifest12). Alternatively, you can download the manifest generated by this command and modify it using Excel.
```linux
pwd
# get your ".fastq*" location
ls /your_location/trimmed_reads/*fastq -d > manifest
```
No matter what method you use, just make sure your manifest file looks like the following:
```linux
sample-id,absolute-filepath,direction
1,/your_location/fastq/SRR1929474_1.fastq,forward
2,/your_location/fastq/SRR1929475_1.fastq,forward
3,/your_location/fastq/SRR1929476_1.fastq,forward
```
## Step 6: Import data from manifest
Importing data for use in QIIME 2.
```linux
mkdir reads_qza
qiime tools import \
--type 'SampleData[SequencesWithQuality]' \
--input-path manifest \
--output-path reads_qza/trimmed_reads.qza \
--input-format SingleEndFastqManifestPhred33
```
## Step 7: DADA2 denoising
Run the DADA2 workflow to correct reads and generate amplicon sequence variants (ASVs). No initial quality filtering or truncation is applied, as CCS reads already have high accuracy (99.9% consensus).
```linux
qiime dada2 denoise-single \
--i-demultiplexed-seqs reads_qza/trimmed_reads.qza \
--p-trunc-len 0 \
--p-max-ee 3 \
--p-n-threads 40 \
--p-n-reads-learn 100000 \
--output-dir dada2_output
```
## Step 8: Check stats output
Generate a report for the denoised data to review the read counts for each sample.
```linux
qiime tools export --input-path dada2_output/denoising_stats.qza --output-path dada2_output
mv dada2_output/stats.tsv dada2_output/dada2_stats.tsv
```
Notably, the input column in this table represents reads after primer trimming with cutadapt.
| Sample-ID | Input | Filtered | % Passed Filter | Denoised | Non-Chimeric | % Non-Chimeric |
|:---------:|:------:|:--------:|:---------------:|:--------:|:------------:|:--------------:|
| HYbig | 145221 | 133155 | 91.69% | 131735 | 126978 | 87.44% |
| HYrock | 145409 | 133049 | 91.5% | 130938 | 129534 | 89.08% |
| PKX1 | 136864 | 124831 | 91.21% | 120618 | 119467 | 87.29% |
# Taxonomy Classification
Welcome to this stage!! you're almost there for a successful outcome!!🪇
## FRCI Database
* FRLDB holds 62 strains, including FaRLiP cyanobacteria and Chl *d* containing strains. Using a 97% identity threshold effectively distinguishes FRL and VL cyanobacteria of the same genus.
```linux
#FRCI
time qiime feature-classifier classify-consensus-vsearch \
--i-query dada2_output/representative_sequences.qza \
--i-reference-reads /home/lab617/Genomics/FRLDB/220921_FRLDB_01.qza \
--i-reference-taxonomy /home/lab617/Genomics/FRLDB/220921_FRLDB_TX.qza \
--p-maxaccepts 1 \
--p-perc-identity 0.97 \
--p-query-cov 1 \
--p-threads 40 \
--verbose \
--output-dir FRL_taxa_97
qiime tools export --input-path FRL_taxa_97/classification.qza --output-path FRL_taxa_97
sed -i -e '1 s/Feature/#Feature/' -e '1 s/Taxon/taxonomy/' FRL_taxa_97/taxonomy.tsv
qiime tools export --input-path dada2_output/table.qza --output-path dada2_output_silva_exported
biom add-metadata -i dada2_output_silva_exported/feature-table.biom -o dada2_output_silva_exported/feature-table_w_tax.biom --observation-metadata-fp FRL_taxa_97/taxonomy.tsv --sc-separated taxonomy
biom convert -i dada2_output_silva_exported/feature-table_w_tax.biom -o dada2_output_silva_exported/FRL_taxa_97.txt --to-tsv --header-key taxonomy
```
* When you see the image below, it means you have completed FRCI!🐋 Please copy the results `Matching query sequences:` to your result report.🐻👍

* Far-red light cyanobacteria taxonomy classification will be stored in `dada2_output_silva_exported/FRL_taxa_97.txt`.
| OTU ID | Sample1 | Sample2 | taxonomy |
| ---------- | ------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| 7a7df92c4f | 44142 | 13 | K_Bacteria; P_Cyanobacteria; O_Synechococcales; F_Leptolyngbyaceae; G_Leptolyngbya; S_cf. Leptolyngbya sp. CCNUM2 |
| 8958f347d | 13893 | 4 | Unassigned |
| 7c51777f79 | 10791 | 4 | K_Bacteria; P_Cyanobacteria; O_Chroococcidiopsidales; F_Chroococcidiopsidaceae; G_Chroococcidiopsis; S_Chroococcidiopsis sp. CCNUC1 |
## CyanoDB
* For convenience, we separated cyanobacteria phyla **(without chloroplast)** from the silva database, so we can analyze the cyanobacteria results in the samples separately.
```linux
#cyano
time qiime feature-classifier classify-consensus-vsearch \
--i-query dada2_output/representative_sequences.qza \
--i-reference-reads /home/lab617/Genomics/databases/Silva138_OnlyCyano.qza \
--i-reference-taxonomy /home/lab617/Genomics/databases/Silva138_Cyano_TX.qza \
--p-maxaccepts 1 \
--p-perc-identity 0.97 \
--p-query-cov 1 \
--p-threads 40 \
--verbose \
--output-dir Cyano_taxa_97
qiime tools export --input-path Cyano_taxa_97/classification.qza --output-path Cyano_taxa_97
sed -i -e '1 s/Feature/#Feature/' -e '1 s/Taxon/taxonomy/' Cyano_taxa_97/taxonomy.tsv
qiime tools export --input-path dada2_output/table.qza --output-path dada2_output_silva_exported
biom add-metadata -i dada2_output_silva_exported/feature-table.biom -o dada2_output_silva_exported/feature-table_w_tax.biom --observation-metadata-fp Cyano_taxa_97/taxonomy.tsv --sc-separated taxonomy
biom convert -i dada2_output_silva_exported/feature-table_w_tax.biom -o dada2_output_silva_exported/Cyano_taxa_97 --to-tsv --header-key taxonomy
```
## SilvaDB
* Silva database has lots of bacterial classification and sequence data, and it’s widely used for bacteria metagenomic analysis. This step might take a long time to analyze the whole picture of the samples.
* ***Note***: When working with PacBio data, the default taxonomy classification may be very slow. To improve speed, add the following options to your command:
``--p-maxaccepts 1 --p-strand "plus" \``
```linux
#silva
time qiime feature-classifier classify-consensus-vsearch \
--i-query dada2_output/representative_sequences.qza \
--i-reference-reads /home/lab617/Genomics/databases/silva-138-99-seqs.qza \
--i-reference-taxonomy /home/lab617/Genomics/databases/silva-138-99-tax.qza \
--p-threads 40 \
--verbose \
--output-dir silva_taxa
qiime tools export --input-path silva_taxa/classification.qza --output-path silva_taxa
sed -i -e '1 s/Feature/#Feature/' -e '1 s/Taxon/taxonomy/' silva_taxa/taxonomy.tsv
qiime tools export --input-path dada2_output/table.qza --output-path dada2_output_silva_exported
biom add-metadata -i dada2_output_silva_exported/feature-table.biom -o dada2_output_silva_exported/feature-table_w_tax.biom --observation-metadata-fp silva_taxa/taxonomy.tsv --sc-separated taxonomy
biom convert -i dada2_output_silva_exported/feature-table_w_tax.biom -o dada2_output_silva_exported/silvatax.txt --to-tsv --header-key taxonomy
```
**Done! Make some happy Cyanobacteria noise!!**
