---
# System prepended metadata

title: Analyzing 16S data from brine-dominated hydrothermal vent samples

---

# Analyzing 16S data from brine-dominated hydrothermal vent samples

I have 96 folders containing Illumina NextSeq2000 paired-end reads in fastq files of filtered diffuse-flow fluids, swabbed chimneys, and ground-up chimneys from a brine-dominated hydrothermal vent field, as well as several full-process negative controls from the three different extraction kits, onto my institution's HPC (Poseidon, WHOI) in the folder "16S_raw_nov12". I want to create ASVs, visualize the community, remove contaminants using decontam (in DADA2), and compare the populations between my different sample types.

Approach: 
1. Use Sarah Hu’s QIIME 2 HPC setup to:
Install QIIME 2 on Poseidon.
Import and run `qiime dada2 denoise-paired`.
Export table.qza, rep-seqs.qza, and metadata to R (qiime2R).
2. In R:
Build a phyloseq object.
Run decontam directly (with full-process negative controls negs).
Analyses and make figures.

This code is modified from Sarah Hu's github: https://github.com/shu251/qiime2-2024/tree/main?tab=readme-ov-file and Sabrina Elkassas's modifications: https://github.com/corporeal-snow-albatross-5/Qiime2_Mariana_forearc_16S_Amplicons

Primer set: 515F (Parada et al., 2016), 926R (Quince et al., 2011)

## Downloading sequences from base space
```
# Set up the command-line interface. 
mkdir $HOME/bin

# Linux download link
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $HOME/bin/bs

# Add executing privileges
chmod u+x $HOME/bin/bs

# authorize
bs auth

# Navigate to find your project number in base space or type: 
bs list projects -F Name -F Id

#then type this to get just your fastQ files: 
bs download project -i 467718649 -o /vortexfs1/omics/huber/selkassas/Mariana_amplicon_sequences_2 --extension fastq.gz

# or all files: (this is what I did)
bs download project -i 476556398 -o /vortexfs1/omics/huber/afulford/16S_raw_nov12
```

## Using QIIME2 for pre-processing
### Install Qiime2 on Poseidon using a singularity container

```
#enter scavenger 
srun -p scavenger --time=04:00:00 --ntasks-per-node 1 --mem=5gb --pty bash

#install using a singularity container
module load singularity/3.7
singularity pull \
  docker://quay.io/qiime2/amplicon:2025.10

#test it to see if it works
singularity exec amplicon_2025.10.sif qiime --help

#docker image file should be around 2 gb. Check:
ls -lh amplicon_2025.10.sif
```

### Git clone Sarah Hu's Qiime2 repo to get the code for the manifest file
`git clone git@github.com:shu251/qiime2-2024.git`

I had to download the GitHub repo to my local computer and scp it into the directory where my sequences are stored. 

```
scp -r /Users/averyfulford/Documents/qiime2-2024-main avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12
```

### Edit Sarah's create-manifest R script
I need to make a few changes. First, I need to change the setwd() file path and file output path. Second, I need to change the R script because each of my forward and reverse fastq files are nested inside individual folders under raw/. Because of that, FASTQ contains a path + filename, so I need to extract SAMPLEID from just the basename rather than the whole path. 

Here is my edited create-manifest.R script:

```
# Read in sample IDs and list of raw fastq files to generate a manifest file ready for qiime2 import

# Change working directory to sequences (your raw directory)
setwd("/proj/omics/huber/afulford/16S_nov12/raw")

# Required libraries
library(tidyverse)
library(stringr)

# Import all FASTQ files recursively
paths <- as.data.frame(
  list.files(
    pattern   = "\\.fastq\\.gz$",
    full.names = TRUE,
    recursive  = TRUE
  )
)

colnames(paths)[1] <- "FASTQ"  # column with full (relative) paths to fastqs

# Extract filename and sample ID
paths_run <- paths %>%
  mutate(
    FNAME    = basename(FASTQ),  # just the filename, no directory
    # Extract sample ID from filename:
    # e.g. 7081_5Day_R1_1_Jul02_S8_L001_R1_001.fastq.gz
    # -> SAMPLEID = "7081_5Day_R1_1_Jul02_S8"
    SAMPLEID = str_extract(FNAME, "^.+(?=_L001_R[12]_001\\.fastq\\.gz)")
  )

# Build full absolute path to each FASTQ
paths_run$FULL_PATH <- normalizePath(paths_run$FASTQ)

# Forward reads (R1)
forward <- paths_run %>%
  filter(str_detect(FNAME, "R1_001\\.fastq\\.gz")) %>%
  select(SAMPLEID, `forward-absolute-filepath` = FULL_PATH)

# Reverse reads (R2)
reverse <- paths_run %>%
  filter(str_detect(FNAME, "R2_001\\.fastq\\.gz")) %>%
  select(SAMPLEID, `reverse-absolute-filepath` = FULL_PATH)

# Join forward and reverse on SAMPLEID
manifest <- forward %>%
  inner_join(reverse, by = "SAMPLEID") %>%   # only keep samples that have both R1 and R2
  select(
    `sample-id` = SAMPLEID,
    `forward-absolute-filepath`,
    `reverse-absolute-filepath`
    # -> SAMPLEID = "7081_5Day_R1_1_Jul02_S8"
    SAMPLEID = str_extract(FNAME, "^.+(?=_L001_R[12]_001\\.fastq\\.gz)")
  )

# Build full absolute path to each FASTQ
paths_run$FULL_PATH <- normalizePath(paths_run$FASTQ)

# Forward reads (R1)
forward <- paths_run %>%
  filter(str_detect(FNAME, "R1_001\\.fastq\\.gz")) %>%
  select(SAMPLEID, `forward-absolute-filepath` = FULL_PATH)

# Reverse reads (R2)
reverse <- paths_run %>%
  filter(str_detect(FNAME, "R2_001\\.fastq\\.gz")) %>%
  select(SAMPLEID, `reverse-absolute-filepath` = FULL_PATH)

# Join forward and reverse on SAMPLEID
manifest <- forward %>%
  inner_join(reverse, by = "SAMPLEID") %>%   # only keep samples that have both R1 and R2
  select(
    `sample-id` = SAMPLEID,
    `forward-absolute-filepath`,
    `reverse-absolute-filepath`
  ) %>%
  arrange(`sample-id`)

# Write manifest file (one level up from raw/)
write.table(
  manifest,
  file      = "/proj/omics/huber/afulford/16S_nov12/manifest.tsv",
  quote     = FALSE,
  col.names = TRUE,
  row.names = FALSE,
  sep       = "\t"
)

cat("Wrote manifest for", nrow(manifest), "samples\n")
```
### Run create-manifest.R

Make an R environment: 
```
conda create -n r_env -c conda-forge r-base r-essentials -y
conda activate r_env
R --version
```

Run the script using `Rscript create-manifest-ahf.R`.

I then manually went into the manifest.tsv files to delete the last line, which had Undetermined reads (those that were missing a recognizable barcode).

### Now I will create a slurm script to import my sequences as a Qiime2 artifact

In essence, a QIIME 2 artifact is a file containing any type of data along with metadata. Since QIIME 2 works with artifacts instead of data files (e.g. FASTA files), you must create a QIIME 2 artifact by importing data. Data produced by QIIME 2 also exist as QIIME 2 artifacts. A QIIME 2 artifact typically has the .qza file extension when stored in a file.

The benefit of having "artifacts" is that they enable QIIME 2 to track how the data came to be in addition to the data itself - you can trace back to all previous analyses that were run to produce the artifact, including the input data used at each step. It also allows the data to be not constrained by file type. 

Here, I'm importing my files as an artifact and then running demux summarize not to demultiplex, but to generate interactive quality plots, check sample depth, and guide trimming/truncation for DADA2.

Create a slurm script to import sequences. 
`nano qiime2_import.sh`

Copy into the document:
```
#!/bin/bash
#SBATCH --partition=compute                          # Queue selection
#SBATCH --job-name=qiime2_import                     # Job name
#SBATCH --mail-type=ALL                              # Mail events (BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=avery.fulford@whoi.edu           # Where to send mail
#SBATCH --ntasks=1                                   # Run a single task
#SBATCH --cpus-per-task=4                            # Number of CPU cores per task
#SBATCH --mem=10gb                                   # Job memory request
#SBATCH --time=24:00:00			       # Time limit hrs:min:sec
#SBATCH --output=qiime2_import.log                   # Job log name

export OMP_NUM_THREADS=4

# Load Singularity module
module load singularity/3.7

CONTAINER=/proj/omics/huber/afulford/amplicon_2025.10.sif
WORKDIR=/proj/omics/huber/afulford/16S_nov12
MANIFEST=manifest.tsv
DEMUX_QZA=demux_brinedom.qza
DEMUX_QZV=demux_brinedom.qzv

# 1) Import: run qiime inside the container, in WORKDIR, for manifest-based paired-end FASTQ
singularity exec --bind ${WORKDIR}:${WORKDIR} ${CONTAINER} \
  bash -c "cd ${WORKDIR} && \
    qiime tools import \
      --type 'SampleData[PairedEndSequencesWithQuality]' \
      --input-path ${MANIFEST} \
      --output-path ${DEMUX_QZA} \
      --input-format PairedEndFastqManifestPhred33V2"

# 2) Summarize demux (visualization of imported data including quality plots)
singularity exec --bind ${WORKDIR}:${WORKDIR} ${CONTAINER} \
  bash -c "cd ${WORKDIR} && \
    qiime demux summarize \
      --i-data ${DEMUX_QZA} \
      --o-visualization ${DEMUX_QZV}"
```

Run it: `sbatch qiime2_import.sh`
Check on it: `squeue -u avery.fulford`

### Trim primers and adapter sequences if present

Create slurm script: `nano qiime2_trim.sh`

Copy into slurm script: 
```
#!/bin/bash
#SBATCH --partition=compute                          # Queue selection
#SBATCH --job-name=qiime2_primer_trim                # Job name
#SBATCH --mail-type=ALL                              # Mail events (BEGIN, END, FAIL)
#SBATCH --mail-user=avery.fulford@whoi.edu           # Where to send mail
#SBATCH --ntasks=1                                   # Run a single task
#SBATCH --cpus-per-task=4                            # Number of CPU cores per task
#SBATCH --mem=10gb                                   # Job memory request
#SBATCH --time=24:00:00                              # Time limit hrs:min:sec
#SBATCH --output=qiime2_primer_trim.log              # Job log name
export OMP_NUM_THREADS=4

module load singularity/3.7

CONTAINER=/vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif
WORKDIR=/proj/omics/huber/afulford/16S_nov12

IN_QZA=demux_16S_nov12.qza
OUT_QZA=demux_16S_nov12_trimmed.qza
OUT_QZV=demux_16S_nov12_trimmed.qzv

# 515F / 926R primers (degenerate)
F_PRIMER=GTGYCAGCMGCCGCGGTAA
R_PRIMER=CCGYCAATTYMTTTRAGTTT

# Nextera adapter (if present)
ADAPTER=CTGTCTCTTATACACATCT

# 1) Trim primers and adapters with cutadapt
singularity exec --bind ${WORKDIR}:${WORKDIR} ${CONTAINER} \
  bash -c "cd ${WORKDIR} && \
    qiime cutadapt trim-paired \
      --i-demultiplexed-sequences ${IN_QZA} \
      --p-cores ${SLURM_CPUS_PER_TASK} \
      --p-front-f ${F_PRIMER} \
      --p-front-r ${R_PRIMER} \
      --p-adapter-f ${ADAPTER} \
      --p-adapter-r ${ADAPTER} \
      --p-error-rate 0.1 \
      --p-overlap 3 \
      --p-match-adapter-wildcards \
      --o-trimmed-sequences ${OUT_QZA}"

# 2) Summarize trimmed reads to inspect quality after primer/adapter removal
singularity exec --bind ${WORKDIR}:${WORKDIR} ${CONTAINER} \
  bash -c "cd ${WORKDIR} && \
    qiime demux summarize \
      --i-data ${OUT_QZA} \
      --o-visualization ${OUT_QZV}"

```

Run slurm script: `sbatch qiime2_trim.sh`

#### Why use an error rate of 0.1 and a overlap length of 3? 

The primers I'm using are degenerate, so in a real sequencing run, the sequenced primer may differ from the written primer by several bp and there could be sequencing errors as well. An error rate of 0.1 allows ~10% of mismatches, which is good for degenerate primers and Illumina sequencing error profiles. An error rate of 0.1 is the default recommended by QIIME2 cutadapt docs and amplicon community workflows for degenerate primer sets. 

Cutadapt uses `--overlap` to determine how many bases need to match (imperfectly) for a primer hit to be considered valid. You want a low value because primers may appear slightly shifted, especially with small indels, unclear read boundaries, or truncated or partially degraded primer sequences. Apparently, even a tiny 3-base seed match is enough for cutadapt to align the primer region!

### View trimmed files on Qiime2View and decide where to truncate sequences

Qiime2View: https://view.qiime2.org/
Forum for truncation info:
https://forum.qiime2.org/t/deciding-where-to-trim-in-dada2-on-paired-end-reads/18762

Download files locally to view:
`scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/demux_brinedom_trimmed.qzv /Users/averyfulford/Desktop/MIT_WHOI4/research`

This run is really good, both reads are mostly Q40 across the whole cycle range.

![Screenshot 2025-11-18 at 7.25.56 PM](https://hackmd.io/_uploads/ByzjCK5eWe.png)

Forward reads: basically Q≈38–40 all the way across; no obvious tail-drop.
Reverse reads: good through maybe ~150–180 bp, then the quality distribution starts to spread and the lower quantiles slide down into the low 20s and below.

I have a ~515F–926R amplicon (~400–412 bp), so for merging I want:
`truncLenF + truncLenR – ~410 ≥ 20–30 bp`

For the forward reads, lower whiskers dip <Q25 at base position 272. For the reverse reads, the median drops below Q30 at position 279, while the whiskers dip <Q25 at position 174, but things only really degrade around 193. I'm going to truncate at 270 for the forward reads and 190 for the reverse reads, which leaves a good 50 bp of overlap. 

I'm going with: 
```
--p-trunc-len-f 270 \
--p-trunc-len-r 190
```

## Run DADA2 to call ASVs!
Writing slurm script `nano qiime2_dada2.sh`

Copy into document:
```
#!/bin/bash
#SBATCH --partition=compute                   # Queue selection
#SBATCH --job-name=qiime2_dada2               # Job name
#SBATCH --mail-type=ALL                       # Mail events (BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=avery.fulford@whoi.edu    # Where to send mail
#SBATCH --ntasks=1                            # Run a single task
#SBATCH --cpus-per-task=4                     # Number of CPU cores per task
#SBATCH --mem=25gb                            # Job memory request
#SBATCH --time=12:00:00                       # Time limit hrs:min:sec
#SBATCH --output=qiime2_dada2.log             # Job log name
export OMP_NUM_THREADS=4

module load singularity/3.7

CONTAINER=/vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif
WORKDIR=/proj/omics/huber/afulford/16S_nov12
TMPDIR_HOST=${WORKDIR}/tmp_qiime_dada2

mkdir -p "${TMPDIR_HOST}"

# Inputs / outputs (make them absolute)
IN_QZA=${WORKDIR}/demux_brinedom_trimmed.qza

TABLE_QZA=${WORKDIR}/asv-table_brinedom.qza
REPSEQ_QZA=${WORKDIR}/asvs-ref-seqs_brinedom.qza
STATS_QZA=${WORKDIR}/dada2-stats_brinedom.qza
BTSTATS_QZA=${WORKDIR}/dada2-basetransitions_brinedom.qza
STATS_QZV=${WORKDIR}/dada2-stats_brinedom.qzv

EXPORT_DIR=${WORKDIR}/exported_brinedom
ASV_TSV=${WORKDIR}/samples-asv-table_brinedom.tsv

mkdir -p "${TMPDIR_HOST}"

############################
# 1) DADA2 denoise-paired  #
############################
singularity exec \
  --bind $WORKDIR:$WORKDIR,$TMPDIR_HOST:$TMPDIR_HOST \
  $CONTAINER \
  qiime dada2 denoise-paired \
    --i-demultiplexed-seqs $IN_QZA \
    --p-trunc-len-f 270 \
    --p-trunc-len-r 190 \
    --p-trim-left-f 0 \
    --p-trim-left-r 0 \
    --p-max-ee-f 2 \
    --p-max-ee-r 2 \
    --p-min-overlap 12 \
    --p-pooling-method independent \
    --p-n-reads-learn 1000000 \
    --p-n-threads $SLURM_CPUS_PER_TASK \
    --p-chimera-method consensus \
    --o-table $TABLE_QZA \
    --o-representative-sequences $REPSEQ_QZA \
    --o-denoising-stats $STATS_QZA \
    --o-base-transition-stats $BTSTATS_QZA

########################################
# 2) EXPORT BIOM TABLE
########################################
singularity exec \
  --bind $WORKDIR:$WORKDIR \
  $CONTAINER \
  qiime tools export \
    --input-path $TABLE_QZA \
    --output-path $WORKDIR/exported_brinedom

########################################
# 3) CONVERT TO TSV
########################################
singularity exec \
  --bind $WORKDIR:$WORKDIR \
  $CONTAINER \
  biom convert \
    -i $WORKDIR/exported_brinedom/feature-table.biom \
    -o $ASV_TSV \
    --to-tsv

########################################
# 4) TABULATE DADA2 STATS
########################################
singularity exec \
  --bind $WORKDIR:$WORKDIR \
  $CONTAINER \
  qiime metadata tabulate \
    --m-input-file $STATS_QZA \
    --o-visualization $STATS_QZV
```

Explanation of output files:

`samples-asv-table.tsv`: ASV table that is tab-delimited. It should show your samples as columns, ASVs (or FeatureIDs) as rows, and numbers will represent the number of sequences assigned to the respective ASV and sample.

`paired-end-input-asvs-ref-seqs.qza`: Reference sequences for the output ASVs. Use this below to assign taxonomy.

`paired-end-input-dada2-stats.qzv`: Output visualization of the DADA2 stats. Import this to the QIIME2 viewer to learn more about how your sequences did.

## Assign taxonomy
### Download SILVA 16S rRNA db for use in QIIME2 using the Rescript plugin
Made file `qiime2_silva_format.sh`:

```
#!/bin/bash
#SBATCH --partition=compute
#SBATCH --job-name=qiime2_silva_format
#SBATCH --mail-type=ALL
#SBATCH --mail-user=avery.fulford@whoi.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=100gb
#SBATCH --time=24:00:00
#SBATCH --output=qiime2_silva_format.log

set -euo pipefail
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load singularity/3.7

# Paths (Avery)
CONTAINER=/vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif
WORKDIR="/vortexfs1/omics/huber/afulford/16S_nov12/"

TMPDIR_HOST=$WORKDIR/tmp_rescript
MPLTMP=/tmp/mplconfig
mkdir -p "$TMPDIR_HOST" "$MPLTMP"

BIND="--bind $WORKDIR:/data,$TMPDIR_HOST:$TMPDIR_HOST"
ENVVARS="--env TMPDIR=$TMPDIR_HOST --env MPLCONFIGDIR=$MPLTMP"

# Outputs (match Sarah/Sabrina’s VSEARCH script names)
SILVA_RNA=/data/silva-138.2-ssu-nr99-rna-seqs_926R.qza
SILVA_TAX=/data/silva-138.2-ssu-nr99-tax_926R.qza
SILVA_DNA=/data/silva-138.2-ssu-nr99-seqs_926R.qza
SILVA_CLEAN=/data/silva-138.2-ssu-nr99-seqs-cleaned_926R.qza
SILVA_LENFILT=/data/silva-138.2-ssu-nr99-seqs-filt_926R.qza
SILVA_DISCARD=/data/silva-138.2-ssu-nr99-seqs-discard_926R.qza

# 1) Get SILVA 138.2 SSU NR99 + taxonomy
############################################
singularity exec --cleanenv $ENVVARS $BIND $CONTAINER qiime rescript get-silva-data \
  --p-version 138.2 \
  --p-target SSURef_NR99 \
  --p-include-species-labels \
  --o-silva-sequences $SILVA_RNA \
  --o-silva-taxonomy $SILVA_TAX

# 2) Reverse-transcribe (RNA -> DNA)
singularity exec --cleanenv $ENVVARS $BIND $CONTAINER qiime rescript reverse-transcribe \
  --i-rna-sequences $SILVA_RNA \
  --o-dna-sequences $SILVA_DNA

# 3) Cull low-quality sequences
singularity exec --cleanenv $ENVVARS $BIND $CONTAINER qiime rescript cull-seqs \
  --i-sequences $SILVA_DNA \
  --o-clean-sequences $SILVA_CLEAN

# 4) Length filter by domain (VSEARCH reference)
singularity exec --cleanenv $ENVVARS $BIND $CONTAINER qiime rescript filter-seqs-length-by-taxon \
  --i-sequences $SILVA_CLEAN \
  --i-taxonomy $SILVA_TAX \
  --p-labels Archaea Bacteria Eukaryota \
  --p-min-lens 900 1200 1400 \
  --o-filtered-seqs $SILVA_LENFILT \
  --o-discarded-seqs $SILVA_DISCARD

# 5) Dereplicate sequences + unify taxonomy (recommended)
singularity exec --cleanenv $ENVVARS $BIND $CONTAINER qiime rescript dereplicate \
  --i-sequences /data/silva-138.2-ssu-nr99-seqs-filt_926R.qza \
  --i-taxa /data/silva-138.2-ssu-nr99-tax_926R.qza \
  --p-rank-handles silva \
  --p-mode uniq \
  --o-dereplicated-sequences /data/silva-138.2-ssu-nr99-seqs-derep_926R.qza \
  --o-dereplicated-taxa /data/silva-138.2-ssu-nr99-tax-derep_926R.qza
```

I ran all of this, but the last step to dereplicate sequences and unify taxonomy failed. The purpose of this is to remove duplicate 16S sequences from the database and, when there are duplicates that do not agree, to choose just one for downstream purposes. It is recommended because it makes the database you're working with smaller, but I don't actually need it for VSEARCH. So, I'm going to go ahead and move on without this step. 

To check that the code above the derep step successfully ran, I looked at the size of the files with `ls -lh silva-138.2-ssu-nr99-*_926R.qza:`

```
-rw-rw-r-- 1 avery.fulford domain users 193M Nov 23 18:50 silva-138.2-ssu-nr99-rna-seqs_926R.qza
-rw-rw-r-- 1 avery.fulford domain users 174M Nov 23 19:04 silva-138.2-ssu-nr99-seqs_926R.qza
-rw-rw-r-- 1 avery.fulford domain users 164M Nov 23 21:23 silva-138.2-ssu-nr99-seqs-cleaned_926R.qza
-rw-rw-r-- 1 avery.fulford domain users 1.4M Nov 23 21:40 silva-138.2-ssu-nr99-seqs-discard_926R.qza
-rw-rw-r-- 1 avery.fulford domain users 162M Nov 23 21:40 silva-138.2-ssu-nr99-seqs-filt_926R.qza
-rw-rw-r-- 1 avery.fulford domain users 7.4M Nov 23 18:50 silva-138.2-ssu-nr99-tax_926R.qza
```
The files are huge. Great!

### Classify ASVs with VSEARCH
Now I'm going to use the Silva database I constructed above to classify ASVs. Following Sarah Hu's protocol, I'm going to use a 90% identity match and ensure that 75% of the assignments much match the top hit of the reference to be called a consensus assignment.

Unfortunately, I keep getting a time out message and I can't boost the amount of time above 24 hours per admin constraints. I'm going to try again to dereplicate sequences, unify taxonomy, and extract only the region between my primers to make the Silva database smaller and therefore easier to search against in VSEARCH. 

Making a script called `qiime2_derep_extract.sh`:

```
#!/bin/bash
#SBATCH --partition=compute
#SBATCH --job-name=qiime2_derep_extract
#SBATCH --mail-type=ALL
#SBATCH --mail-user=avery.fulford@whoi.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=40gb
#SBATCH --time=24:00:00
#SBATCH --output=qiime2_derep_classify.log

set -euo pipefail
module load singularity/3.7
CONTAINER=/vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif
WORKDIR=/vortexfs1/omics/huber/afulford/16S_nov12
TMP=$WORKDIR/tmp_rescript; MPL=/tmp/mplconfig
mkdir -p "$TMP" "$MPL"

# Dereplicate (explicit rank list for RESCRIPt)
singularity exec --cleanenv --env TMPDIR=$TMP --env MPLCONFIGDIR=$MPL \
  --bind $WORKDIR:/data,$TMP:$TMP $CONTAINER \
  qiime rescript dereplicate \
    --i-sequences /data/silva-138.2-ssu-nr99-seqs-filt_926R.qza \
    --i-taxa /data/silva-138.2-ssu-nr99-tax_926R.qza \
    --p-rank-handles domain kingdom phylum class order family genus species \
    --p-mode uniq \
    --o-dereplicated-sequences /data/silva-138.2-ssu-nr99-seqs-derep_926R.qza \
    --o-dereplicated-taxa /data/silva-138.2-ssu-nr99-tax-derep_926R.qza

# Extract 515F–926R region (fastest for VSEARCH)
F=GTGYCAGCMGCCGCGGTAA
R=CCGYCAATTYMTTTRAGTTT
singularity exec --cleanenv --env TMPDIR=$TMP --env MPLCONFIGDIR=$MPL \
  --bind $WORKDIR:/data,$TMP:$TMP $CONTAINER \
  qiime feature-classifier extract-reads \
    --i-sequences /data/silva-138.2-ssu-nr99-seqs-derep_926R.qza \
    --p-f-primer $F --p-r-primer $R \
    --p-min-length 200 --p-max-length 500 \
    --p-n-jobs 16 \
    --o-reads /data/silva-138.2-ssu-nr99-seqs-derep-515F-926R.qza
```

Now attempting again to classify using `qiime2_classify.sh`:
```
#!/bin/bash
#SBATCH --partition=compute
#SBATCH --job-name=qiime2_classify
#SBATCH --mail-type=ALL
#SBATCH --mail-user=avery.fulford@whoi.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=40gb
#SBATCH --time=72:00:00
#SBATCH --output=qiime2_classify.log

set -euo pipefail
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load singularity/3.7

CONTAINER=/vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif
WORKDIR=/vortexfs1/omics/huber/afulford/16S_nov12
TMPDIR_HOST=$WORKDIR/tmp_vsearch
MPLTMP=/tmp/mplconfig
mkdir -p "$TMPDIR_HOST" "$MPLTMP" "$WORKDIR/classification_926R_export"

BIND="--bind $WORKDIR:/data,$TMPDIR_HOST:$TMPDIR_HOST"
ENVVARS="--env TMPDIR=$TMPDIR_HOST --env MPLCONFIGDIR=$MPLTMP"

# DADA2 outputs
REPSEQ=/data/asvs-ref-seqs_brinedom.qza
TABLE=/data/asv-table_brinedom.qza

# Reference from qiime2_silva_format
REF_READS=/data/silva-138.2-ssu-nr99-seqs-derep-515F-926R.qza
REF_TAX=/data/silva-138.2-ssu-nr99-tax-derep_926R.qza

# Outputs
CLASS=/data/classification.qza
SEARCH=/data/search_results.qza
CLASS_QZV=/data/classification.qzv
BARS=/data/taxonomy-barplot.qzv

# 1) VSEARCH consensus classification
singularity exec --cleanenv $ENVVARS $BIND $CONTAINER qiime feature-classifier classify-consensus-vsearch \
  --i-query $REPSEQ \
  --i-reference-reads $REF_READS \
  --i-reference-taxonomy $REF_TAX \
  --o-classification $CLASS \
  --o-search-results $SEARCH \
  --p-threads $SLURM_CPUS_PER_TASK \
  --p-maxaccepts 10 \
  --p-perc-identity 0.90 \
  --p-min-consensus 0.75 \
  --p-query-cov 0.90

# 2) TSV export of taxonomy
singularity exec --cleanenv $ENVVARS $BIND $CONTAINER qiime tools export \
  --input-path $CLASS \
  --output-path /data/classification_926R_export

# 3) Taxonomy table viz
singularity exec --cleanenv $ENVVARS $BIND $CONTAINER qiime metadata tabulate \
  --m-input-file $CLASS \
  --o-visualization $CLASS_QZV

# 4) Barplots
singularity exec --cleanenv $ENVVARS $BIND $CONTAINER qiime taxa barplot \
  --i-table $TABLE \
  --i-taxonomy $CLASS \
  --o-visualization $BARS

echo "VSEARCH done. See:"
echo "  $CLASS      (taxonomy .qza)"
echo "  $CLASS_QZV  (taxonomy .qzv)"
echo "  $BARS       (barplots)"
echo "  /data/classification_926R_export (TSV export)"
```

### Next, create metadata file and add DADA2 stats to it, then use that to visualize taxonomy before contaminant removal.
Downloading data files:
```
scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/*.qzv /Users/averyfulford/Desktop/MIT_WHOI4/
scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/classification_926R_export/taxonomy.tsv /Users/averyfulford/Desktop/MIT_WHOI4/
scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/exported_brinedom/feature-table.biom /Users/averyfulford/Desktop/MIT_WHOI4/
```

Here is the most up-to-date information about the Qiime2 metadata format: https://use.qiime2.org/en/latest/references/metadata.html

I merged data about my samples, the read information from the sequencing center, and `dada2-stats_brinedom.qzv` to create my metadata file using Excel. Pushing this file to the HPC: 
```
scp /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/metadata.tsv avery.fulford@poseidon.whoi.edu://vortexfs1/omics/huber/afulford/16S_nov12
```

Next, I'm going to add this metadata file to the `taxonomy-barplot.qzv` taxonomy visualization file: 

```
module load singularity/3.7

# First, validate the file:
singularity exec --cleanenv \
  --bind /vortexfs1/omics/huber/afulford/16S_nov12:/data \
  /vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif \
  qiime metadata tabulate \
  --m-input-file /data/metadata.tsv \
  --o-visualization /data/metadata.qzv

# Then, can run directly on the login node: 
singularity exec --cleanenv \
  --bind /vortexfs1/omics/huber/afulford/16S_nov12:/data \
  /vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif \
  qiime taxa barplot \
    --i-table /data/asv-table_brinedom.qza \
    --i-taxonomy /data/classification.qza \
    --m-metadata-file /data/metadata.tsv \
    --o-visualization /data/taxonomy-barplot_metadata.qzv
```

Download this back onto personal computer:
```
scp avery.fulford@poseidon.whoi.edu://vortexfs1/omics/huber/afulford/16S_nov12/taxonomy-barplot_metadata.qzv /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/
```

And visualize on Qiime2 viewer: https://view.qiime2.org

## Remove contaminants using decontam
### Downloading appropriate files
For downstream steps, view the `classification.qzv` on Qiime2 View and download it as a .tsv. The file `classification.qzv` has the ASV IDs matched with taxonomy. 

I'm going to do the next part in R. Here is code to export files to my local computer:
```
module load singularity/3.7
CONTAINER=/vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif
WORKDIR=/vortexfs1/omics/huber/afulford/16S_nov12
cd $WORKDIR

# Export the ASV table (BIOM)
singularity exec --cleanenv --bind $WORKDIR:/data $CONTAINER \
  qiime tools export \
    --input-path /data/asv-table_brinedom.qza \
    --output-path /data/export_decontam

# Convert BIOM → TSV (OTU table; features as rows)
singularity exec --cleanenv --bind $WORKDIR:/data $CONTAINER \
  biom convert -i /data/export_decontam/feature-table.biom \
               -o /data/export_decontam/feature-table.tsv \
               --to-tsv

# Export taxonomy to help summarize contaminants by rank
singularity exec --cleanenv --bind $WORKDIR:/data $CONTAINER \
  qiime tools export \
    --input-path /data/classification.qza \
    --output-path /data/export_decontam/taxonomy
    
scp avery.fulford@poseidon.whoi.edu://vortexfs1/omics/huber/afulford/16S_nov12/export_decontam/*.tsv /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/
```

### Considerations for running decontam
#### Prevalence- vs frequency-based method
I have chosen to use decontam's prevalance-based method rather than frequency-based or even a combination of the two. This is because I have several full-process negative controls that I took with the goal of using them to remove contaminants. I do have the DNA concentrations that I submitted to the sequencing center, but as several of these are dilutions of the samples and each extracted sample comes from a different amount of initial sample, I have chosen to focus on the full-process negative controls. 

Further, the prevalence-based method relies on two assumptions: (1) "sequences from contaminating taxa are likely to have higher prevalence in control samples than in true samples" (Davis et al., 2018) and (2) any contamination arises from external sources (e.g., during sampling or kit extraction) rather than cross-contamination between samples. The first assumption is good for my data, and the second assumption is appropriate as well, especially because I performed the DNA extractions of my negative controls on a separate day from the DNA extractions of my samples. 

By contrast, the frequency-based method relies on the same second assumption, but instead of the first assumption, it assumes that there is much more true sample DNA ("S") than contaminant DNA ("C") (or, in the manuscript, "S>>C") and that "sequences from contaminating taxa are likely to have frequencies that inversely correlate with sample DNA concentration" (Davis et al., 2018). I am concerned first that my contaminant DNA is fairly plentiful relative to true sample DNA. In addition, I have looked at the read counts for the negative controls versus the true samples, and the number of reads for the negative controls is often not at the low end relative to the true samples, but right in the middle of the distribution. This makes me concerned that my data does not abide by this assumption necessary for the frequency-based method. 

Indeed, the authors note that "Frequency-based contaminant identification is not recommended for extremely low-biomass samples (C~S or C > S) because the simple approximations we are making for the dependence of contaminant frequency on total DNA concentration break down when contaminants comprise a large fraction of sequencing reads." Whereas, "Prevalence-based contaminant identification remains valid in very low biomass environments where a majority of MGS sequences might derive from contaminants rather than true inhabitants of the sampled environment (i.e. C ~ S or C > S). Even in the low-biomass regime, it is still expected that non-contaminants will appear in a larger fraction of true samples than in negative control samples" (Davis et al., 2018).

Plots of library size against index for four sample types (code to create these plots is in the R code below): 

All sample types together:
![library_size](https://hackmd.io/_uploads/rkctj9iZZe.png)

![library_size_chim_coextract](https://hackmd.io/_uploads/BJlU29iZ-x.png)
![library_size_chim_lowbio](https://hackmd.io/_uploads/SkeLn5s-Zl.png)
![library_size_filter](https://hackmd.io/_uploads/ByeI29i-Zx.png)
![library_size_swab](https://hackmd.io/_uploads/rJx8nciWZx.png)

Seeing some negatives with “mid-range” read counts does not violate prevalence-based assumptions. Prevalence mode uses presence/absence frequency across sample classes, not read depth.

#### Four sample types and associated negative controls

I have a somewhat complicated set of samples paired with negative controls! I need to run decontam on four different sample sets, each of which corresponds to a different set of sample controls. My sample sets are swabs, filters, chimneys extracted using Qiagen's RNA Coextraction kit, and chimneys extracted using Qiagen's Low Biomass DNA extraction kit. 

In my `metadata.tsv` file, these are differentiated under the column "Sample type", as either Swab, Filter, or Chimney, and within Chimney, samples extracted using different kits are indicated by the column "Extraction kit" with either "QiaLowBio" or "QiaCoextract". To make the computation easier, I have marked these four sample categories under the column header "Category". Under "Sample type", I have negatives, marked as "Kit negative", which are negatives for three different extraction kits indicated in "Extraction kit". Again, for computational ease, I have added a column marked "Sample_or_negative" where I have marked each type.

Here are the different negative controls I have: 
* Kit negative controls:
    * Six Masterpure kit negative controls of the same type of swab I used for the swabbed chimney samples. Half of these went through a bead-based purification protocol just as the samples did, and the other half did not.
    * Six Masterpure kit negative controls of the exact filter set that I took on the cruise, but didn't go down on Jason. Half of these went through a bead-based purification protocol just as the samples did, and the other half did not.
    * Four Qiagen RNeasy RNA/DNA coextraction kit negative controls with no chimney or sediment material extracted. These did not go through a bead-based purification protocol because the samples themselves did not. 
    * Two Qiagen Low Biomass coextraction kit negative controls with no chimney or sediment material extracted. These did not go through a bead-based purification protocol because the samples themselves did not. 
* Full-process negative controls:
    * Two filter sets of background deep seawater taken while on off-axis transects - I am planning to use this as a negative control for not just the filter sets, but also the chimney and swab samples, as these samples were sitting in bioboxes with this deep seawater on the way to the surface. 
    * One chimney extract that I had accidentally dropped on the scale (ethanol-cleaned, but not recently) while extracting. This chimney was a 5-day-old chimney that had grown on top of the nozzle of an osmosampler in the five days between setting up and retrieving the sampler. When I first attempted to extract DNA from this sample, I got an amount of DNA that was below the detection limit for Qubit, but after I dropped the second half of the sample on the scale while extracting DNA, I ended up with a very low DNA concentration of 0.0656 ng/uL in 39 uL of water, or just 2.5584 ng of DNA total. 

It is a difficult decision whether to keep that chimney extract full-process negative control in the mix. If I can assume that this DNA is only lab contaminant, and that there was no (or undetectable) microbial colonization of the very hot, thin-walled chimney within 5 days, I think this would be a strong full-process negative control for the chimney samples. While there might still be overlap between this sample and others, I do think it meets the requirement for the prevalence-based decontam method of "in negative controls (S~0), the likelihood of detecting any given contaminant sequence feature will be higher than in true samples (S > 0)" (Davis et al., 2018). The good thing is that I have two swab samples from the same 5-day-old chimney that are true samples, so even if there is overlap between in the ASVs between the contaminated sample and the true samples, the true samples will force decontam to keep the non-contaminants in the mix. I did check whether the taxonomy of this sample is somewhat similar to the two true swab samples, and I can see some overlap but differences as well. For the time being, I'm going to proceed with using this as a negative control.

### Running decontam in R
In writing this code, I drew from the introduction to decontam (https://benjjneb.github.io/decontam/vignettes/decontam_intro.html), the oral contamination vignette (https://benjjneb.github.io/DecontamManuscript/Analyses/oral_contamination.html), and the tutorial Ben gave in the 2025 STAMPS course (https://github.com/mblstamps/stamps2025/blob/main/day7/contamination/decontam_lab_bile.R).

First, get all the data into a phyloseq object.

```
#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")
#BiocManager::install("decontam")
#BiocManager::install("phyloseq")

library(dplyr)
library(tidyverse)
library(phyloseq)
library(decontam)
library(readr)
library(ggplot2)

# file paths
workdir <- "."  # current folder
otu_tsv <- file.path(workdir, "feature-table.tsv")
meta_tsv <- file.path(workdir, "metadata.tsv")
tax_tsv <- file.path(workdir, "taxonomy.tsv")

## load count data from feature-table.tsv
otu_raw <- read_tsv(otu_tsv, show_col_types = FALSE)
stopifnot(names(otu_raw)[1] %in% c("Feature ID", "#OTU ID")) # Sanity check
colnames(otu_raw)[1] <- "Feature" # Rename "#OTU ID"

# convert into matrix of counts (features as rows, samples as columns)
otu_df <- otu_raw %>%
  select(-any_of("taxonomy")) %>%   # drop trailing taxonomy col if present
  column_to_rownames("Feature") %>%
  as.data.frame()
otu_mat <- as.matrix(otu_df)

## load metadata from metadata.tsv
meta_df <- read_tsv(meta_tsv, show_col_types = FALSE)

# Standardize the sample ID column
meta_df <- meta_df %>% rename(SampleID = "sample-id")
# Remove spaces from column headers
meta_df <- meta_df %>% rename_with(~ gsub(" ", "_", .x))

# Align samples across table and metadata
common_samples <- intersect(colnames(otu_mat), meta_df$SampleID)
otu_mat <- otu_mat[, common_samples, drop = FALSE]
# remove categorical/numerical designator row needed for Qiime2
#and use arrange to reorder meta_aln to the column order of otu_mat for phyloseq
meta_aln <- meta_df %>% 
  filter(SampleID %in% common_samples) %>% 
  arrange(factor(SampleID, levels = colnames(otu_mat))) %>%
  column_to_rownames("SampleID")

## load taxonomy data from taxonomy.tsv
tax_df <- read_tsv(tax_tsv, show_col_types = FALSE)

# Clean column names and make sure first column matches ASV IDs in ps
colnames(tax_df)[1] <- "FeatureID"
tax_df <- tax_df %>%
  separate_wider_delim(cols = "Taxon", delim = ";",
                       names = paste0("Rank", seq_len(7)),
                       too_few = "align_start",
                       too_many = "drop") %>%
  mutate(across(starts_with("Rank"), ~ str_trim(str_remove(.x, "^D_\\d+__"))))

# Convert to a tax_table object (matrix, ASVs as rownames)
tax_mat <- as.matrix(
  tax_df %>%
    column_to_rownames("FeatureID") %>%
    select(starts_with("Rank"))
)
tax <- tax_table(tax_mat)

## build phyloseq
OTU <- otu_table(otu_mat, taxa_are_rows = TRUE)
SAM <- sample_data(as.data.frame(meta_aln))
TAX <- tax_table(tax_mat)
ps  <- phyloseq(OTU, SAM, TAX)
```

Next, subset the data into appropriate categories.
```
# Subset the phyloseq object by "Category" column in metadata file
ps_chim_lowbio <- subset_samples(ps, Category %in% 
                                   c("Chimney_QiaLowBio", "All", "All_chimney"))
ps_chim_coextract <- subset_samples(ps, Category %in% 
                                c("Chimney_QiaCoextract", "All", "All_chimney"))
ps_filter <- subset_samples(ps, Category %in% c("Filter", "All"))
ps_swab <- subset_samples(ps, Category %in% c("Swab", "All"))

ps_chim_lowbio_kitnegs <- subset_samples(ps, Category == "Chimney_QiaLowBio")
ps_chim_coextract_kitnegs <- subset_samples(ps, Category 
                                            == "Chimney_QiaCoextract")

ps_filter_kitnegs <- subset_samples(ps, Category == "Filter")
ps_swab_kitnegs <- subset_samples(ps, Category == "Swab")

#For swabs and filters, separate by bead cleanup negatives
ps_filter_kitnegs_b <- subset_samples(
  ps_filter_kitnegs,
  Sample_or_negative != "Negative" |
    (Sample_or_negative == "Negative" &
       Bead == "Bead"))
ps_filter_kitnegs_nb <- subset_samples(
  ps_filter_kitnegs,
  Sample_or_negative != "Negative" |
    (Sample_or_negative == "Negative" &
       Bead == "NoBead"))

ps_swab_kitnegs_b <- subset_samples(
  ps_swab_kitnegs,
  Sample_or_negative != "Negative" |
    (Sample_or_negative == "Negative" &
       Bead == "Bead"))
ps_swab_kitnegs_nb <- subset_samples(
  ps_swab_kitnegs,
  Sample_or_negative != "Negative" |
    (Sample_or_negative == "Negative" &
       Bead == "NoBead"))

# Inspect library sizes for each subset
meta_aln$input_num <- suppressWarnings(as.numeric(meta_aln$input))
meta_aln <- meta_aln[order(meta_aln$input_num),]
meta_aln$Index <- seq(nrow(meta_aln))
ggplot(filter(meta_aln, Category == "Chimney_QiaLowBio"), 
       aes(x=Index, y=as.numeric(input_num), color=Sample_or_negative)) + 
  labs(title = "Qiagen Low Biomass Kit Chimney samples",
       x = "Sample Index",
       y = "Read count") +
  geom_point()

# Instead of removing singletons, remove ASVs with no prevalence in sample subset
prune_nonzero <- function(psx) {
  psx <- prune_samples(sample_sums(psx) > 0, psx)
  psx <- prune_taxa(taxa_sums(psx) > 0, psx)
  psx
}
ps_chim_lowbio_kitnegs    <- prune_nonzero(ps_chim_lowbio_kitnegs)
ps_chim_coextract_kitnegs <- prune_nonzero(ps_chim_coextract_kitnegs)
ps_filter_kitnegs_b       <- prune_nonzero(ps_filter_kitnegs_b)
ps_filter_kitnegs_nb      <- prune_nonzero(ps_filter_kitnegs_nb)
ps_swab_kitnegs_b         <- prune_nonzero(ps_swab_kitnegs_b)
ps_swab_kitnegs_nb        <- prune_nonzero(ps_swab_kitnegs_nb)

#compare unfiltered data to each sample type's subset
ps                    #present: 110375 taxa in 96 samples
ps_chim_lowbio        #present: 24064 taxa in 36 samples
ps_chim_coextract     #present: 12165 taxa in 20 samples
ps_filter             #present: 80131 taxa in 28 samples
ps_swab               #present: 9764 taxa in 17 samples

ps                          #present: 110375 taxa in 96 samples
ps_chim_lowbio_kitnegs        #present: 22899 taxa in 33 samples
ps_chim_coextract_kitnegs     #present: 10808 taxa in 17 samples
ps_filter_kitnegs_b             #present: 79251 taxa in 23 samples
ps_filter_kitnegs_nb             #present: 79103 taxa in 23 samples
ps_swab_kitnegs_b              #present: 8012 taxa in 12 samples
ps_swab_kitnegs_nb              #present: 7918 taxa in 12 samples

```

Now call decontam:
```
# Separate by "Sample" or "Negative" in column "Sample_or_negative"
# The syntax below gives TRUE if a negative control
sample_data(ps_chim_lowbio_kitnegs)$is.neg <- 
  sample_data(ps_chim_lowbio_kitnegs)$Sample_or_negative == 'Negative'
sample_data(ps_chim_coextract_kitnegs)$is.neg <- 
  sample_data(ps_chim_coextract_kitnegs)$Sample_or_negative == 'Negative'

sample_data(ps_filter_kitnegs_b)$is.neg <- 
  sample_data(ps_filter_kitnegs_b)$Sample_or_negative == 'Negative'
sample_data(ps_filter_kitnegs_nb)$is.neg <- 
  sample_data(ps_filter_kitnegs_nb)$Sample_or_negative == 'Negative'

sample_data(ps_swab_kitnegs_b)$is.neg <- 
  sample_data(ps_swab_kitnegs_b)$Sample_or_negative == 'Negative'
sample_data(ps_swab_kitnegs_nb)$is.neg <- 
  sample_data(ps_swab_kitnegs_nb)$Sample_or_negative == 'Negative'

# Call decontam for each sample category
contam_lowbio_kitnegs <- isContaminant(ps_chim_lowbio_kitnegs, neg="is.neg", 
                                       method="prevalence")
contam_coextract_kitnegs <- isContaminant(ps_chim_coextract_kitnegs, 
                                          neg="is.neg", method="prevalence")

contam_filter_kitnegs_b <- isContaminant(ps_filter_kitnegs_b, neg="is.neg", 
                                         method="prevalence")
contam_filter_kitnegs_nb <- isContaminant(ps_filter_kitnegs_nb, neg="is.neg", 
                                          method="prevalence")

contam_swab_kitnegs_b <- isContaminant(ps_swab_kitnegs_b, neg="is.neg", 
                                       method="prevalence")
contam_swab_kitnegs_nb <- isContaminant(ps_swab_kitnegs_nb, neg="is.neg", 
                                        method="prevalence")

# Print the number of contaminants identified (under TRUE in table) for each 
#sample category with the default threshold for contaminant 0.1 (less aggressive)
table(contam_lowbio_kitnegs$contaminant) # identified 32
table(contam_coextract_kitnegs$contaminant) # identified 90
table(contam_filter_kitnegs_b$contaminant) # identified 17
table(contam_filter_kitnegs_nb$contaminant) # identified 15
table(contam_swab_kitnegs_b$contaminant) # identified 13
table(contam_swab_kitnegs_nb$contaminant) # identified 15

#threshold=0.5 will identify as contaminants all sequences that are more 
#prevalent in negative controls than in positive samples.
#Trying using this more aggressive classification threshold
contam_lowbio_kitnegs_50 <- isContaminant(ps_chim_lowbio_kitnegs, neg="is.neg", 
                                          method="prevalence", threshold=0.5)
contam_coextract_kitnegs_50 <- isContaminant(ps_chim_coextract_kitnegs, 
                                             neg="is.neg", 
                                             method="prevalence", threshold=0.5)

contam_filter_kitnegs_b_50 <- isContaminant(ps_filter_kitnegs_b, neg="is.neg", 
                                            method="prevalence", threshold=0.5)
contam_filter_kitnegs_nb_50 <- isContaminant(ps_filter_kitnegs_nb, neg="is.neg", 
                                             method="prevalence", threshold=0.5)

contam_swab_kitnegs_b_50 <- isContaminant(ps_swab_kitnegs_b, neg="is.neg", 
                                          method="prevalence", threshold=0.5)
contam_swab_kitnegs_nb_50 <- isContaminant(ps_swab_kitnegs_nb, neg="is.neg", 
                                           method="prevalence", threshold=0.5)

#Print number of contaminants identified (under TRUE in table)
table(contam_lowbio_kitnegs_50$contaminant) # identified 43
table(contam_coextract_kitnegs_50$contaminant) # identified 174
table(contam_filter_kitnegs_b_50$contaminant) # identified 165
table(contam_filter_kitnegs_nb_50$contaminant) # identified 193
table(contam_swab_kitnegs_b_50$contaminant) # identified 51
table(contam_swab_kitnegs_nb_50$contaminant) # identified 45
```

Now add annotations to the decontam dataframes for downstream analyses.
```
### I will now add annotations to the contam data.frames
#Choose which dataframe to annotate
ps_subset <- ps_filter_kitnegs_b
contam    <- contam_filter_kitnegs_b_50

# Recreate the "st" object so colnames are ASV IDs (like Ben's)
st <- as(otu_table(ps_subset), "matrix")
if (taxa_are_rows(ps_subset)) st <- t(st)  # put sequences/ASVs as columns

# First make a map between short ASV names and the full sequences
sq <- colnames(st)
names(sq) <- paste0("ASV", seq_along(sq))

# Sanity check that the sequence table and the contam data.frames
#are arranged in the same order
if(!identical(colnames(st), rownames(contam))) stop(
  "Mismatch between st and contam.")

# Build a 'tax_df_full' data.frame compatible with Ben's column names
tax_df_full <- as.data.frame(tax_table(ps))  # full taxonomy from global ps
# Subset/align to the taxa present in 'st' (same order as colnames(st))
tax_df_full <- tax_df_full[ colnames(st), , drop = FALSE ]

# Add new columns to the contam data.frame with taxonomic information
contam$Phylum <- tax_df_full$Rank2
contam$Genus  <- tax_df_full$Rank6
contam$ASV    <- names(sq)
rownames(contam) <- NULL  # To avoid printing long sequences/IDs

# Add a column with binned prevalences using "cut" into intervals
contam$Prevalence_Binned <- cut(contam$prev, c(0, 5, 10, 30, 50, 9999), 
                                labels=c("0-5", "6-10", "11-30", "31-50", "50+"))
# View
head(contam)
```

Now I have to decide where to set the threshold for classification of ASVs as contaminant vs non-contaminant, which I will do separately for each sample category.
```
# The decontam manuscript emphasizes inspection of the score assigned by 
#decontam ($p) to identify an appropriate classification threshold. 
# Ideally, there will be an evident high-score (non-contaminant) mode, 
#and a low-score (contaminant) mode.
#    Inspect the histogram of the scores to look for this bimodality, 
#and help choose a classification threshold.
histo <- ggplot(data=contam, aes(x=p)) + 
  labs(title = 'Chimney Low Biomass Kit', 
       x = 'decontam-prevalence Score', y='Number of ASVs') + 
  geom_histogram(binwidth=0.02)
histo

# In the decontam manuscript (https://doi.org/10.1186/s40168-018-0605-2) the 
# distribution of scores was also investigated after stratification by 
#prevalence, i.e. taxa present in more vs fewer samples.
histo + facet_wrap(~Prevalence_Binned)

# Inspecting sequences at the extremes of the score distribution can help guide
#   choices about contaminant classification.
# Let's look more closely at the 10 highest scores 
#(the 10 ASVs that look the most like non-contaminants according to decontam)
i.top10 <- order(contam$p, decreasing=TRUE)[1:10]
contam[i.top10,]
# Now look at the the 10 lowest scores 
#(the 10 ASVs that look the most like contaminants according to decontam)
i.bottom10 <- order(contam$p, decreasing=FALSE)[1:10]
contam[i.bottom10,]
```

Putting putative contaminants from all sample categories into one table to parse manually, and compare putative contaminants flagged between different sample categories:
```
## Extract contaminant ASVs + taxonomy per category and threshold
# Build a taxonomy lookup with convenient columns
tax_lookup <- as.data.frame(tax_table(ps)) %>%
  tibble::rownames_to_column("ASV")

#Build function to pull TRUE contaminant ASVs and join stats + taxonomy
pull_contam_tbl <- function(contam_df, category, threshold_label, which_negs) {
  # get character vector of the contaminant ASV IDs
  asv_true <- rownames(contam_df)[which(contam_df$contaminant)]
  # get contam stats like p and prev from contam dataframe as well
  contam_stats <- contam_df[asv_true, c("p","prev"), drop = FALSE] %>%
    #convert row names (ASV IDs) into an explicit column called ASV
    tibble::rownames_to_column("ASV")
  # add taxonomy and metadata
  contam_stats %>%
    left_join(tax_lookup, by = "ASV") %>%
    mutate(
      Category  = category,
      Threshold = threshold_label,
      Which_Negs = which_negs
    ) %>%
    # Order a bit: IDs, stats, key ranks first
    relocate(ASV, Category, Threshold, p, prev)
}

# Build table of all contaminants
contam_tables <- list(
  pull_contam_tbl(contam_lowbio_kitnegs_50,   "Chimney_QiaLowBio",   "0.5", 
                  "kitnegs_only"),
  pull_contam_tbl(contam_coextract_kitnegs_50,"Chimney_QiaCoextract","0.5", 
                  "kitnegs_only"),
  pull_contam_tbl(contam_filter_kitnegs_b_50,   "Filter Beads",       "0.5", 
                  "kitnegs_only"),
  pull_contam_tbl(contam_filter_kitnegs_nb_50,   "Filter No Beads",   "0.5", 
                  "kitnegs_only"),
  pull_contam_tbl(contam_swab_kitnegs_b_50,     "Swab Beads",       "0.5", 
                  "kitnegs_only"),
  pull_contam_tbl(contam_swab_kitnegs_nb_50,     "Swab No Beads",    "0.5", 
                  "kitnegs_only")
)

# One combined data frame of contaminants with taxonomy
contam_all_kitnegs <- dplyr::bind_rows(contam_tables)
# write to disk (one combined file)
readr::write_tsv(contam_all_kitnegs, "contaminant_asvs_with_taxonomy_kitnegs.tsv")

# compute the number of contaminant taxa in the no beads tables that are 
#NOT also in the beads tables
asv_only_kit <- setdiff(
  rownames(contam_filter_kitnegs_nb_50)[contam_filter_kitnegs_nb_50$contaminant],
  rownames(contam_filter_kitnegs_b_50)[contam_filter_kitnegs_b_50$contaminant]
)
length(asv_only_kit)
# Taxonomy for those ASVs
as.data.frame(tax_table(ps)) %>%
  tibble::as_tibble(rownames = "ASV") %>%
  dplyr::filter(ASV %in% asv_only_kit) %>%
  dplyr::select(ASV, dplyr::any_of(c("Rank2","Rank3","Rank4"))) %>%
  print(n = Inf)
```

Plots (bins are the number of samples with that ASV):
![decontam_chim_lowbio_histo](https://hackmd.io/_uploads/Hyn0Lyhb-l.png)
![decontam_chim_lowbio_histo_binned](https://hackmd.io/_uploads/SykyPJ3ZZg.png)
![decontam_chim_coextract_histo](https://hackmd.io/_uploads/ByQywk3WZx.png)
![decontam_chim_coextract_histo_binned](https://hackmd.io/_uploads/SkDyDyhZ-g.png)
![decontam_swabs_histo](https://hackmd.io/_uploads/ByXgDJ3ZZg.png)
![decontam_swabs_histo_binned](https://hackmd.io/_uploads/BJIew12Z-e.png)
![decontam_filters_histo](https://hackmd.io/_uploads/HJoxwkh-be.png)
![decontam_filters_histo_binned](https://hackmd.io/_uploads/HykbvJn-Zx.png)

### Negatives: running decontam with only kit negatives, not all (including full-process) negatives for comparison

Upon talking to Julie, I have run the same decontam analysis but for only the kit negatives - no full-process negative controls. Here is how the results changed:

For the 0.1 (default) threshold, using only the kit negatives led to the identification of the same number of contaminants (32) for the low biomass kit chimneys, 90 instead of 102 contaminants for the coextraction kit chimneys, 44 instead of 38 contaminants for filters, and 33 instead of 20 contaminants for swabs. It's really interesting that at this less strict cutoff, there were actually more contaminants found for most categories using only kit negatives. How much overlap is there between the two approaches? The number of ASVs that are present in the "kitnegs_only" result that are NOT also in the "all" negatives result are  for the low biomass kit (including 5 Acidobacteriae), 25 for the coextraction kit (of which 21 have no identifed taxonomy), 13 for filters (including 4 Methanosarcinia and 1 Cyanobacteria), and 14 for swabs.

For the 0.5 (more aggressive) threshold, using only the kit negatives led to the identification of 43 instead of 214 contaminants for the low biomass kit chimneys, 174 instead of 218 contaminants for the coextraction kit chimneys, 252 instead of 388 contaminants for filters, and 81 instead of 141 contaminants for swabs. How much overlap is there between the two approaches? The number of ASVs that are present in the "kitnegs_only" result that are NOT also in the "all" negatives result are 0 for the low biomass kit, 46 for the coextraction kit, 33 for filters, and 1 for swabs. These ASVs do include some taxa of interest; for instance, ASVs marked as contaminants for the coextraction kit using only the kit negatives but not "all" negatives included several Methanobacteriota, Thermodesulfobacteriota, and Campylobacteria. Of  the 33 ASVs marked as contaminants for the filters found with only kit negatives, 6 were either Methanosarcinia or Methanomicrobia.

This suggests greater bimodality using only the kit negatives, but fewer total contaminants flagged. 

I'm also going to separate out my filter and swabs negative controls that went through bead-based DNA extract cleanup (like the samples did) versus the ones that didn't to avoid false positives from a mismatched negative workflow.

Here are the decontam histograms made using only the kit negatives:
![chim_lowbio_kitnegs_histo](https://hackmd.io/_uploads/B1Scw2xM-g.png)
![chim_lowbio_kitnegs_histo_binned](https://hackmd.io/_uploads/SkKqw2lMbl.png)
![chim_coextract_kitnegs_histo](https://hackmd.io/_uploads/S1n9D2lGbg.png)
![chim_coextract_kitnegs_histo_binned](https://hackmd.io/_uploads/HyAcwhxfWl.png)
![filter_kitnegs_bead_histo](https://hackmd.io/_uploads/By_oP3lfbl.png)
![filter_kitnegs_bead_histo_binned](https://hackmd.io/_uploads/rkRoP2lMZe.png)
![filter_kitnegs_nobeads_histo](https://hackmd.io/_uploads/Hy5nP3eGWl.png)
![filter_kitnegs_nobeads_histo_binned](https://hackmd.io/_uploads/r1jhDnlMbl.png)
![swab_kitnegs_bead_histo](https://hackmd.io/_uploads/B1RhwneM-x.png)
![swab_kitnegs_bead_histo_binned](https://hackmd.io/_uploads/HygaP2lfWg.png)
![swab_kitnegs_nobeads_histo](https://hackmd.io/_uploads/BJMaD2efbe.png)
![swab_kitnegs_nobeads_histo_binned](https://hackmd.io/_uploads/SyHpP2eMZg.png)

### Setting the contaminant threshold

Questions to guide where to set the threshold:
* Is the histogram consistent with decontam's hypothesized bimodal distribution?
* After accounting for prevalence, i.e. taxa present in fewer samples vs a greater number of samples, does this inform the bimodal distribution and therefore classification threshold?
* Do the top and bottom 10 p scores - as in, the 10 ASVs most and least likely (respectively) to be non-contaminants - say anything about the classification threshold?

Remember that 0.1 is the default threshold, whereas a threshold of 0.5 is a more aggressive classification threshold which will identify as contaminants all sequences that are more 
prevalent in negative controls than in positive samples.

I think appropriate thresholds are:
* chimney low biomass kit: 0.25
* chimney coextraction kit: 0.25
* swabs, with beads: 0.24 - because more likely vent taxa than non at 0.25
* swabs, no beads: 0.24
* filters, with beads: 0.22
* filters, no beads: 0.22

I chose these cutoffs by looking at the histograms and manually looking at the taxa flagged as contaminants at different p.prev values.

R code to run decontam using these thresholds, export final contaminant table, and remove identified contaminants from phyloseq for downstream analyses:
```
#export final contaminant table
# Build table of final flagged contaminants
contam_tables_final <- list(
  pull_contam_tbl(contam_lowbio_kitnegs_25,   "Chimney_QiaLowBio",   "0.25", 
                  "kitnegs_only"),
  pull_contam_tbl(contam_coextract_kitnegs_25,"Chimney_QiaCoextract","0.25", 
                  "kitnegs_only"),
  pull_contam_tbl(contam_filter_kitnegs_b_22,   "Filter Beads",       "0.22", 
                  "kitnegs_only"),
  pull_contam_tbl(contam_filter_kitnegs_nb_22,   "Filter No Beads",   "0.2", 
                  "kitnegs_only"),
  pull_contam_tbl(contam_swab_kitnegs_b_24,     "Swab Beads",       "0.24", 
                  "kitnegs_only"),
  pull_contam_tbl(contam_swab_kitnegs_nb_24,     "Swab No Beads",    "0.24", 
                  "kitnegs_only")
)

# One combined data frame of contaminants with taxonomy
contam_final <- dplyr::bind_rows(contam_tables_final)
# write to disk (one combined file)
readr::write_tsv(contam_final, "contaminant_asvs_final.tsv")

###remove flagged contaminants from phyloseq
##Collect contaminant ASV IDs (unique, non-NA)
contam_ids <- contam_final$ASV %>%
  unique() %>%
  setdiff(NA_character_) # in total: 353 contaminant ASVs

# Sanity: how many of these are in phyloseq object?
matched <- intersect(contam_ids, taxa_names(ps))
message("Contaminant ASVs in ps: ", length(matched), " / ", length(contam_ids))

## Remove contaminants from ps and tidy zeros
ps_clean <- prune_taxa(!(taxa_names(ps) %in% matched), ps)

## Quick before/after check
message("Before: ", ntaxa(ps),  " taxa across ", nsamples(ps),  " samples.")
message("After: ", ntaxa(ps_clean), " taxa across ", nsamples(ps_clean), " samples")

saveRDS(ps_clean, file = "ps_clean_decontam.rds")
# Later, in another script:
# ps_clean <- readRDS("ps_clean_decontam.rds")

# also export components as TSVs
# OTU / ASV table (features as rows)
otu_out <- as(otu_table(ps_clean), "matrix")
if (!taxa_are_rows(ps_clean)) otu_out <- t(otu_out)
otu_out <- cbind(FeatureID = rownames(otu_out), as.data.frame(otu_out))
readr::write_tsv(otu_out, "feature-table.clean.tsv")

# Taxonomy
tax_out <- as.data.frame(tax_table(ps_clean)) %>%
  tibble::rownames_to_column("FeatureID")
readr::write_tsv(tax_out, "taxonomy.clean.tsv")

# Metadata
meta_out <- as(sample_data(ps_clean), "data.frame") %>%
  tibble::rownames_to_column("sample-id")
readr::write_tsv(meta_out, "metadata.clean.tsv")

```

## Statistical analysis and visualization of ASVs
Questions I'd like to get at with my visualization and data analysis:
* Now that I have removed contaminant taxa, what do the taxonomy barplots look like for all samples?
* What are the key taxa present in my hydrothermal samples, and what is their core microbiome?
* Relative to the background seawater, which taxa are more abundant in my filtered diffuse-flow fluid samples, swabbed chimneys, and powdered chimney samples?
* What is the difference in community composition between the exterior, middle, and interior of hydrothermal vent chimneys?
* How different are the chimney microbiomes between the North vs South vent field?

### Making taxonomy barplot through Qiime2
First, I'll visualize my taxonomy bar plots again. I need to start by getting my tsv files from R back into the qiime2 format.

Get exported TSV components back onto Poseidon HPC so that I can recreate a Qiime2 object:
```
scp /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/*.clean.tsv avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam/
```
Modify taxonomy	file to	collapse ranks into a single “Taxon” column for	compatibility:
```
awk 'BEGIN{FS="\t";OFS="\t"} NR==1{print "Feature ID","Taxon"} NR>1{tax=$2; for(i=3;i<=NF;i++){tax=tax"; " $i} print $1,tax}' taxonomy.clean.tsv > taxonomy.qiime.tsv
```

Script to get files into a Qiime2 artifact and create a barplot:
```
#!/bin/bash
#SBATCH --partition=compute
#SBATCH --job-name=import_postdecontam
#SBATCH --mail-type=ALL
#SBATCH --mail-user=avery.fulford@whoi.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10gb
#SBATCH --time=24:00:00
#SBATCH --output=import_postdecontam.log

set -euo pipefail
export OMP_NUM_THREADS=4

module load singularity/3.7

# ---- Paths ----
CONTAINER=/vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif
WORKDIR=/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam
export MPLCONFIGDIR=/tmp/mplconfig && mkdir -p $MPLCONFIGDIR

# Convenience alias so it never forgets the container
run() { singularity exec --bind "${WORKDIR}:/data" "${CONTAINER}" "$@"; }

# ---- 1. Import taxonomy ----
# Assumes the taxonomy file has a header like:
# Feature ID    Taxon    Confidence
run qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format TSVTaxonomyFormat \
  --input-path /data/taxonomy.qiime.tsv \
  --output-path /data/taxonomy.clean.qza
  
# ---- 2. Convert feature table TSV -> BIOM (HDF5) ----
# Assumes the TSV has "#OTU ID" as the first column header - all good!
run biom convert \
  -i /data/feature-table.clean.tsv \
  -o /data/feature-table.clean.biom \
  --table-type="OTU table" \
  --to-hdf5

# ---- 3. Import feature table BIOM -> QZA ----
run qiime tools import \
  --type 'FeatureTable[Frequency]' \
  --input-path /data/feature-table.clean.biom \
  --input-format BIOMV210Format \
  --output-path /data/feature-table.clean.qza

# ---- 4. Import metadata ----
# Metadata stays as a TSV file (QIIME uses it directly, no import needed),
# but validate its formatting here:
run qiime metadata tabulate \
  --m-input-file /data/metadata.clean.tsv \
  --o-visualization /data/metadata.clean.qzv

# ---- 5. Generate taxonomy barplot ----
run qiime taxa barplot \
  --i-table /data/feature-table.clean.qza \
  --i-taxonomy /data/taxonomy.clean.qza \
  --m-metadata-file /data/metadata.clean.tsv \
  --o-visualization /data/taxa-barplot.clean.qzv
```

Download file to local computer: 
```
scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam/taxa-barplot.clean.qzv /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/
```

### I want to avoid rarefying
Rarefaction (subsampling every sample to an equal number of reads) equalizes depth for distance calculations, but it:
* Throws away valid data (reads below the cutoff)
* Adds random noise when resampling
* Can remove many vent samples that already have low counts
So I want to keep all reads and instead use methods that account for unequal sequencing depth, for instance DivNet and Breakaway.

Instead of QIIME’s rarefied indices, I'll use some of Amy Willis's tools:
* DivNet estimates Shannon and evenness, modeling read sampling error
* Breakaway estimates true richness (observed + unobserved species)
Both are implemented in R (breakaway, DivNet packages) and use the unrarefied count table. These methods also output confidence intervals.

### Use Qiime2 to create a clustered heatmap based on sample type, chimney layer, and North vs South vent field
Right now, SampleType gives Chimney, Filter, Swab, etc., but it doesn't include which layer of the chimney (Interior, Middle, Exterior). First I need to add a column to my metadata file to reflect this, "SampleTypeLayer".

Next, slurm script to create heatmap:
```
#!/bin/bash
#SBATCH --partition=compute
#SBATCH --job-name=taxa_diversity_class
#SBATCH --mail-type=ALL
#SBATCH --mail-user=avery.fulford@whoi.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=25gb
#SBATCH --time=24:00:00
#SBATCH --output=taxa_diversity_class.log

set -euo pipefail
export OMP_NUM_THREADS=4

module load singularity/3.7

# Paths
CONTAINER=/vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif
WORKDIR="/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam"
export MPLCONFIGDIR=/tmp/mplconfig && mkdir -p "$MPLCONFIGDIR"

# Convenience wrapper so every call runs inside the container
run() { singularity exec --bind "${WORKDIR}:/data" "${CONTAINER}" "$@"; }

# 0) Prune all negatives by metadata:
#    Keep samples where Sample_type is NOT 'Possible contam' or 'Kit negative'
run qiime feature-table filter-samples \
  --i-table /data/feature-table.clean.qza \
  --m-metadata-file /data/metadata.clean.tsv \
  --p-where "Sample_type NOT IN ('Possible contam','Kit negative')" \
  --o-filtered-table /data/feature-table.noneg.qza

# Optional: only create heatmap for chimney and swab samples
run qiime feature-table filter-samples \
  --i-table /data/feature-table.noneg.qza \
  --m-metadata-file /data/metadata.clean.tsv \
  --p-where "Sample_type IN ('Chimney','Swab')" \
  --o-filtered-table /data/feature-table.noneg.qza

# 1) Prepare table for heatmap: keep only features with Class annotations
run qiime taxa filter-table \
  --i-table /data/feature-table.noneg.qza \
  --i-taxonomy /data/taxonomy.clean.qza \
  --p-include "c__" \
  --o-filtered-table /data/table-with-class.qza

# 2) Collapse by taxonomy to Class (QIIME level 3)
run qiime taxa collapse \
  --i-table /data/table-with-class.qza \
  --i-taxonomy /data/taxonomy.clean.qza \
  --p-level 3 \
  --o-collapsed-table /data/class-taxonomy-table.qza

# 3) Normalize to relative abundance (avoid depth bias in heatmap)
run qiime feature-table relative-frequency \
  --i-table /data/class-taxonomy-table.qza \
  --o-relative-frequency-table /data/class-relfreq-table.qza

# 4) Class-level heatmap (cluster features) based on metadata column SampleTypeLayer
#    Can replace SampleTypeLayer with Sample_type, Chimney_position, N_or_S, or whatever is interesting

run qiime feature-table heatmap \
  --i-table /data/class-relfreq-table.qza \
  --m-sample-metadata-file /data/metadata.clean.tsv \
  --m-sample-metadata-column "SampleTypeLayer" \
  --p-cluster samples \
  --o-visualization /data/heatmap_CLASS.qzv
  
# Class-level heatmap (cluster features) based on metadata column N_or_S
run qiime feature-table heatmap \
  --i-table /data/class-relfreq-table.qza \
  --m-sample-metadata-file /data/metadata.clean.tsv \
  --m-sample-metadata-column "N_or_S" \
  --p-cluster samples \
  --o-visualization /data/heatmap_CLASS_NorS.qzv

# 5) Read-depth summary (for reporting only; does NOT rarefy)
run qiime feature-table summarize \
  --i-table /data/feature-table.noneg.qza \
  --o-visualization /data/table-summary.clean.qzv \
  --m-sample-metadata-file /data/metadata.clean.tsv

```
Code to get `heatmap_CLASS.qzv` on local computer:
```
scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam/heatmap_CLASS.qzv /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/

scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam/heatmap_CLASS_NorS.qzv /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/
```

![heatmap_CLASS](https://hackmd.io/_uploads/Hkl_OkIfZg.png)

### Identify core microbiome of different sample types
Are there “universal” vent-associated taxa shared across chimney and fluid samples? How about within the chimney samples, swab samples, and fluid samples individually? 

The core microbiome is the set of ASVs (or taxa at any chosen rank) that are consistently present across a defined group of samples.
Two parameters define it:
* Prevalence: fraction of samples where a taxon is present (≥ 80%)
* Detection threshold: minimum relative abundance (> 0.1%)

These chosen parameter values mean “present at > 0.1 % relative abundance in at least 80 % of samples.”
This avoids labeling taxa as “core” if they occur only as trace contaminants.

I will compute the core microbiome for:
1. All hydrothermal samples (everything except Sample_type == "Background SW")
2. Each Sample_type separately (Chimney, Swab, Filter)

Slurm script:
```
#!/bin/bash
#SBATCH --partition=compute
#SBATCH --job-name=core_microbiome
#SBATCH --mail-type=ALL
#SBATCH --mail-user=avery.fulford@whoi.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=25gb
#SBATCH --time=24:00:00
#SBATCH --output=core_microbiome.log

set -euo pipefail
export OMP_NUM_THREADS=4

module load singularity/3.7

# Paths
CONTAINER=/vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif
WORKDIR="/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam"
export MPLCONFIGDIR=/tmp/mplconfig && mkdir -p "$MPLCONFIGDIR"

# Convenience wrapper so every call runs inside the container
run() { singularity exec --bind "${WORKDIR}:/data" "${CONTAINER}" "$@"; }

# Filter data for only hydrothermal samples
run qiime feature-table filter-samples \
  --i-table /data/feature-table.clean.qza \
  --m-metadata-file /data/metadata.clean.tsv \
  --p-where "Sample_type != 'Background SW'" \
  --o-filtered-table /data/hydrothermal-only.qza

# Create core feature viz using Qiime2 for only hydrothermal samples
run qiime feature-table core-features \
  --i-table /data/hydrothermal-only.qza \
  --p-min-fraction 0.8 \
  --o-visualization /data/core-hydrothermal.qzv

# Iterative: for each sample type, filter data for only those samples and create core feature viz
for t in Chimney Swab Filter; do
  run qiime feature-table filter-samples \
    --i-table /data/hydrothermal-only.qza \
    --m-metadata-file /data/metadata.clean.tsv \
    --p-where "Sample_type='$t'" \
    --o-filtered-table /data/${t,,}-only.qza

  run qiime feature-table core-features \
    --i-table /data/${t,,}-only.qza \
    --p-min-fraction 0.8 \
    --o-visualization /data/core-${t,,}.qzv
done
```

Download files to local computer:
```
scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam/core-*.qzv /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/core_microbiome
```

There was only one ASV that was in 80% of all my hydrothermal-associated samples: d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter

There was none for all chimney samples, 9 for swabs, and 25 for filters.

### Create NMDS plot of ASV prevalence and abundance with colored overlays to see what factors drive clustering
I'm going to do this in R using phyloseq and vegan.

```
library(vegan)
library(phyloseq)
library(ggplot2)
library(tidyverse)

# load pre-decontam dataset as phyloseq object
ps <- readRDS("ps.rds")
# load post-decontam dataset as phyloseq object
ps_clean <- readRDS("ps_clean_decontam.rds")

# remove negatives from phyloseq object
ps_clean_nonegs <- subset_samples(
  ps_clean,
  Sample_type != "Kit negative" & Sample_type != "Possible contam")

#subset for chimneys and swabs only
ps_clean_chim <- subset_samples(ps_clean, Sample_type %in% c("Chimney", "Swab"))

# transform phyloseq to relative abundance to avoid bias by sequencing depth
ps_rel <- transform_sample_counts(ps, function(x) x / sum(x))
ps_clean_nonegs_rel <- transform_sample_counts(ps_clean_nonegs, function(x) x / sum(x))
ps_clean_chim_rel <- transform_sample_counts(ps_clean_chim, function(x) x / sum(x))

# Compute Bray–Curtis distances and run NMDS
# should see a stress value printed. < 0.2 is acceptable, < 0.1 is very good
ord_nmds <- ordinate(ps_rel, method = "NMDS", distance = "bray")
# stress value of 0.2103611
ord_nmds_clean <- ordinate(ps_clean_nonegs_rel, method = "NMDS", distance = "bray")
# stress value of 0.220147
ord_nmds_chim <- ordinate(ps_clean_chim_rel, method = "NMDS", distance = "bray")
# stress value of 0.1864533

# Plot and color by “SampleTypeLayer” column
plot_ordination(ps_rel, ord_nmds, color = "SampleTypeLayer") +
  geom_point(size = 4, alpha = 0.8) +
  theme_minimal(base_size = 14) +
  labs(title = "NMDS of Hydrothermal Vent ASV Communities: Pre Decontam",
       color = "Sample Type and Chimney Layer")

plot_ordination(ps_clean_nonegs_rel, ord_nmds_clean, color = "SampleTypeLayer") +
  geom_point(size = 4, alpha = 0.8) +
  theme_minimal(base_size = 14) +
  labs(title = "NMDS of Hydrothermal Vent ASV Communities: Post Decontam",
       color = "Sample Type and Chimney Layer")


# Plot and color by “N_or_S” column
plot_ordination(ps_rel, ord_nmds, color = "N_or_S") +
  geom_point(size = 4, alpha = 0.8) +
  theme_minimal(base_size = 14) +
  labs(title = "NMDS of Hydrothermal Vent ASV Communities: Pre Decontam",
       color = "North vs South Cleft")

plot_ordination(ps_clean_nonegs_rel, ord_nmds_clean, color = "N_or_S") +
  geom_point(size = 4, alpha = 0.8) +
  theme_minimal(base_size = 14) +
  labs(title = "NMDS of Hydrothermal Vent ASV Communities: Post Decontam",
       color = "North vs South Cleft")

# Color by North vs South, use different symbols for SampleTypeLayer
# Define 7 custom shapes (these are distinct and easy to read)
custom_shapes <- c(16, 17, 15, 18, 3, 8, 7) 
# (circle, triangle, square, diamond, plus, star, inverted triangle)

plot_ordination(ps_clean_nonegs_rel, ord_nmds_clean,
                color = "N_or_S", shape = "SampleTypeLayer") +
  geom_point(size = 4, alpha = 0.8) +
  scale_shape_manual(values = custom_shapes) +
  theme_minimal(base_size = 14) +
  labs(
    title = "NMDS of Hydrothermal Vent ASV Communities: Post Decontam",
    color = "Vent Field (North vs South)",
    shape = "Sample Type / Chimney Layer"
  ) +
  theme(
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

# Plot only chimneys by layer and N_or_S
plot_ordination(ps_clean_chim_rel, ord_nmds_chim,
                color = "Location_name", shape = "SampleTypeLayer") +
  geom_point(size = 4, alpha = 0.8) +
  theme_minimal(base_size = 14) +
  labs(
    title = "NMDS of Hydrothermal Vent Chimney ASV Communities",
    color = "Vent",
    shape = "Chimney Layer"
  ) +
  theme(
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )
```

Hmm, these are high stress values, which are only borderline usable. This is probably because NMDS is sensitive to a large number of zeroes (common in ASV data), high sample heterogeneity, and low overlap of ASVs among samples. This does somewhat describe my dataset. 
![nmds_pre_decontam](https://hackmd.io/_uploads/ByftM58z-e.png)
![nmds_post_decontam_layers_nonegs](https://hackmd.io/_uploads/BJGqG5LGbx.png)
![nmds_chim_swab_only](https://hackmd.io/_uploads/r1HhzcUGWe.png)
![nmds_chim_swab_location](https://hackmd.io/_uploads/SkOhzcLf-e.png)

Some interpretations:
* Great that negative controls seem to cluster separately before decontam contaminant removal
* After decontam, there seems to be clustering by filters and swabs within sample type, as well as between vent fields
* Separating out only the crushed chimney and swab samples shows better clustering of North vs South vent fields
* With only the chimney and swab samples, they seem to cluster more by individual vent rather than by chimney layer, indicating that the individual vent geochemistry is a very important driver of the microbial community

### Differential abundance of hydrothermal taxa relative to background seawater
Relative to the background seawater, which taxa are more abundant in my filtered diffuse-flow fluid samples, swabbed chimneys, and powdered chimney samples?

I'm planning to use QIIME2 ANCOM BC-2. This has two values you can tweak: `--p-significance-threshold` and `--p-effect-size-threshold`.
* `--p-significance-threshold` sets the adjusted p-value (q-value) cutoff below which a taxon is considered statistically significant. I used the default in QIIME 2: 0.05, meaning taxa with FDR-corrected p < 0.05 are treated as differentially abundant. This is the standard balance between sensitivity and false-discovery control in exploratory microbial ecology.
* `--p-effect-size-threshold` sets the minimum absolute log-fold-change (LFC) a taxon must have to be shown on the barplot. Default in QIIME 2 is 0.0 (no filtering) - ChatGPT suggested I use 0.5, so roughly a 1.6-fold change in relative abundance (since log₂(1.6) ≈ 0.68). This removes small but statistically significant shifts (e.g., 1–2 % changes) that aren’t biologically meaningful.

After I get my output, I can adjust -- if too many taxa are filtered out, I can lower the effect-size cutoff. If too many appear, I can tighten either threshold.

Slurm script to use ANCOM-BC2 through QIIME2 to visualize differential abundance in (1) "Filter" vs "Background SW" in metadata column "Sample_type", (2) "Swab" vs "Background SW" in metadata column "Sample_type", (3) "Chimney" vs "Filter" in metadata column "Sample_type", and (4) "North" vs "South" vent samples in metadata column "N_or_S":

```
#!/bin/bash
#SBATCH --partition=compute
#SBATCH --job-name=qiime2_ancombc2
#SBATCH --mail-type=ALL
#SBATCH --mail-user=avery.fulford@whoi.edu
#SBATCH --qos=unlim
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16gb
#SBATCH --time=72:00:00
#SBATCH --output=qiime2_ancombc2.log

set -euo pipefail
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-4}

module load singularity/3.7

# Paths
CONTAINER=/vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif
WORKDIR=/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam
TABLE=/data/feature-table.noneg.qza    # non-negative table created earlier
METADATA=/data/metadata.clean.tsv

# outputs
OUTDIR=/data/ancombc2_out

run() { singularity exec --bind "${WORKDIR}:/data" "${CONTAINER}" "$@"; }

# Helper: make a 2-level subset table for a given metadata column + levels
# Usage: subset_two_levels COLUMN "LevelA" "LevelB" OUTTABLE
subset_two_levels () {
  local col="$1"; local levA="$2"; local levB="$3"; local out="$4"
  run qiime feature-table filter-samples \
    --i-table "${TABLE}" \
    --m-metadata-file "${METADATA}" \
    --p-where "${col} IN ('${levA}','${levB}')" \
    --o-filtered-table "${out}"
}

# 1) Filter vs Background SW  (Sample_type)
#    Reference = Background SW → +LFC = enriched in Filter
#subset_two_levels "Sample_type" "Filter" "Background SW" "${OUTDIR}/tbl_Filter_BG.qza"

#run qiime composition ancombc2 \
#  --i-table "${OUTDIR}/tbl_Filter_BG.qza" \
#  --m-metadata-file "${METADATA}" \
#  --p-fixed-effects-formula "Sample_type" \
#  --p-reference-levels "Sample_type::Background SW" \
#  --o-ancombc2-output "${OUTDIR}/abc2_Filter_vs_BG.qza"

run qiime composition ancombc2-visualizer \
  --i-data "${OUTDIR}/abc2_Filter_vs_BG.qza" \
  --i-taxonomy /data/taxonomy.clean.qza \
  --o-visualization "${OUTDIR}/abc2_Filter_vs_BG.qzv"

# 1b) Swab vs Background SW  (Sample_type)
#     Reference = Background SW → +LFC = enriched in Swab
subset_two_levels "Sample_type" "Swab" "Background SW" "${OUTDIR}/tbl_Swab_BG.qza"

run qiime composition ancombc2 \
  --i-table "${OUTDIR}/tbl_Swab_BG.qza" \
  --m-metadata-file "${METADATA}" \
  --p-fixed-effects-formula "Sample_type" \
  --p-reference-levels "Sample_type::Background SW" \
  --o-ancombc2-output "${OUTDIR}/abc2_Swab_vs_BG.qza"

run qiime composition ancombc2-visualizer \
  --i-data "${OUTDIR}/abc2_Swab_vs_BG.qza" \
  --i-taxonomy /data/taxonomy.clean.qza \
  --o-visualization "${OUTDIR}/abc2_Swab_vs_BG.qzv"

# 2) Chimney vs Filter  (Sample_type)
#    Reference = Filter → +LFC = enriched in Chimney
subset_two_levels "Sample_type" "Chimney" "Filter" "${OUTDIR}/tbl_Chimney_Filter.qza"

run qiime composition ancombc2 \
  --i-table "${OUTDIR}/tbl_Chimney_Filter.qza" \
  --m-metadata-file "${METADATA}" \
  --p-fixed-effects-formula "Sample_type" \
  --p-reference-levels "Sample_type::Filter" \
  --o-ancombc2-output "${OUTDIR}/abc2_Chimney_vs_Filter.qza"

run qiime composition ancombc2-visualizer \
  --i-data "${OUTDIR}/abc2_Chimney_vs_Filter.qza" \
  --i-taxonomy /data/taxonomy.clean.qza \
  --o-visualization "${OUTDIR}/abc2_Chimney_vs_Filter.qzv"


# 3) South vs North  (N_or_S)
#    Reference = North → +LFC = enriched in South
subset_two_levels "N_or_S" "South" "North" "${OUTDIR}/tbl_South_North.qza"

run qiime composition ancombc2 \
  --i-table "${OUTDIR}/tbl_South_North.qza" \
  --m-metadata-file "${METADATA}" \
  --p-fixed-effects-formula "N_or_S" \
  --p-reference-levels "N_or_S::North" \
  --o-ancombc2-output "${OUTDIR}/abc2_South_vs_North.qza"

run qiime composition ancombc2-visualizer \
  --i-data "${OUTDIR}/abc2_South_vs_North.qza" \
  --i-taxonomy /data/taxonomy.clean.qza \
  --o-visualization "${OUTDIR}/abc2_South_vs_North.qzv"

echo "Done. Visualizations in ${WORKDIR}/ancombc2_out/"
```
Downloading these onto local computer:
```
scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam/ancombc2_out/*.qzv /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/ancom-bc2
```
Now, I need to export these files into a flat tsv so that I can visualize them on my own. 
```
#!/bin/bash
#SBATCH --partition=compute
#SBATCH --job-name=ancombc2_export_tables
#SBATCH --mail-type=ALL
#SBATCH --mail-user=avery.fulford@whoi.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4gb
#SBATCH --time=02:00:00
#SBATCH --output=ancombc2_export_tables.log

set -euo pipefail
module load singularity/3.7

# Paths
CONTAINER=/vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif
WORKDIR=/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam

OUTDIR=/data/ancombc2_out
EXPORT_DIR=${OUTDIR}/export

# convenience wrapper to run commands inside the container with /data bound
run() { singularity exec --bind "${WORKDIR}:/data" "${CONTAINER}" "$@"; }

# list of contrast tags to export (matches existing filenames)
TAGS=(
  "Filter_vs_BG"
  "Swab_vs_BG"
  "Chimney_vs_Filter"
  "South_vs_North"
)

for tag in "${TAGS[@]}"; do
  IN_ART="${OUTDIR}/abc2_${tag}.qza"
  OUT_DIR="${EXPORT_DIR}/${tag}"

  echo " - ${tag}"
  run qiime tools export \
    --input-path "${IN_ART}" \
    --output-path "${OUT_DIR}"
done
```

Download to local computer:
```
scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam/ancombc2_out/export/Chimney_vs_Filter/* /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/ancom-bc2/Chimney_vs_Filter

scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam/ancombc2_out/export/Filter_vs_BG/* /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/ancom-bc2/Filter_vs_BG

scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam/ancombc2_out/export/Swab_vs_BG/* /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/ancom-bc2/Swab_vs_BG

scp avery.fulford@poseidon.whoi.edu:/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam/ancombc2_out/export/South_vs_North/* /Users/averyfulford/Desktop/MIT_WHOI4/research/16S\ analysis/ancom-bc2/South_vs_North
```

Now I'm going to write code in R to visualize these.

### Community Composition and Beta Diversity Patterns
Approach: compare richness and diversity indices between chimney exteriors vs interiors, or diffuse-flow fluids vs chimneys, using breakaway and divnet (breakaway focuses on richness while DivNet focuses on Shannon, Simpson, and other alpha diversities as well as some beta diversity indices).

## Select samples for metagenomic sequencing
Guiding principle for selecting samples
Scientific interests:
* Identifying metabolic gene pathways across gradients (e.g., inner, middle, outer chimney samples)
* Recovering MAGs (higher DNA concentration samples, lower evenness)
* For filters, samples that had less seawater intrusion (dissimilar to background seawater samples)

Technical considerations:
* Samples not dominated by contaminants (dissimilar to respective kit negatives)
* more than 1 ng/uL DNA

### Identifying samples dissimilar to seawater samples
* Used Bray-Curtis distances of most similar and mean of two background seawater samples to identify the samples most similar to BGSW
* Bray-Curtis distance matrix ranges from 0 (identical) to 1 (completely different)
* Identified five outlier samples that are < 0.95. All are filters!
![Screenshot 2026-01-10 at 2.53.21 PM](https://hackmd.io/_uploads/r1l807eB-x.png)

### Identifying samples dissimilar to respective kit negatives
* Used Bray-Curtis distances to most similar and mean of kit negatives within each sample category
* Identified four samples that are <0.95. Again, all are filters
![Screenshot 2026-01-10 at 3.00.56 PM](https://hackmd.io/_uploads/SJHqxVlBWx.png)

### Identifying samples with fewer contaminants identified using decontam
* Calculated the samples with the highest number of contaminant ASVs and proportion of contaminant ASVs relative to # of reads
* Out of 408 total contaminants identified using decontam …
![Screenshot 2026-01-10 at 3.02.10 PM](https://hackmd.io/_uploads/BJTcgEeSWe.png)

### Identifying low evenness samples for recovering MAGs
* Calculated Pielou’s Evenness: how equally abundant species are in community, with 0 = very uneven and 1 = perfectly even
* Pielou’s Evenness = Shannon diversity / ln(richness)
* Top 10 species with highest evenness (might exclude): 
![Screenshot 2026-01-10 at 3.02.18 PM](https://hackmd.io/_uploads/rJfseVlSbx.png)

A helpful heuristic diagram for comparing alpha diversity metrics (made by Amy Willis):
![Screenshot 2026-01-08 at 2.11.43 PM](https://hackmd.io/_uploads/B1yIaXgSWx.png)

### The full code in R
```
library(vegan)
library(phyloseq)
library(ggplot2)
library(tidyverse)

# load pre-decontam dataset as phyloseq object
ps <- readRDS("ps.rds")
# load post-decontam dataset as phyloseq object
ps_decontam <- readRDS("ps_clean_decontam.rds")

# load the ASV IDs and taxonomy for the contaminants identified using decontam
contaminant_asvs <- read_tsv("contaminant_asvs_final.tsv", show_col_types = FALSE)

# transform phyloseq to relative abundance to avoid bias by sequencing depth
ps_rel <- transform_sample_counts(ps, function(x) x / sum(x))
ps_decontam_rel <- transform_sample_counts(ps_decontam, function(x) x / sum(x))

# ---
# compute distances to Background SW and kit negatives by Category
# ---

# Get Bray–Curtis distance matrix on ps_rel (includes BG SW + kit negs)
# Using pre-decontam data because this is what would be sequenced for metaG
bc_mat <- as.matrix(phyloseq::distance(ps_rel, method = "bray"))
# Bray-Curtis matrix ranges from 0 (identical) to 1 (completely different)

# Sample metadata as data.frame
meta <- data.frame(sample_data(ps_rel))
meta$SampleID <- rownames(meta)

# Identify background seawater and kit negative samples
bg_sw_ids <- meta %>%
  filter(Sample_type == "Background SW") %>%
  pull(SampleID)

kit_neg_ids <- meta %>%
  filter(Sample_type == "Kit negative") %>%
  pull(SampleID)

# Initialize columns for distances
meta <- meta %>%
  mutate(
    dist_min_to_bgSW   = NA_real_,
    dist_mean_to_bgSW  = NA_real_,
    dist_min_to_kitneg = NA_real_,
    dist_mean_to_kitneg = NA_real_
  )

# --- distances to Background SW
# get all sample IDs that are not BGSW
non_bg_ids <- setdiff(rownames(bc_mat), bg_sw_ids)

# for each sample ID that is not BGSW in the distance matrix, pull out b-c dissimilarity
for (sid in non_bg_ids) {
  # create a vector of distances from sample sid to each Background SW sample
  dists <- bc_mat[sid, bg_sw_ids]
  # store in meta how close this sample is to the most similar BGSW sample
  meta$dist_min_to_bgSW[meta$SampleID == sid]  <- min(dists, na.rm = TRUE)
  # store in meta the mean of how close this sample to both BGSW samples
  meta$dist_mean_to_bgSW[meta$SampleID == sid] <- mean(dists, na.rm = TRUE)
}
# interpretation: high min value → none of the BGs look close → good for metagenome
# high mean value: high separation from both BGSW samples
# we want high min and mean

# print the SampleID from meta whose dist_min_to_bgSW is <.95
meta %>%
  filter(dist_min_to_bgSW < 0.95) %>%
  select(dist_min_to_bgSW, dist_mean_to_bgSW)

# -------- distances to kit negatives (within same Category) --------
# We only care about these Categories:
category <- c("Swab", "Filter", "Chimney_QiaLowBio", 
                            "Chimney_QiaCoextract")

for (cat in category) {
  # get IDs for kit negatives for this Category
  kit_ids_cat <- meta %>%
    filter(Sample_type == "Kit negative", Category == cat) %>%
    pull(SampleID)
  
  # get IDs for non-kit-negative samples in this Category
  sample_ids_cat <- meta %>%
    filter(Sample_type != "Kit negative", Category == cat) %>%
    pull(SampleID)
  
  # Only keep those that actually appear in bc_mat
  sample_ids_cat <- intersect(sample_ids_cat, rownames(bc_mat))
  
  # for sample ID in each category
  for (sid in sample_ids_cat) {
    # create a vector of distances from sample sid to each kit negative
    dists <- bc_mat[sid, kit_ids_cat]
    # store in meta how close this sample is to the most similar kit negative
    meta$dist_min_to_kitneg[meta$SampleID == sid]   <- min(dists, na.rm = TRUE)
    # store in meta the mean of how close this sample to all kit negatives
    meta$dist_mean_to_kitneg[meta$SampleID == sid]  <- mean(dists, na.rm = TRUE)
  }
}
# interpretation: high min value → none of the negatives look close → good for metagenome
# high mean value: high separation from all negatives
# we want high min and mean

# print the SampleID from meta whose dist_min_to_kitneg is <.95
meta %>%
  filter(dist_min_to_kitneg < 0.95) %>%
  select(dist_min_to_kitneg, dist_mean_to_kitneg)

# ---
# contaminant ASV counts & proportions per sample
# ---

# Use relative-abundance OTU table for ps_rel
otu_rel <- as(otu_table(ps_rel), "matrix")
# rows = ASVs, cols = samples
# values are relative abundances per ASV

# formatting for contam IDs
contam_ids <- contaminant_asvs$ASV

# get only the ASVs that correspond to the identified contaminants
contam_rel <- otu_rel[rownames(otu_rel) %in% contam_ids, , drop = FALSE]
# sanity check – should be > 0
length(contam_ids)

# compute number of contaminant ASVs present in each sample (count > 0)
# turns matrix into T/F values if >0, sums each column (column = 2)
n_contam_asvs <- apply(contam_rel > 0, 2, sum)
# compute fraction of the sample composed of contaminants
# adds up relative abundance of all contaminant ASVs in that sample
prop_contam_abund <- colSums(contam_rel)

# Ensures the outputs are associated with sample names
sample_ids <- colnames(otu_rel)

# puts stats in tibble
contam_stats <- tibble(
  SampleID          = sample_ids,
  n_contam_asvs     = n_contam_asvs[sample_ids], #number of contaminants present
  prop_contam_abund = prop_contam_abund[sample_ids] #fraction of sample that are contam
)

# join contaminant stats into meta
meta <- meta %>%
  left_join(contam_stats, by = "SampleID")

# fix issue with duplicate columns
meta <- meta %>%
  # drop the duplicate .x columns
  select(-n_contam_asvs.x, -prop_contam_abund.x) %>%
  # rename .y columns to cleaner names
     rename(
         n_contam_asvs     = n_contam_asvs.y,
         prop_contam_abund = prop_contam_abund.y
     )
# sanity check - should have just one column
names(meta)

# Top 10 samples by NUMBER of contaminant ASVs
meta %>%
  filter(Sample_or_negative == "Sample") %>%
  arrange(desc(n_contam_asvs), desc(prop_contam_abund)) %>%
  select(SampleID, n_contam_asvs, prop_contam_abund) %>%
  slice_head(n = 10)
# Top 10 samples by PROPORTION of contaminant relative abundance
meta %>%
  filter(Sample_or_negative == "Sample") %>%
  arrange(desc(prop_contam_abund)) %>%
  select(SampleID, n_contam_asvs, prop_contam_abund) %>%
  slice_head(n = 10)

# ---
# compute Pielou's evenness, species richness, and Shannon diversity per sample
# ---

# Since ASVs are rows, need to transpose for vegan functions
if (taxa_are_rows(ps_rel)) {
  otu_rel_t <- t(otu_rel)
} else {
  otu_rel_t <- otu_rel
}

# Compute species (ASV) richness (S)
richness <- specnumber(otu_rel_t)   # count of nonzero ASVs
# Compute Shannon diversity (H)
shannon <- diversity(otu_rel_t, index = "shannon")
# Compute Pielou's Evenness: J = H / ln(S)
evenness <- shannon / log(richness)

# Build tibble
evenness_df <- tibble(
  SampleID      = names(shannon),
  Shannon       = shannon,
  Richness      = richness,
  Evenness      = evenness
)

# Add to meta
meta <- meta %>%
  left_join(evenness_df, by = "SampleID")
# sanity check
names(meta)

# pull out top 10 samples with highest evenness (good for MAG recovery)
meta %>%
  filter(Sample_or_negative == "Sample") %>%
  arrange(Evenness, Shannon) %>%        # lowest evenness first
  select(SampleID, Evenness, Shannon, Richness) %>%
  slice_head(n = 10)
# pull out top 10 samples with lowest evenness (bad for MAG recovery)
meta %>%
  filter(Sample_or_negative == "Sample") %>%
  arrange(Evenness, Shannon) %>%        # lowest evenness first
  select(SampleID, Evenness, Shannon, Richness) %>%
  slice_tail(n = 10)
```

### Make more taxonomy bar plots in make-barplots.sh
```
#!/bin/bash
#SBATCH --partition=compute
#SBATCH --job-name=make-barplots
#SBATCH --mail-type=ALL
#SBATCH --mail-user=avery.fulford@whoi.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=10gb
#SBATCH --time=6:00:00
#SBATCH --output=make-barplots.log

set -euo pipefail
export OMP_NUM_THREADS=4

module load singularity/3.7

# ---- Paths ----
CONTAINER=/vortexfs1/omics/huber/afulford/16S_nov12/amplicon_2025.10.sif
WORKDIR=/vortexfs1/omics/huber/afulford/16S_nov12/post-decontam

# Matplotlib config dir (avoid permission issues)
export MPLCONFIGDIR=/tmp/mplconfig

# Convenience alias
run() { singularity exec --bind "${WORKDIR}:/data" "${CONTAINER}" "$@"; }

# Barplot for all post-contam samples, but no negatives
run qiime taxa barplot \
  --i-table /data/feature-table.noneg.qza \
  --i-taxonomy /data/taxonomy.clean.qza \
  --m-metadata-file /data/metadata.clean.tsv \
  --o-visualization /data/barplots/barplot-postdecontam-nonegs.qzv

# Separate out only post-decontam filters and background seawater samples
run qiime feature-table filter-samples \
  --i-table /data/feature-table.noneg.qza \
  --m-metadata-file /data/metadata.clean.tsv \
  --p-where "Sample_type IN ('Filter','Background SW')" \
  --o-filtered-table /data/fluids-only.qza

# Barplot for only post-decontam filters
run qiime taxa barplot \
  --i-table /data/fluids-only.qza \
  --i-taxonomy /data/taxonomy.clean.qza \
  --m-metadata-file /data/metadata.clean.tsv \
  --o-visualization /data/barplots/barplot-postdecontam-fluids.qzv

# Separate out only post-decontam chimneys and swabs
run qiime feature-table filter-samples \
  --i-table /data/feature-table.noneg.qza \
  --m-metadata-file /data/metadata.clean.tsv \
  --p-where "Sample_type IN ('Chimney','Swab')" \
  --o-filtered-table /data/sulfides-only.qza

# Barplot for only post-decontam chimneys and swabs
run qiime taxa barplot \
  --i-table /data/sulfides-only.qza \
  --i-taxonomy /data/taxonomy.clean.qza \
  --m-metadata-file /data/metadata.clean.tsv \
  --o-visualization /data/barplots/barplot-postdecontam-sulfides.qzv
```