# 16S rRNA sequencing gene datasets for CRC data ###### tags: `16S rRNA` `COST Action` `ML` ### Used datasets | Dataset | 16S rRNA Region | Control (n) | Adenoma (n) | CRC (n) | Available metadata | | ----------------------------------------------------------------- | --- | ----------- | ----------- | ------- | ----------------------------------------------------------------- | | [Baxter](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4823848/) | V4 | 172 | 198 | 120 | Gender, age, weight, height, BMI, country, race | | [Zackular](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4221363/) | V4 | 30 | 30 | 30 | Gender, age, weight, height, BMI, country, race, FOBT, medication | | [Zeller](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4299606/) | V4 | 50 | 25 (small adenoma) + 13 (large adenoma) | 41 | Gender, age, BMI, country, FOBT | | **TOTAL** | V4 | 252 | 266 | 191 | *All of the above* | # Data processing & sharing All datasets were processed using [qiime2](https://docs.qiime2.org/2021.11/) pipeline with [DADA2](https://benjjneb.github.io/dada2/) for Sequence quality control and feature table construction, and [SILVA](https://www.arb-silva.de/) database for taxonomic assignment, then a phyloseq object was constructed. ## Download qiime2 artifacts [Here](https://drive.google.com/file/d/1ME5o9vIZ1opihPtkVFlc0gk--0lSTkjq/view?usp=share_link) ## Download processed data :open_file_folder: - Abundance table at genus level is [here.](https://drive.google.com/file/d/1w5sFdGfuwug4cQuEqgHcjKlfLbNF6iVn/view?usp=share_link) :warning: <font color = 'gray'>Sample counts with NO filtering.</font> - Clean metadata is [here](https://drive.google.com/file/d/1yvMj_t-XHAa3tIM-Qd59kPCAhBlgKRQh/view?usp=sharing). :memo: <font color = 'gray'> Countries: CA - Canada. USA - United States of America. FRA - France. </font> - Phyloseq object is [here](https://drive.google.com/file/d/1OZBXD1XB_5N4PbL8At1aksnG0Wns0F65/view?usp=sharing). ## Steps for processing data Qiime version: 2021.4 :eyes: ### ASV table construction ```bash=s conda activate qiime2-2021.4 qiime tools import \ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path manifiest.csv \ --output-path demux.qza \ --input-format PairedEndFastqManifestPhred33 \ qiime demux summarize \ --i-data demux.qza \ --o-visualization demux.qzv qiime dada2 denoise-paired \ --i-demultiplexed-seqs demux.qza \ --p-n-threads 6 \ --p-trim-left-f 22 \ --p-trim-left-r 22 \ --p-trunc-len-f 240 \ --p-trunc-len-r 240 \ --o-table table.qza \ --o-representative-sequences rep-seqs.qza \ --o-denoising-stats denoising-stats.qza ``` ### Taxonomic assigmnet :warning: Download SILVA database classifier used with Qiime version 2021.4 ```bash=s wget https://data.qiime2.org/2021.4/common/silva-138-99-nb-classifier.qza ``` Taxonomic assigmnet ```bash=s qiime feature-classifier classify-sklearn \ --i-classifier silva-138-99-nb-classifier.qza \ --i-reads rep-seqs.qza \ --o-classification taxonomy.qza qiime metadata tabulate \ --m-input-file taxonomy.qza \ --o-visualization taxonomy.qzv qiime taxa barplot \ --i-table table.qza \ --i-taxonomy taxonomy.qza \ --m-metadata-file list.tsv \ --o-visualization taxa-bar-plots.qzv ``` ### Phyloseq ```R=1 library(tidyverse) library(qiime2R) library(phyloseq) library(microbiome) library(viridis) #ASV table#### ASV<-read_qza("DATA/table.qza") #Taxonomy table taxonomyq <-read_qza("DATA/taxonomy.qza") #Tax table transformations taxtableq <- taxonomyq$data %>% as.tibble() %>% separate(Taxon, sep=";", c("Kingdom","Phylum","Class","Order","Family","Genus","Species")) taxtableq$Kingdom <- gsub("d__", "",taxtableq$Kingdom) taxtableq$Phylum <- gsub("p__", "",taxtableq$Phylum) taxtableq$Class <- gsub("c__", "",taxtableq$Class) taxtableq$Order <- gsub("o__", "",taxtableq$Order) taxtableq$Family <- gsub("f__", "",taxtableq$Family) taxtableq$Genus <- gsub("g__", "",taxtableq$Genus) taxtableq$Species <- gsub("s__", "",taxtableq$Species) #Clean taxa: Remove Kingdom Unassigned & Eukaryota taxtableq <- taxtableq[grep("Eukaryota|Unassigned", taxtableq$Kingdom, invert=T),] #Tree tree<-read_qza("DATA/rooted-tree.qza") #Metadata metadata<-read_csv("DATA/metadata.csv") #Construct phyloseq object ASVs <- otu_table(ASV$data, taxa_are_rows = T) tree <- phy_tree(tree$data) TAXq <- tax_table(as.data.frame(taxtableq) %>% select_("-Confidence") %>% column_to_rownames("Feature.ID") %>% as.matrix()) #moving the taxonomy to the way phyloseq wants it sample_metadata <- sample_data(metadata %>% as.data.frame()) sample_names(sample_metadata) <- paste(metadata$SampleID) physeq<- merge_phyloseq(ASVs, tree, TAXq,sample_metadata) #Remove samples with no diagnosis - Some extra samples were inlcuded by mistake- physeq <- subset_samples(physeq, Diagnosis!="NA") #Agglomerate at genus level physeq.gen <- tax_glom(physeq, taxrank="Genus") #Write data for sharing data <- psmelt(physeq.gen) ASV_tab <- data %>% select(SampleID,Abundance,Genus) genus <- pivot_wider(ASV_tab, id_cols=SampleID, names_from=Genus, values_from=Abundance, values_fn = mean) write_csv(genus,"genus.csv") #Save phyloseq object for sharing saveRDS(physeq.gen, "physeq.RDS") ``` ### Metadata Complete metadata found for the three studies is [here](https://docs.google.com/spreadsheets/d/1iJ2JeL-FQrs4Aj9dGJuNusGvHf3-_rWs/edit?usp=sharing&ouid=104693150197955528399&rtpof=true&sd=true). :memo: <font color = 'gray'> In "Clean metadata" is only the metadata that is common for the three studies. </font> --- #### Other datasets not included :::spoiler Pending for response | Dataset | 16S rRNA region | Control | Adenoma | CRC | Status | | ------- | --------------- | ------- | ------- | --- | --- | | [Ahn](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3866154/) | V3-V4 | 148 | 0 | 62 | Accesion trough dbGaP datasets... Very difficult no access yet | | [Flemer](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5529966/) | V3-V4 | 37 | 0 | 43| There is a problem with ENA, but luckily it will be solved soon :crossed_fingers: | | [Hale](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225053/) | V3-V5 | 473 | 214 | 17 | Can't find sequencing data SRP070783 :question: | | [Wang](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3260502/)| V3 | 56 | 0 | 46 | Can't find metadata | :::