This is always a difficult topic for me. How to handle proper statistics in my large data set analysis. Here's a collection of papers and discussion on the best way to approach it. This page is a work in progress of notes copy paste
Found two stat exchanges discussion on how to handle p value corrections with genomic datasets:
Benjamini-Yekutieli method was designed to handle the situation with correlated test results better. It can provide an FDR that is less conservative than the BH value.
[…] I would suggest relying on FDR-based or maxT tests as implemented in the multtest R package (see mt.maxT), but the definitive guide to resampling strategies for genomic application is Multiple Testing Procedures with Applications to Genomics, from Dudoit & van der Laan (Springer, 2008). See also Andrea Foulkes's book on genetics with R, which is reviewed in the JSS. She has great material on multiple testing procedures.
Interesting paper showing that ANCOM and Aldex2 are the more robust DAA methods
This one is interesting because it states:
Based on the benchmarking study, we conclude that none of the existing DAA methods evaluated can be applied blindly to any real microbiome dataset. The applicability of an existing DAA method depends on specific settings, which are usually unknown a priori. To circumvent the difficulty of selecting the best DAA tool in practice, we design ZicoSeq, which addresses the major challenges in DAA and remedies the drawbacks of existing DAA methods. ZicoSeq can be applied to microbiome datasets from diverse settings and is a useful DAA tool for robust microbiome biomarker discovery.