sourmash submission review

pyopensci review responses === Hi @L, @e and @s, Thank you so much for your reviews and your enthusiasm for sourmash! We have put a lot of effort into maintenance over the past few years, and it's fantastic to get some external feedback for improvement. @ctb, @luiz, and I (the sourmash maintainers) have addressed each of your comments via the issues posted above. As those address each point in-line, I will summarize here and provide text and explanations where needed. We have cut a new release with these changes (v4.8.6), and plan to release version v4.9.0 when review is accepted and we have a new DOI. ## JOSS manuscript fixes - ensure all citations have DOI and update citation for recently published preprint (https://github.com/sourmash-bio/sourmash/pull/2964) ## README updates ### Badges added: - `repostatus`, `pyver` python versions - platform support (added windows) ### Comparison to similar packages: Our goal with sourmash is to maintain a swiss-army knife of k-mer based sequence analysis tools for lightweight sequence analysis. There are a myriad of software tools that focus on one or a few sourmash functions, such as pairwise sequence comparisons (e.g. Mash) or taxonomic classification (Kraken, sylph, etc), albeit with different approaches. To avoid a laundry list of comparisons, we have instead added the following text to the README, which we hope helps distinguish sourmash: > sourmash is a k-mer analysis multitool, and we aim to provide stable, robust programmatic and command-line APIs for a variety of sequence comparisons. Some of our special sauce includes: > - `FracMinHash` sketching, which enables accurate comparisons (including ANI) between data sets of different sizes > - `sourmash gather`, a combinatorial k-mer approach for more accurate metagenomic profiling > Please see the [sourmash publications](https://sourmash.readthedocs.io/en/latest/publications.html#sourmash-fundamentals) for details. ## Documentation Improvements ### k-mer and scaling recommendations - In the FAQ, we have added adjusted information on k-mer size and added a section for `scaled` selection to the FAQ. These can be seen here: https://sourmash.readthedocs.io/en/latest/faq.html#what-k-mer-size-s-should-i-use-with-sourmash - The FAQ also links to the relevant section here: https://sourmash.readthedocs.io/en/latest/using-sourmash-a-guide.html#what-k-mer-size-s-should-i-use - fix download link in LCA tutorial (https://github.com/sourmash-bio/sourmash/pull/2920/files) ### Installation: - check tutorial directs users to install the right version of `miniforge` for their operating system (installing the linux version on mac will not be usable). - Remove `bioconda` requirement, as `sourmash-minimal` installs from `conda-forge`. ## User Experience Improvements - Added `--labels-to` and `--labels-from` to allow user customization of plot labels with `sourmash plot` - Added details to `--threshold` argument in `LCA classify` - > help="minimum number of hashes needed for a taxonomic classification (default: 5)") ## Code format standardization - fixed unused imports in `benchmarks/benchmarks.py` - refactored `benchmarks.py` to label relevant constants used for testing (e.g. MinHash k-mer size and the number of times to repeat each function for benchmarking) - lint all files via `ruff` and set up a linting pre-commit hook to check lint for all PR's. Set up `tox -e fix_lint` for automatic fixes (https://github.com/sourmash-bio/sourmash/pull/2427) - `nix flake and envrc` - update for arm-based macs (https://github.com/sourmash-bio/sourmash/pull/2975) ## Duplicated code in `test_sourmash.py` While we absolutely agree that helper functions improve maintainability, in our tests, the duplicated code is a result of a specific testing philosophy, which is that tests should be simple, "flat" (not call a lot of test-specific functions), have low code complexity (minimize `if` and `for` statements), and otherwise not be viewed as primary code to be maintained and refactored regularly. This has led to a lot of repeated code in the tests (which currently number 2753!), but often these tests must load or modify test data in a unique manner in order to trigger edge cases and check bugfixes we've completed. So we want to avoid editing the tests and potentially reducing branch coverage and their coverage of edge cases. For the future, we would appreciate any suggestions on large-scale ways to discover and/or fix instances of widely-shared code blocks and update some of the legacy items (e.g. decorators being used rather than our updated text fixtures). --- --- --- # submission issue with reviews: https://github.com/pyOpenSci/software-submission/issues/129 ## To do: - add badges for python versions supported - make sure JOSS paper references are visible / available / linked - modify miniforge installation instructions to make sure `mamba` is immediately callable after miniforge installation?? - Adjust plot example: include arguments to allow users to adjust the legends of both matrix and dendrogram plots in the function mentioned in the example `sourmash plot --pdf --labels ecoli_cmp` - `benchmarks/benchmarks.py` - `os`, `Path` are unused? - "Magic numbers: There are some magic numbers in the code (e.g., 500, 21, 3000, 10000, 1000, 20). Is it possible to define these numbers as constants with meaningful names or add the comments to make the code more self-explanatory?"" - Correct LCA database download link in LCA tutorial [#Building your own LCA database](https://sourmash.readthedocs.io/en/latest/tutorials-lca.html). The correct link should be `curl -L https://osf.io/bw8d7/download -o delmont-subsample-sigs.tar.gz` - Regarding `test/test_sourmash.py` : Refactoring duplicate code into helper functions to avoid redundancy and improve maintainability. - docs: add recommendations for the selection of ksize and scaled. - docs: What are the consequences of using different scales? Do you recommend using the same scale to generate the signatures for genomes or databases? - --> add page explaining FracMinHash a bit better? - Is it possible to add explanations and default values to --threshold in the help message of `sourmash lca classify -h` # Review Comments ## Review #1 > @bluegenes Outstanding work with `sourmash`! Your commitment to creating a package that's both easily maintainable and well-documented truly shines through. The code is impressively organized, accompanied by clear comments explaining each section, making it easy to comprehend the purpose of each file and function. Furthermore, the documentation for `high-level files and directories` is exceptionally clear — I intend to adopt this clarity in my own work. I've noted a few minor comments for your consideration. > > * 1. `mamba` was not callable after installing `miniforge3` on my Mac OS. **We have modified installation instructions for `miniforge` to ensure that `mamba` is immediately callable.** > * 2. Is it possible to include arguments to allow users to adjust the legends of both matrix and dendrogram plots in the function mentioned in the example: `sourmash plot --pdf --lables ecoli_cmp` ? > * 3. Regarding `benchmarks/benchmarks.py` : > > * 1. Unused libraries: `os` & `Path` > * 2. Magic numbers: There are some magic numbers in the code (e.g., 500, 21, 3000, 10000, 1000, 20). Is it possible to define these numbers as constants with meaningful names or add the comments to make the code more self-explanatory? > * 4. The download link doesn't work in the section [#Building your own LCA database](https://sourmash.readthedocs.io/en/latest/tutorials-lca.html). The correct link should be `curl -L https://osf.io/bw8d7/download -o delmont-subsample-sigs.tar.gz` > * 4. Regarding `test/test_sourmash.py` : Refactoring duplicate code into helper functions to avoid redundancy and improve maintainability. For instance: > > > > > ``` > def load_signatures(testsigs, ksize=21, moltype='dna'): > """Load signatures from files.""" > sigs = [] > for fn in testsigs: > sigs.append(sourmash.load_one_signature(fn, ksize=ksize, select_moltype=moltype)) > return sigs > > def test_compare_serial(runtmp): > """Test compare command serially.""" > c = runtmp > > testsigs = utils.get_test_data('genome-s1*.sig') > testsigs = glob.glob(testsigs) > > c.run_sourmash('compare', '-o', 'cmp', '-k', '21', '--dna', *testsigs) > > cmp_outfile = c.output('cmp') > assert os.path.exists(cmp_outfile) > cmp_out = numpy.load(cmp_outfile) > > sigs = load_signatures(testsigs) > > cmp_calc = numpy.zeros([len(sigs), len(sigs)]) > for i, si in enumerate(sigs): > for j, sj in enumerate(sigs): > cmp_calc[i][j] = si.similarity(sj) > > sigs = load_signatures(testsigs) > > assert (cmp_out == cmp_calc).all() > ``` > > * 5. I may have overlooked it in your documentation. It would be helpful if you could provide some recommendations for the selection of `k-mers` and `scales`. > * 6. What are the consequences of using different scales? Do you recommend using the same scale to generate the signatures for genomes or databases? > > ``` > sourmash gather ecoli-genome.sig inter/genbank-k31.lca.json.gz > WARNING: final scaled was 10000, vs query scaled of 1000 > ``` > > * 7. Is it possible to add explanations and default values to `--threshold` in the help message of `sourmash lca classify -h` ## Review #2 Review Comments @bluegenes This is one of the best maintained packages I've seen. The development environment is easy to setup and work in, each commit has meaningful messages attached, and the documentation is extensive and up to date. I plan on showing this repository to my own team concerning how projects should look when maintained well. I ran into a few hiccups along the way however The nix flake and envrc needs to be updated to make sure the entire environment is available with a simple call to direnv allow. There are unused imports in benchmarks/benchmarks.py Not all tests ran out of the box in the dev environment via the nix flake but were easily made to work and seem to work in the GitHub actions. Flake8 revealed a lot of errors, it might not be the worst idea to run code through a linter and fix everything that can be automatically fixed, this may help with the unused imports as well. These are none critical but it helps.