Draft blog post: I found a sizeable bug in our published software. Now what?

# Draft blog post: I found a sizeable bug in our published software. Now what? (Author: Titus Brown) tags: replicability, repeatability, reproducibility, bugs, workflows A few weeks ago, as part of [an effort to reduce memory consumption in our spacegraphcats software](https://github.com/spacegraphcats/spacegraphcats/pull/310), I found [a bug](https://github.com/spacegraphcats/spacegraphcats/issues/299). This bug has [serious consequences](https://github.com/spacegraphcats/spacegraphcats/pull/310#issuecomment-712912472) for at least some results, and [will necessitate rerunning several thousands of hours of compute](https://github.com/spacegraphcats/spacegraphcats/pull/310#issuecomment-712925887) for a current project in the lab. Note that our first paper [on using spacegraphcats](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02066-4) came out recently - before this bug was discovered... ## The backstory The bug itself isn't very interesting. Briefly, spacegraphcats makes use of some super awesome software called [BCALM](https://pubmed.ncbi.nlm.nih.gov/27307618/) to generate an assembly graph from input DNA sequences. Since BCALM is multithreaded and doesn't output the graph components in a consistent, reroducible order, we post-process the graph to make it consistent, so that we don't have to worry about variability in our results downstream of this. During this post-processing, we by default collapse certain "uninteresting" nodes in this graph. The bug is this: the if/else statement in the code that handles this option will drop a reasonably substantial chunk of the graph *if* this graph post-processing is on. It's a simple "whoops, iterate over *this* list of keys not *that* list of keys" bug. Now, I grok this code. I wrote some of it. I am responsible for testing and evaluating this portion of spacegraphcats. I knew there was dropped sequence content, and it made sense to me because *of course* we're going to drop data when we merge nodes. (No. Duh. The nodes are *merged*, and their content is *merged*. We shouldn't be dropping sequence *content*. Bad Titus.) Then, recently, I was playing around with spacegraphcats to see if I could do some cool new stuff, and I was feeding reference genomes in, and large portions of the reference genome were simply *missing*. And I thought, huh? Why? And I dug, and I dug. And I realized our code was broken. And then I evaluated it on some real data and saw that [we were losing 20-40% of the sequence content](https://pubmed.ncbi.nlm.nih.gov/27307618/). And then I freaked out, and started worrying about [our published paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02066-4) and whether we would need to retract it, or update it. And then I thought some more and realized that we were probably only going to _gain_ sensitivity and that improved sensitivity would not be a bad outcome. And then I realized I should probably just rerun the paper and take a look! ## Re-running our paper pipeline. Almost all of the key results in our paper were generated by an automated snakemake pipeline, the first part of which is available [here](https://github.com/spacegraphcats/2018-paper-spacegraphcats/tree/master/pipeline-base). This was my first real snakemake pipeline, and it's a bit ugly, but it worked quite nicely at the time! Since it was completely automated, I decided to rerun the portion of the pipeline responsible for figures 2 and 3 in the paper, which represent the intermediate results that should be most immediately dependent on the behavior I fixed. But I didn't want to mess up the paper repo, so I copied it over to a new repo where I could edit the pipeline in isolation: https://github.com/ctb/2020-rerun-hu-s1 and then I re-ran it successfully. Yay w00t! ## The upshot: nothing significant changed, and we're not sure why. After digging into the results, it looks like our key results on real data [don't change much](https://github.com/spacegraphcats/spacegraphcats/pull/310#issuecomment-719551496). But... **I'm not really sure why.** It seems clear that better sensitivity should *improve our results*. The fact that it doesn't is puzzling! My guess is that it's actually a result of the *biology* in play - our results in that paper are based on studying core components of microbial genomes, which might not be affected by the software in question. If we were looking at the accessory elements in the genomes, we might see much larger changes. Conveniently this is extremely relevant to some new work that Taylor Reiter is doing, so we might be able to evaluate it there. (Although there's a question as to why we would take the time to do that, since then we'd be studying the software bug and not the biology!) So, all is well, it seems. But this all raises some interesting questions... ## Some interesting questions ### Is repeatability useful? It was neat to be able to rerun the analysis with such little effort! But while it was trivial for me (as the author of the pipeline), and probably would have been simple for e.g. Taylor, who is a co-author and a snakemake/spacegraphcats expert, it is not clear that it would have been as easy for other people. It is also not clear that others would have been able to interpret the results. Frankly, it was hard for *me* to interpret the results, because I was expecting differences - I spent *a lot* of time making sure the files were the right ones! To expand on this, **my** time investment was: * ~1-2 hours - setting up the run and fixing minor stuff * ~1 week - waiting for the compute to finish * ~3-4 hours - reproducing graphs and trying to understand what was going on My prediction is that it would have taken someone else 2-3 days to dig into the graphs and really understand them, on top of all the time spent struggling with getting things to run. That's a big investment! I could have made some of this easier, prospectively, by writing more and better documentation. But... papers are big and complicated, and it's hard to know in advance which parts will matter. And it's hard to know if there's any point to doing it, which reduces my motivation. Moreover, **why would anyone want to repeat our paper**?! I honestly don't know what the use case would be for that! My mental model here is that *most* repeaters would want to either (a) apply the same pipeline to new data to do new science - bio researchers! - or (b) apply their own pipeline to "our" data so they can compare their results to ours - methods researchers! Admittedly, this is based on my own experience in bioinformatics, where I usually pick up other people's software for (a) or (b) but never actually re-run their analysis. ### Is it worth it to revise old papers with new understanding? ### What's the right way to engage with the journal around these questions? ### If the results had been dramatically *better*, would the journal be interested? --- ## Concluding thoughts It would be lovely if most papers were set up to allow easy re-running of computational workflows! I think the main benefits would accrue to the authors, though. I don't see that being a big incentive, and I am not sure why journals or reviewers would push this issue without better incentives. And it's a fair bit of work on top of all the other work. Research is already hard, and just getting workflows or pipelines to run is plenty difficult in the first place! Several people - e.g. Fernando Perez in particular - have made the point that what you really want to do is bake good repeatability practice in to projects from the beginning. We did that in this project (and try to do it to most projects), and I think it paid off. Which, honestly, means this kind of thing needs to be baked into teaching and training at the undergraduate and graduate levels. But I'm not 100% sure we know what that training and teaching needs to be, or how to teach it effectively. And I think most faculty aren't familiar with the kinds of software and approaches that make this easy. And I'm also not sure that we actually know what the right approach is (although I can point to lots of wrong or broken approaches, and have a lot of personal experience with them :). I think this is where we need to really evolve our practice, and communicate it, and work collectively as a field towards better practice. To that end, check out our workflows paper [Streamlining data-intensive biology with workflow systems](https://academic.oup.com/gigascience/article/10/1/giaa140/6092773?login=true). --titus