Software - HackMD

# Software This document describes the motivation and function of the two software deliverables of my thesis project. Publication-paired notebooks : The scientific article I write will be accompanied by several notebooks illustrating the methods employed and results achieved. While the static article will be understandable on its own, different notebooks will be directly referenced throughout the text. The goal is to enhance the clarity, transmissibility, and reproducibility of the methods and results described in the paper. The notebooks will be provided as a supplement and packaged in a form as to ensure accessibility and portability across computing platforms. Exploratory HC3 app analysis app : A web application specifically geared to exploratory analysis of the data in our publication. At this time, it is difficult to determine the final form of this application, as I will have to adjust it for user-friendliness and generality before publication. The graphical user interface will be written in Vue.js, and it will communicate with a backend GraphQL server running over a relational database (Postgres). It will have three major pieces of functionality: (1) browsing various in-browser custom visualizations (related to published result) across the whole of the HC3 (and maybe other?) dataset; (2) executing parameterized analysis pipelines (probably based on Metaflow); (3) dynamically rendering parameterized Jupyter notebooks against user-specified parameter sets. Due to the complexity of the underlying software stack, this application will be unmaintainable going forward. Therefore, it will *not* attempt to support a wide range of use cases. Instead, it will (1) serve as an open-source, forkable proof-of-concept prototype app for exploratory data analysis of large multi-session datasets; (2) act as a publication supplement, allowing users to explore a superset of the analysis results presented in the paper. ## Publication-paired notebooks The published article will include several supplementary Jupyter notebooks. The notebooks are intended to be a direct extension of the core text. They will illustrate the methods described and the generation of the results in the text. ### Motivation As data and computation have grown more abundant, data analysis and statistical modeling have become larger parts of the scientific enterprise. The growing scale and complexity of analytical techniques has enabled researchers to ask deeper questions of ever larger datasets. But it also means that a typical published analysis is more difficult to understand, more likely to contain invalid results derived from software bugs or analytical errors, and requires greater investment to replicate. At the same time, the expanding history and scale of the scientific literature means that there is more published work to sift through than ever before. The "publish-or-perish" economy of academia can incentivize researchers to publish findings without sufficient investment in software testing, result validation, and clarity of communication. Further, peer reviewers often lack the time, expertise, and incentive to perform quality control at a high level. In recent years, some researchers have started to study these problems and provide evidence that there is a "replication crisis" in science. Scientific publishing has made some progress in alleviating these issues. Today's scientific publications often make available supplemental material, possibly including code or data, which may assist researchers seeking to replicate the results or adapt the methods of a publication. However, this material is often poorly documented and difficult to work with. It is not given the same attention as the core publication artifact: a static, paginated document. This format, largely unchanged over centuries, was shaped by the constraints of the printing press. Its present persistence is a function of institutional inertia. Its digitized form, the PDF file, does not exploit the interactive and visual potential of computation. Thus, a *de novo* digital format would have many advantages. Insofar as a scientific finding relies on data analysis, a well-designed digital publication format would enable greater **clarity of communication**, **transmissibility of methods**, and **reproducibility of results**. Of course, changing any kind of institutional standard is notoriously difficult and uncertain. Agreement on the design of the standard would require buy-in from multiple publishers, each of which would need to develop new infrastructure to support the standard. And thousands of researchers in diverse fields would have to change their work habits and learn new technical skills. The scientific world lacks the cohesion to implement this kind of directed, top-down change. Instead, useful change is likely to emerge from the bottom up. Individual research teams can experiment with novel supplemental publication artifacts. Hopefully, the most viable formats will gain currency over time. Publication-quality computational "notebooks" are one experimental format. ### An ideal format What would an improved scientific publication format look like? It would have: - Key concepts illustrated with rich and interactive visualizations. Where possible, models, algorithms could be represented with interactive visualizations. - The **data** used as an input to the published analyses would be both **available** and **documented**. - The **code** used to produce the published results would be available, formatted in a standard way, and readily executable on a typical PC without hassle. These goals are realizable today. clarity As with all technical writing, an important part of scientific writing is *clarity*. Research papers often describe complex ideas. A computational medium offers the possibility of including illustrative, interactive visualizations that may aid in the presentation of data and the communication of mathematical ideas. transmissibility The methods employed in a publication are transmissible to the extent that a reader can easily adapt them to a new problem or dataset. The traditional way of transmitting a technique is laborious-- the reader has to digest the text describing the method and duplicate the setup. This depends on the clarity and thoroughness of the published description, as well as the duplicator's ability to correctly interpret the description. Thus, adapting a published technique to a new problem is frequently expensive and error-prone. Computational methods, however, can be published in executable form-- that is, an artifact implementing the method can be copied and distributed at virtually zero cost. reproducibility Reproducibility is a pillar of science. Reproducing already-published work is expensive, error prone, and not incentivized. Consequently, the scientific literature includes many false claims due to experimenter error, bias (e.g. p-hacking),and fraud. The enhanced clarity and transmissibility of a computational format will lower the cost of research reproduction. ### Software difficulties Personal computing has now been around for decades. Almost all scientific publications today include data analysis and visualizations prepared with software. Some level of code literacy is increasingly a prerequisite for a STEM graduate degree. Yet many papers are still published without any form of executable artifact. Even when such an artifact is included, it is often difficult to use-- it may be poorly documented, have obscure dependencies, or contain bugs or inconsistencies that suprise new users. Why haven't the benefits of a computational medium been more fully realized by scientific publishers? To date, the main barrier has been the lack of easy-to-use infrastructure for interactivity and portability. (1) Rich software tools (applications, libraries) are prohibitively expensive to create, document, and maintain. Development is not incentivized by acamedia and requires a labor pool unavailable to most researchers. The successful software products that saturate daily life are backed by full-time customer support and software engineering teams. In contrast, the typical academic research application is developed by a patchwork of graduate students and postdocs, then only sporadically maintained by new personnel. Funding may dry up in a few years. (2) Simpler software tools (scripts) fail to realize many of the benefits that make software attractive in the first place. Scripts are often rigid and opaque. The user interface (the command line) is limited. A script may have undocumented dependencies on obscure libraries. Visualizations generated by a script often lack a clear and obvious relationship to the code that generates them. To sum up: a computational medium in scientific publishing promises large potential benefits in clarity, transmissibility, and reproducibility of science. But in most research contexts, it is prohibitively expensive to develop and maintain applications or libraries sophisticated enough to realize these benefits. The most common software artifacts released with publications are scripts, which are cheap to develop but do not exploit the interactive and visual potential of personal computers. ### Solution: Containers and notebooks Two technologies have emerged over the last several years that can serve as the basis for a new publication product. Computational notebooks (e.g. Jupyter) and containers (e.g. Docker) provide key infrastructure that can allow researchers to write paper supplements in a computational format. These technologies are already established and are being actively developed. They are general-purpose enough to accomodate a huge range of use cases.