# Software
This document describes the motivation and function of the two software
deliverables of my thesis project.
Publication-paired notebooks
: The scientific article I write will be accompanied by several notebooks
illustrating the methods employed and results achieved. While the static
article will be understandable on its own, different notebooks will be
directly referenced throughout the text. The goal is to enhance the clarity,
transmissibility, and reproducibility of the methods and results described in
the paper. The notebooks will be provided as a supplement and packaged in a
form as to ensure accessibility and portability across computing platforms.
Exploratory HC3 app analysis app
: A web application specifically geared to exploratory analysis of the data in
our publication. At this time, it is difficult to determine the final form of
this application, as I will have to adjust it for user-friendliness and
generality before publication. The graphical user interface will be written
in Vue.js, and it will communicate with a backend GraphQL server
running over a relational database (Postgres). It will have three major
pieces of functionality: (1) browsing various in-browser custom
visualizations (related to published result) across the whole of the HC3 (and
maybe other?) dataset; (2) executing parameterized analysis pipelines
(probably based on Metaflow); (3) dynamically rendering parameterized Jupyter
notebooks against user-specified parameter sets. Due to the complexity of the
underlying software stack, this application will be unmaintainable going
forward. Therefore, it will *not* attempt to support a wide range of use
cases. Instead, it will (1) serve as an open-source, forkable
proof-of-concept prototype app for exploratory data analysis of large
multi-session datasets; (2) act as a publication supplement, allowing users
to explore a superset of the analysis results presented in the paper.
## Publication-paired notebooks
The published article will include several supplementary Jupyter notebooks. The
notebooks are intended to be a direct extension of the core text. They will
illustrate the methods described and the generation of the results in the text.
### Motivation
As data and computation have grown more abundant, data analysis and
statistical modeling have become larger parts of the scientific enterprise.
The growing scale and complexity of analytical techniques has enabled
researchers to ask deeper questions of ever larger datasets. But it also means
that a typical published analysis is more difficult to understand, more likely
to contain invalid results derived from software bugs or analytical errors, and
requires greater investment to replicate. At the same time, the expanding
history and scale of the scientific literature means that there is more
published work to sift through than ever before. The "publish-or-perish"
economy of academia can incentivize researchers to publish findings without
sufficient investment in software testing, result validation, and clarity of
communication. Further, peer reviewers often lack the time, expertise, and
incentive to perform quality control at a high level. In recent years, some
researchers have started to study these problems and provide evidence that
there is a "replication crisis" in science.
Scientific publishing has made some progress in alleviating these issues.
Today's scientific publications often make available supplemental material,
possibly including code or data, which may assist researchers seeking to
replicate the results or adapt the methods of a publication. However, this
material is often poorly documented and difficult to work with. It is not given
the same attention as the core publication artifact: a static, paginated
document. This format, largely unchanged over centuries, was shaped by the
constraints of the printing press. Its present persistence is a function of
institutional inertia. Its digitized form, the PDF file, does not exploit the
interactive and visual potential of computation. Thus, a *de novo* digital
format would have many advantages. Insofar as a scientific finding relies on
data analysis, a well-designed digital publication format would enable greater
**clarity of communication**, **transmissibility of methods**, and
**reproducibility of results**.
Of course, changing any kind of institutional standard is notoriously difficult
and uncertain. Agreement on the design of the standard would require buy-in
from multiple publishers, each of which would need to develop new
infrastructure to support the standard. And thousands of researchers in diverse
fields would have to change their work habits and learn new technical skills.
The scientific world lacks the cohesion to implement this kind of directed,
top-down change. Instead, useful change is likely to emerge from the bottom up.
Individual research teams can experiment with novel supplemental publication
artifacts. Hopefully, the most viable formats will gain currency over time.
Publication-quality computational "notebooks" are one experimental format.
### An ideal format
What would an improved scientific publication format look like? It would have:
- Key concepts illustrated with rich and interactive visualizations. Where possible,
models, algorithms could be represented with interactive visualizations.
- The **data** used as an input to the published analyses would be both
**available** and **documented**.
- The **code** used to produce the published results would be available,
formatted in a standard way, and readily executable on a typical PC without
hassle.
These goals are realizable today.
clarity
As with all technical writing, an important part of scientific writing is
*clarity*. Research papers often describe complex ideas. A computational medium
offers the possibility of including illustrative, interactive visualizations
that may aid in the presentation of data and the communication of
mathematical ideas.
transmissibility
The methods employed in a publication are transmissible to the extent that
a reader can easily adapt them to a new problem or dataset. The traditional
way of transmitting a technique is laborious-- the reader has to digest the
text describing the method and duplicate the setup. This depends on the
clarity and thoroughness of the published description, as well as the
duplicator's ability to correctly interpret the description. Thus, adapting a
published technique to a new problem is frequently expensive and error-prone.
Computational methods, however, can be published in executable form-- that
is, an artifact implementing the method can be copied and distributed at
virtually zero cost.
reproducibility
Reproducibility is a pillar of science. Reproducing already-published work is
expensive, error prone, and not incentivized. Consequently, the scientific
literature includes many false claims due to experimenter error, bias (e.g.
p-hacking),and fraud. The enhanced clarity and transmissibility of a
computational format will lower the cost of research reproduction.
### Software difficulties
Personal computing has now been around for decades. Almost all scientific
publications today include data analysis and visualizations prepared with
software. Some level of code literacy is increasingly a prerequisite for a STEM
graduate degree.
Yet many papers are still published without any form of executable artifact.
Even when such an artifact is included, it is often difficult to use-- it may
be poorly documented, have obscure dependencies, or contain bugs or
inconsistencies that suprise new users. Why haven't the benefits of a
computational medium been more fully realized by scientific publishers? To
date, the main barrier has been the lack of easy-to-use infrastructure for
interactivity and portability.
(1) Rich software tools (applications, libraries) are prohibitively expensive
to create, document, and maintain. Development is not incentivized by acamedia
and requires a labor pool unavailable to most researchers. The successful
software products that saturate daily life are backed by full-time customer
support and software engineering teams. In contrast, the typical academic
research application is developed by a patchwork of graduate students and
postdocs, then only sporadically maintained by new personnel. Funding may dry
up in a few years.
(2) Simpler software tools (scripts) fail to realize many of the benefits that
make software attractive in the first place. Scripts are often rigid and
opaque. The user interface (the command line) is limited. A script may have
undocumented dependencies on obscure libraries. Visualizations generated by a
script often lack a clear and obvious relationship to the code that generates
them.
To sum up: a computational medium in scientific publishing promises large
potential benefits in clarity, transmissibility, and reproducibility of
science. But in most research contexts, it is prohibitively expensive to
develop and maintain applications or libraries sophisticated enough to realize
these benefits. The most common software artifacts released with publications
are scripts, which are cheap to develop but do not exploit the interactive and
visual potential of personal computers.
### Solution: Containers and notebooks
Two technologies have emerged over the last several years that can serve as the
basis for a new publication product. Computational notebooks (e.g. Jupyter) and
containers (e.g. Docker) provide key infrastructure that can allow researchers
to write paper supplements in a computational format. These technologies are
already established and are being actively developed. They are general-purpose
enough to accomodate a huge range of use cases.