Data Analysis Programming

# Data Analysis Programming ###### tags: `Work` `PIG` `Thoughts` `Data` `Coaching` After working together on the SCSP Dashboard, Alex (AR) requested that I give him a 'lecture' about the process I use to set up analysis project code, and the principles that drive that process. This seems like a useful thing to articulate, both as Alex and I figure out how we're going to collaborate on analysis projects like OLES Demand Modelling^[Placeholder], and as the Blueprint crew builds analysis capacity and seeks common practices. ## Resources Unless otherwise stated, all resources in this list are available free of charge. ### Lineage #### Wickham, [The Tidyverse Style Guide](https://style.tidyverse.org/files.html#names) This guide offers practical advice on how to write and organize code in R, along with a solid backbone of principles rooted in experience working on analysis projects. Highly recommended reading on its own, most of the things I do follow from the values articulated in this book. #### Wickham, [R Packages](https://r-pkgs.org/) This book answers in detail the questions _what is an R package? Why would I want to make one? How can I build my own?_ Packages are one of the best features of R, and one of its key advantages over other statistical suites (e.g. STATA, SPSS, whose customizability is limited and inaccessible), and other programming languages (e.g. python, java, C++, whose analogous libraries demand regular care and feeding, and sometimes painful, confusing set up). They are easy to make, easy to install, and safe to mess around with. A good deal of the recommendations I make here are driven by my experiences developing packages. ### Recommended Reading #### Wickham, [R for Data Science](https://r4ds.had.co.nz/) #### Wickham, [Advanced R](https://adv-r.hadley.nz/) #### Boehmke, [Hands-On Machine Learning with R](https://bradleyboehmke.github.io/HOML/) #### Bryan, [Happy Git with R](https://happygitwithr.com/) ## Principles ### Reproduce #### What is the big idea? The big idea around reproducibility is shifting the focus from building a product or a set of products (dashboard, report, whatever) to building the process that creates the product or set of products. The code base for each analysis project should be able to run from start to finish without demanding that the R user do anything interactive. Further, input data files should be treated as functionally *immutable*. They should not be manipulated manually in a graphical spreadsheet program. #### Why is it important? It is vitally important that your analysis work be reproducible. A reproducible process can be rapidly and responsively adjusted to correct an error in the output. Corrections made to the process will persist in future iterations. Collaborators can step into a reproducible process, run through your code, and examine results as they are being generated. The outputs from reproducible work are higher quality; reproducible projects are easier to work on; and reproducible code lends itself better to reuse. Especially at Blueprint, it can often feel like there isn't enough time to do things "the right way", and so considerations like this get left by the wayside. In my experience, building reproducibility into analysis work takes a very moderate time investment up front, and often saves days of effort and weeks of calendar time over the course of a project. Once you start to reap the benefits of working reproducibly, you will instinctively tend towards it. #### What does it mean for data analysis at Blueprint? The best way to ensure that the process is reproducible is to regularly restart your R session, fire up your project in an empty environment, and run it from the top. Where it breaks, demands interaction, runs out of order, etc. figure out what's wrong and fix it with code. First, you'll have to configure your R / RStudio settings to never save your workspace. Treat frustrating or difficult gaps between your present toolkit and the toolkit you need to overcome a given reproducibility challenge as a delightful learning opportunity. Embrace the part of you that gets satisfaction by elegantly solving problems. ### Reuse #### What is the big idea? You will be a more powerful and more efficient R user the less you think of it as a program for statistical analysis, like STATA or SPSS, and more as a programming _language_ like python or javascript. One of R's best features, as a language, is that it makes reusing code -- both within and across projects -- straightforward, with functions and packages, respectively. To make the most out of the medium your working in, you should always be on the lookout for opportunities to build and refine functions and packages to support your work. #### Why is it important? Within an individual project, building a reusable codebase makes it easier and faster to respond to correct an error or to extend the analysis. This is true whether the necessary change is a result of external feedback or personal epiphany. It makes it easier to organize your work and make the code itself comprehensible to would-be collaborators. Building generalizable solutions to reuse across projects hasn't been a priority for Blueprint, but as our team grows larger, it's something that we should lean into more. #### What does it mean for data analysis at Blueprint? For us, as a capacity domain, it should mean looking for and investing in opportunities to build reusable tools, processes, code snippets, whatever, document them, and share them. ### Comprehend #### What is the big idea? The big idea here is to consider the _human reader_ of your code as you're writing it. Whether that is a colleague looking to collaborate on the project, or your future self, whose working memory has long since emptied of any relevant context. Often, a person's first instinct in this direction is to bulk up on comments. I'm not big on comments because they represent a truth that is parallel to that represented in the code. Just because a comment says something is happening, doesn't mean it actually is, or that it's happening correctly. A reader, therefore, of a script commented in a typical style, must constantly be validating the commentary against their interpretation of the code. This tends to either add confusion, or discourage close mechanistic reading of the code itself, creating space for errors that persist through review. Rather than relying on comments to make your code human readable, use creative and elegant object names, clearly scoped functions, and consistent code style to make clear what's going on and why. #### Why is it important? It's easier to work on code if the code makes clear at every step what it's doing, for you and your colleagues. It's easier to identify and work with chunks of code that you want to reuse and package. #### What does it mean for data analysis at Blueprint? This means naming things nicely, using functions to structure your work, and applying consistent styling, per the [Tidyverse Style Guide](https://style.tidyverse.org/index.html). I would recommend some minor deviations from the guide, but they are best introduced once a learner is already using the tidyverse style by habit. ### Collaborate #### What is the big idea? Collaboration is both a product and a driver of the above. Comprehensible, reproducible analysis projects can be stepped into by colleagues, who will feel at home building more on them. The #### Why is it important? #### What does it mean for data analysis at Blueprint? ## Practices ### Scoping the problem ### Naming things ### Organizing scripts ### Organizing data files ### Functions! ### Building packages ## Processes WIP Corporate processes that exist to facilitate the interface between the data analysis executed to support a project, and the other aspects of that projects output. ### Resourcing Analysis Work ### Initializing an Analysis Project ### Generating Outputs ### Iterating Content ### Using Version Control