CDRC Carbon Footprint Calculator RSE Project 2022

--- title: CDRC Carbon Footprint Calculator RSE Project 2022 tags: rse-projects --- # RSE Project Hackpad - CDRC Carbon Footprint Calculator RSE Project 2022 ## Project Summary ## Key details - Lead RSE: Alex Coleman - Lead Contact: Emily Ennis - Start date: 24th October 2022 - Number of RSE days: 8.5 ## Key links - Project GitHub repository: https://github.com/Leeds-CDRC/Carbon-Footprint-Calculator (private repo 🔒) - Project documentation: None ## Narrative overview This project involved debugging and applying fixes and improvements to an existing Shiny application deployed as an Azure App Service App. The tool, developed as part of a LIDA Data Scientist Enrichment project, is a large (~2700 lines) Shiny app that using fuzzy string matches determines provides nutritional and environmental impact values for a provided input file. This core functionality is supplemented with a file conversion tool for Saffron files (a data standard used by the major project partner Leeds City Council). Initial work focused on identifying why a number of input files led to the application crashing for users. This scenario occured with a very small datafile (i.e. only a single row) and was caused by functions that attempted to identify missing values for nutritional or environmental data after a first pass at named ingredients in the data file. In the instance of a small datafile that had matches from the first pass, this meant the function identifying missing values would return an empty dataframe which it would attempt to apply a series of other functions upon which led to the error. To resolve this a fix was applied that performed a data validation step to confirm if the dataframe was empty and if it was use the original data. Two issues were addressed around the Saffron file conversion function. This was a feature of the application that converted a saffron file into the required format for the calculator. In the first instance this conversion step did not run within the deployed version of the application. This lead to fatal application errors preventing users from performing file conversions. The root cause of this issue was a lack of dependencies for a particular package that was used as part of the saffron file conversion step. Adding this package within the Dockerfile for the repository resolved this on subsequent builds of the application. The second issue was users uploading their saffron file directly to the calculator leading to an error. Therefore, an additional data validation step was developed that checked for relevant columns for proceeding with calculating nutritional and environmental data. If these columns were not present the app stopped execution and produced a popup to the user informing them of the issues and encouraging them to check if they had converted from the Saffron file. To assist with debugging and support in the future a number of additional changes were made within the project and project repository. First, to enable informative messages during the runtime of the application logging was implemented throughout the app. This involved incorporating and modifying an existing R package `rlog` into the Shiny app. This was done predominantly because the Shiny server obfuscates any output from the `stdout` channel which was used by a number of logging levels in the `rlog` package. Therefore, the `rlog` code was incorporated within the project code as a module with changes made to ensure it always outputted logging information to the `stderr`. This also ensures that logs can be visualised from Azure via the Log Stream making it easier for developers to monitor/view runtime logging. Second, debugging Shiny often requires the RStudio IDE, therefore a devcontainer was added to the project that used the existing Dockerfile specification and installed an instance of RStudio server to allow developers to debug the application within RStudio without having to worry about configuration. Third, the project had no clear was to reproduce the project setup outside of a Docker container, therefore steps to recreating the app environment on a local machine were included. Finally, the project lacked documentation, therefore, documentation (in the roxygen2 style) has been added for a number of functions, the app data flow has been mapped out via [Miro](https://miro.com/app/board/uXjVPHMSZ-A=/) and basic repository documentation such as a README.md have been added. Some users reported issues with the Calculator disconnecting almost immediately on reaching the website. This appeared only to affect a subset of users in Leeds City Council who were all using legacy web browsers (a requirement). Therefore, steps were taken to adjust the Shiny Server configuration to disable more modern web communication protocols between the browsers and the server. This appeared to resolve the initial issue and allow users on legacy browsers to connect to the app. Overall, a number of steps have been taken to improve the stability and user experience of the application. During this process a number of other issues have been identified that should form the work of further improvements to the app. Other aspects of this work package have also aimed to improve documentation and tooling around the project to improve the future developer/contributor experience. ## Changes made Brief overview of changes made and relevant PRs: - Adding validation step within `FSA_matches` function [#4](https://github.com/Leeds-CDRC/Carbon-Footprint-Calculator/pull/4) - Adding `snakecase` dependency to resolve saffron file conversion crashes [#11](https://github.com/Leeds-CDRC/Carbon-Footprint-Calculator/pull/11) - Adding documentation and logging through project [#16](https://github.com/Leeds-CDRC/Carbon-Footprint-Calculator/pull/16) - Configuring shiny-server.conf to disable top 5 web communication protocols [#19](https://github.com/Leeds-CDRC/Carbon-Footprint-Calculator/pull/19) - Adding input data validation step and user message [#20](https://github.com/Leeds-CDRC/Carbon-Footprint-Calculator/pull/20) - Adding steps to remove empty rows along with user validation of this [#22](https://github.com/Leeds-CDRC/Carbon-Footprint-Calculator/pull/22) ## Next steps - **Refactoring the code base** The code base is very large for a Shiny app. This leads to it being difficult to read and complex to debug. A crucial piece of work to ensure the sustainability of this project should be to refactor the code base into small more modular chunks. Shiny allows this through [Shiny modules](https://shiny.rstudio.com/articles/modules.html) although this will take substantial time and effort given the existing code base and it's high level of coupling between functions. Alongside refactoring tests should be written for various used functions to help test for correctness and provide transparency regarding expected outcomes. Below is a brief outline of steps to approach this in increasing complexity: 1. Initially separate the UI code and server code into two separate files 2. Within the server code remove all non-shiny functions and add them into specific script files (functions should be grouped together by functionality within different script files) within the R directory (you do not need to explicitly source them as Shiny does this by default). 3. For these initial non-shiny functions write tests to accompany them using [`test_that`](https://testthat.r-lib.org/). 4. Remove all commented out code 5. Group together shiny server code into related sections 6. Implement shiny modules for shiny server code and associated tests - **Reviewing the fuzzy matching implementation** At present the implementation of fuzzy matching by the `best_match` function does not work as desired and in some cases leads to [erroneous results](https://github.com/Leeds-CDRC/Carbon-Footprint-Calculator/issues/12) that are not surfaced to the user explicitly. These issues raise some concerns about the correctness of results generated by the app which should be surfaced to the user. - **App data validation** The app utilised a number of data sources to match ingredients to datasets with nutritional and environmental data. In some data files there are [incidences of duplication, spelling errors and missing data](https://github.com/Leeds-CDRC/Carbon-Footprint-Calculator/issues/14). This raises concerns about the correctness of results generated from the application and also introduces the potential for errors when matching operations occur. Steps should be taken to check input data to ensure it's veracity and documentation added to clarify the data provenance of the results generated by the app. **Overview** Overall, I would strongly encourage working through the above steps before engaging in any further work that extends the functionality of the app. At present the calculator is a large, complex monolith of code this makes the flow of the app highly coupled and difficult to debug. Resolving this initial rigidty and complexity is critical to ensure the app can work sustainably and deliver it's intended usecase. Furthermore, making the suggested changes will: 1. ensure the app is more flexible for future changes 2. make the app more robust for handling bugs that arise 3. reduce complexity and make it easier for others to contribute 4. provide confidence to users of validity of results. Whilst it is often more interesting and more desireable to add features to academic research apps this will lead to adding further complexity and coupling which will make the application harder to maintain in the long run.