# Areas where DS needs DE help the most ## Table of contents * [Data quality](#data-quality) * [Daily DQ job](#daily-dq-job) * [Utility bill DQ](#utility-bill-dq) * [Projects](#projects) * [Trust in data science products](#trust-in-data-science-products) ## Data Quality ### Daily DQ job #### Materials * [Confluence: Pipeline](https://aquicore.atlassian.net/wiki/spaces/APD/pages/843841550/Data+Quality+-+anomaly+detector) * [Confluence: Dashboard](https://aquicore.atlassian.net/wiki/spaces/APD/pages/892829837/Data+Quality+-+dashboard) * [Redash dashboard](https://app.redash.io/aquicore/dashboard/dq---overview---prod?p_Data%20Source=2&p_Detectors=1&p_Timeframe=30) #### Summary of current approach * [Databricks coordinator job](https://dbc-e4ce7ba6-2672.cloud.databricks.com/?o=2093562134185003#job/331193) calls an [Initial Processing notebook](https://dbc-e4ce7ba6-2672.cloud.databricks.com/?o=2093562134185003#notebook/292562793750149/command/292562793750150) * Notebook runs [DataQualityService](https://github.com/aquicore/aq-data-quality/blob/integration/src/aqdq/anomaly_pipeline/services/data_quality_service.py) custom class on last 24h of 15-min data for all eligible electricity and water meters * "Eligible" = account is active, building not archived, meter doesn't have DQ set to "off" in D&E * Both raw (S3 packets) and platform data are processed. Raw data processed by creating an Athena partition of last two days of S3 data; platform data called from Influx * Data is run through various detectors and saves initial results as a JSON for each building to [intermediate results](https://s3.console.aws.amazon.com/s3/buckets/aq-dataquality?region=us-east-1&prefix=dq-results/intermediate/&showversions=false) directory in S3 * The job then calls a [second Databricks notebook](https://dbc-e4ce7ba6-2672.cloud.databricks.com/?o=2093562134185003#notebook/1262131105237568/command/1262131105237569) that performs secondary processing * i.e. flatlines below a certain duration are converted to "not an issue" values * Each building's JSON pulled from intermediate S3 directory, processed, and then sent to [final results](https://s3.console.aws.amazon.com/s3/buckets/aq-dataquality?prefix=dq-results/final/&showversions=false) S3 directory * AWS Glue traverses the final results directory every day and creates a schema for AWS Athena * Redash queries Athena to create summary tables for users #### Limitations of current approach * Redash * Separate from the platform. Users typically don't like needing to go to multiple places * Not interactive - users cannot mark issues as complete or otherwise communicate back to the platform * Clicking around too much easily causes Redash to freeze up or fail to display plots, reducing trust in it * Dashboard focus is "top-down" rather than horizontal * Jason only uses the Offline Devices dashboard because it's effectively a to-do list * The current Redash dashboard is more of an executive summary. * It's good for providing a percent DQ score, or identifying the number of positive spikes vs. missing data * But it's hard to take action since the data is inherently grouped *by account* * We tried getting EMs to use the dashboard for their accounts, but there was no adoption * Potentially due to lack of interactivity, reliability issues with updating, dashboard freezing * Dashboard is a security apocalypse waiting to happen * Anyone using the dashboard can modify a query, causing a table to break for all users * Nothing is version controlled * There's also no way to implement version control without copy and pasting code into separate files, then backing up *those* files * Internal user sharing their screen can *very* easily reveal sensitive information about other accounts on accident #### Where DE support would help * Store DQ results in Redshift tables * Make it infinitely easier for DS to query DQ results * Formatting JSON files for Athena makes them essentially unreadable for us, and querying Athena is slow and costly * Make it infinitely easier to modify the DQ job * Adding/removing a column, or changing the dtype of an existing column, can cause the Glue schema or Athena table creation to fail with no indication of what file caused the issue * Improve DQ dashboard UX by making the dashboard's data pull reliable * Switch from Redash to a different BI tool * Have infrastructure as code, allowing for version control of any changes to dashboard tables * If BI tool is interactive (e.g. users can update Redshift tables with actions telling to ignore certain DQ issues), engagement will likely be higher * Create visualizations with much more precision and customization than Redash ### Utility bill DQ #### Current approach * Databricks jobs (one per tab in final spreadsheet) that query utility bill data from the Monolith via Postgres * Save CSVs to a public S3 bucket * Ingest CSVs to separate tabs in a Google Sheet #### Limitations with current approach * Google Sheets is separate from the platform * Security concerns: S3 bucket must be public for Google Sheets to import data * Anyone can access the data if they guess the URL #### Where DE support would help * Replace Google Sheets with BI tool that requires Aquicore login * Address security concern of non-AQ users accessing our customers' data * Reference one Redshift table with Utility Bill DQ results * We currently imitate this by having a _Google Sheets template_ that serves as a "source of truth." Users can download to get the latest report. * But we really shouldn't be dealing with copies of data * Make it easier to access utility bills from the sheet * We currently use URLs that take the user to the relevant webpage in the platform. Maybe this is a broader software engineering epic, but we should also think about the Utility Bill data model we want (which I think would fall more under data engineering) ## Projects #### Current approach * For morning overshoot peak demand projects, we use Databricks jobs that scan a month of data and identify all morning overshoots for that month * We then create a row in the AFDD database if certain criteria are met (at least 3 morning overshoots, or at least one is in top 15% of daily peaks for the month) * We then have a separate Databricks job that pulls in morning overshoot summaries from the AFDD DB and then writes to the Projects DB * Morning overshoot projects for the same building are either: * Overwritten with fresh data if the building already has a project but is still in `DRAFT` (i.e. a BSE hasn't looked at it yet) * Ignored if the project *has* already been looked at by a BSE and is in a different status (e.g. `COMPLETED`, `IN PROGRESS`, `WON'T DO`, etc.) * These morning overshoots appear in the Projects Center #### Where DE support would help * Maybe this is less DE than business in general, but there is low adoption of these projects among BSEs * Maybe this reflects more the next section on trust in DS products ## Trust in data science products #### Current approach * Confluence pages, meetings with teams, answering support requests #### Ways DE support would help * I think it'd be good to have an area in the platform where internal users (or maybe even customers too) could play with each of our models * You could specify the inputs to the model and see what output you get. Produce some intuitive visualizations * I know how to do the above in Flask + JavaScript. It seems like the rest of Product (and Nate in particular) want one consistent set of tools/code, though, so anything in the platform should stay in Java (I guess?) * I think it'd be worth pushing a bit and seeing if we could create a sandbox area, though, at least starting in Flask, and then rewriting to Java if needed * Given this is a data science product, I want to dig my heels in a bit that this should stay in Python and be easy to edit for current and future data scientists at AQ, rather than needing a Java developer to help