# Areas where DS needs DE help the most
## Table of contents
* [Data quality](#data-quality)
* [Daily DQ job](#daily-dq-job)
* [Utility bill DQ](#utility-bill-dq)
* [Projects](#projects)
* [Trust in data science products](#trust-in-data-science-products)
## Data Quality
### Daily DQ job
#### Materials
* [Confluence: Pipeline](https://aquicore.atlassian.net/wiki/spaces/APD/pages/843841550/Data+Quality+-+anomaly+detector)
* [Confluence: Dashboard](https://aquicore.atlassian.net/wiki/spaces/APD/pages/892829837/Data+Quality+-+dashboard)
* [Redash dashboard](https://app.redash.io/aquicore/dashboard/dq---overview---prod?p_Data%20Source=2&p_Detectors=1&p_Timeframe=30)
#### Summary of current approach
* [Databricks coordinator job](https://dbc-e4ce7ba6-2672.cloud.databricks.com/?o=2093562134185003#job/331193) calls an [Initial Processing notebook](https://dbc-e4ce7ba6-2672.cloud.databricks.com/?o=2093562134185003#notebook/292562793750149/command/292562793750150)
* Notebook runs [DataQualityService](https://github.com/aquicore/aq-data-quality/blob/integration/src/aqdq/anomaly_pipeline/services/data_quality_service.py) custom class on last 24h of 15-min data for all eligible electricity and water meters
* "Eligible" = account is active, building not archived, meter doesn't have DQ set to "off" in D&E
* Both raw (S3 packets) and platform data are processed. Raw data processed by creating an Athena partition of last two days of S3 data; platform data called from Influx
* Data is run through various detectors and saves initial results as a JSON for each building to [intermediate results](https://s3.console.aws.amazon.com/s3/buckets/aq-dataquality?region=us-east-1&prefix=dq-results/intermediate/&showversions=false) directory in S3
* The job then calls a [second Databricks notebook](https://dbc-e4ce7ba6-2672.cloud.databricks.com/?o=2093562134185003#notebook/1262131105237568/command/1262131105237569) that performs secondary processing
* i.e. flatlines below a certain duration are converted to "not an issue" values
* Each building's JSON pulled from intermediate S3 directory, processed, and then sent to [final results](https://s3.console.aws.amazon.com/s3/buckets/aq-dataquality?prefix=dq-results/final/&showversions=false) S3 directory
* AWS Glue traverses the final results directory every day and creates a schema for AWS Athena
* Redash queries Athena to create summary tables for users
#### Limitations of current approach
* Redash
* Separate from the platform. Users typically don't like needing to go to multiple places
* Not interactive - users cannot mark issues as complete or otherwise communicate back to the platform
* Clicking around too much easily causes Redash to freeze up or fail to display plots, reducing trust in it
* Dashboard focus is "top-down" rather than horizontal
* Jason only uses the Offline Devices dashboard because it's effectively a to-do list
* The current Redash dashboard is more of an executive summary.
* It's good for providing a percent DQ score, or identifying the number of positive spikes vs. missing data
* But it's hard to take action since the data is inherently grouped *by account*
* We tried getting EMs to use the dashboard for their accounts, but there was no adoption
* Potentially due to lack of interactivity, reliability issues with updating, dashboard freezing
* Dashboard is a security apocalypse waiting to happen
* Anyone using the dashboard can modify a query, causing a table to break for all users
* Nothing is version controlled
* There's also no way to implement version control without copy and pasting code into separate files, then backing up *those* files
* Internal user sharing their screen can *very* easily reveal sensitive information about other accounts on accident
#### Where DE support would help
* Store DQ results in Redshift tables
* Make it infinitely easier for DS to query DQ results
* Formatting JSON files for Athena makes them essentially unreadable for us, and querying Athena is slow and costly
* Make it infinitely easier to modify the DQ job
* Adding/removing a column, or changing the dtype of an existing column, can cause the Glue schema or Athena table creation to fail with no indication of what file caused the issue
* Improve DQ dashboard UX by making the dashboard's data pull reliable
* Switch from Redash to a different BI tool
* Have infrastructure as code, allowing for version control of any changes to dashboard tables
* If BI tool is interactive (e.g. users can update Redshift tables with actions telling to ignore certain DQ issues), engagement will likely be higher
* Create visualizations with much more precision and customization than Redash
### Utility bill DQ
#### Current approach
* Databricks jobs (one per tab in final spreadsheet) that query utility bill data from the Monolith via Postgres
* Save CSVs to a public S3 bucket
* Ingest CSVs to separate tabs in a Google Sheet
#### Limitations with current approach
* Google Sheets is separate from the platform
* Security concerns: S3 bucket must be public for Google Sheets to import data
* Anyone can access the data if they guess the URL
#### Where DE support would help
* Replace Google Sheets with BI tool that requires Aquicore login
* Address security concern of non-AQ users accessing our customers' data
* Reference one Redshift table with Utility Bill DQ results
* We currently imitate this by having a _Google Sheets template_ that serves as a "source of truth." Users can download to get the latest report.
* But we really shouldn't be dealing with copies of data
* Make it easier to access utility bills from the sheet
* We currently use URLs that take the user to the relevant webpage in the platform. Maybe this is a broader software engineering epic, but we should also think about the Utility Bill data model we want (which I think would fall more under data engineering)
## Projects
#### Current approach
* For morning overshoot peak demand projects, we use Databricks jobs that scan a month of data and identify all morning overshoots for that month
* We then create a row in the AFDD database if certain criteria are met (at least 3 morning overshoots, or at least one is in top 15% of daily peaks for the month)
* We then have a separate Databricks job that pulls in morning overshoot summaries from the AFDD DB and then writes to the Projects DB
* Morning overshoot projects for the same building are either:
* Overwritten with fresh data if the building already has a project but is still in `DRAFT` (i.e. a BSE hasn't looked at it yet)
* Ignored if the project *has* already been looked at by a BSE and is in a different status (e.g. `COMPLETED`, `IN PROGRESS`, `WON'T DO`, etc.)
* These morning overshoots appear in the Projects Center
#### Where DE support would help
* Maybe this is less DE than business in general, but there is low adoption of these projects among BSEs
* Maybe this reflects more the next section on trust in DS products
## Trust in data science products
#### Current approach
* Confluence pages, meetings with teams, answering support requests
#### Ways DE support would help
* I think it'd be good to have an area in the platform where internal users (or maybe even customers too) could play with each of our models
* You could specify the inputs to the model and see what output you get. Produce some intuitive visualizations
* I know how to do the above in Flask + JavaScript. It seems like the rest of Product (and Nate in particular) want one consistent set of tools/code, though, so anything in the platform should stay in Java (I guess?)
* I think it'd be worth pushing a bit and seeing if we could create a sandbox area, though, at least starting in Flask, and then rewriting to Java if needed
* Given this is a data science product, I want to dig my heels in a bit that this should stay in Python and be easy to edit for current and future data scientists at AQ, rather than needing a Java developer to help