General baseline model requirements

# General baseline model requirements This page describes what's needed to make the general baseline model available as an API. ## Table of contents * [/get_data](#get-data-endpoint) * [/train](#train-endpoint) * [/predict](#predict-endpoint) ## Resources * [Sample Docker image for Flask app](https://github.com/docker/awesome-compose/tree/master/flask) --- ## `/get_data` endpoint ### Description Query Influx for kW and temperature data for a specified building and date range, then add `hour`, `is_weekend`, and `is_holiday` fields. ### API contract * URL format: `/get_data/building_id={}&start_date={}&end_date={}` * Inputs: * Building ID * Start date (including hour and timezone) * End date (including hour and timezone) * Returns: * JSON of hourly datetime, kW, temperature, `hour`, `is_weekend`, and `is_holiday` data. * HTTP method: `GET` ### Required non-standard library Python packages #### Essential * `dateutil` * `pandas` * `pytz` #### Nice to have AQ libraries * `aqds-utils` * `TimeConverter` for easier datetime formatting * `aqds` * `Building`, `DataService`, `RawDataParamsBuilder`, `Queryable`, `Interval`, `Measure`, `Unit`, `TemperatureParamsBuilder` for easier Influx querying ### Pseudocode 1. Query Influx for hourly kW data for the specified building and time range * Convert timestamps to timezone-specific datetime * Remove unnecessary columns 2. Query Influx for hourly temperature data for the specified building and time range * Convert timestamps to timezone-specific datetime * Convert Celsius to Fahrenheit * Remove unnecessary columns 3. Merge dataframes on datetime 4. Add `hour` and `is_weekend` columns by checking datetime column * NOTE: for weekend, maybe reference Redshift instead 6. Add `is_holiday` column using function from `pandas.tseries.holiday` * NOTE: maybe reference Redshift instead ### Error handling We assume building ID, start date, and end date are correct when our endpoint is triggered. This means: * Building ID is integer * Start date is earlier than end date * Dates can be parsed into `dt.datetime` objects We handle the following errors on the Python side: * The Influx query for kW data: * Fails * All kW values are 0 * All kW values are null * The Influx query for temperature data: * Fails * All values are null --- ## `/train` endpoint ### Description Train a model on a provided dataframe. Save the model as a `.pkl` file in S3 ### API contract * URL format: `/train/building_id={}` * Inputs: * JSON with fields for: * `kW` * `hour` * `degF` (temperature) * `is_weekend` * `is_holiday` * Returns: * Nothing * HTTP method: `POST` ### Required non-standard library Python packages #### Essential * `boto3` * `pandas` * `sklearn` #### Nice to have AQ libraries * `aqds-utils` * `S3Bucket` class for easier read/write to S3 * `TimeConverter` for easier datetime formatting ### Pseudocode 1. Split dataframe into features and target 2. Train a random forest regressor on features and target 3. Save random forest regressor to S3 as a `{building_id}.pkl` file ### Error handling If the provided dataframe is missing necessary columns, we return an error. We do not check the quality of the data being sent to the model, though that could be added. --- ## `/predict` endpoint ### Description Use an already-trained model to generate predicted kW for a provided building and date range. ### API contract * URL format: `/predict/building_id={}&start_date={}&end_date={}` * Inputs: * Building ID * Start date (including hour and timezone) * End date (including hour and timezone) * Returns: * Predicted kW * JSON with fields for: * `datetime` * `kW_actual` * `kW_mean_pred` * `kW_upper_pred` * `kW_lower_pred` * HTTP method: `GET` ### Required non-standard library Python packages #### Essential * `boto3` * `dateutil` * `pandas` * `pytz` #### Nice to have AQ libraries * `aqds-utils` * `TimeConverter` for easier datetime formatting * `aqds` * `Building`, `DataService`, `RawDataParamsBuilder`, `Queryable`, `Interval`, `Measure`, `Unit`, `TemperatureParamsBuilder` for easier Influx querying ### Pseudocode 1. Call `/get_data` endpoint with provided building ID, start date, and end date 2. Load random forest regressor from S3 3. Use regressor to generate predictions on data from `/get_data` endpoint 4. Generate time series of standard deviation of random forest predictions for each hour 5. Return JSON of date, actual kW, model predicted kW, model prediction upper bound, model prediction lower bound ### Error handling * Error in dates * Dates aren't in valid format (i.e. can't be parsed into `dt.datetime` objects) * Start date > end date * S3 pull fails * No model for building ID in S3 * Service is down * The Influx query for kW data: * Fails * All kW values are 0 * All kW values are null * The Influx query for temperature data: * Fails * All values are null