Alert Data Model Problems

# Alert Data Model Problems ## Problems We're working around awkward parts of the alert service API while implementing our initial alerting use cases. Users signing up for "Courier Group" emails should get emails for 1) :courier-missing-file / SLA CG missing files, and 2) :flow-failure / CG customer-attributed flow failures. 1. The UI assumes there's only one audience and one subscription for "Courier Group" events. 1. The UI configuration corresponds to both an audience and a subscription in the backend. This makes it complex to trigger the correct backend change when a user interacts with the UI. We have an [upsert-alert-config](https://github.com/amperity/app/blob/6bc9f1349cb576fba85e254c010e31290816c5d3/service/owl/owl-api/src/amperity/owl_api/handler/alert.clj#L191-L212) OWL API method that creates / updates both an audience and a subscription rather than a single backend resource. 1. To enable generic failure alerting, upsert-alert-config would become even more complex because it would modify multiple subscriptions: one for each event type (:courier-missing-file, :flow-failure w/ ::error/attribution = :customer). 1. Clearing the UI configuration does not fully delete backend subscriptions and audiences. 1. REPL migrations are required. Let's say you want "Courier Group" alerts. If we want a new event type to alert when subscribed to "Courier Group" alerts, you have to run a REPL migration to add subscriptions for that event type. This is error prone and becomes more complicated with multiple subscriptions. ## What data does the UI give us? Here's a rough sketch of what a user can configure in the UI: ```clojure= {::tenant/id "foo" :configured-alerts [{:workflow-type :courier-groups :emails #{"a@foo.com" "b@foo.com"}} {:workflow-type :campaigns :emails #{"c@foo.com"}}]} ``` This would correspond to: * "a@foo.com" and "b@foo.com" receiving alerts for * SLA CG missing files * Customer-attributed CG workflow failures * "c@foo.com" receiving alerts for * Customer-attributed campaign send task failures ## Why isn't this just the data model we use? We have upcoming use cases we know we want to build that don't fit that data model. These include: * Ad-hoc workflow alerts * Non-workflow alerts * Non-email channels * Separated failure, update, success alerts Additionally, the UI configuration is focused around workflow types. We currently submit events that don't directly correspond 1:1 with workflow types (e.g. :courier-missing-file, :campaign-send-failure, :flow-failure). ## Principles 1. Events should describe something that happened in the system. * ::event/type should describe what occurred and be easily understandable to eng + support 2. Events should be decoupled from who + how users are alerted on those events. * Workflow service should not have to know how people are subscribing to submitted events. It should just generate events and trust alerts are properly generated downstream. 3. Alert audiences should be decoupled from alert subscriptions. * Incoming event filters (subscriptions) don't need to be coupled 1:1 with who gets the events (audiences). Adding multiple audiences in the UI doesn't have to create duplicate subscriptions for each audience. 4. UI configuration should directly map to a deterministic set of resources in the backend. * Adding subscriptions + audiences in the backend should not impact what is shown in the UI unless that is explicitly desired. 5. Adding new alerts for a given UI configuration should be easy on the frontend. 6. Adding new alerts for a given UI configuration should be easy on the backend. ## Solution 1) Add ::subscription/type 1. Introduce ::subscription/type. This will replace ::event/type in determining whether an event matches with a subscription. Principles: :x: :heavy_check_mark: :grey_question: 1. :grey_question: -- Events would no longer represent something that happened in the system. 1. Events would be self-descriptive as we would keep the ::event/type. 2. There might be multiple alert events for a single system event. For example, having "courier group" AND "all flow failure" alert configurations in the UI would require workflow-service to submit 2 events for a flow failure. One with ::subscription/type :courier-group and another with ::subscription/type :flow-failure 2. :x: -- Events would be coupled with how users are alerted on those events. 3. :x: -- Alert subscriptions would still be keyed on a single audience. 4. :x: -- The UI would not know how to differentiate between a UI-generated subscription and a backend subscription. 5. :heavy_check_mark: -- Frontend UI would only have to generate one subscription + audience pair per audience channel in the UI configuration. 6. :grey_question: -- REPL migration scripts would be easy since subscriptions are simple. But new alert events would have be generated in every backend system that we want to alert on for a given use case. Other Notes: * We lose fine grain filtering on events of different ::event/type. If you wanted a UI configuration to subscribe users to customer attributed task failures, but also all workflow failures, the ::event/filters would conflict for those incoming events. ## Solution 2) Add alert groups 1. Introduce alert groups. Alert groups are collections of subscriptions and audiences that indicate the subscriptions and audiences should be configured together. 1. Introduce ::group/key. This will allow grouping any audience or subscription together when displaying configuration in the UI. The group key will not be exposed on audience or subscription objects. 2. Introduce group methods 1. `alert.api/get-group` -- Fetch the audiences and subscriptions with a given ::group/key 3. `alert.api/upsert-group!` -- Compare existing audience + subscription configuration in the backend with the desired configuration presented by the frontend. Make changes to audiences + subscriptions so it matches the desired UI configuration. 1. Alternatively, this could be a "changes" call that provides the desired diff. Ideally though, I think the diff should be in the backend to abstract away diffing logic from the UI. 5. `alert.api/delete-group!` -- If a UI alert configuration is cleared, this call deletes the existing audiences and subscriptions fully. Principles: :x: :heavy_check_mark: :grey_question: 1. :heavy_check_mark: -- Events will stay as they are now. 2. :heavy_check_mark: -- Events will not know anything about subscriptions or audiences. 3. :x: -- Alert subscriptions would still be keyed on a single audience. 4. :heavy_check_mark: -- Backend configuration can be fetched directly from the UI when provided a deterministic UI key. 5. :grey_question: The UI configuration will need to generate a request shape for the upsert-group call that contains the audiences as well as subscriptions for each event type. However, the complexity should be mitigated by the upsert call. 6. :grey_question: Systems submitting events will not need to be updated on new use cases. Adding event types to a given group key should be easy using admin repl to create new subscriptions with the proper subscription shape if that group exists. Other Notes: * Though subscriptions and audiences would still be coupled, we could retrofit this in the future if we wanted to decouple them. To do this, we could * If we want the flexibility to add multiple groups for a given UI configuration section (e.g. separate email lists for different courier group filters), we would need to add proper group ids and translate the UI to expect a list call based off of the ::group/key ## Solution 3) Add alert groups + decouple audiences + subscriptions 1. See above. 2. Remove ::audience/id from subscriptions. 3. When generating alerts after an event is submitted, don't use the audience ids from the subscriptions, but rather any audience with a shared ::group/key. Principles: :x: :heavy_check_mark: :grey_question: 1. :heavy_check_mark: See above. 2. :heavy_check_mark: See above. 3. :heavy_check_mark: A new subscription will not be required for each desired event type multiplied by 4. :heavy_check_mark: See above. 5. :grey_question: See above. 6. :grey_question: See above.