--- tags: CDEvents --- # Incident Events Status: discussion Contributors: - Andrea Frittoli [@afrittoli](https://github.com/afrittoli) - Emil Bäckmark [@e-backmark-ericsson](https://github.com/e-backmark-ericsson) - Andrew Meyer [@menehune23](https://github.com/menehune23) - Erik Sternerson [@erkist](https://github.com/erkist) ## Summary One of the main goals of CDEvents is to increase observability in continuous delivery pipelines. Today, CDEvents defines events for several areas of software delivery pipeline, starting from Software Configuration Management (SCM), through Continuous Integration (CI), Continuous Deployment (CD). The software lifecycle does not stop at deployment though, it continues in production environments. Incident events are meant to fill this gap. ## Motivation [DORA][dora] metrics are a widespread set of DevOps metrics. The reasearch of the DORA institute shows that high performing organisations are achieved by elite performers in four metrics for software delivery performance. These metrics span the software lifecycle, from coding to managing production environments. The four metrics are: - Lead time for changes - Deployment frequency - Change failure rate - Time to restore service CDEvents release v0.1 supports the first two DORA metrics today. Aim of this feature is to extend CDEvents to collect data for the entire set of DORA metrics. ## Goals - Introduce subjects and predicates to model events required to calculate "change failure rate" and "time to restore service" - Extend the existing [DORA metrics POC][cdevents-dora] to include "change failure rate" and "time to restore service" - Extend interoperability through events to managing of production environments. Define a standard way to model an incident which can be adopted across tools ## Non-goals - Produce tools to calculate the DORA metrics from CDEvents ## Use Cases - Produce DORA metrics - Extend observability of CD pipelines to production environments - Interoperability in the incident management space that allows DevOps engineers to build automation reusable across tools ## Requirements 1. New incidents can be reported by multiple sources 2. Existing incident can be reported as solved by multiple sources 3. It must be possible to associate incidents to specific environments 4. It must be possible to associate incidents to specific services 5. It must be possible to associate incidents to specific software versions ## Terminology - **Incident**: measurable disruptions of a service level indicator (SLI) in a production environment ## Related Work ### CDEvents GitHub issue The topic of incident events has been previously discussed in a [GitHub issue][incident-events-issue] and at the CDEvents WG. The notes from the working group are included in the [GitHub issue][incident-events-issue]. The current proposal is mostly on this material. ### Keptn "Problem" events Keptn defines events associated to what Keptn calls [problems][keptn-problems], with the following mandatory attributes: ```json "ProblemEventData": { "required": [ "ProblemID", "ProblemTitle", "ProblemDetails", "PID", "labels" ], ``` ### Eiffel "Issue" events Eiffel defines events associated to "issues". Issues can be "[defined][eiffel-issue-defined]" and "[verified][eiffel-issue-verified]". Attributes involved in these events include: - "type" with legal values of e.g. "BUG", "IMPROVEMENT" and "REQUIREMENT" - "tracker", stating the name of an external issue tracker - "id", stating an id of the issue in the external issue tracker - "uri", stating a uri to reference the issue through the external issue tracker Eiffel "issues" are probably not 1-1 mapped to CDEvents "incidents". Issues can be used to report on incidents, but they are not the incidents themselves. And issues could declare items that are not incidents (like improvements or requirements) ### OSS EU 2022 Presentation Erik and Andrea presented [CDEvents + DORA][cdevents-dora-presentation] at the Open Source Summit EU 2022. Incidents events where not available, but we included in the slides a few comments about how they might look like. ## Proposal Introduce a new subject, called **incident**, with mandatory **service** references. All extra data would be handled via `customData` in the very first iteration. Introduce new predicates associated to it, **reported** and **resolved**. Environment and artifacts are part of the service event data model. If the service id in the incident event is the same as that from service event, and if the source of the incident event has access to previous events or subject details, the service reference would be enough. The data model should optionally to include both the environment and artifact reference in the incident events. The [Connecting Events](/-Or6hobHSLWVj4duAWX7nA) discussion may help clarify this further. ### Spec Changes Example of minimal *incident reported* event: ```json { "context": { "version": "0.2.0-draft", "id" : "A234-1234-1234", "source" : "/prod/prometheus/123", "type" : "dev.cdevents.incident.reported", "timestamp" : "2023-01-18T09:38:00Z" }, "subject" : { "id": "E3B0C7AB-AA25-4215-9826-11F8F9A4AF89", "type": "incident", "content": { "service": { "id": "service/my-app", "source": "/k8s-prod/namespace/" } } } } ``` Example of *incident reported* event with optional fields: ```json { "context": { "version": "0.2.0-draft", "id" : "A234-1234-1234", "source" : "/prod/prometheus/123", "type" : "dev.cdevents.incident.reported", "timestamp" : "2023-01-18T09:38:00Z" }, "subject" : { "id": "E3B0C7AB-AA25-4215-9826-11F8F9A4AF89", "type": "incident", "content": { "service": { "id": "service/my-app", "source": "/k8s-prod/namespace/" }, "environment": { "id": "/cloud-x/region-y/k8s-prod/namespace/", }, "artifact": { "id": "purl-url" } } }, "customData": { "metric": { "name": "responseTime", "threshold": "10ms", "value": "100ms" } } } ``` ### SDK Changes SDKs will produce the new events based on inputs from the SDK users. No other specific change to the SDK is expected. ### Tool Adoption When defining the data model, we need consider what tools and sources will generate the events, and where they will gather the data to populate the events. For the interoperability to work, data like environments, service and artifacts must be described consistently across tools. This is especially important for incident events, since multiple events may be generated by different tools about the same overall incident, and only associated with a common root cause at a later stage. ## References - [Define new incident events for the DORA metrics][incident-events-issue] - [Keptn problem events][keptn-problem] [incident-events-issue]: https://github.com/cdevents/spec/issues/59 [dora]: https://www.devops-research.com/research.html [cdevents-dora]: https://github.com/afrittoli/cdevents-metrics-poc [keptn-problem]: https://github.com/keptn/spec/blob/0.2.4/cloudevents.md#problem [cdevents-dora-presentation]: https://github.com/cdevents/presentations/tree/main/2022-09-16-osseu-devops-metrics-through-cdevents [eiffel-issue-defined]: https://github.com/eiffel-community/eiffel/blob/master/eiffel-vocabulary/EiffelIssueDefinedEvent.md [eiffel-issue-verified]: https://github.com/eiffel-community/eiffel/blob/master/eiffel-vocabulary/EiffelIssueVerifiedEvent.md