AppStudio Controller Metrics Proposal

# AppStudio Controller Metrics Proposal **JIRA**: https://issues.redhat.com/browse/GITOPSRVCE-273 **Author**: _Panagiotis Georgiadis_ _(<panos@redhat.com>)_ ----- Review comments: | Reviewer | Date | Comments | | -------- | -------- | -------- | | Text | Text | Text | ----- ## Part 1: General AppStudio Controller maintenance Propose alerts, for the AppStudio controller, based on similar OpenShift Operators. These are related to the health and status of the controller itself. #### AppStudioVersionMismatch * use-case: After an update of the AppStudio operator, all controllers should be in the same version * description: 'There are `{{ $value }}` different semantic versions of AppStudio components running.' * summary: 'Different semantic versions of Appstudio components running.', * Fire after: '15m' * Use promQL count-by and labels. #### AppStudioPodCrashLooping * use-case: the appstudio controller pod has to be healthy * description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state (reason: "CrashLoopBackOff").' * summary: 'Pod is crash looping.' * Fire after: '15m' * Use promQL max_over_time #### AppStudioPodNodReady * use-case: the appstudio controller pod has to be healthy * description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes.' * summary: 'Pod has been in a non-ready state for more than 15 minutes.' * Fire after: '15m' * Use PromQL 'sum by', 'max by' and look for the kube_pod_status_phase for `phase=~"Pending|Unknown|Failed"}` #### AppStudioDeploymentReplicasMismatch * use-case: availability of the service * description: 'Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 15 minutes.', * summary: Deployment has not matched the expected number of replicas. * Fire after: '15m' * Use PromQL: compare 'kube_deployment_spec_replicas' with 'kube_deployment_status_replicas_available' and 'changes' should be '0' #### AppStudioContainerWaiting * use-case: the appstudio controller pod has to be healthy * description: 'pod/{{ $labels.pod }} in namespace {{ $labels.namespace }} on container {{ $labels.container}} has been in waiting state for longer than 1 hour.' * summary: 'Pod container waiting longer than 1 hour' * Fire after: '1h' * Use PromQL 'sum by' > NOTE 1: If there are any configmaps related to AppStudio controller, similar alerts could be deployed as well. > NOTE 2: If there are Certificates that needs to be renewed, we would need to be alerted 1 week before (`AppStudioCertExpirationWarningSeconds: 7 * 24 * 3600,`) and 1 days before (`AppStudioCertExpirationCriticalSeconds: 1 * 24 * 3600,`). > NOTE 3: I didn't see any Persistent Storage in the AppStudio manifests. If there will be, then we need alerts for those as well. Examples: `AppStudioPersistentVolumeErrors`, `AppStudioPeristentVolumeFillingUp` (with 'volumeFullPredictionSampleTime': '6h') ## Part 2: Metrics & alerts for the Custom Resources The controller has 6 CRs: ### 1. AppStudio Application CR ##### Alert: AppStudioAppNotReady * [ApplicationCR Status](https://github.com/redhat-appstudio/application-api/blob/03f73a06d9787ed4884b22a9291a76e9a9523fb6/api/v1alpha1/application_types.go#L70) * description: 'appstudioapplication in namespace {{ $labels.namespace }} on cluster {{ $labels.node}} has been unready for more than 15 minutes' * summary: 'AppStudio Application is not ready' * Fire after: '15m' * Use PromQL: condition="Ready",status="true"} == 0 ##### Metrics: Useful for PromQL queries for the [ApplicationCR devfile](https://github.com/redhat-appstudio/application-api/blob/03f73a06d9787ed4884b22a9291a76e9a9523fb6/api/v1alpha1/application_types.go#L73): * Age since the AppStudio Application CR has been created. * Use `.metadata.creationTimestamp` * Number of all AppStudio Application CRs * Number of AppStudio Application CRs that have condition "Ready" * Number of AppStudio Application CRs that have condition "true" * Number of AppStudio Application CRs with conditions in undesired state * Time (in seconds) it took from the creation to become ready * Number of AppStudio CRs with the exact same GitOpsRepository * Number of AppStudio CRs without a DevFile ### 2. AppStudio Component CR ##### Alert: AppStudioComponentNotReady * [Is it unhealthy?](https://github.com/redhat-appstudio/application-api/blob/main/api/v1alpha1/component_types.go#L132) * description: 'AppStudio component namespace {{ $labels.namespace }} on cluster {{ $labels.node}} has been unready for more than 15 minutes' * summary: 'AppStudio Component is not ready' * Fire after: '15m' * Use PromQL: condition="Ready",status="true"} == 0 ##### Metrics: Useful for PromQL queries for the Component: * Age since the AppStudio Component CR has been created. * Use `.metadata.creationTimestamp` * Number of all AppStudio CRs * Number of AppStudio Component CRs that have condition "Ready" * Number of AppStudio Component CRs that have condition "true" * Number of AppStudio Component CRs with conditions in undesired state * Time (in seconds) it took from the creation to become ready * Number of AppStudio Component CRs without (empty) `ComponentStatus.GitOps` status. * Number of AppStudio Component CRs without (empty) `ComponentStatus.ContainerImage`. * * Number of AppStudio Component CRs without (empty) `ComponentStatus.Devfile`. * * Number of AppStudio Component CRs without (empty) `ComponentStatus.Webhook`. * [Empty Webhook ?](https://github.com/redhat-appstudio/application-api/blob/main/api/v1alpha1/component_types.go#L135) * [Empty Container Image Link?](https://github.com/redhat-appstudio/application-api/blob/main/api/v1alpha1/component_types.go#L138) * [Is devfile empty?](https://github.com/redhat-appstudio/application-api/blob/main/api/v1alpha1/component_types.go#L141) * [Is repository link empty?](https://github.com/redhat-appstudio/application-api/blob/03f73a06d9787ed4884b22a9291a76e9a9523fb6/api/v1alpha1/component_types.go#L150) * [Is branch empty?](https://github.com/redhat-appstudio/application-api/blob/03f73a06d9787ed4884b22a9291a76e9a9523fb6/api/v1alpha1/component_types.go#L153) * [Is context missing?](https://github.com/redhat-appstudio/application-api/blob/03f73a06d9787ed4884b22a9291a76e9a9523fb6/api/v1alpha1/component_types.go#L156) * [Is ResourceGeneration skipped?](https://github.com/redhat-appstudio/application-api/blob/03f73a06d9787ed4884b22a9291a76e9a9523fb6/api/v1alpha1/component_types.go#L159) * [Is commitID empty?](https://github.com/redhat-appstudio/application-api/blob/03f73a06d9787ed4884b22a9291a76e9a9523fb6/api/v1alpha1/component_types.go#L162) ### 3. AppStudio Environment CR There is no status implemented in the code yet. (See [here](https://github.com/redhat-appstudio/application-api/blob/03f73a06d9787ed4884b22a9291a76e9a9523fb6/api/v1alpha1/environment_types.go#L115)). Thus metrics evaluatation for this one needs to be revisited. #### Metrics * Number of total AppStudio Environment CRs ### 4. AppStudio PromotionRun CR ##### Alert: AppStudioPromRunHighErrorRate LABELS { severity = "warning" } summary = "High error rate for promotion runs detected" description = "The error rate for promotion runs has exceeded the threshold. This may indicate a problem with the reliability of the promotion process, and requires further investigation. For example this alert will trigger if the error rate for promotion runs (i.e. the ratio of promotion runs with errors to total promotion runs) exceeds 10% for at least 1 minute. When the alert is triggered, it will send a notification with a summary and description of the issue, as well as a label indicating that the severity of the issue is "warning". ##### Alert: NonHealthyPromotionRuns Looking at [this err msg](https://github.com/redhat-appstudio/managed-gitops/blob/main/appstudio-controller/controllers/appstudio.redhat.com/promotionrun_controller.go#L291) sounds like we could also setup an alert for it. This alert will trigger if the percentage of promotion runs with a non-healthy status exceeds 10% for at least 1 minute. ```alert ALERT NonHealthyPromotionRuns IF (count(kube_promotion_run{status != "Synced/Healthy"}) / count(kube_promotion_run)) > 0.1 FOR 1m LABELS { severity = "warning" } ANNOTATIONS { summary = "High percentage of promotion runs with non-healthy status detected" description = "The percentage of promotion runs with a non-healthy status has exceeded the threshold. This may indicate a problem with the promotion process, and requires further investigation." } ``` ##### Alert MismatchedActiveBindings To set up an alert based on the activebinding metric (see below), you can use the ALERTS block in your Prometheus configuration file. Here is an example of how you could define an alert that triggers when the number of promotion runs with mismatched active bindings exceeds a certain threshold: ```alert ALERT MismatchedActiveBindings IF count(appstudio_promotion_run{activebinding != spec_activebinding}) > 0 FOR 1m LABELS { severity = "warning" } ANNOTATIONS { summary = "Promotion runs with mismatched active bindings detected" description = "There are one or more promotion runs with mismatched active bindings. This may indicate a problem with the promotion process." } ``` This alert will trigger if the number of promotion runs with mismatched active bindings exceeds 0 for at least 1 minute. When the alert is triggered, it will send a notification with a summary and description of the issue, as well as a label indicating that the severity of the issue is "warning". ##### Alert TotalFailedPromRuns High number of failed promotion runs: You can set up an alert to trigger when the number of failed promotion runs exceeds a certain threshold, as this may indicate a problem with the promotion process. ##### Alert TooManyFailedPromRuns Low success rate for promotion runs: You can set up an alert to trigger when the success rate for promotion runs falls below a certain threshold, as this may indicate a problem with the promotion process. ##### Alert TooLongPromRuns Long duration for promotion runs: You can set up an alert to trigger when the duration of promotion runs exceeds a certain threshold, as this may indicate a performance issue with the promotion process. ##### Alert TooManyDeletePendingPromRuns High number of pending finalizers: You can set up an alert to trigger when the number of promotion runs with pending finalizers exceeds a certain threshold, as this may indicate a problem with the finalization process. ##### Metrics * `promotionRun.Status.State`: This field indicates whether the promotion run is currently [active](https://github.com/redhat-appstudio/application-api/blob/main/api/v1alpha1/promotionrun_types.go#L82) or [failed](https://github.com/redhat-appstudio/application-api/blob/main/api/v1alpha1/promotionrun_types.go#L84) or [waiting](https://github.com/redhat-appstudio/application-api/blob/main/api/v1alpha1/promotionrun_types.go#L119). You could expose this as a gauge metric with a value of 1 for active promotion runs and 0 for inactive ones and 2 for pending ones. * `promotionRun.Status.CompletionResult`: This field indicates whether the promotion run that has completed all work (meaning State field is 'Complete') is [successful]([PromotionRunCompleteResult_Failure PromotionRunCompleteResult = "Failure"](https://github.com/redhat-appstudio/application-api/blob/main/api/v1alpha1/promotionrun_types.go#L91)) or [failed](https://github.com/redhat-appstudio/application-api/blob/main/api/v1alpha1/promotionrun_types.go#L92). You could expose this as a gauge metric with a value of 1 for successful promotion runs and 0 for failed ones. * `promotionRun.Status.EnvironmentStatus[].Status`. This field indicates if the envronment is Success, In Program or Failed. Use gauge 0,1,2 for these. * `promotionRun.Status.ActiveBindings`: A useful metric here would be to count the number of promotion runs with mismatched active bindings: You could track the number of promotion runs that have an active binding in their spec field that differs from the active binding in their status field. This could indicate a problem with the promotion process, as the active binding in the spec field should be reflected in the status field once the promotion is complete. You could set up an alert to trigger when the number of promotion runs with mismatched active bindings exceeds a certain threshold. See [the code](https://github.com/redhat-appstudio/managed-gitops/blob/main/appstudio-controller/controllers/appstudio.redhat.com/promotionrun_controller.go#L158) that alreadt does that. An example of a PromQL query that you can use to track the number of promotion runs with mismatched active bindings: `count(appstudio_promotion_run{activebinding != spec_activebinding})`. This query will count the number of `appstudio_promotion_run` objects that have an activebinding field in their status that differs from the spec_activebinding field in their spec. * The same way there is `Status.StartTime` I would propose to have also `Status.CompletionTime`: These fields contain timestamps for when the promotion run started and completed (if applicable). You could expose these as histogram metrics to track the duration of promotion runs. This idea came to my mind after noticing we have certain expectations, such us [10 minutes](https://github.com/redhat-appstudio/managed-gitops/blob/main/appstudio-controller/controllers/appstudio.redhat.com/promotionrun_controller.go#L47) duration. * `promotionrun_create_total`: This is a counter metric that tracks the total number of PromotionRun objects that have been created. * `promotionrun_update_total`: This is a counter metric that tracks the total number of PromotionRun objects that have been updated. * `promotionrun_delete_total`: This is a counter metric that tracks the total number of PromotionRun objects that have been deleted. * `promotionrun_duration_seconds`: This is a histogram metric that tracks the duration of PromotionRun objects in seconds. * `promotionrun_promotion_status_total`: This is a counter metric that tracks the number of PromotionRun objects with each possible promotion status (e.g. "Success", "Failure", etc.). * `promotionrun_resource_count`: This is a gauge metric that tracks the number of resources (e.g., pods, deployments, etc.) that are associated with a promotion run. ### 5. AppStudio Snapshot CR ##### Alerts: I couldn't think of any useful ones. ##### Metrics: * Number of snapshots: You could track the total number of Snapshot objects in the cluster to monitor the overall usage of snapshots. * `Spec.Application`: Number of snapshots with a given application: You could track the number of Snapshot objects with a specific application specified in the Application field to monitor the usage of snapshots for a particular application. e.g. PromQL: `count(appstudio_snapshot{application="my-app"})` * `Spec.Components`Number of snapshots with a given component: You could track the number of Snapshot objects with a specific component specified in the Components field to monitor the usage of snapshots for a particular component. e.g. PromQL `count(appstudio_snapshot{components="my-component"})` * `Spec.Components.ContainerImage`: Number of snapshots with a given container image: You could track the number of Snapshot objects with a specific container image specified in the ContainerImage field to monitor the usage of snapshots with a particular container image. ### 6. AppStudio snapshotenvironmentbinding CR ##### Alerts: I couldn't think of any useful ones. ##### Metrics: * Number of SnapshotEnvironmentBindings: You can use this metric to track the overall usage of SnapshotEnvironmentBindings in your cluster. You can use the following PromQL query to get the total number of SnapshotEnvironmentBindings: `count(appstudio_snapshotenvironmentbinding)` * Number of SnapshotEnvironmentBindings per application: You can use this metric to track the usage of SnapshotEnvironmentBindings for a particular application. You can use the following PromQL query to get the number of SnapshotEnvironmentBindings for a specific application: `count(appstudio_snapshotenvironmentbinding{application="my-app"})` * Number of SnapshotEnvironmentBindings per environment: You can use this metric to track the usage of SnapshotEnvironmentBindings for a particular environment. You can use the following PromQL query to get the number of SnapshotEnvironmentBindings for a specific environment: `count(appstudio_snapshotenvironmentbinding{environment="my-environment"})` * Number of SnapshotEnvironmentBindings per snapshot: You can use this metric to track the usage of SnapshotEnvironmentBindings for a particular snapshot. You can use the following PromQL query to get the number of SnapshotEnvironmentBindings for a specific snapshot: `count(appstudio_snapshotenvironmentbinding{snapshot="my-snapshot"})` * Number of components per SnapshotEnvironmentBinding: You can use this metric to track the average number of components per SnapshotEnvironmentBinding. You can use the following PromQL query to get the average number of components per SnapshotEnvironmentBinding: `avg(appstudio_snapshotenvironmentbinding_components) by (appstudio_snapshotenvironmentbinding)` * Replicas per component: You can use this metric to track the average number of replicas per component. You can use the following PromQL query to get the average number of replicas per component: `avg(appstudio_snapshotenvironmentbinding_configuration_replicas) by (appstudio_snapshotenvironmentbinding_components)`