Troubleshooting Guide

# Troubleshooting Guide This document contains a collection of guidance for troubleshooting issues with ACK controllers. ## Deploying ACK controllers - How ACK controllers are deployed (deployment + service + serviceAccount + CRDs + ClusterRole/Role and ClusterRoleBindings) - Two ways of deploying controllers (raw manifests +kustomize or helm) - Our recommendation is to use IRSA and Helm - Emphasize that controllers can be installed on *any* Kubernetes cluster, not just EKS ones ## How to diagnose issues with ACK - First, determine if the issue is actually related to the ACK controller - Next, check that the Kubernetes RBAC permissions are properly set up. You can ask the customer to do `kubectl can-i` to check this. - Next, check that the IAM roles for the controller's ServiceAccount have the necessary permissions to communicate with the AWS API in question. Our installation instructions typically reference an IAM Managed Policy in the source controller's `config/iam/recommended-iam-policy` file. Ask the customer if they have applied that policy to the IAM role associated with their controller's ServiceAccount. - If a user is having problems with a specific resource, first thing to ask is for the user to do `kubectl describe <kind>/<resource_id>` and take a look at the `Status.Conditions` collection. If there was a failure to communicate with the AWS service API, there will typically be a `Condition` of type `ConditionTypeTerminal` with a reason/message explaining what the problem was. - The next thing to check is the controller's logs. The controller may be restarted with the `log.level=DEBUG` configuration option in order to enable much more verbose log records that include trace-like functionality that displays the various state transitions a custom resource goes through during reconciliation. - Check that you are running a prometheus server and that the controllers are publishing metrics to it. These prometheus metrics can be useful mechanism for you and the customer to diagnose issues ## Common issues - There may be an issue with the Kubernetes RBAC roles, the IAM roles associated with the controller (IRSA), or the IAM roles associated with an individual AWS account when using CARM - There may be an issue with Prometheus metrics server. Make sure that the metrics server is being set up properly (use the default port, etc) and is publishing metrics properly - They may be an issue with namespace annotations when deploying accross multiple accounts - They may be an issue with CR annotations if the user is trying to deploy a resource in a different region - For China-based customers trying to use ACK, they might have issues using the official ACK container images, which are deployed in a public ECR repository - For China-based customers trying to use ACK, services may not support all the feature flags exposed by the custom resource - When upgrading a controller with `helm install --upgrade`, older custom resource definitions (CRDs) need to be deleted (likely with --force) before installing newer CRDs. See [the warning in the Helm docs][crd-helm] about this particular problem. [crd-helm]: https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations