Try   HackMD

App Mesh on Amazon ECS - Part 2

This guide is a continuation of Part 1.

Up until this point, we've been dealing with applications communicating in our mesh with unencrypted traffic. Sure, we've configured a self-referencing security group and assigned it only to the ECS tasks that implement our virtual nodes and virtual gateways within the color-mesh service mesh, but we have the option to further encrypt the traffic moving between virtual resources at the TLS network layer.

5. TLS

In this section, we'll explore how to further secure our mesh's network traffic by focusing on the original 2 virtual nodes we launched - red and blue - by creating a Private Certificate Authority (PCA) in AWS Certificate Manager (ACM), using it to provision SSL certificates, and installing these certificates in both the root CA and our Envoy proxy sidecars to encrypt the traffic moving between ECS tasks. We'll begin with one-way TLS not too dissimilar to the TLS mechanisms you may already be familiar with (e.g. HTTPS) and provide an example of the full pipeline incorporating ACM PCA, then we'll move onto implementing mutual TLS, involving both client and server tasks providing certificates to one another.

Throughout this section, pay close attention to the procedures we use to configure TLS, specifically the components involved in the implementation, and the difference in configuration complexity between 1-way TLS, and 2-way mutual TLS (mTLS).

Let's start with one-way TLS.

1-Way TLS: Server to Client

Consider the following 2 entities representing our virtual nodes in communication within the service mesh.

  • The client is the virtual node which makes the first unencrypted request to the server.
  • The server is the virtual node which receives the request from the client and disseminates its SSL certificate.

App Mesh supports multiple avenues for retrieving the certificate material in one-way TLS in ECS for both the server to send to the client, and for the client to validate the server's certificate. As previously mentioned, we'll explore an implementation of TLS on App Mesh using certificates provisioned from ACM PCA.

Note: The SPIFF Runtime Environment (SPIRE) implementation of the Envoy Secret Discovery Service (SDS) API is not supported on ECS or Fargate as of November 2023. However, this may change in the future.

Configuring the Private Certificate Authority

This section was in-part adapted from the App Mesh walkthrough on TLS configuration.

Before we can generate certificates, we must first create a private certificate authority which we can use, because within-mesh TLS encryption only supports private certificates.

Disclaimer: ACM PCA charges a hefty fee for managing general-purpose private certificate authorities, $400 a month at the time of writing. You are not charged for the first 30 days of the first private CA you create, but you are charged immediately for subsequent CAs. Follow the steps in this section with care if you are operating in a personal or individual AWS account.
// 05a-create-pca.json { "CertificateAuthorityConfiguration": { "KeyAlgorithm": "RSA_2048", "SigningAlgorithm": "SHA256WITHRSA", "Subject": { "CommonName": "colors.local" } }, "CertificateAuthorityType": "ROOT" }
export ROOT_CA_ARN=$(aws acm-pca create-certificate-authority --cli-input-json file://05a-create-pca.json | jq -r ".CertificateAuthorityArn")

Note the following:

  • We'll be making great use of shell variables in this section when we create the PCA.
  • These commands will be making use of the jq command-line utility to extract values from the JSON responses provided by the AWS CLI. jq's -r flag strips the double quotation marks from the result and just provide's us with the actual value we've extracted.
  • We created a ROOT certificate authority, because this CA does not have material signed by a higher authority - it will be the CA issuing all subsequent certificates for our environment.

Let's have a look at what we just created.

aws acm-pca describe-certificate-authority --certificate-authority-arn $ROOT_CA_ARN

At this point in time, our private CA will be in the PENDING_CERTIFICATE state because the CA is missing a root certificate. We need to satisfy the certificate signing request (CSR) for our root CA by issuing a certificate using our CA (which will result in a self-signed root certificate) and import the certificate back into PCA as the CA certificate.

We'll start by fetching the CSR for our CA.

export ROOT_CA_CSR=$(aws acm-pca get-certificate-authority-csr --certificate-authority-arn $ROOT_CA_ARN | jq -r ".Csr") && echo "$ROOT_CA_CSR"

With version 2 of the AWS CLI, the CSR must be Base64 encoded before we can pass it to ACM PCA.

export ROOT_CA_CSR_ENCODED=$(echo "$ROOT_CA_CSR" | base64)
// 05b-sign-csr-through-issuing.json { "SigningAlgorithm": "SHA256WITHRSA", "TemplateArn": "arn:aws:acm-pca:::template/RootCACertificate/V1", "Validity": { "Value": 2, "Type": "YEARS" } }
export ROOT_CA_CERT_ARN=$(aws acm-pca issue-certificate --certificate-authority-arn $ROOT_CA_ARN --csr "$ROOT_CA_CSR_ENCODED" --cli-input-json file://05b-sign-csr-through-issuing.json | jq -r ".CertificateArn")

Two things of note:

  • The certificate's validity dictates how long the certificate can be used before you need to issue and import a new root CA certificate for the private CA. Here, this is simply an arbitary setting for us to prevent the root certificate from expiring.
  • For more information on the RootCACertificate certificate template, refer to the PCA documentation.

The certificate we've just issued with be the root CA's certificate, which will form the root of the certificate chain for all subsequent certificates we sign using this CA. We now need to fetch this certificate, encode it in Base64 (again, because we're using version 2 of the AWS CLI), and import it as the CA certificate.

export ROOT_CA_CERT=$(aws acm-pca get-certificate --certificate-arn $ROOT_CA_CERT_ARN --certificate-authority-arn $ROOT_CA_ARN | jq -r ".Certificate") export ROOT_CA_CERT_ENCODED=$(echo "$ROOT_CA_CERT" | base64) aws acm-pca import-certificate-authority-certificate --certificate-authority-arn $ROOT_CA_ARN --certificate "$ROOT_CA_CERT_ENCODED"

Now, we can check the state of our private CA - the status should've transitioned to ACTIVE.

aws acm-pca describe-certificate-authority --certificate-authority-arn $ROOT_CA_ARN

Requesting and Specifying the Certificate

Wonderful - we now have a private CA that we can use to generate certificates for the virtual nodes within our mesh! Let's request that a certificate be generated.

export COLOR_CERT_ARN=$(aws acm request-certificate --domain-name "*.colors.local" --certificate-authority-arn $ROOT_CA_ARN | jq -r ".CertificateArn")

Give the certificate around a few seconds to be generated, then we can update the red virtual node to use this certificate to terminate TLS traffic to its port 80 listener. Note that we saved the certificate ARN to an environment variable, but because of the layout and complexity of this file, it's much easier to just insert the value here.

// 05c-update-red.json { "virtualNodeName": "red", "meshName": "color-mesh", "spec": { "listeners": [ { "portMapping": { "port": 80, "protocol": "http" }, "tls": { "certificate": { "acm": { "certificateArn": "<arn-of-your-acm-certificate>" } }, "mode": "STRICT" } } ], "serviceDiscovery": { "awsCloudMap": { "namespaceName": "colors.local", "serviceName": "red" } }, "logging": { "accessLog": { "file": { "format": { "text": "[%START_TIME%] %REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL% %RESPONSE_CODE%\n" }, "path": "/dev/stdout" } } }, "backends": [ { "virtualService": { "virtualServiceName": "router.colors.local" } } ] } }
aws appmesh update-virtual-node --cli-input-json file://05c-update-red.json

Every other configuration value for the red virtual node remains the same - all we're doing is adding TLS termination on the listener.

At this point in time, we need to update our ECS task role to add 2 permissions: acm:ExportCertificate for the virtual node's listener so that Envoy can fetch the certificate from ACM, and acm-pca:GetCertificateAuthorityCertificate, the latter of which is not used now, but which will be required in the next section.

Now, we can Exec into the main application container (not the Envoy sidecar) for both the red and blue nodes to run the following commands to test the SSL configuration.

  1. First, run curl localhost:9901/stats | grep connected on both nodes to check that Envoy is still connected to the App Mesh control plane.
  2. Run curl localhost:9901/stats | grep ssl to check the SSL metrics for each node. Observe the following, and bear in mind we haven't made any request between the nodes at this point in time:
    • red will recognize the Envoy listener in the same task on port 15000 as this port corresponds to Envoy; despite port 15000 not being open on either ECS task, this is the port to which traffic is proxied before leaving the task.
    • blue will recognize red's downstream presence as an Egress target in the mesh, and will present statistics on port 80.
  3. Make a request from blue to red - curl red.colors.local (assuming blue still has red's virtual service as a backend from part 2 of this tutorial). Observe that the curl request from blue to red is currently succeeding.

At present, blue is not currently validating the TLS certificate provided by red. The blue node will still encrypt traffic using red's certificate, but blue will not check the authenticity of this certificate because we haven't configured this in blue's client policy. Let's observe which SSL metrics have incremented as a result of this request.

  • Both virtual nodes are engaging in an SSL handshake, and the .ssl.handshake metric will be set to the same value for both nodes after a successful SSL handshake has been negotiated.
    • This handshake does not happen on every request from blue to red; rather, subsequent requests within a short window will use the same SSL material from the previous handshake.
  • The ssl.no_certificate metric refers to successful SSL negotiations where a certificate was not provided by the client (which would constitute mutual TLS). Currently, the certificate is only being provided by the server, which means that this metric should be incrementing at the same time as the handshake count.
  • The .ssl.ciphers.<cipher-id> metric tracks the number of times that a particular cipher was used to encrypt the secret key during a TLS handshake.

We can change blue's behavior (which is currently using the certificate without validating it) by adding the ability for blue to validate red's certificate, thus improving our security posture.

// 05d-update-blue.json { "virtualNodeName": "blue", "meshName": "color-mesh", "spec": { "listeners": [ { "portMapping": { "port": 80, "protocol": "http" } } ], "serviceDiscovery": { "awsCloudMap": { "namespaceName": "colors.local", "serviceName": "blue" } }, "logging": { "accessLog": { "file": { "format": { "json": [ { "key": "start_time", "value": "%START_TIME%" }, { "key": "method", "value": "%REQ(:METHOD)%" }, { "key": "request_path", "value": "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%" }, { "key": "protocol", "value": "%PROTOCOL%" }, { "key": "response_code", "value": "%RESPONSE_CODE%" } ] }, "path": "/dev/stdout" } } }, "backends": [ { "virtualService": { "virtualServiceName": "red.colors.local" } } ], "backendDefaults": { "clientPolicy": { "tls": { "enforce": true, "validation": { "trust": { "acm": { "certificateAuthorityArns": [ "<arn-of-your-pca>" ] } } } } } } } }
aws appmesh update-virtual-node --cli-input-json file://05d-update-blue.json

Once again - observe that the only thing we've changed in blue's configuration is to pass the ARN of the private CA we created in the previous subsection. At this point, return to the Exec shells we have open for our red and blue nodes, and curl red.colors.local from the blue node. Once again, run curl localhost:9901/stats | grep ssl, and note the following statistics.

  • The .ssl.handshake metric will increment if you waited a sufficient amount of time since the last request from blue to red.
  • The ssl.no_certificate metric is still incrementing alongside the .ssl.handshake metric.
  • The .ssl.ciphers.<cipher-id> metric is still incrementing as well.

Mutual TLS

At this point in time, we've configured TLS enryption on the traffic between the blue client node and the red server node. Envoy negotiates TLS on our application's behalf, and while the traffic movement from Envoy to the application container is unencrypted, this is generally considered to be acceptable because the unencrypted traffic is local to the task.

However, we can also configure mutual TLS for the server to verify the authenticity of the client it's communicating with by requiring that the client also provide a certificate for validation purposes. Mutual TLS, when properly implemented, will require that the client, in its initial request to the server, provide its own certificate for validation using a local certificate trust chain on the server. In the following sections, we'll add the infrastructure necessary to implement mutual TLS. These steps will reuse the certificate we provisioned for the server, but you can just as easily request another certificate from our private CA for the client and use that instead - just remember to update the COLOR_CERT_ARN environment variable with the ARN of the new certificate in the steps we use in this subsequent sections.

Methods for Providing the Certificate Material

Unlike with our server certificate, Envoy doesn't support fetching key material from ACM when configuring the client Envoy proxy to provide its own certificate in its request to the server. Only a local file path or SDS secret parameter is supported, and because SPIRE's SDS implementation isn't available in ECS or Fargate (at the time of writing, November 2023), we need to configure the certificate material on the Envoy sidecar itself. There are 2 ways we can load the certificate onto Envoy.

  1. Copy the certificate files into Envoy by building a new container image. This method has the benefit of the certificates being ready to go when the Envoy proxy spins up, and the image will inherit the default entrypoint command. However, the image cannot be stored publicly (because there is private key material baked into it), and when the certificate is to be rotated, the image must be rebuilt and pushed to the repository before the ECS task can be updated.
  2. Store the certificate files in SSM and reference them in ECS as environment variables. This method will still require building a new Envoy image and updating the image's command to store the certificate material from the environment into files in local storage, but has 2 main benefits: we're not hard-coding the certificate material in the image (and thus our security posture is improved here), and when the certificate is to be rotated, we just need to update the relevant SSM parameters before cycling the tasks to fetch the new values.

As you may have guessed, we'll be exploring option #2, since it represents a more secure and streamlined approach to specifying the client certificate and private key material.

Fetching and Storing the Material

Download and unpack the certificate, certificate chain, and the private key into separate files. These steps will make use of the jq command-line utility to parse the output. We need to create an encoded passphrase file to encrypt the private key material in transit - you can use any alphanumeric passphrase here.

echo -n "<your-passphrase>" > passphrase.txt && echo -n $(openssl base64 -in passphrase.txt) > passphrase-base64.txt aws acm export-certificate --certificate-arn $COLOR_CERT_ARN --passphrase file://passphrase-base64.txt > certificate-info.json cat certificate-info.json | jq -r ".Certificate" > cert.pem cat certificate-info.json | jq -r ".CertificateChain" > cert-chain.pem cat certificate-info.json | jq -r ".PrivateKey" > private-key-encrypted.pem
Note: If you are using source control to store the files in this tutorial, now is a good time to ignore these sensitive files from the commit list. You can add 3 file definitions, certificate-info.json, passphrase*, and *.pem.

We also need to decrypt the private key material by specifying the pasphrase we used when fetching the certificate material. The command below will make use of the openssl utility to do this.

openssl rsa -in private-key-encrypted.pem -out private-key.pem
// You will be asked for the passphrase here - it must match the value of "<your-passphrase>" above.

Now, head over to SSM Parameter Store and store the following 6 parameters in SSM as text entries, which we will update the ECS task definitions to reference. The first 3 parameters contain the file contents.

  • /acm/cert - This will store the value within the cert.pem file.
  • /acm/cert-chain - This will store the value within the cert-chain.pem file.
  • /acm/private-key - This will store the value within the private-key.pem file.

Here's an example command to load the contents of cert.pem into the /acm/cert parameter, using a SecureString datatype in SSM. You must execute this command in the same directory as you used for the certificate commands above.

aws ssm put-parameter --name " /acm/cert" --value "file://cert.pem" --type SecureString

The next 3 parameters will store the paths to the relevant files on the Envoy container. The files won't yet exist at image build time; rather, we'll be creating the files at runtime. Here are the values I'll be using for each, but feel free to substitute your own.

  • /acm/cert-path - /keys/cert.pem
  • /acm/cert-chain-path - /keys/cert-chain.pem
  • /acm/private-key-path - /keys/private-key.pem

Here's an example command to load the file paths into SSM using a String datatype.

aws ssm put-parameter --name "/acm/cert-path" --value "/keys/cert.pem" --type String
Note: Git Bash users on Windows may encounter an issue that will affect you in 2 ways:

  • You may be prevented from creating the parameter due to a ValidationException, with the following reason: Parameter name must be a fully qualified name.
  • When creating the 3 path parameters, you may notice that the parameter value SSM inherits the full path to your Git shell as a prefix to the /keys/<file>.pem value instead of just inserting the value.
These errors occur because of a known issue regarding the way Git Bash handles command-line options starting with a slash /, which arises due to the methods employed by Git Bash to provide an experience as close to the native Bash experience as possible. To work around this issue temporarily (and avoid breaking other slash-based command-line options), simply prefix the aws ssm put-parameter commands with MSYS_NO_PATHCONV=1 to diable root path inferencing for only that command. This is what the full command will look like, as an example, using the command above.
MSYS_NO_PATHCONV=1 aws ssm put-parameter --name "/acm/cert-path" --value "/keys/cert.pem" --type String

Once these 6 parameters have been configured, we can discuss how these parameters should be passed to Envoy.

Passing the Material to Envoy

Previously, we were using the Envoy image that was released by App Mesh. This version does not already know how to use any certificate material we pass to it, nor will it have the certificate material we need by default. We need to write the certificate, certificate chain, and private key material into the image - for this, we create a very simple Dockerfile that will work in conjunction with the SSM-parameter environment variables we pass to it.

# Dockerfile FROM public.ecr.aws/appmesh/aws-appmesh-envoy:v1.27.0.0-prod as envoy RUN mkdir /keys && chmod 777 /keys CMD ["sh", "-c", "echo \"$CERT\" > $CERT_PATH && echo \"$CERT_CHAIN\" > $CERT_CHAIN_PATH && echo \"$PRIVATE_KEY\" > $PRIVATE_KEY_PATH && /usr/bin/agent"]

Let's break down this Dockerfile.

  • We will be using the public App Mesh distribution of the Envoy image as our base image.
  • The default user inside this image will not have permission to write to top-level folders - thus, after we create the /keys folder (which you should substitute with your own folder if you opted to select a different path for the certificate and private key material), we need to update the folder's permission modes to allow our user to write to this folder.
  • We are given the CMD this image has by default in the App Mesh documentation - /usr/bin/agent. Because we can't specify the file initialization commands at build time (because the environment variables are only passed at runtime), we prefix the container's executable with the echo statements so that the files are only written when this container is run.
  • We need to escape the double quotes in the echo commands to preserve newlines when writing these values to file. If we don't use double quotes here, the newlines will turn into spaces in the resulting file, which we don't want to happen.

Remember - this guide was built on version 1.27.0.0 of the Envoy image - feel free to substitute the most up-to-date version you are using.

Build and push this image to a remote repository. We will then need to create a new revision of our ECS task definition for both of the red and blue virtual nodes to accomplish 2 things:

1. Reference our aforementioned custom Envoy image.

This is easily done by changing the image entry for our Envoy container to the name of your custom image.

2. Pass the environment variables from SSM to the container.

This is done by adding the secrets task definition parameter to the Envoy container specification (not the application container!) with a list of environment variables and the ARN of their corresponding parameters from SSM:

... "secrets": [ { "name": "CERT", "valueFrom": "arn:aws:ssm:<your-region>:<your-account-id>:parameter/acm/cert" }, { "name": "CERT_PATH", "valueFrom": "arn:aws:ssm:<your-region>:<your-account-id>:parameter/acm/cert-path" }, { "name": "CERT_CHAIN", "valueFrom": "arn:aws:ssm:<your-region>:<your-account-id>:parameter/acm/cert-chain" }, { "name": "CERT_CHAIN_PATH", "valueFrom": "arn:aws:ssm:<your-region>:<your-account-id>:parameter/acm/cert-chain-path" }, { "name": "PRIVATE_KEY", "valueFrom": "arn:aws:ssm:<your-region>:<your-account-id>:parameter/acm/private-key" }, { "name": "PRIVATE_KEY_PATH", "valueFrom": "arn:aws:ssm:<your-region>:<your-account-id>:parameter/acm/private-key-path" } ], ...

We also need to configure our ECS task execution role with permissions to SSM to allow ECS to successfully query the aforementioned parameters and initialize them inside the Envoy container. This is easily done by attaching the AmazonSSMReadOnlyAccess AWS-managed policy to the task execution role.

Once that's done, update the red-svc and blue-svc ECS services to force a new deployment onto the next task definition version you've just deployed.

Configuring App Mesh for Mutual TLS

At this point in time, Envoy's operation has not changed - we're still using one-way TLS with the server's provided certificate and client-sided validation material both sourced from ACM, but the local files we initialized through our custom Envoy image have not been put to use yet in configuration mutual TLS. Now, we need to update the virtual node settings for red and blue:

  • red will require a client TLS certificate in any requests made to its listener, and will validate the incoming client certificate using the local certificate chain located at the path defined by the CERT_CHAIN_PATH environment variable.
  • blue will provide a certificate during its initial request to red, using the local certificate and private key files located at the paths defined by their corresponding environment variables.

We'll apply these conigurations in steps to demonstrate what happens when only one side of the mutual TLS configuration has been configured. Typically you'd configure the client certificate before configuring server validation, but we'll reverse the order here to demonstrate what happens in a failed client certificate verification scenario.

To start, let's update red to require a TLS certificate from the requesting client while operating in STRICT mode. Copy the contents of 05c to a new file, and under the listeners -> tls field, add the following specification.

// 05g-update-red-mtls.json ... "spec": { "listeners": [ { "portMapping": { "port": 80, "protocol": "http" }, "tls": { "certificate": { "acm": { "certificateArn": "<your-certificate-arn>" } }, "mode": "STRICT", "validation": { "trust": { "file": { "certificateChain": "/keys/cert-chain.pem" } } } } } ], "serviceDiscovery": { ...
aws appmesh update-virtual-node --cli-input-json file://05g-update-red-mtls.json

Before we update blue, let's attempt to curl red.colors.local from blue to see what's happened. Wait until the .server_ssl_socket_factory.ssl_context_update_by_sds metric has incremented on red's statistics, indicating that red's Envoy sidecar has received the updated SSL specification from the App Mesh control plane, before running curl from blue. If blue succeeds in curling and the .ssl.handshake metric has not increased, blue will still be using the previous SSL handshake material. Give it a few minutes and try again.

You'll know when red has receievd the updated SSL context when you attempt to run the curl command and encounter an error similar to the following:

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: TLS error: 268436496:SSL routines:OPENSSL_internal:SSLV3_ALERT_HANDSHAKE_FAILURE

When this happens, check the .ssl.connection_error statistic on the blue node - this metric will increase by a certain amount every time you make a request to red that fails due to SSL issues.

Then, go to the red node's shell and check the .ssl.fail_verify_no_cert metric. This metric tracks the SSL connection failures that occur when the red node's listener expects the incoming client request to provide a certificate, but the client does not provide one. These connections will fail because the red listener is operating in STRICT TLS mode - in a production scenario, to avoid downtime when configuring the blue client, you'd swap red to operate in PERMISSIVE mode to allow connections without SSL, if SSL is not possible for a certain amount of time.

Now, let's update blue to provide a client certificate on the initial request to its authorized backends. To do this, we provide the certificate entry under the clientPolicy -> tls field - Copy the contents of 05d to a new file and add the following.

// 05h-update-blue-mtls.json ... "backendDefaults": { "clientPolicy": { "tls": { "enforce": true, "validation": { "trust": { "acm": { "certificateAuthorityArns": [ "<your-certificate-authority-arn>" ] } } }, "certificate": { "file": { "certificateChain": "/keys/cert.pem", "privateKey": "/keys/private-key.pem" } } } } } } }
aws appmesh update-virtual-node --cli-input-json file://05h-update-blue-mtls.json

Note: This node may take a while to update. You'll know when Envoy's received the new specification when the .client_ssl_socket_factory.ssl_context_update_by_sds metric increments.

Don't feel misled by the certificateChain parameter naming convention when we the blue virtual node. It may seem like this node is asking for a certificate chain, but the chain will only be valid if it includes the certificate we exported. This is not the case when exporting a certificate from ACM, which will separate the certificate and the certificate chain, and in TLS handshakes, the client tends to send only its certificate and not the whole chain.

Once ready, curl the red node again from blue. Pay attention to the following statistics:

  • .ssl.handshake should increment on both virtual nodes if the curl was successful.
  • On red, note that .ssl.no_certificate has not incremented - this is because SSL was successfully negotiated with a client certificate.

Envoy aims to make the TLS implementation invisible to the downstream application, which is why it is only these metrics that track the status of TLS negotiation from within the task.

Congratulations! You've successfully configured red and blue to negotiate TLS both 1-way and mutually, and made your way through a highly complex chapter of this guide - you should feel proud of yourself for making it this far. While we've only demonstrated TLS configuration between the red and the blue nodes, if you'd like to experiment with TLS in App Mesh and further your understanding of the configuration, here's some additional things you may wish to consider:

  • Negotiating TLS between the virtual gateway we configured in section 4,and the blue node at the gateway route's root path /.
  • Terminating TLS at the load balancer. This isn't strictly an App-Mesh-specific task, but it does improve our gateway's security posture.

Now, let's move onto a discussion on observability.

6. Observability with OpenTelemetry

The Envoy proxy exposes a lot of data and customisability for end-user monitoring solutions. We've already seen some of the potential of the Envoy administrative interface when we had a look at the SSL metrics during mutual TLS configuration, and during the brief aside at the end of section 2, but there's so much more that Envoy allows us to do in terms of observability without needing to configure anything at the application level.

There are 3 core types of observability signals - logs, metrics, and traces - commonly referred to as the Three Pillars of Observability. We've already configured logging to CloudWatch Logs in our task definition using the awslogs driver. In this section, we're going to build an implementation-agnostic observability pipeline using the AWS Distro for OpenTelemetry (ADOT) to ingest metrics and traces from our Envoy sidecars in the color-mesh service mesh. ADOT allows us to mix-and-match different types of signal sources and destinations, and drop in existing observability solutions from 3rd-party backends which you may already be using.

We'll be touching briefly on some of the core ADOT components in this section. For an in-depth understanding of ADOT's potential, check out the following guide on Observability with OpenTelemetry on Amazon ECS.

Metrics - Prometheus and CloudWatch Logs

Let's start by initializing the config.yaml file which we will use to customize the components in our ADOT sidecar. The skeleton we'll be using looks like this.

# config.yaml receivers: # Accept data from an upstream source exporters: # Send data to a downstream backend extensions: # Relevant to the operation of the collector service: extensions: # Enable any extensions we configured above pipelines: # Sort our receivers and exporters into signal pipelines

Remember how we were running curl localhost:9901/stats from our application container to get statistics relevant to the Envoy sidecar in our color application tasks from port 9901? If we append /prometheus to the curl address (making localhost:9901/stats/prometheus), we can retrieve those statistics in a Prometheus-compatible format - give it a try! Let's declare a prometheus receiver to scrape this endpoint. This is a drop-in replacement for existing Prometheus sidecars and will be configured as such - notice that we're using the same syntax.

# config.yaml receivers: prometheus: config: global: scrape_interval: 10s scrape_timeout: 10s scrape_configs: - job_name: "envoy-prometheus" static_configs: - targets: [ 0.0.0.0:9901 ] metrics_path: /stats/prometheus exporters: ...

You can name the job_name anything you'd like - I've opted to name it based on the endpoint and type of data we're scraping. The scraped target is over the same network interface as this container, and the metrics_path dictates the path on this target to be scraped.

Scraping the endpoint isn't enough, however. We need to then send the data to a backend. Let's try sending the Envoy metric signals to CloudWatch Metrics by configuring the following exporter.

# config.yaml ... exporters: awsemf: namespace: "ECS/Envoy" log_group_name: "envoy-proxy-metrics" dimension_rollup_option: NoDimensionRollup extensions: ...

(Side note: NoDimensionRollup indicates that CloudWatch should interpret incoming dimension combinations as-is. This is to prevent metric value duplication, which results in an unnecessarily higher rate of ingestion.)

The Embedded Metric Format (EMF) specification is located here if you'd like to learn more about why CloudWatch uses EMF.

While we're at it, let's add the collector's health check extension too.

# config.yaml ... extensions: health_check: service: ...

At this point, we've declared and configured our components, but we haven't enabled them in the collector. Let's enable them by configuring a pipeline with our 2 components, and also enable the health check extension.

# config.yaml ... service: extensions: [ health_check ] pipelines: metrics: receivers: [ prometheus ] exporters: [ awsemf ]

We now need to pass this configuration file to ADOT. Let's do that by creating a new image using ADOT as a base, and pushing this image to a remote repository.

# Dockerfile FROM public.ecr.aws/aws-observability/aws-otel-collector COPY config.yaml /etc/ecs/config.yaml CMD [ "--config=/etc/ecs/config.yaml" ]
docker build -t <registry>/adot-envoy . && docker push <registry>/adot-envoy

The image's new command will tell it to load the configuration file we just copied into the image.

Let's start by instrumenting the blue virtual node's ECS task with our ADOT container. We'll instrument one virtual node at a time to explain some important points, especially around tracing in the next section. We need to incorporate this image into our task definition - specifically, here's what we need to add.

  • We need to update our task's CPU and memory budget to provide the new container with some resources to run.
...
    "cpu": "512",
    "memory": "1 GB",
...
  • We also need to incorporate another container definition for ADOT. This will go in the containerDefinitions section.
... "containerDefinitions": [ ..., { "name": "adot-collector", "image": "<your-registry>/adot-envoy", "cpu": 256, "memory": 512, "essential": true, "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "color-mesh", "awslogs-region": "<your-region>", "awslogs-stream-prefix": "blue-adot-collector" } }, "healthCheck": { "command": [ "/healthcheck" ], "interval": 5, "timeout": 6, "retries": 5, "startPeriod": 1 } } ], ...
  • To ensure that our ADOT collector is running and ready to route observability signals before our other tasks start running, we'll also configure a startup dependency for the Envoy container to prevent it from starting until ADOT is healthy.
... "dependsOn": [ { "containerName": "adot-collector", "condition": "HEALTHY" } ], ...

All in all, here's what the task definition for our blue virtual node would look like after we make the necessary changes to incorporate the ADOT sidecar:

// 06a-blue-td-with-adot.json { "family": "blue_node", "taskRoleArn": "<your-task-role-arn>", "executionRoleArn": "<your-task-execution-role-arn>", "networkMode": "awsvpc", "containerDefinitions": [ { "name": "blue-app", "image": "forsakenidol/colorapp", "cpu": 128, "memory": 128, "portMappings": [ { "containerPort": 80, "protocol": "tcp", "name": "color-port" } ], "essential": true, "environment": [ { "name": "COLOR", "value": "blue" } ], "dependsOn": [ { "containerName": "envoy-sidecar", "condition": "HEALTHY" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "color-mesh", "awslogs-region": "<your-region>", "awslogs-create-group": "true", "awslogs-stream-prefix": "blue-app-container" } }, "healthCheck": { "command": [ "CMD-SHELL", "curl -f http://localhost/health/ || exit 1" ], "interval": 5, "timeout": 5, "retries": 3 } }, { "name" : "envoy-sidecar", "image" : "<your-envoy-image>", "cpu": 128, "memory": 384, "essential" : true, "environment" : [ { "name" : "APPMESH_RESOURCE_ARN", "value" : "<virtual-node-arn>" } ], "secrets": [ <list-of-tls-secrets> ], "dependsOn": [ { "containerName": "adot-collector", "condition": "HEALTHY" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "color-mesh", "awslogs-region": "<your-region>", "awslogs-create-group": "true", "awslogs-stream-prefix": "blue-envoy-sidecar" } }, "startTimeout": 30, "healthCheck" : { "command" : [ "CMD-SHELL", "curl -s http://localhost:9901/server_info | grep state | grep -q LIVE" ], "interval" : 5, "retries" : 3, "startPeriod" : 10, "timeout" : 2 }, "user" : "1337" }, { "name": "adot-collector", "image": "<your-registry>/adot-envoy", "cpu": 256, "memory": 512, "essential": true, "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "color-mesh", "awslogs-region": "<your-region>", "awslogs-stream-prefix": "blue-adot-collector" } }, "healthCheck": { "command": [ "/healthcheck" ], "interval": 5, "timeout": 6, "retries": 5, "startPeriod": 1 } } ], "requiresCompatibilities": [ "FARGATE" ], "cpu": "512", "memory": "1 GB", "proxyConfiguration": { "type": "APPMESH", "containerName": "envoy-sidecar", "properties": [ { "name": "IgnoredUID", "value": "1337" }, { "name": "AppPorts", "value": "80" }, { "name": "ProxyIngressPort", "value": "15000" }, { "name": "ProxyEgressPort", "value": "15001" }, { "name": "EgressIgnoredIPs", "value": "169.254.170.2,169.254.169.254" } ] } }
aws ecs register-task-definition --cli-input-json file://06a-blue-td-with-adot.json

There's one more thing we need to configure - ADOT will require IAM permissions to write metrics to CloudWatch. We can provide those permissions by attaching the CloudWatchAgentServerPolicy AWS-managed policy the ECS task role.

Once the task definition has been configured, let's go ahead and create a new deployment of blue-svc.

// 06-update-blue.json { "cluster": "<cluster-name>", "service": "blue-svc", "taskDefinition": "blue_node", "forceNewDeployment": true }
aws ecs update-service --cli-input-json file://06-update-blue.json

Once the service's new task is up and running, let's check the metrics that were produced. Navigate to the CloudWatch console, and from the left navigation panel, select Metrics -> All metrics -> Custom namespaces -> ECS/Envoy, which corresponds to the metric namespace we defined in the awsemf exporter.

image

If the exporter was configured correctly and OTEL, through the awsemf exporter, is able to send metrics to CloudWatch, you should see a number of dimension combinations in the ECS/Envoy metric namespace. Each of these metric combinations corresponds to an existing label on the metrics scraped by Prometheus. You can confirm the existence of these labels by Execing into the blue task's main application container, running curl localhost:9901/stats/prometheus, and grepping the output for any of the labels in the dimension combinations available to you.

For example, let's say one of the dimension combinations available to you is:

OTelLib, envoy_cluster_name, envoy_ssl_curve

(Side note: the OTelLib field is populated with the receiver from which the metric was received. Here, this is our prometheusreceiver.)

You can run the following commands to look at the corresponding metrics that were scraped from Envoy's Prometheus endpoint.

curl localhost:9901/stats/prometheus | grep envoy_cluster_name
curl localhost:9901/stats/prometheus | grep envoy_ssl_curve

Be aware that some labels are present on multiple dimension combinations; one example above is envoy_cluster_name, meaning that there will be a lot of metrics that are presented when searching on this label. Other labels at higher levels of specificity will have less metrics associated with them. Experiment with various dimension combinations and see how the metrics were exported to CloudWatch from Envoy's Prometheus-emitting metrics endpoint.

It's trivial to then extend this pipeline to include Amazon Managed Prometheus if you prefer to query the metrics using Prometheus-compatible APIs. Consider updating the config.yaml file to include the prometheusremotewrite exporter if you'd like to try this out - remember, a single pipeline can export to multiple backends at the same time!

Traces - AWS X-Ray

We've got a metrics pipeline to send Envoy statistics to CloudWatch Metrics! Now, let's incorporate a traces pipeline to send Envoy tracing data to AWS X-Ray. Envoy, once configured, will also emit trace data for traffic reaching the application's listener port without us having to configure anything at the application level.

Let's go back to the blue virtual node and instrument it for X-Ray traces. First, we need to actually configure Envoy, which exposes a number of different tracing variables for us to choose from. We want to send our trace data to X-Ray, so let's tell Envoy to generate tracing data in the X-Ray format by specifying the relevant environment variable in our task definition.

// 06b-blue-td-with-adot-traces.json ... "essential" : true, "environment" : [ { "name" : "APPMESH_RESOURCE_ARN", "value" : "<blue-virtual-node-arn>" }, { "name": "ENABLE_ENVOY_XRAY_TRACING", "value": "1" } ], "secrets": [ ...
aws ecs register-task-definition --cli-input-json file://06b-blue-td-with-adot-traces.json

Remember to add X-Ray permissions to the task role to allow ADOT to push trace data to X-Ray. The collector requires the same permissions as the X-Ray Daemon, since this is also a drop-in replacement for that component - we can satisfy this requirement by adding the AWSXRayDaemonWriteAccess AWS-managed policy to the task role.

We then need to return to our ADOT config.yaml file and declare a handful of supporting components to ingest the trace data and send it off to the X-Ray service. Luckily for us, it's super easy to configure these components. We'll start with the receiver

# config.yaml receivers: prometheus: ... awsxray: # Zero further configuration - we will inherit the receiver's defaults.

then the exporter

exporters: awsemf: ... awsxray: region: <your-task-region>

then declare the trace pipeline as follows.

service: extensions: [ health_check ] pipelines: ... traces: receivers: [ awsxray ] exporters: [ awsxray ]

That's the entire config.yaml file completed! Build and push the Envoy image again, then update the blue service to use the new task definition with the tracing environment variable.

aws ecs update-service --cli-input-json file://06-update-blue.json

Exploring Envoy's Tracing Capability

Once the blue service has been updated for tracing, we can explore how Envoy tracing responds to requests made to the blue service, as well as requests from blue to connected backends (which, if you've been following part 1 of this guide, only constitutes the red.colors.local virtual service). There's 2 ways we can test blue's tracing capabilities.

  1. curl or otherwise make HTTP requests to the network load balancer in front of our virtual gateway. Remember that the root path is configured to serve traffic from blue.colors.local.
  2. Exec into blue's application container, and run curl red.colors.local.

Both of the above commands incorporate blue's virtual node in some capacity; #1 is hitting the listener directly, while #2 is serving traffic out of the node, off the listener. There's 2 ways you can view the X-Ray service map of these traces in AWS:

  • In the CloudWatch console, under X-Ray traces -> Service map in the left sidebar.
  • In the X-Ray console, under Service map.

Here's what the map should currently look like in CloudWatch's service map view.

image

You may notice something missing from the service map - while the Client -> blue traffic path is here, where is the red node? Why is the traffic from blue running curl red.colors.local not being captured by Envoy's tracing?

The reason for this is twofold:

  1. Envoy only captures requests to a virtual node that involves the listener when a particular virtual node is instrumented for tracing. The sidecar does not capture requests made from the node that do not pass through that node's listener in any capacity - this must be handled by tracing instrumentation / configuration in the downstream node / resource that the request is being sent to.
  2. We haven't configured tracing on the red virtual node, meaning that red isn't generating traces for incoming requests to its listener.

Let's add tracing capability to red so that we can see that side of the curl request. The following 3 components are required in the task definition for red:

  1. The ENABLE_ENVOY_XRAY_TRACING environment variable for Envoy must be set to 1.
  2. Incorporate the ADOT sidecar using the same container definition we specified for blue. (Remember to change the log stream prefix for red's ADOT collector!)
  3. Define a container startup dependency so that Envoy only starts after the ADOT collector becomes healthy.

Launch the task definition and cycle the red service onto the new version of the task definition. Once the new task comes up, Exec into blue's application container and run curl red.colors.local again - notice in the X-Ray service map that we now have a node for red, because we've configured Envoy to trace the incoming request to red's virtual node listener.

image

You can follow a similar procedure to instrument the green and yellow virtual nodes for tracing - once configured, try running curl router.colors.local from red, as well as querying the /router path on the load balancer sitting in front of the virtual gateway, which also hits the virtual router fronting both of these virtual nodes. The service map should now have all 4 of the color nodes present.

image

(Side Note: You may be wondering why there is a Client node connected to red, when red doesn't have any external gateway routing as with the other 3 nodes. This is because the client here represents the source of the requests we make to the virtual router from within red. These requests are not propagations from upstream nodes; rather, Envoy has interpreted a request initiating from red with no upstream request source as having originated from outside the virtual node.)

Note that your service map may not have the nodes in the same spots, depending on how X-Ray has drawn the map - the key thing to be aware of is the connections between the nodes, as these should be the same regardless of where the nodes are actually located in the map.

Congratulations! You've reached the end of this guide on App Mesh with Amazon ECS. I hope you enjoyed this guide, and the information that you've learnt over the course of following this tutorial will serve you well as you embark on your App Mesh journey.

Thanks for reading!

Written by ForsakenIdol.