This guide is a continuation of Part 1.
Up until this point, we've been dealing with applications communicating in our mesh with unencrypted traffic. Sure, we've configured a self-referencing security group and assigned it only to the ECS tasks that implement our virtual nodes and virtual gateways within the color-mesh
service mesh, but we have the option to further encrypt the traffic moving between virtual resources at the TLS network layer.
In this section, we'll explore how to further secure our mesh's network traffic by focusing on the original 2 virtual nodes we launched - red
and blue
- by creating a Private Certificate Authority (PCA) in AWS Certificate Manager (ACM), using it to provision SSL certificates, and installing these certificates in both the root CA and our Envoy proxy sidecars to encrypt the traffic moving between ECS tasks. We'll begin with one-way TLS not too dissimilar to the TLS mechanisms you may already be familiar with (e.g. HTTPS) and provide an example of the full pipeline incorporating ACM PCA, then we'll move onto implementing mutual TLS, involving both client and server tasks providing certificates to one another.
Throughout this section, pay close attention to the procedures we use to configure TLS, specifically the components involved in the implementation, and the difference in configuration complexity between 1-way TLS, and 2-way mutual TLS (mTLS).
Let's start with one-way TLS.
Consider the following 2 entities representing our virtual nodes in communication within the service mesh.
App Mesh supports multiple avenues for retrieving the certificate material in one-way TLS in ECS for both the server to send to the client, and for the client to validate the server's certificate. As previously mentioned, we'll explore an implementation of TLS on App Mesh using certificates provisioned from ACM PCA.
This section was in-part adapted from the App Mesh walkthrough on TLS configuration.
Before we can generate certificates, we must first create a private certificate authority which we can use, because within-mesh TLS encryption only supports private certificates.
Note the following:
jq
command-line utility to extract values from the JSON
responses provided by the AWS CLI. jq
's -r
flag strips the double quotation marks from the result and just provide's us with the actual value we've extracted.ROOT
certificate authority, because this CA does not have material signed by a higher authority - it will be the CA issuing all subsequent certificates for our environment.Let's have a look at what we just created.
At this point in time, our private CA will be in the PENDING_CERTIFICATE
state because the CA is missing a root certificate. We need to satisfy the certificate signing request (CSR) for our root CA by issuing a certificate using our CA (which will result in a self-signed root certificate) and import the certificate back into PCA as the CA certificate.
We'll start by fetching the CSR for our CA.
With version 2 of the AWS CLI, the CSR must be Base64 encoded before we can pass it to ACM PCA.
Two things of note:
RootCACertificate
certificate template, refer to the PCA documentation.The certificate we've just issued with be the root CA's certificate, which will form the root of the certificate chain for all subsequent certificates we sign using this CA. We now need to fetch this certificate, encode it in Base64 (again, because we're using version 2 of the AWS CLI), and import it as the CA certificate.
Now, we can check the state of our private CA - the status should've transitioned to ACTIVE
.
Wonderful - we now have a private CA that we can use to generate certificates for the virtual nodes within our mesh! Let's request that a certificate be generated.
Give the certificate around a few seconds to be generated, then we can update the red
virtual node to use this certificate to terminate TLS traffic to its port 80 listener. Note that we saved the certificate ARN to an environment variable, but because of the layout and complexity of this file, it's much easier to just insert the value here.
Every other configuration value for the red
virtual node remains the same - all we're doing is adding TLS termination on the listener.
At this point in time, we need to update our ECS task role to add 2 permissions: acm:ExportCertificate
for the virtual node's listener so that Envoy can fetch the certificate from ACM, and acm-pca:GetCertificateAuthorityCertificate
, the latter of which is not used now, but which will be required in the next section.
Now, we can Exec into the main application container (not the Envoy sidecar) for both the red
and blue
nodes to run the following commands to test the SSL configuration.
curl localhost:9901/stats | grep connected
on both nodes to check that Envoy is still connected to the App Mesh control plane.curl localhost:9901/stats | grep ssl
to check the SSL metrics for each node. Observe the following, and bear in mind we haven't made any request between the nodes at this point in time:
red
will recognize the Envoy listener in the same task on port 15000 as this port corresponds to Envoy; despite port 15000 not being open on either ECS task, this is the port to which traffic is proxied before leaving the task.blue
will recognize red
's downstream presence as an Egress target in the mesh, and will present statistics on port 80.blue
to red
- curl red.colors.local
(assuming blue
still has red
's virtual service as a backend from part 2 of this tutorial). Observe that the curl
request from blue
to red
is currently succeeding.At present, blue
is not currently validating the TLS certificate provided by red
. The blue
node will still encrypt traffic using red
's certificate, but blue
will not check the authenticity of this certificate because we haven't configured this in blue
's client policy. Let's observe which SSL metrics have incremented as a result of this request.
.ssl.handshake
metric will be set to the same value for both nodes after a successful SSL handshake has been negotiated.
blue
to red
; rather, subsequent requests within a short window will use the same SSL material from the previous handshake.ssl.no_certificate
metric refers to successful SSL negotiations where a certificate was not provided by the client (which would constitute mutual TLS). Currently, the certificate is only being provided by the server, which means that this metric should be incrementing at the same time as the handshake
count..ssl.ciphers.<cipher-id>
metric tracks the number of times that a particular cipher was used to encrypt the secret key during a TLS handshake.We can change blue
's behavior (which is currently using the certificate without validating it) by adding the ability for blue
to validate red
's certificate, thus improving our security posture.
Once again - observe that the only thing we've changed in blue
's configuration is to pass the ARN of the private CA we created in the previous subsection. At this point, return to the Exec shells we have open for our red
and blue
nodes, and curl red.colors.local
from the blue
node. Once again, run curl localhost:9901/stats | grep ssl
, and note the following statistics.
.ssl.handshake
metric will increment if you waited a sufficient amount of time since the last request from blue
to red
.ssl.no_certificate
metric is still incrementing alongside the .ssl.handshake
metric..ssl.ciphers.<cipher-id>
metric is still incrementing as well.At this point in time, we've configured TLS enryption on the traffic between the blue
client node and the red
server node. Envoy negotiates TLS on our application's behalf, and while the traffic movement from Envoy to the application container is unencrypted, this is generally considered to be acceptable because the unencrypted traffic is local to the task.
However, we can also configure mutual TLS for the server to verify the authenticity of the client it's communicating with by requiring that the client also provide a certificate for validation purposes. Mutual TLS, when properly implemented, will require that the client, in its initial request to the server, provide its own certificate for validation using a local certificate trust chain on the server. In the following sections, we'll add the infrastructure necessary to implement mutual TLS. These steps will reuse the certificate we provisioned for the server, but you can just as easily request another certificate from our private CA for the client and use that instead - just remember to update the COLOR_CERT_ARN
environment variable with the ARN of the new certificate in the steps we use in this subsequent sections.
Unlike with our server certificate, Envoy doesn't support fetching key material from ACM when configuring the client Envoy proxy to provide its own certificate in its request to the server. Only a local file path or SDS secret parameter is supported, and because SPIRE's SDS implementation isn't available in ECS or Fargate (at the time of writing, November 2023), we need to configure the certificate material on the Envoy sidecar itself. There are 2 ways we can load the certificate onto Envoy.
As you may have guessed, we'll be exploring option #2, since it represents a more secure and streamlined approach to specifying the client certificate and private key material.
Download and unpack the certificate, certificate chain, and the private key into separate files. These steps will make use of the jq
command-line utility to parse the output. We need to create an encoded passphrase file to encrypt the private key material in transit - you can use any alphanumeric passphrase here.
certificate-info.json
, passphrase*
, and *.pem
.
We also need to decrypt the private key material by specifying the pasphrase we used when fetching the certificate material. The command below will make use of the openssl
utility to do this.
Now, head over to SSM Parameter Store and store the following 6 parameters in SSM as text
entries, which we will update the ECS task definitions to reference. The first 3 parameters contain the file contents.
/acm/cert
- This will store the value within the cert.pem
file./acm/cert-chain
- This will store the value within the cert-chain.pem
file./acm/private-key
- This will store the value within the private-key.pem
file.Here's an example command to load the contents of cert.pem
into the /acm/cert
parameter, using a SecureString
datatype in SSM. You must execute this command in the same directory as you used for the certificate commands above.
The next 3 parameters will store the paths to the relevant files on the Envoy container. The files won't yet exist at image build time; rather, we'll be creating the files at runtime. Here are the values I'll be using for each, but feel free to substitute your own.
/acm/cert-path
- /keys/cert.pem
/acm/cert-chain-path
- /keys/cert-chain.pem
/acm/private-key-path
- /keys/private-key.pem
Here's an example command to load the file paths into SSM using a String
datatype.
ValidationException
, with the following reason: Parameter name must be a fully qualified name.
/keys/<file>.pem
value instead of just inserting the value./
, which arises due to the methods employed by Git Bash to provide an experience as close to the native Bash experience as possible. To work around this issue temporarily (and avoid breaking other slash-based command-line options), simply prefix the aws ssm put-parameter
commands with MSYS_NO_PATHCONV=1
to diable root path inferencing for only that command. This is what the full command will look like, as an example, using the command above.
MSYS_NO_PATHCONV=1 aws ssm put-parameter --name "/acm/cert-path" --value "/keys/cert.pem" --type String
Once these 6 parameters have been configured, we can discuss how these parameters should be passed to Envoy.
Previously, we were using the Envoy image that was released by App Mesh. This version does not already know how to use any certificate material we pass to it, nor will it have the certificate material we need by default. We need to write the certificate, certificate chain, and private key material into the image - for this, we create a very simple Dockerfile that will work in conjunction with the SSM-parameter environment variables we pass to it.
Let's break down this Dockerfile.
/keys
folder (which you should substitute with your own folder if you opted to select a different path for the certificate and private key material), we need to update the folder's permission modes to allow our user to write to this folder.CMD
this image has by default in the App Mesh documentation - /usr/bin/agent
. Because we can't specify the file initialization commands at build time (because the environment variables are only passed at runtime), we prefix the container's executable with the echo
statements so that the files are only written when this container is run.echo
commands to preserve newlines when writing these values to file. If we don't use double quotes here, the newlines will turn into spaces in the resulting file, which we don't want to happen.Remember - this guide was built on version 1.27.0.0 of the Envoy image - feel free to substitute the most up-to-date version you are using.
Build and push this image to a remote repository. We will then need to create a new revision of our ECS task definition for both of the red
and blue
virtual nodes to accomplish 2 things:
This is easily done by changing the image
entry for our Envoy container to the name of your custom image.
This is done by adding the secrets
task definition parameter to the Envoy container specification (not the application container!) with a list of environment variables and the ARN of their corresponding parameters from SSM:
We also need to configure our ECS task execution role with permissions to SSM to allow ECS to successfully query the aforementioned parameters and initialize them inside the Envoy container. This is easily done by attaching the AmazonSSMReadOnlyAccess
AWS-managed policy to the task execution role.
Once that's done, update the red-svc
and blue-svc
ECS services to force a new deployment onto the next task definition version you've just deployed.
At this point in time, Envoy's operation has not changed - we're still using one-way TLS with the server's provided certificate and client-sided validation material both sourced from ACM, but the local files we initialized through our custom Envoy image have not been put to use yet in configuration mutual TLS. Now, we need to update the virtual node settings for red
and blue
:
red
will require a client TLS certificate in any requests made to its listener, and will validate the incoming client certificate using the local certificate chain located at the path defined by the CERT_CHAIN_PATH
environment variable.blue
will provide a certificate during its initial request to red
, using the local certificate and private key files located at the paths defined by their corresponding environment variables.We'll apply these conigurations in steps to demonstrate what happens when only one side of the mutual TLS configuration has been configured. Typically you'd configure the client certificate before configuring server validation, but we'll reverse the order here to demonstrate what happens in a failed client certificate verification scenario.
To start, let's update red
to require a TLS certificate from the requesting client while operating in STRICT
mode. Copy the contents of 05c
to a new file, and under the listeners -> tls
field, add the following specification.
Before we update blue
, let's attempt to curl red.colors.local
from blue
to see what's happened. Wait until the .server_ssl_socket_factory.ssl_context_update_by_sds
metric has incremented on red
's statistics, indicating that red
's Envoy sidecar has received the updated SSL specification from the App Mesh control plane, before running curl
from blue
. If blue
succeeds in curl
ing and the .ssl.handshake
metric has not increased, blue
will still be using the previous SSL handshake material. Give it a few minutes and try again.
You'll know when red
has receievd the updated SSL context when you attempt to run the curl
command and encounter an error similar to the following:
When this happens, check the .ssl.connection_error
statistic on the blue
node - this metric will increase by a certain amount every time you make a request to red
that fails due to SSL issues.
Then, go to the red
node's shell and check the .ssl.fail_verify_no_cert
metric. This metric tracks the SSL connection failures that occur when the red
node's listener expects the incoming client request to provide a certificate, but the client does not provide one. These connections will fail because the red
listener is operating in STRICT
TLS mode - in a production scenario, to avoid downtime when configuring the blue
client, you'd swap red
to operate in PERMISSIVE
mode to allow connections without SSL, if SSL is not possible for a certain amount of time.
Now, let's update blue
to provide a client certificate on the initial request to its authorized backends. To do this, we provide the certificate
entry under the clientPolicy -> tls
field - Copy the contents of 05d
to a new file and add the following.
Note: This node may take a while to update. You'll know when Envoy's received the new specification when the .client_ssl_socket_factory.ssl_context_update_by_sds
metric increments.
Don't feel misled by the certificateChain
parameter naming convention when we the blue
virtual node. It may seem like this node is asking for a certificate chain, but the chain will only be valid if it includes the certificate we exported. This is not the case when exporting a certificate from ACM, which will separate the certificate and the certificate chain, and in TLS handshakes, the client tends to send only its certificate and not the whole chain.
Once ready, curl
the red
node again from blue
. Pay attention to the following statistics:
.ssl.handshake
should increment on both virtual nodes if the curl
was successful.red
, note that .ssl.no_certificate
has not incremented - this is because SSL was successfully negotiated with a client certificate.Envoy aims to make the TLS implementation invisible to the downstream application, which is why it is only these metrics that track the status of TLS negotiation from within the task.
Congratulations! You've successfully configured red
and blue
to negotiate TLS both 1-way and mutually, and made your way through a highly complex chapter of this guide - you should feel proud of yourself for making it this far. While we've only demonstrated TLS configuration between the red
and the blue
nodes, if you'd like to experiment with TLS in App Mesh and further your understanding of the configuration, here's some additional things you may wish to consider:
blue
node at the gateway route's root path /
.Now, let's move onto a discussion on observability.
The Envoy proxy exposes a lot of data and customisability for end-user monitoring solutions. We've already seen some of the potential of the Envoy administrative interface when we had a look at the SSL metrics during mutual TLS configuration, and during the brief aside at the end of section 2, but there's so much more that Envoy allows us to do in terms of observability without needing to configure anything at the application level.
There are 3 core types of observability signals - logs, metrics, and traces - commonly referred to as the Three Pillars of Observability. We've already configured logging to CloudWatch Logs in our task definition using the awslogs
driver. In this section, we're going to build an implementation-agnostic observability pipeline using the AWS Distro for OpenTelemetry (ADOT) to ingest metrics and traces from our Envoy sidecars in the color-mesh
service mesh. ADOT allows us to mix-and-match different types of signal sources and destinations, and drop in existing observability solutions from 3rd-party backends which you may already be using.
We'll be touching briefly on some of the core ADOT components in this section. For an in-depth understanding of ADOT's potential, check out the following guide on Observability with OpenTelemetry on Amazon ECS.
Let's start by initializing the config.yaml
file which we will use to customize the components in our ADOT sidecar. The skeleton we'll be using looks like this.
Remember how we were running curl localhost:9901/stats
from our application container to get statistics relevant to the Envoy sidecar in our color
application tasks from port 9901? If we append /prometheus
to the curl
address (making localhost:9901/stats/prometheus
), we can retrieve those statistics in a Prometheus-compatible format - give it a try! Let's declare a prometheus
receiver to scrape this endpoint. This is a drop-in replacement for existing Prometheus sidecars and will be configured as such - notice that we're using the same syntax.
You can name the job_name
anything you'd like - I've opted to name it based on the endpoint and type of data we're scraping. The scraped target
is over the same network interface as this container, and the metrics_path
dictates the path on this target
to be scraped.
Scraping the endpoint isn't enough, however. We need to then send the data to a backend. Let's try sending the Envoy metric signals to CloudWatch Metrics by configuring the following exporter.
(Side note: NoDimensionRollup
indicates that CloudWatch should interpret incoming dimension combinations as-is. This is to prevent metric value duplication, which results in an unnecessarily higher rate of ingestion.)
The Embedded Metric Format (EMF) specification is located here if you'd like to learn more about why CloudWatch uses EMF.
While we're at it, let's add the collector's health check extension too.
At this point, we've declared and configured our components, but we haven't enabled them in the collector. Let's enable them by configuring a pipeline with our 2 components, and also enable the health check extension.
We now need to pass this configuration file to ADOT. Let's do that by creating a new image using ADOT as a base, and pushing this image to a remote repository.
The image's new command will tell it to load the configuration file we just copied into the image.
Let's start by instrumenting the blue
virtual node's ECS task with our ADOT container. We'll instrument one virtual node at a time to explain some important points, especially around tracing in the next section. We need to incorporate this image into our task definition - specifically, here's what we need to add.
containerDefinitions
section.All in all, here's what the task definition for our blue
virtual node would look like after we make the necessary changes to incorporate the ADOT sidecar:
There's one more thing we need to configure - ADOT will require IAM permissions to write metrics to CloudWatch. We can provide those permissions by attaching the CloudWatchAgentServerPolicy
AWS-managed policy the ECS task role.
Once the task definition has been configured, let's go ahead and create a new deployment of blue-svc
.
Once the service's new task is up and running, let's check the metrics that were produced. Navigate to the CloudWatch console, and from the left navigation panel, select Metrics -> All metrics -> Custom namespaces -> ECS/Envoy
, which corresponds to the metric namespace we defined in the awsemf
exporter.
If the exporter was configured correctly and OTEL, through the awsemf
exporter, is able to send metrics to CloudWatch, you should see a number of dimension combinations in the ECS/Envoy
metric namespace. Each of these metric combinations corresponds to an existing label on the metrics scraped by Prometheus. You can confirm the existence of these labels by Exec
ing into the blue
task's main application container, running curl localhost:9901/stats/prometheus
, and grep
ping the output for any of the labels in the dimension combinations available to you.
For example, let's say one of the dimension combinations available to you is:
(Side note: the OTelLib field is populated with the receiver from which the metric was received. Here, this is our prometheusreceiver
.)
You can run the following commands to look at the corresponding metrics that were scraped from Envoy's Prometheus endpoint.
Be aware that some labels are present on multiple dimension combinations; one example above is envoy_cluster_name
, meaning that there will be a lot of metrics that are presented when searching on this label. Other labels at higher levels of specificity will have less metrics associated with them. Experiment with various dimension combinations and see how the metrics were exported to CloudWatch from Envoy's Prometheus-emitting metrics endpoint.
It's trivial to then extend this pipeline to include Amazon Managed Prometheus if you prefer to query the metrics using Prometheus-compatible APIs. Consider updating the config.yaml
file to include the prometheusremotewrite
exporter if you'd like to try this out - remember, a single pipeline can export to multiple backends at the same time!
We've got a metrics pipeline to send Envoy statistics to CloudWatch Metrics! Now, let's incorporate a traces pipeline to send Envoy tracing data to AWS X-Ray. Envoy, once configured, will also emit trace data for traffic reaching the application's listener port without us having to configure anything at the application level.
Let's go back to the blue
virtual node and instrument it for X-Ray traces. First, we need to actually configure Envoy, which exposes a number of different tracing variables for us to choose from. We want to send our trace data to X-Ray, so let's tell Envoy to generate tracing data in the X-Ray format by specifying the relevant environment variable in our task definition.
Remember to add X-Ray permissions to the task role to allow ADOT to push trace data to X-Ray. The collector requires the same permissions as the X-Ray Daemon, since this is also a drop-in replacement for that component - we can satisfy this requirement by adding the AWSXRayDaemonWriteAccess
AWS-managed policy to the task role.
We then need to return to our ADOT config.yaml
file and declare a handful of supporting components to ingest the trace data and send it off to the X-Ray service. Luckily for us, it's super easy to configure these components. We'll start with the receiver…
… then the exporter…
… then declare the trace pipeline as follows.
That's the entire config.yaml
file completed! Build and push the Envoy image again, then update the blue
service to use the new task definition with the tracing environment variable.
Once the blue
service has been updated for tracing, we can explore how Envoy tracing responds to requests made to the blue
service, as well as requests from blue
to connected backends (which, if you've been following part 1 of this guide, only constitutes the red.colors.local
virtual service). There's 2 ways we can test blue
's tracing capabilities.
curl
or otherwise make HTTP requests to the network load balancer in front of our virtual gateway. Remember that the root path is configured to serve traffic from blue.colors.local
.Exec
into blue
's application container, and run curl red.colors.local
.Both of the above commands incorporate blue
's virtual node in some capacity; #1 is hitting the listener directly, while #2 is serving traffic out of the node, off the listener. There's 2 ways you can view the X-Ray service map of these traces in AWS:
X-Ray traces -> Service map
in the left sidebar.Service map
.Here's what the map should currently look like in CloudWatch's service map view.
You may notice something missing from the service map - while the Client -> blue
traffic path is here, where is the red node? Why is the traffic from blue
running curl red.colors.local
not being captured by Envoy's tracing?
The reason for this is twofold:
red
virtual node, meaning that red
isn't generating traces for incoming requests to its listener.Let's add tracing capability to red
so that we can see that side of the curl
request. The following 3 components are required in the task definition for red
:
ENABLE_ENVOY_XRAY_TRACING
environment variable for Envoy must be set to 1
.blue
. (Remember to change the log stream prefix for red
's ADOT collector!)Launch the task definition and cycle the red
service onto the new version of the task definition. Once the new task comes up, Exec
into blue
's application container and run curl red.colors.local
again - notice in the X-Ray service map that we now have a node for red
, because we've configured Envoy to trace the incoming request to red
's virtual node listener.
You can follow a similar procedure to instrument the green
and yellow
virtual nodes for tracing - once configured, try running curl router.colors.local
from red
, as well as querying the /router
path on the load balancer sitting in front of the virtual gateway, which also hits the virtual router fronting both of these virtual nodes. The service map should now have all 4 of the color nodes present.
(Side Note: You may be wondering why there is a Client
node connected to red
, when red
doesn't have any external gateway routing as with the other 3 nodes. This is because the client here represents the source of the requests we make to the virtual router from within red
. These requests are not propagations from upstream nodes; rather, Envoy has interpreted a request initiating from red
with no upstream request source as having originated from outside the virtual node.)
Note that your service map may not have the nodes in the same spots, depending on how X-Ray has drawn the map - the key thing to be aware of is the connections between the nodes, as these should be the same regardless of where the nodes are actually located in the map.
Congratulations! You've reached the end of this guide on App Mesh with Amazon ECS. I hope you enjoyed this guide, and the information that you've learnt over the course of following this tutorial will serve you well as you embark on your App Mesh journey.
Thanks for reading!
Written by ForsakenIdol.