Observability with OpenTelemetry - Amazon ECS

# Observability - OpenTelemetry and Amazon ECS [**OpenTelemetry**](https://opentelemetry.io/), commonly abbreviated to **OTEL**, is a project sponsored by the Cloud-Native Computing Foundation (CNCF) and an extremely powerful solution to manage traces, metrics, and logs in a highly observable system, with a single collector to process all of these signals and send them through to the desired backends. The [**AWS Distro for OpenTelemetry**](https://aws.amazon.com/otel/) (**ADOT**) is a distribution of OTEL streamlined for use with AWS services and partnered with several 3rd-party integrations such as Jaeger, Datadog, New Relic, Dynatrace, Splunk, and more. OpenTelemetry has a wide ecosystem of integrations and is a one-size-fits-all distribution for application observability for containers. Despite all of this, however, the initial learning curve for OTEL can be very steep. Many of the resources for getting started with OTEL-based observability solutions generally jump straight from instrumentation documentation to a fully-fledged sample application (some examples [here](https://opentelemetry.io/docs/demo/) and [here](https://aws-otel.github.io/docs/setup/ecs)) which you don't get to build and understand step-by-step. While this can be useful for fast-tracking users to understand what an OTEL-instrumented application should look like, it doesn't allow you to dive deep into the power of OTEL, or develop an understanding of how to set up an observability solution by yourself. In this guide, we'll start with an introduction to the key concepts behind observability, and a comparison between container infrastructure with and without OTEL. I'll then walk you through the entire process to build a simple 2-tier Flask application, instrument it with the AWS Distribution for OpenTelemetry, configure the ADOT sidecar, and deploy the entire application on [**Amazon Elastic Container Service**](https://aws.amazon.com/ecs/) (ECS). You'll develop a thorough understanding of the entire process from initial application development, to application instrumentation, to processing observability signals emitted by your application in various backends across AWS and 3rd-party services. We'll be using open-source solutions wherever possible, and I'll focus on some of the more popular solutions that already exist on AWS, such as [**Amazon Managed Prometheus**](https://aws.amazon.com/prometheus/), [**X-Ray**](https://aws.amazon.com/xray/), and [**CloudWatch Logs**](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html), as well as demonstrating how to incorporate supported 3rd-party backends which you may already have in place. Let's get started. ## What is Observability? Before we can learn OpenTelemetry, we must first understand the concept of **observability**. The [Cloud Native Glossary](https://glossary.cncf.io/observability/) defines observability as a system property that defines the degree to which the system can generate actionable insights as system outputs. It allows users to understand a system’s state from these outputs and take (corrective) action. These actionable insights generated by a system can be placed into 3 different groups: - **Logs** - Text records with metadata that relate to system execution. - **Metrics** - Runetime measurements about a particular service, relating to measurable system values. - **Traces** - Tracking the flow of a request through different components of an application. Most users will already have experience with **logs** - if you've ever developed an application and `print`ed something to the screen (perhaps to understand the logic in your application), then congratulations! You've just performed a simple instrumentation for your application to emit logs. Taking **Python** as an example, even the simplest `Hello World` application... ```python # hello.py print "Hello World!" ``` ... is emitting a log - `Hello World!` In practise, logs are more than just printing - there are different log streams (e.g. `stdout` and `stderr`) and different ways to log (e.g. to a local file), as well as different formats for logs (e.g. plaintext or JSON), and it all comes down to the specific needs of your application. **Metrics** are a little more complex. One of the most popular solutions for metric management is [**Prometheus**](https://prometheus.io/), which requires an endpoint on your application which serves metrics. Here, a number of variables are tracked internally which pertain to both the operation of the application, and the interactions that end-users have with the application. Those variables are exposed in a specific format through the endpoint, which is then scraped by Prometheus, and each variable forms a **time series** consisting of a **metric name**, a series of **samples** containing values for that metric, and optional **labels** for the metric. In this way, a metric comprises a series of values that provide information about how a particular signal changes over time with the execution of the application. Applications can also frequently be comprised of many different modules or components, where a single request can flow through multiple components in this application. **Traces** demonstrate the flow of these requests through each component, from the point where the request is received by the application, to each component required for the request. Traces will also frequently contain additional information about the request, such as the time it took for a specific component to process the request, whether it emitted an error, and whether it had to call further downstream components. Tracing backends will also process traces to generate a **service graph**, a visual representation of each component in the application, and the nature of the service graph will depend on the way the represented application was instrumented. Logs, metrics, and traces form the **3 Pillars of Observability**, and a well-instrumented application should aim to emit all 3 types of observable signals. However, observability is more than just having access to these observable signals - it also encompasses the ability to use this signals to improve aspects of the source application, such as the end-user experience or meeting certain business requirements. To this end, various products exist that allow users to ingest these signals and display them in an easy-to-read format, allowing them to make decisions based on this data. ## Why should you instrument with OpenTelemetry? There are a multitude of solutions currently available for each pillar of observability, and they all have different steps to instrument and different components that need to be running. As a few examples: - **Prometheus** has its own sidecar which needs to run to scrape your application's endpoint(s). - **DataDog** has its own sidecar which needs to run to observe your application. - **X-Ray**, when instrumenting with X-Ray SDKs specifically, requires the **X-Ray Daemon** to run alongside the main application process. <image src="https://cdn.discordapp.com/attachments/427068629486534667/1163728853831790644/no_adot_multiple_sidecars.png?ex=6540a1f3&is=652e2cf3&hm=54c06a55b2bd5c373339525a882fc0356d4a32a55d83a679ba521401447d10b3&" /> <br /><br /> All of these requirements result in multiple sidecars running alongside typically only a single application container, increasing management overhead. Furthermore, to replace a monitoring solution (e.g. switching between tracing backends) requires swapping the relevant sidecar, and in some cases, re-instrumenting the application with the updated SDK. ![adot_cannot_scale_easily](https://hackmd.io/_uploads/HJ2ZREmMp.png) OpenTelemetry avoids these issues by providing the following benefits: - A single sidecar which can capture **multiple types** of observability signals using a configuration file. <image src="https://cdn.discordapp.com/attachments/427068629486534667/1163728854544818177/adot_one_sidecar.png?ex=6540a1f3&is=652e2cf3&hm=4244b879d660d886a4ecadbfd7c16765da05c1dec0c0c3437b1e524fe63f4c2d&" /> <br /><br /> - **Swap-In/Swap-Out** ease of migration from supported monitoring solutions. - For backends that cannot be easily swapped, a standardized format for exchanging observability signals called the **OpenTelemetry Protocol** (OTLP), supported by many different backend solutions. <image src="https://cdn.discordapp.com/attachments/427068629486534667/1163728854175711282/adot_swapping_backends.png?ex=6540a1f3&is=652e2cf3&hm=0f5596d06dea129ac54da51b75a85268baa6190959bff3c79efdd9b825424c7d&" /> <br /><br /> *(Note that we don't send logs through OTEL when we're using the AWS Distro for OpenTelemetry - we'll discuss why is this case later in this guide.)* OTEL cuts down a huge portion of the management overhead involved when working with many different types of observability signals, allowing you to focus on just 2 things: instrumentating your application, and configurating a single sidecar. ## Tutorial - Instrumenting with ADOT Now that we have an understanding of OpenTelemetry, we can start building a sample application and instrumenting it with OpenTelemetry. We're going to develop a simple [**Flask**](https://flask.palletsprojects.com/en/3.0.x/) application in Python with 2 routes - a home route (`/`), and a `GET` route (`/get`) that calls a [**MySQL**](https://www.mysql.com/) database. We'll be performing the following tasks, each of which corresponds to a section header below: 1. **Coding** the main logic of the Flask application. 2. Adding support for a **MySQL database**. 3. **Dockerizing** the application and running it on **ECS**. 4. Configuring the ADOT sidecar for **metrics** and incorporating upstream AWS services. 5. Instrumenting the application for **traces** and updating the ADOT configuration to match. 6. Discussing **further steps** to improve your understanding of OTEL. You'll need at least a basic understanding of the following. - The MySQL client, or another database of your choice; we'll be using MySQL in this tutorial. If you want to test the application locally, you'll also need an instance of the database running on your development machine. - Docker and the ECS service. You'll need an ECS cluster running in your AWS account, but you don't need any container instances, since we'll be launching all of our resources into [**Fargate**](https://aws.amazon.com/fargate/), which is serverless. ### Why don't we use ADOT for logging? ADOT is a distribution of OTEL which expands upon OTEL's core functionality by adding custom receivers and exporters for various AWS services and supported 3rd-party observability tools. However, you may have noticed that we're not going to be using the AWS Distro for OpenTelemetry to route our logs. This is because while the **upstream** OpenTelemetry ecosystem supports a [wide range of components](https://github.com/open-telemetry/opentelemetry-collector-contrib), not all of these are supported by ADOT. Specifically, the [**logging solutions**](https://aws-otel.github.io/docs/components/misc-exporters#logging-exporter) ADOT provides are not suitable for our use (the Zap logger logs to the console, and the file exporter, surprise surprise, writes to a file), and when hosting on AWS container services, there are other simple, yet highly configurable, options for [ECS](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/firelens-example-taskdefs.html) and [EKS](https://docs.aws.amazon.com/eks/latest/userguide/fargate-logging.html) to log to a remote location off the actual container itself. We'll discuss more on how to configure logging on ECS specifically in step 3, when we write the task definition for our application. ## 1. Coding the Application *Note: If you would like to skip coding sections #1 and #2 and go straight to Dockerizing this application, you can reference the sample code in [**this**](https://github.com/ForsakenIdol/Flask-SQL-Sample-App) repository, which also contains the required Dockerfile. If so, skip straight to section #3.* These steps were tested with Python 3.11, but any version of Python 3.8+ should work. In your working directory, start by configuring a virtual environment. ```shell python3 venv .venv source .venv/bin/activate ``` If you're on Windows, the `activate` path may be slightly different - `.venv/Scripts/activate`. We'll use virtual environments to keep track of different framework versions across different projects - it's a bad idea to install your dependencies globally because differing versions can cause conflicts. Install **Flask** and the [**Python MySQL Connector**](https://dev.mysql.com/doc/connector-python/en/connector-python-installation.html). ```sh pip install flask mysql-connector-python ``` This is a good point in time to `freeze` your requirements. If you're using source control, it's also a good idea to set up a [pre-commit hook](https://stackoverflow.com/questions/65695906/how-to-update-a-requirements-file-automatically-when-new-packages-are-installed) to make sure you're not submiting a `commit` with untracked requirements. ```sh pip freeze > requirements.txt ``` Create a `templates` folder in the top-level of your working directory to store the HTML template files our application will serve. Working with templating in Flask is very easy - let's start with a single `index.html` file in this directory and populate it with anything you'd like. My index page looks something like this: ```htmlmixed <head> <title>ADOT Example 1</title> </head> <body> <h1>ADOT Example Page</h1> <p>Welcome to the ADOT example page.</p> </body> ``` We have our first HTML file! Let's serve it using Flask. Back up in the root directory, and create an `app.py` file. The port on which the application runs is up to you, but I'll use port 80 to avoid having to specify the port number. ```python from flask import Flask, render_template app = Flask(__name__) @app.route("/") def home(): print("Hit the root page.") return render_template("index.html") if __name__ == '__main__': app.run(host="0.0.0.0", port=80) ``` In this file, we declare an instance of a Flask application, bind a root path function (to serve the `index.html` file), and tell it to run when the file is executed with Python. At this point, you can go ahead and run `python app.py`, and access `localhost` in a browser (if you specified another port, you'll need to access `localhost:<port>` instead) - you should see the following. <image src="https://cdn.discordapp.com/attachments/427068629486534667/1163779348315246632/ADOT_example_homepage.png?ex=6540d0fa&is=652e5bfa&hm=ee0fd8b1eea6ddf21effcf8a09173a398ea12f7305d0340d5f38296b9209246f&" /> Wonderful - we have our first sample application! Now let's add a MySQL database connector. ## 2. Adding a Database Our application is going to have a simple `/get` route to request data from a backend database. Let's use the MySQL connector we installed in part 1. At this point in time, your project's directory structure should look something like this: ``` root ├── .venv ├── templates │ └── index.html ├── app.py └── requirements.txt ``` In the `templates/` folder, let's define another simple template that uses templating variables - `get.html`. ```htmlmixed <head> <title>ADOT Get</title> </head> <body> <h1>List of Entries</h1> <ul> {% for item in db_entries %} <li>{{ item }}</li> {% endfor %} </ul> </body> ``` The `for` directive and `{{ item }}` entries are template code that we can substitute with actual values in our `app.py` file. Let's render this template using some filler variables for the time being, until we actually define a connector to our MySQL database. To do this, we'll add the `get` route to our `app.py` file. Place this under the root path, but before the `__name__` statement. ```python @app.route("/get") def get(): print("Getting database entries.") return render_template("get.html", db_entries=[ {"key_1": "value_1"}, {"key_1": "value_2"}, {"key_1": "value_3_final"} ]) ``` Running the application at this point in time, you can test the `/get` path of the app to see the list rendered as a result of passing the `db_entries` variable to our template. <image src="https://cdn.discordapp.com/attachments/427068629486534667/1163786703740338247/ADOT_get_generic.png?ex=6540d7d4&is=652e62d4&hm=8574ea804400330d518ed1daf4f50ea0f9db0d46f0373b4f8baa912af516b611&" /> Now, we're going to add some functionality to our app to allow us to integrate a MySQL database. You can use any database you'd like here that has sample data you want to integrate with your application - all you need is at least 1 table which contains data. Import the connector you previously installed with `from mysql import connector` at the top of the `app.py` file. We'll then define our connector's parameters using environment variables for security - here's an example set of commands you'd load into the terminal in which you're running the `app.py` file: ```bash= # parameters.env export DB_USER=root export DB_PASSWORD=password123 export DB_HOST=localhost export DB_DATABASE=world export DB_TABLE_NAME=country ``` The database and table names are drawn from the sample `world` database that MySQL can initialize for you when MySQL server is installed. Feel free to use your own database and table combination for these values, and remember to update the other fields as necessary to match your local installation if available. In the `app.py` file, start by importing the libraries we need to make a connection to the MySQL database. ```python= from flask import Flask, render_template from mysql import connector import os ``` We'll use the variables we've previously exported to establish a connection to the database in our `app.py` file. After declaring `app = Flask(__name__)`, add the following code: ```python= try: cnx = connector.connect(user=os.environ["DB_USER"], password=os.environ["DB_PASSWORD"], host=os.environ["DB_HOST"], database=os.environ["DB_DATABASE"], port=3306) cursor = cnx.cursor() print("MySQL connection established.") except Exception as e: print("MySQL database connection failed - proceeding without database connection.") print (e) cursor = None ``` The MySQL connector will attempt to connect to the database using the parameters we specified in our environment file, which you can load with `source parameters.env` into the terminal before you next run `python app.py`. If you have a local instance of MySQL and your parameters are correctly configured, you should see the log line `MySQL connection established.` in the output. If you don't you can still run the file, but you'll encounter the print statement in the `except` block instead. There is an interesting caveat with the way the MySQL container is brought up. To put it simply, during the container startup process, MySQL spawns an initial process to perform database operations, before spinning down that instance and spinning up the actual database process that we can connect to. The initial, dummy instance is similar to our final database instance in that it will pass [**container health checks**](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#container_definition_healthcheck). Thus, if we configure a container startup order such that our Flask application starts **after** the MySQL container is healthy (i.e. it passes health checks), a health check could very well hit the first process and assume the container is healthy. To manage this possibility, let's add some retry logic to the MySQL connection code. ```python= # ... Imports retry = 1 connected = False max_retries = 10 while (retry <= max_retries and not connected): try: cnx = connector.connect(user=os.environ["DB_USER"], password=os.environ["DB_PASSWORD"], host=os.environ["DB_HOST"], database=os.environ["DB_DATABASE"], port=3306) cursor = cnx.cursor() connected = True except Exception as e: print("MySQL connection failed: retry {} of {}. Retrying...".format(retry, max_retries)) print (e) cursor = None retry += 1 sleep(3) if not connected:print("MySQL database connection failed - proceeding without database connection.") else: print("MySQL connection established.") # ... Paths ``` Currently, our `/get` route isn't using our MySQL database - it's loading sample data. We can modify the route for it to print out a simple `GET` query to our table. To do this, we'll need to modify the route's function to run a query whenever the `/get` route is hit, and print out the result of the query. Here's what the `/get` path should look like once we add a reference to the database. ```python= @app.route("/get") def get(): global cursor if cursor is not None: print("Getting database entries.") query = "SELECT * FROM {}".format(os.environ["DB_TABLE_NAME"]) cursor.execute(query) return render_template("get.html", db_entries=[entry for entry in cursor]) else: print("Serving /get without database entries.") return render_template("get.html", db_entries=[ "No database connection.", "Proceeding without a database connection." ]) ``` Once again, we're wrapping the route's logic in a `try - catch` block because we added logic to allow the application to proceed without a database connection. The cursor we're using to access the database has been defined in the global space, and we need to reference it as such. Run the application now, and depending on the database you're using, you should see the sample data from your instance of the database. One more thing we can add at this point is a metrics provider that we'll be able to integrate into an OpenTelemetry pipeline once we introduce the collector. For this, we'll use `prometheus-flask-exporter`. Install the package with `pip install prometheus-flask-exporter` (remember to `pip freeze` again!), import the package with `from prometheus_flask_exporter import PrometheusMetrics`, and add the following line before you start defining any application routes: ```python flask_metrics = PrometheusMetrics(app, path="/mymetrics") ``` The metrics are exposed on the `/metrics` path by default, and upstream components which we'll be configuring later will look for metrics on this path by default. However, there are settings for configuring custom scraped paths, and I'll demonstrate these later in this tutorial. For now, start the app and navigate to the `/mymetrics` path - this is the endpoint we'll be scraping later. Here's an example of what your metrics endpoint may look like at this stage - be aware that the more you interact with your application and the more paths you visit, the more metrics may be exposed here: <image src="https://cdn.discordapp.com/attachments/427068629486534667/1163804435898695742/ADOT_example_metrics.png?ex=6540e857&is=652e7357&hm=71e0ef51ba6f2ae9a3a95fde91e88809eb7091f17dfe126ca0691c2e937a361c&" /> <br /><br /> Congratulations! You have a sample application. Consider comparing the code you've written against the [sample code](https://github.com/ForsakenIdol/Flask-SQL-Sample-App) I've provided. Now, let's put it into a container and run it on ECS before we start to instrument it. ## 3. Containerization Our final application will be running on Amazon ECS, so we'd better create a simple development pipeline before we proceed further, by configuring a **Dockerfile** and a **task definition** for our application's containers. We'll start by creating the Dockerfile in the top level of our working directory. ```Dockerfile= # Dockerfile FROM python:3.11.5 AS base WORKDIR /app COPY app.py requirements.txt ./ COPY templates ./templates/ RUN pip install -r requirements.txt RUN apt-get update && apt-get install -y default-mysql-client CMD [ "python", "app.py" ] ``` In this Dockerfile, we: 1. Set our workspace directory in the Python image. 2. Copy over our top-level files into the `/app` directory, and the templates into `/app/templates/`. 3. Install the required packages for the application to run (this is why it's so important to keep an up-to-date requirements file). 4. Install the MySQL client. This strictly isn't required for this application to run - however, if you encounter issues connecting to the database when the application has been deployed, it can be a useful utility when you're poking around inside the container. Feel free to omit this, however, if you want to decrease the size of your final image. 5. Set our execution command in the image. Now, we can build the image with... ```bash docker build -t <image_name> . ``` ... and run the image with this command. ```bash= docker run -p <host_port>:80 \ -h localhost \ --env DB_USER \ --env DB_PASSWORD \ --env DB_HOST=host.docker.internal \ --env DB_DATABASE \ --env DB_TABLE_NAME \ <image_name> ``` Not only are we propagating our shell's environment variables into the container, but `host.docker.internal` is the hostname that allows a Docker container to communicate with processes running on the parent machine's `localhost` interface. We can use the `<host_port>` to map our container's port 80 to another port on our host, if the host's port 80 is already being used by another process. If it's unused however, you can set the host port to 80 as well. Now is also a good time to set up a remote public repository in a container registry of your choice (DockerHub, ECR, etc.) for you to `docker push` this image for reference by upstream components in ECS. Before we introduce ADOT, let's also set up a custom MySQL image that we can pre-populate with sample data. To do this, I'll demonstrate using the sample `world` database that MySQL provides [**here**](https://dev.mysql.com/doc/index-other.html). Create a folder in the top-level directory called `mysql` and download the `world.sql` script from the extracted `world-db` folder into that directory. We can then configure a simple Dockerfile for our image as follows. ```dockerfile FROM mysql AS base COPY script.sql /docker-entrypoint-initdb.d/ ``` The image will inherit the default entrypoint command - all we're doing here is loading the SQL script to be executed. Build this image and push it to your remote image repository - we'll be referencing this in the next section below. ### ECS Let's also set up a task definition to ensure our container can run on ECS. There's a few ways you can construct this; you could run `aws ecs register-task-definition --generate-cli-skeleton` to generate [**all the possible configuration values**](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html), but most of these are not required. In this guide, we're going to start with a skeleton file, and populate it step by step. ```json= { "family": "flask_mysql_sample_app", "taskRoleArn": "", "executionRoleArn": "", "networkMode": "awsvpc", "containerDefinitions": [], "requiresCompatibilities": [ "FARGATE" ], "cpu": "256", "memory": "1 GB", "runtimePlatform": { "cpuArchitecture": "X86_64", "operatingSystemFamily": "LINUX" } } ``` A few things to note right off the bat: - When we register this task definition, it will be given the name `flask_mysql_sample_app`. - Since we're running our task in **Fargate**, the only valid network mode is `awsvpc`. This will assign an **Elastic Network Interface** (ENI) to our task and give it the same networking properties as an EC2 instance. - The CPU and memory combination is not arbitrary on Fargate. There are specific values for both the CPU and memory assignment for this task that must be satisifed. For more information, have a look at the [**ECS task definition parameters**](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#task_size). Our container startup order is important; even though we've configured database connection retry logic in our Flask application, the container will only retry a the connection a set number of times. Once the connection is made, both containers are essential to our task. ECS allows us to do this by configuring container dependencies - let's start by adding our MySQL container image. Add the following entry to the `containerDefinitions` section of the task definition. ```json= { "name": "mysql-db", "image": "<name-of-your-mysql-image>", "cpu": 128, "memory": 512, "essential": true, "secrets": [ { "name": "MYSQL_ROOT_PASSWORD", "valueFrom": "" } ], "dependsOn": [], "startTimeout": 60, "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "sample-app", "awslogs-region": "<your-region>", "awslogs-stream-prefix": "mysql-db", "awslogs-create-group": "true" } }, "healthCheck": { "command": ["CMD-SHELL","mysqladmin ping -h localhost -u root -p$MYSQL_ROOT_PASSWORD 2>&1 | grep alive || exit 1"], "interval": 5, "timeout": 5, "retries": 3, "startPeriod": 30 } } ``` A few things to note: - Replace `<name-of-your-mysql-image>` with the image you previously built for MySQL containing the sample data, and `<your-region>` with the AWS region you're currently using. - We set `startTimeout: 60`, meaning that we're giving the MySQL container 60 seconds to start up before we give up on resolving dependencies for any future containers that may need this container to become healthy before they can start. - `mysqladmin` allows us to construct a health check that will print `mysqld is alive` if the main process is running and ready to accept SQL traffic. We can `grep` the command's output for the word `alive` when the container is ready. **This is the health check that can present problems for dependency resolution if it returns a healthy response on the init process in this container**, hence why we configured database retry logic in our Flask application. - The `logConfiguration` uses the `awslogs` driver, which pushes our logs to CloudWatch. This represents the first pillar of observability: **logs**. The log configuration makes this very easy to set up; if you want to experiment with other logging drivers, there is a comprehensive list of supported drivers [**here**](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_LogConfiguration.html). - Notice the `secrets` key - because we're dealing with passwords, <span style="color: red; font-weight: bold;">you should never pass sensitive information to a container in plaintext</span>. Use either **Secrets Manager** or **Systems Manager Parameter Store** (SSM) to pass these values. Here's an example of what an SSM parameter might look like in this context: <image src="https://cdn.discordapp.com/attachments/427068629486534667/1164150826495913984/Example_ssm_parameter.png?ex=65422af1&is=652fb5f1&hm=dd9ca5222ddc606704f410ec93a6b805a60e516f27a662337d6db39bb1f4e866&" /> <br /><br /> In this scenario, here's how you'd populate the `secrets` entry in the container definition above. ```json "secrets": [ { "name": "MYSQL_ROOT_PASSWORD", "valueFrom": "arn:aws:ssm:<your-region>:<your-account-id>:parameter/FLASK_MYSQL_DB_PASSWORD" } ] ``` Great! We've got our first MySQL container. Now, let's set up the second container and configure a startup dependency. Add the following container definition to the task definition: ```json { "name": "flask-app", "image": "<name-of-your-flask-image>", "cpu": 128, "memory": 256, "portMappings": [ { "containerPort": 80, "hostPort": 80, "protocol": "tcp", "name": "app-port" } ], "essential": true, "environment": [ { "name": "DB_USER", "value": "root" }, { "name": "DB_HOST", "value": "localhost" }, { "name": "DB_DATABASE", "value": "world" }, { "name": "DB_TABLE_NAME", "value": "country" } ], "secrets": [ { "name": "DB_PASSWORD", "valueFrom": "arn:aws:ssm:<your-region>:<your-account-id>:parameter/FLASK_MYSQL_DB_PASSWORD" } ], "dependsOn": [ { "containerName": "mysql-db", "condition": "HEALTHY" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "sample-app", "awslogs-region": "<your-region>", "awslogs-stream-prefix": "flask-app", "awslogs-create-group": "true" } }, "healthCheck": { "command": [ "CMD-SHELL", "curl -f http://localhost/ || exit 1" ], "interval": 10, "startPeriod": 10 } }, ... (MySQL container definition below) ``` There's 2 more aspects of this task definition we need to configure. The first is our [**task execution role ARN**](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html) in the `executionRoleArn` field. Because our task pulls secrets from the SSM parameter store, we need to provide the task with permissions to do so when it is initializing. We need to attach 3 policies to the role: - The Amazon-managed `AmazonECSTaskExecutionRolePolicy` to provide permissions to create the log streams for our containers. This policy is also required if you've stored your container images in ECR. - The Amazon-managed `AmazonSSMReadOnlyAccess` policy to fetch our password parameter from SSM. - Either an inline policy for creating CloudWatch log groups, or you can create the log group for the containers before launching the task. The second role is our `taskRoleArn`. While our task's containers don't interact with AWS services (yet, this will change in subsequent sections), we have installed the MySQL client in our Flask application container image for troubleshooting purposes. If you want to configure **ECS Exec** permissions for this task, this role will require the permissions set out [here](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html#ecs-exec-required-iam-permissions). Once both roles have been configured, you can go ahead and register the task definition with `aws ecs register-task-definition --cli-input-json file://task-definition.json`; provided the **AWS CLI** has been configured with the correct region and credentials, you should see the task definition appear in the ECS console under the name `flask_mysql_sample_app`. At this point, you can run an instance of the task definition as a task. Bear in mind that if your VPC does not have a [PrivateLink endpoint for SSM](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html) configured, you'll need to attach a security group that allows traffic on port 443 (as well as port 80 for the actual application) to reach the task - otherwise, you only need to open port 80. ## 4. ADOT Instrumentation - Metrics Now that we have a working application on ECS (hurray!), we can actually begin to dive into instrumenting it with OpenTelemetry. As mentioned previously, the AWS Distro for OpenTelemetry (ADOT) is not commonly used for logging on ECS, since there are a host of powerful logging drivers that ECS already provides. Instead, we'll be using ADOT to send **metrics** and **traces** to various observability backends. Add the following to a `config.yaml` file in a new directory in our application's workspace - let's call the directory `adot`. OpenTelemetry, and by extension ADOT, is configured using a YAML file, the skeleton of which looks like this: ```yaml= receivers: processors: exporters: extensions: service: ``` - **Receivers** accept or query for input data from upstream sources. - **Processors** perform operations on the input data from a receiver. - **Exporters** send the output data (either directly from a receiver or after being processed) to one or more backends. - **Extensions** are features that directly relate to the operation of the collector and do not operate on data pipelines. - **Services** are where components that fall into the 4 categories above are enabled by declaring them in either specific data pipelines, or as extensions. These components are defined in the [**OpenTelemetry documentation**](https://opentelemetry.io/docs/collector/configuration/) for the OTEL collector. <div style="background-color: #ffadad; padding: 1rem 2rem; margin-bottom: 1.5rem;"> <b>Note</b>: ADOT does not support all of the receivers and exporters in the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib">contribution repository<a/>. If you attempt to use an unsupported component, ADOT will fail when attempting to unmarshal the configuration file, and will print out a list of supported components for that particular version of the collector. We will only be using ADOT-supported components in this tutorial as of version 0.33 of the collector, and you can find a guide for most of them <a href="https://aws-otel.github.io/docs/introduction">here</a> under the <b>ADOT Collector Components</b> section. </div> We'll start by configuring the components for metrics, since we exposed a Prometheus-compatible endpoint to scrape in our application (`/mymetrics`). The configuration is as simply as a drag-and-drop for existing Prometheus scrape configurations, as per the implementation [here](https://aws-otel.github.io/docs/getting-started/prometheus-remote-write-exporter). This configuration will go under the `receivers` section: ```yaml= receivers: prometheus: config: global: scrape_interval: 10s scrape_timeout: 10s scrape_configs: - job_name: "flask-prometheus" static_configs: - targets: [ 0.0.0.0:80 ] metrics_path: /mymetrics ``` Let's break down what's going on here. - The `scrape_configs` section allows us to provide a list of endpoints that Prometheus should scrape, that are reachable by the collector. Here, we define a single target, which is our application, binding it on port 80. - The `metrics_path` will default to `/metrics`, if such a path is available on our scrape target. However, it is not, becasue we changed this path to `/mymetrics`, and so we must declare the new path as such. - The `job_name` is an arbitrary name we can give to this scraped endpoint. - The `scrape_interval` tells the Prometheus receiver how long to wait between scrapes, and the `scrape_timeout` is how long to wait after scraping for data to arrive. Unlike some of the other receivers, Prometheus does not wait for data to come to it - rather, it will proactively seek out data by scraping the endpoints we define in this configuration. Let's omit `processors` from this configuration (we'll export the metrics directly to our backend) and have a look at configuring an `exporter` to write the incoming data to an **Amazon Managed Prometheus** (AMP) remote write endpoint. To do this, let's first create a Prometheus workspace in AMP... <image src="https://cdn.discordapp.com/attachments/427068629486534667/1164482032307552286/AMP_workspace.png?ex=65435f67&is=6530ea67&hm=2f10b67a5a080171f3875c76cbc00f76adba80c8f23a334eee4ca7329f96ab44&" /> <br /><br /> ... then, we'll configure a [**prometheusremotewrite**](https://aws-otel.github.io/docs/getting-started/prometheus-remote-write-exporter#prometheus-remote-write-exporter) exporter with our workspace's `your_remote_write_endpoint`. ```yaml= exporters: prometheusremotewrite: endpoint: "your_remote_write_endpoint" auth: authenticator: sigv4auth ``` The remote write endpoint should have the following syntax: ``` https://aps-workspaces.<my_region_id>.amazonaws.com/workspaces/<my_workspace_name>/api/v1/remote_write ``` In terms of extensions, we'll also need to configure the `SigV4` authenticator, since we'll be using it to authenticate to AMP using AWS credentials. While we're at it, let's also declare the `health_check` extension. ```yaml= extensions: health_check: # Configuration for Sigv4 available here - most fields are optional: # https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/sigv4authextension/README.md sigv4auth: region: <my_region_id> service: aps ``` Now, we've declared all the components, but OTEL doesn't yet know how to use these components because we haven't activated them. Let's declare the extensions and set up a `pipeline` for our metrics under the `service` section. ```yaml= service: extensions: [ health_check, sigv4auth ] pipelines: metrics: receivers: [ prometheus ] exporters: [ prometheusremotewrite ] ``` *Once again, notice the lack of the* `processors` *header - we're sending metric signals directly from the receiver to the exporter.* At this point in time, your `config.yaml` file should look something like this: ```yaml= receivers: prometheus: config: global: scrape_interval: 10s scrape_timeout: 10s scrape_configs: - job_name: "flask-prometheus" static_configs: - targets: [ 0.0.0.0:80 ] metrics_path: /mymetrics exporters: prometheusremotewrite: endpoint: "your_remote_write_endpoint" auth: authenticator: sigv4auth extensions: health_check: sigv4auth: region: <my_region_id> service: aps service: extensions: [ health_check, sigv4auth ] pipelines: metrics: receivers: [ prometheus ] exporters: [ prometheusremotewrite ] ``` There are a few ways we can feed this configuration file to the ADOT collector. You could mount it at task initialization time to the container as a parameter, but I like to follow the same methodology I used for mounting the sample data to MySQL by creating a custom image with this file added. This is the method recommended in the [Amazon Managed Prometheus documentation](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-onboard-ingest-metrics-OpenTelemetry-ECS.html) for loading a configuration, and it's remarkably simple - the Dockerfile consists of only 3 lines: ```Dockerfile= FROM public.ecr.aws/aws-observability/aws-otel-collector:latest COPY config.yaml /etc/ecs/config.yaml CMD ["--config=/etc/ecs/config.yaml"] ``` Build the image and push it to a remote Docker repository. We can then incorporate a new container into our task definition with this image. Here's what the new container definition looks like: ```json= { "name": "adot-collector", "image": "<my_adot_collector_image>", "cpu": 256, "memory": 128, "essential": true, "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "sample-app", "awslogs-region": "<your-region>", "awslogs-stream-prefix": "adot-collector" } }, "healthCheck": { "command": [ "/healthcheck" ], "interval": 5, "timeout": 6, "retries": 5, "startPeriod": 1 } } ``` But wait - there's now a few additional things we need to configure. - Our previous **resource limits** were for 2 containers, and don't account for the new ADOT collector sidecar. Let's increase the CPU values at the task level to 512 units to give our new ADOT collector sidecar some resource budget. - We can configure a **startup dependency** for the ADOT collector to become healthy before the MySQL container starts up. This will allow the collector to be up and running before any of the other containers are started, and it can start scraping metrics right away - simply add a startup dependency as follows to the MySQL container: ```json "dependsOn": [ { "containerName": "adot-collector", "condition": "HEALTHY" } ] ``` - ADOT requires **IAM permissions** to write to an AMP workspace. We can provide these permissions by adding the `AmazonPrometheusRemoteWriteAccess` AWS-managed policy to our task role. Note that the policy provides access to all AMP workspaces - if you have multiple workspaces in the region you're working in, you can copy this policy and scope it down to the specific workspaces you want to give the task permission to write into - this is called the [**principle of least-privilege**](https://www.cloudflare.com/learning/access-management/principle-of-least-privilege/). Once you've configured this, go ahead and register the new task definition and run an instance of the task. ADOT will write metrics to your Prometheus workspace endpoint, and you can query metrics from this workspace using a tool of your choice, e.g. [`awscurl`](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-compatible-APIs.html). <div style="background-color: #ffadad; padding: 2rem 2rem 1rem; margin-bottom: 1rem;"><b>Note</b>: You may see the following logs for the ADOT container: <br/><br/> <pre><code>warn internal/transaction.go:123 Failed to scrape Prometheus endpoint</code></pre> This occurs because ADOT is the first container to be started, and will attempt to scrape the <code>/mymetrics</code> endpoint despite the main container having not started yet. This error <b>should</b> self-resolve; if it does not, check the status and logs of the <code>flask-app</code> container. </div> ### Aside: Configuring Amazon Managed Grafana Congratulations! You've configured the ADOT collector to receive and export metrics to a backend. However, Prometheus is also commonly integrated with [**Grafana**](https://grafana.com/), another open-source dashboard monitoring tool. AWS provides a distribution of Grafana known as [**Amazon Managed Grafana**](https://aws.amazon.com/grafana/) (AMG), and in this subsection, we're going to have a look at how to configure AMG and add our AMP workspace as a data source to play around with the tool and create some charts. <div style="background-color: #88f7b3; padding: 2rem 2rem; margin-bottom: 1.5rem;"> <b>Note</b>: This section is entirely optional. We've already demonstrated ADOT's ability to send metrics to a Prometheus workspace; this is just an extension of that workspace to create a fully-fledged metrics platform. If you're already familiar with Grafana or just want to focus on OpenTelemetry, skip to the next subsection, where we discuss the CloudWatch Metrics pipeline. </div> To start off, navigate to the **Amazon Grafana** service and create a new workspace with the following settings: - These steps were tested with Grafana version 9.4. - Use the **IAM Identity Center** as the authentication method for this workspace. If you don't already have Identity Center configured, create a user here, and Grafana will launch Identity Center in a new window for you to continue configuring the service. - Leave the permission type on **Service managed**, and don't put the workspace in a VPC - we're only storing sample data here. (*For a production or sensitive Grafana workspace, you'd definitely want to use the network security features a VPC provides, but for demonstration purposes, this isn't necessary.*) - We don't need Grafana alerting, and our **Network access control** can be left on **Open access** for aforementioned reasons. - Add **Amazon Managed Service for Prometheus** as a data source to this workspace. Review all the settings you've selected, then create the workspace. Give the workspace some time to finish creation. <image src="https://cdn.discordapp.com/attachments/427068629486534667/1164837580211245117/01_create_grafana_workspace.png?ex=6544aa88&is=65323588&hm=ac7714277bce47da2cd3dbf2ea289f1fc797f4912fb29b6ef2c41a530262ac8f&" /> <br/><br/> After you configure your Identity Center user (which may involve verifying your email address), scroll down on the **Authentication** tab and select **Assign new user or group**, adding the user you just configured. Grafana will by default add new users as **Viewers**, who will not immediately have permission to modify the workspace. You'll need to select the user you just added, and use the **Action** dropdown to make them an admin. Return to the Grafana workspace view and click on the **Grafana workspace URL**, signing in with Identity Center. <img src="https://cdn.discordapp.com/attachments/427068629486534667/1164538114082091020/02_grafana_dashboard.png?ex=654393a2&is=65311ea2&hm=d445cbc6dd53bae8d8560dd70148cc4a348de5d7d30646efb9ed2a0c974862e1&" /> <br/><br/> (*If the image above is too small, right-click it and open it in a new tab.*) Let's add our Amazon Managed Prometheus data source. Contrary to the **Add your first data source** button, this is actually done easiest back in the **Amazon Grafana** service - not in this view. 1. Return to the Amazon Grafana view of your workspace and go to the **Data sources** tab. 2. Next to the **Amazon Managed Service for Prometheus** data source, select **Configure in Grafana**. This will bring you back to the workspace URL. 3. Select the region and corresponding workspace, and click **Add 1 data source**. **Congratulations!** You've just configured an Amazon Grafana workspace and added our AMP repository as a data source. Go ahead and **Create your first dashboard** to experiment with the Grafana interface and have a look at the metrics being ingested into your workspace. Below is an example query you can run. <image src="https://cdn.discordapp.com/attachments/427068629486534667/1164541010966237344/03_grafana_example_query.png?ex=65439655&is=65312155&hm=cf8588bbb049b8ac1dc3a20132382b1c102741f2bd57838e24857bfb49829ca9&" /> <br/><br/> ### CloudWatch Container Metrics It is possible to declare multiple components or pipelines of the same resource type - see the design proposal [here](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/design.md) that led to this implementation. Subsequent pipelines must have a slashed `/` value to differentiate them from the previous declaration of that component or pipeline, as in the following example, where we have 2 metric pipelines, `metrics` and `metrics/2`: ```yaml= service: extensions: [ ] pipelines: metrics: receivers: [ ] processors: [ ] exporters: [ ] metrics/2: receivers: [ ] processors: [ ] exporters: [ ] ``` In this tutorial, we're going to build a secondary metrics pipeline to demonstrate one of the key metrics sources made available to us in ECS - [**Container Resource Metrics**](https://aws-otel.github.io/docs/components/ecs-metrics-receiver). These are made available through the `awsecscontainermetrics` receiver, which scrapes the [**ECS Task Metadata Endpoint**](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v4-fargate.html) to collect task-level and container-level metrics. We'll start by declaring the new receiver. ```yaml receivers: ... awsecscontainermetrics: collection_interval: 10s ``` By default, this receiver will scrape the task metadata endpoint every 20 seconds, but here, I've halved that value by specifying the `collection_interval` so metrics are scraped twice as quickly with half as much time in-between scrapes. We can send these metrics to the backend of our choice - for the sake of demonstration, let's use the `awsemf` exporter to send these metrics to **CloudWatch**. ```yaml exporters: ... awsemf: namespace: "ECS/ContainerMetrics/ADOT" log_group_name: "adot-ecs-container-metrics" region: <your-region> ``` You can name the `namespace` and log group anything you want - this is arbitrary. Finally, we'll decalre our second metrics pipeline: ```yaml= service: extensions: [ health_check, sigv4auth ] pipelines: metrics: receivers: [ prometheus ] exporters: [ prometheusremotewrite ] metrics/2: receivers: [ awsecscontainermetrics ] exporters: [ awsemf ] ``` Build the collector and run an instance of our task so far - you should see the following new resources in the CloudWatch console: - A new log group called `adot-ecs-container-metrics`, or with the name you gave this log group, with logs in the [**Embedded Metric Format**](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html) (EMF). - Container and task-level metrics in the `ECS/ContainerMetrics/ADOT` namespace under the `Metrics` -> `All Metrics` -> `Browse` section. <image src="https://cdn.discordapp.com/attachments/427068629486534667/1165538540768198676/06_cloudwatch_metrics_with_no_dimensions.png?ex=6547375a&is=6534c25a&hm=2d64c49a98518dac6a49bfec1970883e2cdb1facebf1443b77cab35fab6a8084&" /> <br/><br/> One thing you may notice at this stage is that our metrics, despite being logically grouped into different categories (e.g. cluster name, task ID, container name), lack dimensions that have been picked up by CloudWatch. This can become problematic in environments where we have a number of different tasks running in separate clusters, all exporting metrics to CloudWatch using a similar method. <image src="https://cdn.discordapp.com/attachments/427068629486534667/1165548091231965254/07_cloudwatch_metrics_with_no_dimensions_top_level_view.png?ex=6547403f&is=6534cb3f&hm=8b1f27cdb36252490b338504e8fc222fd51e8e728609399b3dca7ee92cc09f4f&" /> <br/><br/> This is happening because while the scraped metrics come with a full set of valuable [**resource attributes**](https://aws-otel.github.io/docs/components/ecs-metrics-receiver#resource-attributes-and-metrics-labels), these attributes are not automatically converted to metric labels, and so CloudWatch can't see these attributes. We can change this by setting the `resource_to_telemetry_conversion` parameter in the `awsemf` exporter as follows. ```yaml exporters: ... awsemf: namespace: "ECS/ContainerMetrics/ADOT" log_group_name: "adot-ecs-container-metrics" region: <your-region> resource_to_telemetry_conversion: enabled: true ``` This will add a label for every attribute present on our metrics, but results in a slight issue - CloudWatch has not only interpreted every label as a dimension (which is something we do want), but it's also created a [**dimension combination**](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Dimension:~:text=Dimension%20combinations) for **every dimension** in our metric data! <image src="https://cdn.discordapp.com/attachments/427068629486534667/1165555143916408953/08_cloudwatch_metrics_too_many_dimensions.png?ex=654746d1&is=6534d1d1&hm=c37cf7a83e42b97c53805e3ac180a3c76d31e3a2ff274f7ab59e2272acdef265&" /> <br/><br/> The exporter is interpreting every label as a separate dimension and will generate a standalone "combination" for each dimension, where each combination constitutes **a single dimension**. To remedy this, first, we need to tell the `awsemf` exporter not to roll up every dimension to the root level. We do this using the `dimension_rollup_option` parameter. By default, this parameter is set to `ZeroAndSingleDimensionRollup` (without the `awsemf.nodimrollupdefault` feature gate enabled), which means that a "combination" has been created for every dimension. If there are `n` unique metrics in our input data, and there are `m` possible dimensions, each of the `n` metrics has been duplicated **`n` times** across each of the `m` inferred combinations! If we set this parameter to `NoDimensionRollup`, `awsemf` will still implicitly create dimensions from our labels that we previously generated using the `resource_to_telemetry_conversion` parameter (have a look at the corresponding log stream to verify this), but CloudWatch will only create a single dimension combination which constitutes **all** of the possible dimensions in our input data. In our scenario, because there are 2 different classes of attributes that we've converted into labels (task labels and container labels) and no individual metric is going to have **both classes of labels** at the same time (a metric only ever pertains to a task **or** a container, never both), this results in the singular dimension combination having **no entries**. This brings us back to the situation where we had only metrics with no dimensions. However, we can now use the `metric_declarations` parameter to tell CloudWatch to declare **custom dimension combinations**, based on the now-available labels. Let's start by defining some metrics using dimensions relevant to the infrastructure around our task. ```yaml exporters: ... awsemf: namespace: "ECS/ContainerMetrics/ADOT" log_group_name: "adot-ecs-container-metrics" region: <your-region> resource_to_telemetry_conversion: enabled: true dimension_rollup_option: NoDimensionRollup metric_declarations: - dimensions: [ [aws.ecs.cluster.name], [aws.ecs.cluster.name, aws.ecs.task.family] ] metric_name_selectors: [ container.*, ecs.task.* ] ``` *(Tip - if you've been testing every configuration of our collector and the current CloudWatch metric namespace is getting too crowded, you can configure another namespace by changing the* `namespace` *parameter.)* In this scenario CloudWatch will no-longer implicitly present dimension combinations for our metrics based on available attributes, but will only present the combinations we explicitly define in the collector. In the configuration above, we've explicitly defined 2 different combinations based on the following dimensions: - The **cluster** in which those metrics originated. - The **cluster** and **task definition family** from which those metrics originated. Entries in our `adot-ecs-container-metrics` log group which contain all of the dimensions in each combination will be grouped accordingly, and because every metric name in our log stream starts with either `container.*` or `ecs.task.*`, the selectors will target **all** the entries in our stream. Experiment with the [**available resource attributes**](https://aws-otel.github.io/docs/components/ecs-metrics-receiver#resource-attributes-and-metrics-labels) which we've converted to dimensions and see what metrics you can define using various combinations of attributes. For example, here's another configuration for the `awsemf` exporter which defines separate dimension combinations for task-level and container-level measurements: ```yaml= awsemf: namespace: "ECS/ContainerMetrics/ADOT" log_group_name: "adot-ecs-container-metrics" region: <your-region> resource_to_telemetry_conversion: enabled: true dimension_rollup_option: NoDimensionRollup metric_declarations: - dimensions: [ [aws.ecs.cluster.name, aws.ecs.task.family] ] metric_name_selectors: [ ecs.task.* ] - dimensions: [ [aws.ecs.cluster.name, aws.ecs.task.family, container.name] ] metric_name_selectors: [ container.* ] ``` If you only have a single ECS cluster, but want to differentiate the metrics based on the task they originated from, here's an example of a configuration which can achieve this: ```yaml= metric_declarations: - dimensions: [ [aws.ecs.task.family, aws.ecs.task.id] ] metric_name_selectors: [ ecs.task.* ] - dimensions: [ [aws.ecs.task.family, aws.ecs.task.id, container.name] ] metric_name_selectors: [ container.* ] ``` ## 5. ADOT Instrumentation - Traces We've covered 2 pillars of observability, logs and metrics. Now, let's discuss **traces**. Tracing solutions are unique in that they are **language-specific** - that is to say, regardless of whether your application is compatible with [**automatic or manual instrumentation**](https://opentelemetry.io/docs/concepts/instrumentation/), there are different SDKs and methods for each programming language. In this tutorial, we're using Python, which has support for both automatic and manual instrumentation, but here, we'll be demonstrating **manual instrumentation**, to shed some light on what's happening at the code level. When it comes to [**AWS X-Ray**](https://aws.amazon.com/xray/), we have 2 different tracing solutions available. The OpenTelemetry SDK and the AWS X-Ray SDK for trace generation are not the same and it's important we don't get the 2 of them mixed up. For our manual instrumentation, we'll be using the former, because it produces backend-agnostic trace documents in the OTLP format, meaning you can send these documents to a series of compatible backends (including X-Ray), whereas instrumentation with the X-Ray SDK produces a tightly-coupled solution. Start by installing the relevant libraries. There are 2 sets of libraries we're interested in, those which are required for global configuration of OpenTelemetry... ```shell pip install opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-sdk-extension-aws ``` ... and the libraries that are specific to the modules we're using in our application, which for us, are the libraries for the `Flask` and `MySQL-connector` modules. ```shell pip install opentelemetry-instrumentation-flask ``` *(Remember to update your requirements file!)* The `mysql-connector` library comes with its own [**instrumentation specification**](https://dev.mysql.com/doc/connector-python/en/connector-python-opentelemetry.html), but this specification does not come bundled with the actual dependencies the library requires. These were installed with the global OTEL modules above and do not need to be installed again. Both of these groups of libraries are initialized separately. Generally, we initialize the global OTEL libraries and apply our desired configuration before we move to configure module-specific tracing options. Here's an example of how that would look. ```python= # Other imports at the top of the file from prometheus_flask_exporter import PrometheusMetrics from time import sleep # Global Imports for OpenTelemetry from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.sdk.extension.aws.trace import AwsXRayIdGenerator # app = Flask(__name__) # ... ``` Let's break down what each of these imports are used for. - The tracing API is called `trace`. This import is generic and used in a few scenarios further down the line, and can be called to get information about and mutate the current trace being recorded. - The `TracerProvider` is responsible for generating traces. - The `OTLPSpanExporter` is responsible for sending those generated traces to the ADOT sidecar. This will be configured with the necessary endpoint later in the application. - The `BatchSpanProcessor` simply forces batch processing of traces, as opposed to a single trace at a time. - The `AwsXRayIdGenerator` is necessary because we're pushing trace data to X-Ray, which requires trace IDs in a [**specific format**](https://docs.aws.amazon.com/xray/latest/devguide/xray-api-sendingdata.html). Now that we've imported the required components, let's perform the necessary initialization. ```python= # Global OTEL Imports above # Global initialization for OTEL traces oltp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317") span_processor = BatchSpanProcessor(oltp_exporter) trace.set_tracer_provider(TracerProvider(active_span_processor=span_processor, id_generator=AwsXRayIdGenerator())) # app = Flask(__name__) # ... ``` What have we just configured? We performed the following: 1. Initialized a span exporter and informed it of the endpoint to which to push traces. 2. Wrapped the span exporter in a processor that passes batches of spans to it at a time. 3. Configured the overall trace provider with the span processor and told it how to generate IDs for sending upstream. This ID generator can also be set as an environment variable (`OTEL_PYTHON_ID_GENERATOR=xray`) if desired. The ADOT SDK also contains functionality for propagating the trace context downstream if our application makes calls to external services (this is available in the `opentelemetry-propagator-aws-xray` package, which we didn't install), but because we're only dealing with 2 containers in a single task, we can choose to omit this. Now, we can begin to instrument the module-specific tracing libraries. There's 2 modules we need to configure tracing for - `Flask` and `MySQL-connector`. We'll start by importing the library specific packages we installed earlier. ```python= # Library-Specific Imports for OTEL traces from opentelemetry.instrumentation.flask import FlaskInstrumentor from mysql.connector.opentelemetry.instrumentation import MySQLInstrumentor ``` Then, instrumentation is as simple as declaring 2 lines of code after the Flask app initialization. ```python= app = Flask(__name__) # Library-specific initialization for OTEL traces FlaskInstrumentor().instrument_app(app) MySQLInstrumentor().instrument() retry = 1 connected = False ... ``` OpenTelemetry makes instrumentation (even manually) very simple - because the `Flask` and `MySQL-connector` libraries have a wealth of support, most of the tracing logic is implemented behind the scenes. However, there is one other change we need to make. MySQL instrumentation wraps our `cursor` object, which was originally part of the `MySQLCursor` class, in a `TracedMySQLCursor` class, and to access the query results, we need to fetch them from the `._wrapped` property, instead of just calling the `cursor` directly. This means that the line in the `get` route where we render the template from the entries provided by the cursor... ```python= return render_template("get.html", db_entries=[entry for entry in cursor]) ``` ... needs to be changed very slightly. ```python= return render_template("get.html", db_entries=[entry for entry in cursor._wrapped]) ``` Finally, for our Flask application, you need to rebuild the image and push it again to the remote repository. We now also need to update our ADOT `config.yaml` file to declare an additional pipeline for traces. Let's start by adding a new receiver... ```yaml= receivers: ... otlp: protocols: grpc: endpoint: 0.0.0.0:4317 ``` ... as well as a new exporter. ```yaml exporters: ... awsxray: ``` The [OTLP receiver](https://aws-otel.github.io/docs/getting-started/x-ray#configuring-the-otlp-receiver) accepts data in the **OpenTelemetry Protocol** (OTLP) specification, and because we instrumented our application using the OTEL SDK instead of the AWS X-Ray SDK, we will be sending our trace data in the OTLP format. 4317 is the default GRPC port OTEL uses to listen for incoming data, and here we use that to send trace data from our application to the ADOT sidecar. We'll also need to configure another pipeline incorporating these 2 components. ```yaml service: extensions: [ health_check, sigv4auth ] pipelines: ... traces: receivers: [ otlp ] exporters: [ awsxray ] ``` *(Once again, we're not using processors in our pipeline - we're opting to export incoming data directly to a backend without processing it in the collector.)* Remember to rebuild the ADOT collector after modifying its configuration file to incorporate the new pipeline. We now need to update our ECS task role to provide permissions for the ADOT collector to send traces to X-Ray. The collector requires the exact same permissions that the [X-Ray daemon](https://docs.aws.amazon.com/xray/latest/devguide/xray-daemon.html) would use, which can be satisifed by attaching the AWS-managed `AWSXRayDaemonWriteAccess` policy to the task role. At this point, you can build the new collector and Flask application images and run an instance of the task. If everything is correctly configured, navigate to the **X-Ray** service, and notice that the service map should have 3 nodes: a client node, a central, larger node, and a `SELECT` node representing our backend. **How does the service map know to put the SQL database in its own node?** When we instrumented the application, we first wrapped the Flask app in `FlaskInstrumentor` instrumentation, but then we also wrapped our database logic with a `MySQLInstrumentor` initialization. Hence, when an incoming request to our `/get` route is received: 1. The request hits **Flask** first, since Flask manages the routing for our application. Flask instrumentation initializes a **parent span** for this route. 2. Assuming the application launched with a successful database connection (the `Database SQL` node will not appear if this is not the case), the routing eventually reaches the `cursor.execute(query)` statement. 3. At this point in time, the MySQL instrumentation library initializes **its own child span**. Because the parent Flask trace is still active, the tracing library identifies that a **sub-call** to another resource must be made in this route, and that resource is represented by the MySQL trace that's just been initialized. Therefore, a trace for the `/get` route will contain a **subsegment** representing the `SELECT` call we made to our MySQL database, which is interpreted by the service map as another node. You can also view the service map in **CloudWatch**, which is considered to be the new console for X-Ray - below is what the service map should look like at this point in time. <img src="https://cdn.discordapp.com/attachments/427068629486534667/1164795477536014356/04_xray_trace_map_before_naming.png?ex=65448352&is=65320e52&hm=84f18ed5d6218ca5d5bdbd5ea4449f32999f711af46eaac72bc5c5106943b61e&" /> <br/><br/> However, notice that the central node is currently named `unknown_service` - this is the node that represents our Flask application. This is because `unknown_service` is the default name that OTEL gives to nodes in spans that do not have a name set. To name our service, we simply need to set the `OTEL_SERVICE_NAME` [**environment variable**](https://opentelemetry.io/docs/concepts/sdk-configuration/general-sdk-configuration/#otel_service_name) in our Flask app. This can be done in our task definition. ```json= ... { "name": "DB_TABLE_NAME", "value": "country" }, { "name": "OTEL_SERVICE_NAME", "value": "flask_app" } ], "secrets": [ ... ``` After setting this, register the task definition again and run another instance of the task; you should see a new service map node with the `flask_app` name you've given to the application, instead of the default name `unknown_service`. <img src="https://cdn.discordapp.com/attachments/427068629486534667/1164799358085115925/05_xray_trace_map_after_naming.png?ex=654486ef&is=653211ef&hm=7145c08bc21ae77d6c50075de86e47855711fe7a54fb63832d6a234c651415b8&" /> <br/><br/> ## 6. Where to next? Congratulations! You've just created a simple Flask application, set up a metrics endpoint, and configured the AWS Distro for OpenTelemetry to send metrics and traces to various backends in AWS. We've also discussed the various logging drivers available to you in ECS, and the reason why ADOT doesn't ship logging solutions with the collector. Hopefully, you've now developed a basic understanding of the 3 pillars of observability, and have the confidence necessary to start instrumenting your own applications. However, you may be wondering, **how do I expand my knowledge of OTEL from here?** There's a few things you can do, and we'll discuss what they are. ### Add another backend to your pipeline(s). A single pipeline can incorporate more than 1 receiver and exporter. As an example, try exporting your traces not just to X-Ray, but also to [**New Relic**](https://newrelic.com/). You'll need to set up an account with New Relic and take note of your API key. New Relic natively supports the OTLP protocol - use your API key to configure an [OTLP exporter](https://aws-otel.github.io/docs/components/otlp-exporter#new-relic)... ```yaml= exporters: prometheusremotewrite: endpoint: "<your-prometheus-remote-write-endpoint>" auth: authenticator: sigv4auth awsxray: # New Relic OTLP exporter configuration otlp: endpoint: otlp.nr-data.net:4317 headers: api-key: <YOUR_NEW_RELIC_LICENSE_KEY> ``` ... before specifying this exporter in your traces pipeline. ```yaml= service: pipelines: traces: receivers: [ otlp ] exporters: [ awsxray, otlp ] ``` Rebuild the collector and run the task definition again, and see your traces in the New Relic service map. Experiment with different backends and see what you can incorporate! ### Embrace a new kind of architecture. There are a few ways you can improve the architecture we launched in this tutorial. - **Separate the database out into its own tier.** We don't typically launch a database in the same task definition as the main application container, unless the database is only to hold ephemeral data, like a cache. Try launching the database in its own task definition, or incorporate [**Amazon RDS**](https://aws.amazon.com/rds/) for a managed database service. This allows our frontend application to scale independently of the database. - **Separate the ADOT collector into its own service.** The sidecar design pattern isn't the only way you can deploy OpenTelemetry. You can also deploy an **instrumentation service** with a single endpoint for your applications to send observability data to, allowing both the application and instrumentation tiers to scale independently of one another. [**Here**](https://aws.amazon.com/blogs/opensource/deployment-patterns-for-the-aws-distro-for-opentelemetry-collector-with-amazon-elastic-container-service/) is an AWS blog post discussing both the sidecar and service deployment patterns. **Thanks for reading!** *Written by [ForsakenIdol](https://forsakenidol.com/).*