ECS Task Groups - Changing the Design Philosophy

# ECS Task Groups - Changing the Design Philosophy *2025-05-15 update - Amazon ECS released the [**availability zone rebalancing feature**](https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-ecs-az-rebalancing-speeds-mean-time-recovery-event/) back in November 2024. This solves the AZ distribution issue after an AZ outage event, but it requires that you enable this feature on your ECS services. The issue this article describes now has official mitigations, but I will leave this up for historic reasons.* *This article was originally posted on 2024-10-24.* --- When you create an ECS service that runs on [**EC2 container instances**](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-capacity.html), you can control where ECS places tasks on those instances. To do this, you must define a [**task placement strategy**](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-placement-strategies.html). There are 3 distinct placement strategies that you can define. - `binpack`: The tasks are placed in such a way to use up as many compute resources (specifically, `cpu` or `memory`) from each instance at a time. Once an instance is "full" in that it cannot accept any further tasks, the ECS scheduler moves onto the next container instance and start filling that one up. - `random`: Place tasks randomly. - `spread`: Select a property of the container instances, and spread the tasks out evenly such that they take up as many values of that property as possible. Some examples of `spread` properties are the instances' **Availability Zone** or **Instance ID**. ## Background In ECS, there exists the concept of a [**task group**](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-groups.html). Task groups serve 2 main purposes. - The `memberOf` [**task placement constraint**](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/constraint-examples.html) accepts the name of a task group (e.g. `group-1`) for task scheduling. When specified, this constraint tells the ECS service to only place an incoming task on container instances that are already running tasks that belong to `group-1`. - When the `spread` task placement strategy is used, this strategy applies to tasks at the task group level. The task group therefore also determines which tasks will be spread **relative to each other**. ## The Problem The default task group an ECS task belongs to currently depends on how that task was launched. 1. Tasks that are launched **standalone** use the **task definition family name** (for example, `family:my-task-definition`) as their task group name. However, a custom task group name can be specified during standalone task launch by using the `group` field of the `ecs:RunTask` API call - see the reference to the field [**here**](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-group). 2. Tasks that are launched as part of a service, however, use the **service name** as the task group name. This **cannot** be changed. #2 is where our problems arise, specifically with respect to disaster recovery or container instance maintenance leading to a failure in the service to spread tasks most efficiently across container instances. ## An Example The problem above can be best understood through the following steps: 1. Create an ECS cluster with 2 container instances (CIs), each in a different Availability Zone (AZ). 2. Create an ECS service with 2 long-lived tasks. The default scheduling strategy for an ECS service is to spread the tasks across both CIs, and therefore across both AZs. Use the service's default [**health thresholds**](https://aws.amazon.com/blogs/containers/a-deep-dive-into-amazon-ecs-task-health-and-task-replacement/) of minimum: 100%, maximum: 200%. The default task placement strategy for an ECS service is to **spread the tasks across availability zones** as evenly as possible. ![01 Initial Steady State](https://hackmd.io/_uploads/r1Hegvmekl.png) 3. Terminate and deregister one of the CIs from the cluster. This simulates AZ failure (and the subsequent user action to remove the affected CI from scheduling further tasks), but also simulates maintenance or recycling of that CI. The task that was running on that CI will be rescheduled onto the remaining CI. ![02 Zone B is Lost](https://hackmd.io/_uploads/ByCWgP7gye.png) 4. Spin up another CI in the AZ that the deregistered CI was running in. This will return our cluster's CI infrastructure back to its original state, but the previously-rescheduled task will not migrate over to the newly-created CI, as this would constitute an unnecessary disruption. At present, one CI has 2 tasks, while the other CI has none. ![03 Zone B is Reinstated](https://hackmd.io/_uploads/Bkt7gPmxkg.png) 5. Force a new deployment on the ECS service. Observe how both of the new tasks that are spun up **on the other CI that previously was not running any tasks**. Once the tasks from the old deployment are stopped, both of the new tasks are running on **a single container instance**, and therefore in **a single Availability Zone**. ![04 New Deployment Triggered](https://hackmd.io/_uploads/rycVlPXx1x.png) ![05 New Deployment Complete](https://hackmd.io/_uploads/SkcVeDQlyl.png) ## Replicating the Example Let's apply the steps above to a real cluster to see this in action. The steps here are for the `ap-southeast-2` region, but feel free to change this to your desired region. *If you don't want to replicate this example, you can skip straight to the next section, where we discuss why this is happening.* ### Step 1 - Create the Infrastructure Create the ECS cluster. ```sh aws ecs create-cluster --cluster-name test-cluster-1 ``` Get the recommended image ID for the ECS-optimized AMI. ```sh IMAGE_ID=$(aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2023/recommended --region ap-southeast-2 | jq ".Parameters[0].Value" | jq -r ".|fromjson.image_id") ``` Container instances need to bootstrap using user data. Let's create a simple userdata file that we'll use for this, called `userdata.sh`. ```sh # Put the following in the `userdata.sh` file. #!/bin/bash cat << 'EOF' >> /etc/ecs/ecs.config ECS_CLUSTER=test-cluster-1 ECS_AVAILABLE_LOGGING_DRIVERS=["json-file","awslogs"] EOF ``` Create skeletons for the `run-instance` command parameters - this will make launching the instances much easier. Create 2 files - `01-run-first-instance.json` and `02-run-second-instance.json` - with the only difference between these files being the `SubnetId` field. Select 2 subnets in different availability zones. ```json { "InstanceType": "t3.medium", "Monitoring": { "Enabled": true }, "SecurityGroupIds": [ "<container-instance-sg>" ], "SubnetId": "<subnet-id>", "IamInstanceProfile": { "Arn": "<container-instance-iam-profile>" }, "MetadataOptions": { "HttpTokens": "required" } } ``` Let's launch these instances. ```sh INSTANCE_ID_1=$(aws ec2 run-instances --cli-input-json file://01-run-first-instance.json --image-id $IMAGE_ID --user-data file://userdata.sh | jq -r ".Instances[0].InstanceId") INSTANCE_ID_2=$(aws ec2 run-instances --cli-input-json file://02-run-second-instance.json --image-id $IMAGE_ID --user-data file://userdata.sh | jq -r ".Instances[0].InstanceId") ``` ### Step 2 - Create the Service Let's create a simple task definition, which we'll use to create a service that will run on the EC2 container instances we just launched. Create the `example-webserver-td.json` file. ```json { "family": "example-webserver", "containerDefinitions": [ { "name": "sample-app", "image": "httpd:2.4", "portMappings": [ { "containerPort": 80, "hostPort": 0, "protocol": "tcp" } ], "essential": true, "entryPoint": [ "sh", "-c" ], "command": [ "/bin/sh -c \"echo '<html> <head> <title>Amazon ECS Sample App</title> <style>body {margin-top: 40px; background-color: #333;} </style> </head><body> <div style=color:white;text-align:center> <h1>Amazon ECS Sample App</h1> <h2>Congratulations!</h2> <p>Your application is now running on a container in Amazon ECS.</p> </div></body></html>' > /usr/local/apache2/htdocs/index.html && httpd-foreground\"" ] } ], "executionRoleArn": "arn:aws:iam::908684752359:role/ecsTaskExecutionRole", "networkMode": "bridge", "requiresCompatibilities": [ "EC2" ], "cpu": "256", "memory": "512", "runtimePlatform": { "operatingSystemFamily": "LINUX" } } ``` Register the task definition. ```sh TASK_DEFINITION_ARN=$(aws ecs register-task-definition --cli-input-json file://example-webserver-td.json | jq -r ".taskDefinition.taskDefinitionArn") ``` Create a service with 2 tasks from this task definition. `03-create-service.json` ```json { "cluster": "test-cluster-1", "serviceName": "test-service-1", "taskDefinition": "example-webserver", "desiredCount": 2, "launchType": "EC2" } ``` ```sh SERVICE_NAME=$(aws ecs create-service --cli-input-json file://03-create-service.json | jq -r ".service.serviceName") ``` Observe that each task starts on a different container instance. ### Step 3 - Terminate one of the Instances ```sh aws ec2 terminate-instances --instance-ids $INSTANCE_ID_2 ``` Give the service a few minutes to stabilise and observe that the task previously running on the now-terminated instance has been rescheduled onto instance #1. ### Step 4 - Start another Instance ```sh INSTANCE_ID_2=$(aws ec2 run-instances --cli-input-json file://02-run-second-instance.json --image-id $IMAGE_ID --user-data file://userdata.sh | jq -r ".Instances[0].InstanceId") ``` The second instance is empty. ### Step 5 - Force a new Service Deployment ```sh aws ecs update-service --cluster test-cluster-1 --service $SERVICE_NAME --force-new-deployment ``` Observe that both of the new tasks are scheduled onto the second container instance. The first container instance (that was never disrupted) is empty. ## What is Happening? If we did not perform step #3, or we did not encounter a situation where one of the CIs had to be removed from the cluster, step #5 would still place the new tasks correctly across both container instances. This is because the tasks from the old deployment would still have the correct spread, with one task in each AZ. However, once we perform #3, the problem arises because ECS is spreading the tasks from the new deployment alongside the tasks from the old deployment in the same group. ECS makes no distinction between both deployments in the service and is spreading incoming tasks against old tasks that **will be spun down** once the new tasks are running. When a new deployment is triggered, ECS sees 2 tasks in 1 AZ (from the old deployment), and opts to spin the 2 new tasks (from the new deployment) on the container instance running **in the other AZ** in order to best meet the `spread` placement strategy. However, as we previously saw, the final result of triggering a new deployment does not actually meet the `spread` placement requirement - both of the new tasks run on the **same** container instance. ## How can ECS fix this? There are several ways that customers can fix this issue by themselves, but there is also a way that the ECS service can rework where the task group rolls up to with respect to ECS services. Let's discuss the latter first. ECS can make task groups roll up to the **deployment level**, instead of the service level, to fix the `spread` placement strategy. This prevents tasks belonging to one service deployment from being spread against tasks from a different service deployment, because during the service's steady state, **only one deployment can exist at a time**. The effects of the above recommendation also extend to the task group feature of being specified in a `memberOf` placement constraint. If the service name is specifed as the task group name in this constraint, ECS can default to the service's most recent deployment, and place incoming tasks on container instances that run service tasks from that particular deployment. ## How can customers fix this? Customers can take any of the following actions to ensure that their tasks continue to be spread evenly after a service update. - **Set the service health thresholds to `minimum: any value less than 100%`, and `maximum: 100%`.** This tells the ECS service to spin down old tasks before launching new ones. Once the new task is spun up, the old task that it would be replacing is already gone (as it's been stopped), and so the new task will not be spread against that task. - **After deployment, scale the service out and back in.** As we previously observed during replication, the end result of the new deployment potentially has both tasks running on the same container instance, and therefore in the same availability zone. The task spread can be fixed by scaling the service out to twice its desired count (`4` in our example), then back down again (to `2` in our case). Scaling the service out allows the new tasks to be spread correctly while keeping all tasks in the same deployment, and when scaling in the service, [ECS selects tasks to terminate that maintain a balance across Availability Zones. Within an AZ, tasks are selected at random](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-placement-strategies.html). - **Instead of triggering a new deployment, create a new service instead.** This method adopts a blue-green approach, with traffic cutting over at either the load balancer level, or the DNS record level (if 2 different load balancers are created). Each of these mitigations is not without its problems. Lowering the service health thresholds can cause a decrease in service availability, while the other 2 methods (scale out/in and blue/green deployments) require additional work either after the deployment is triggered, or during the actual deployment itself. Essentially, in order to fortify their workloads against AZ failures or infrastructure maintenance, customers are required to take additional steps to correct an aspect of service design that would be better changed on the side of ECS. ## Conclusion It is important for ECS not to spread one set of tasks against another set of tasks that are expected to be brought down after a deployment has successfully rolled out. Given any pair of tasks from the same ECS service, **both tasks should not be compared to one another for scheduling purposes if they cannot run concurrently when the parent service is in a steady state**. This issue arises because task groups roll up to the ECS service level, instead of the ECS deployment level, which would make more logical sense, as we've explored here. There are changes that can be made at the ECS service level by AWS, and there are also ways that customers can mitigate this issue themselves. ---