SECRET OF CSS

Configure ECS Task Failure Alerts | by Ross Rhodes | Sep, 2022


Set up execution failure notifications using Amazon EventBridge

0*5MRBp181wQCKtXwK
Photo by Ales Krivec on Unsplash

A popular service in Amazon Web Services (AWS), the Elastic Container Service (ECS) allows us to run containerised applications in the cloud. At a high level, ECS consists of three primary resources:

  1. task definitions to run one or more containers
  2. services that execute one or more instances of task definitions
  3. clusters which group services and tasks together
1*9iuVG5WWOoHkJLjdae1P7w
ECS diagram by author

If we want to host an application within ECS, we may provide a service within a cluster — especially if we desire high availability. This is a particularly common pattern for ECS. However, we can take different approaches, especially when we want to run a task on an infrequent basis. In such cases, we may consider scheduled ECS tasks. Invoked by Amazon EventBridge rules, scheduled tasks are not managed by services.

1*robytj6AAFcLIA2HCKYMWA
ECS diagram by author

Following this scheduled task pattern, we lose out on many Amazon CloudWatch metrics available out of the box that are tailored specifically to ECS services. Furthermore, there’s no predefined metric for the count of task execution failures, so how do we configure alerts for such events?

Let’s first explain how we may configure alerts for tasks that fail within a service. After which, let’s discuss why it’s suboptimal, and why it is not applicable to scheduled tasks. Let’s propose a better solution with a working example using the AWS Cloud Development Kit (CDK), which can apply to all tasks, irrespective of whether they belong to a service.

When a task runs within a service, we use all three aforementioned resource types, since tasks and services must be grouped within a cluster. Default CloudWatch metrics for ECS rely upon two dimensions: ClusterName and ServiceName. Whenever we want to configure alerts on a service task, we use these dimensions to make sure alerts are tailored to the correct task instances.

Usually, a service will host one or more healthy tasks (although the desired count can be set to 0). To monitor this, we graph the number of running tasks using pre-existing CPUUtilization and MemoryUtilization metrics. If the number of running tasks falls short of expectations, this is a good indication that something is wrong, and we may set alarms to alert us to such cases. Keep in mind that during a deployment update, new tasks are provisioned before old ones are torn down, so we wouldn’t expect our total number of service tasks to drop below the usual number at any stage.

Whilst this is useful for services, we cannot apply the same solution to scheduled tasks. This is because scheduled tasks do not belong to a service, and by definition, they do not have a stable running count — we expect this to oscillate over time depending on how often we invoke these tasks and how long they run.

Furthermore, there are delays to this alerting strategy, since CloudWatch waits until configured evaluation periods have passed before raising the alarm. What if we want to reduce the time between failure occurrences and throwing alerts?

Turning to Amazon EventBridge, not only does this service invoke scheduled tasks, but it can also manage task state change events as well. When a container within a task terminates, this creates an event. We can use EventBridge to detect these events and trigger downstream actions when an event matches a given pattern. For example, we may configure a new rule within EventBridge to detect when a container stops and reports a common failure exit code: 1, 137, 139, or 255.

An event pattern to filter for this criteria requires several details:

  • the ECS cluster Amazon Resource Name (ARN)
  • the task definition ARN
  • the reason the task stopped (Essential container in task exited)
  • the last task status matching STOPPED
  • one or more of the aforementioned common failure exit codes

With these details provided, an EventBridge pattern looks like the following:

1*q0SzaKQz98E9DRK73XzSiA
EventBridge console screenshot by author

The newly configured rule may then take steps to notify us of task state changes using EventBridge targets. This can take many forms, like email notifications via the Simple Notification Service (SNS), or API destinations to send RESTful API requests.

Note that if this rule were to be applied to a task within a service, it would not detect tasks terminated during a deployment update. In such cases, the stopped reason says “Scaling activity initiated by deployment,” which does not match our event pattern. Therefore, it’s safe to apply this without fear of alerts on every task definition update.

Given the above suggestion, let’s demonstrate a working alert mechanism that we shall build and deploy with CDK. We’ll provision a scheduled ECS task with an EventBridge rule to notify us by email whenever a task execution fails with one of the common exit codes. The full solution is available on GitHub, but we’ll walk through each section here and explain in further detail how it all hangs together.

1*OigRMat3cOPrs cUIX7haQ
Architecture diagram by author

We start first with an ECS cluster. This is required for all tasks, irrespective of whether they belong to a service. We’ll provision a Fargate task for this example, which requires us to set enableFargateCapacityProviders.

Next, we define the task definition. Given we’re using the Fargate launch type, we require an execution role for this, which is assumed by ECS and grants all required permissions by applying an AWS-managed service role.

Now we require a container to host our application within the task. Assuming we have a Dockerfile and application code already exists within the current directory (see the GitHub repository for an example), we provision a new Docker Image Asset under this directory, and pass it to the addContainer method. The log driver for AWS CloudWatch is optional here, but useful for reviewing any logs produced by the application.

With all the ECS infrastructure provisioned, let’s configure an EventBridge rule to invoke the task. CDK offers this functionality out of the box for us (❤ CDK). For the sake of easier testing, our task is configured to launch every minute using the latest Fargate platform version.

Now we want to replicate the EventBridge task pattern shared earlier for the detection of execution failure events. This is a little more cumbersome — a comment is given to justify each of the listed failure exit codes.

This leaves us to decide how best to be notified of task failures. For this example, let’s submit matching events to an SNS topic and set up an email subscription for notifications. The recipient address may be set using the ALERT_EMAIL_ADDRESS environment variable.

That brings our CDK solution to a close. After we deploy these resources to AWS, our scheduled task should invoke every minute with immediate effect. Whenever the task fails during execution with a matching exit status code, the email address recipient receives a notification of this task failure.

Note that this setup should not be left running indefinitely — otherwise, a task will spin up every minute and likely spam our email inboxes with failure notification messages. Remember to tear this down when it’s no longer required, or at least disable the EventBridge invocation rule.

Alerts for ECS tasks are something I’ve considered a couple of times during the past few years working on AWS. Since I recently revisited this issue, it felt like a worthwhile topic to share with you all. The following served as valuable references for this article:

I hope this serves as a useful article for anyone working with ECS. Do let me know if you follow similar patterns — or take different strategies altogether! It would be great to hear how others configure their task alerts.



News Credit

%d bloggers like this: