Action execution failures

Incident Report for Skeddly

Postmortem

The Problem

At 04:42 UTC on 2018-06-13, action executions started failing with an error:

The retrieved credentials have already expired: Now = 06/13/2018 04:45:05, Credentials expiration = 06/13/2018 04:42:30

This error message was coming from the AWS SDK that we use.

We noticed that the message was coming from a single EC2 instance only. Once that EC2 instance attempted to execute an action, the above error message would be seen and the action would fail.

However, the clock on the offending EC2 instance was correct.

Steps to Stop the Problem

In order to prevent the errors from continuing, the offending EC2 instance was terminated at 05:42. Indeed, this did stop the problem.

Digging Deeper

It's very hard to diagnose a terminated EC2 instance. However...

The program stack traces at the time of failure indicated the problem was happening when Skeddly attempted to retrieve AWS credentials from the EC2 instance metadata. The credentials were already expired when they were retrieved from the metadata.

Our worker EC2 instances use IAM Roles for EC2 Instances for AWS credentials. These credentials are used for a few purposes:

To "assume" to customer AWS IAM roles,
To upload logs to our S3 bucket, and
To upload and download intermediate reports during some of our report-generating actions.

Looking at the actions that failed, most were using IAM roles to allow Skeddly to access customer AWS accounts. So that would fit #1.

Our action execution logs are stored in S3. So parts of logs were unable to be uploaded from this EC2 instance. Bingo for #2.

The remaining actions that failed that were using IAM user access keys failed while attempting to upload or download an intermediate report from S3. That's #3.

AWS Support

After talking with AWS support, most likely the EC2 instance metadata was unreachable for some reason and the AWS SDK was caching the Skeddly credentials it received previously. Eventually those cached credentials expired.

Since the EC2 instance was terminated, we are unable to diagnose further, but they mentioned to running the following via PowerShell next time:

get-date;net time
ipconfig /all
route print

That may shed some light on the issue.

Hopefully we'll never need it.

Our SLA

We are applying our SLA to this incident. All action failures that occurred between 04:42 UTC and 05:40 UTC with an "unknown failure" will have the SLA applied.

Posted Jun 13, 2018 - 03:02 EDT

Resolved

The issue has been resolved. Action executions are executing normally again.

We are applying our SLA to affected action executions.

Posted Jun 13, 2018 - 02:31 EDT

Monitoring

After terminating the offending EC2 instance, the time problem has been resolved. We are monitoring to ensure the problem is fully resolved.

Posted Jun 13, 2018 - 02:07 EDT

Identified

We have identified a problem with the clock on one of Skeddly's action execution workers. The worker was terminated to stop the problem.

Posted Jun 13, 2018 - 01:45 EDT

Investigating

Action executions are ending prematurely. We are investigating.

Posted Jun 13, 2018 - 00:53 EDT

This incident affected: Action Infrastructure.