Errors saving action execution logs to S3
Incident Report for Skeddly
Postmortem

The Problem

At 2018-08-10 03:58:03 UTC, the clock jumped forward to 2018-12-14 on one of our worker EC2 instances.

This caused a number of problems:

  • The action executions that were running on that EC2 instance were tagged with a December timestamp,
  • AWS API calls were denied with a “Forbidden” and/or “RequestTimeTooSkewed” error because of the time difference.

5 minutes later, at 2018-08-10 04:04:57, the clock shifted again, backwards this time, to 2018-10-12.

  • API requests were still denied,
  • New actions that executed on that EC2 instance were now tagged with an October timestamp.

6 minutes later, at 2018-08-10 04:10:51, the clock shifted back to the correct time.

  • API requests were still being denied because AWS credentials were cached in-memory with a December expiration date,
  • Action execution log parts were not being saved to S3

The Resolution

When the problem was discovered, all services were immediately stopped, which meant that:

  • All action executions were paused,
  • New action executions were delayed starting.

At this point in time, it was unclear what exactly happened and whether the problem was contained to a single EC2 instance or not. It was also unknown to be a clock issue. It was thought to be an IAM role permission issue due to the fact that we were seeing increased “Forbidden” responses from S3.

A few EC2 instance workers were inspected. Authentication tests performed against S3 were successful.

At this point, we determined that the problem was a temporary IAM role issue, cached in memory, and by stopping all services, the problem would be resolved when services were restarted.

Services were restarted. Actions executed correctly, and execution logs were being correctly saved to S3.

Digging Deeper

Upon further examination of our logs in Loggly, we noticed:

  • All the problems were coming from a single EC2 instance,
  • Those logs were timestamped in the future.

Even though that single EC2 instance seemed to be working correctly at the time, it was terminated and a replacement was launched by Auto Scaling.

Some Action Executions Suffered

Actions executing during the 1.5 hour time period may show gaps in the execution logs and/or incomplete logs.

The “active executions” administrator view (internal tool), showed 715 action executions “stuck in the future”. All but 85 of them were able to be brought back to the current time. The remaining 85 had to be cancelled on our end.

All affected action executions should have our SLA automatically applied to them. If some were missed, please contact us.

The Cause

At this time, it’s unclear whether this is an NTP issue, an EC2 hardware issue, an OS issue, or an application issue.

We’re working with AWS support to help determine what exactly happened and how it can be avoided in the future.

Posted 11 months ago. Aug 10, 2018 - 11:32 EDT

Resolved
At 12:15am EST, one of our action worker instances was unable to authenticate against Amazon S3. This caused the worker to be unable to upload partial action execution logs to the S3 bucket in which logs are stored. Aside from this issue, action executions were proceeding correctly.

After diagnosing the issue, at 1:30am EST the EC2 instance was terminated and a replacement was launched by Auto Scaling.

Actions are executing normally, and execution logs are being stored in S3 correctly.

Update 2:13am EST:

We have also found that the clock on the problem EC2 instance jumped forward in time to December 12, 2018. This may have helped cause the authentication failures with S3. The cause for the clock shift has not been determined.

Some action executions were able to be brought back to the present, but unfortunately, this has forced us to cancel some action executions which are "stuck" in the future. Our SLA will be applied to all affected action executions.
Posted 11 months ago. Aug 10, 2018 - 01:49 EDT
Monitoring
The problem EC2 instance has been terminated and a replacement is being launched. We are monitoring the situation.

Action execution is continuing normally. Action execution logs are being saved to S3.
Posted 11 months ago. Aug 10, 2018 - 01:40 EDT
Identified
We have identified a problem with a single EC2 instance saving action execution logs to S3. Otherwise, actions have been executing correctly.
Posted 11 months ago. Aug 10, 2018 - 01:30 EDT
Investigating
We are investigating an issue saving action execution logs to S3.
Posted 11 months ago. Aug 10, 2018 - 01:00 EDT
This incident affected: Action Infrastructure.