At 2018-08-10 03:58:03 UTC, the clock jumped forward to 2018-12-14 on one of our worker EC2 instances.
This caused a number of problems:
5 minutes later, at 2018-08-10 04:04:57, the clock shifted again, backwards this time, to 2018-10-12.
6 minutes later, at 2018-08-10 04:10:51, the clock shifted back to the correct time.
When the problem was discovered, all services were immediately stopped, which meant that:
At this point in time, it was unclear what exactly happened and whether the problem was contained to a single EC2 instance or not. It was also unknown to be a clock issue. It was thought to be an IAM role permission issue due to the fact that we were seeing increased “Forbidden” responses from S3.
A few EC2 instance workers were inspected. Authentication tests performed against S3 were successful.
At this point, we determined that the problem was a temporary IAM role issue, cached in memory, and by stopping all services, the problem would be resolved when services were restarted.
Services were restarted. Actions executed correctly, and execution logs were being correctly saved to S3.
Upon further examination of our logs in Loggly, we noticed:
Even though that single EC2 instance seemed to be working correctly at the time, it was terminated and a replacement was launched by Auto Scaling.
Actions executing during the 1.5 hour time period may show gaps in the execution logs and/or incomplete logs.
The “active executions” administrator view (internal tool), showed 715 action executions “stuck in the future”. All but 85 of them were able to be brought back to the current time. The remaining 85 had to be cancelled on our end.
All affected action executions should have our SLA automatically applied to them. If some were missed, please contact us.
At this time, it’s unclear whether this is an NTP issue, an EC2 hardware issue, an OS issue, or an application issue.
We’re working with AWS support to help determine what exactly happened and how it can be avoided in the future.