At 04:42 UTC on 2018-06-13, action executions started failing with an error:
The retrieved credentials have already expired: Now = 06/13/2018 04:45:05, Credentials expiration = 06/13/2018 04:42:30
This error message was coming from the AWS SDK that we use.
We noticed that the message was coming from a single EC2 instance only. Once that EC2 instance attempted to execute an action, the above error message would be seen and the action would fail.
However, the clock on the offending EC2 instance was correct.
In order to prevent the errors from continuing, the offending EC2 instance was terminated at 05:42. Indeed, this did stop the problem.
It's very hard to diagnose a terminated EC2 instance. However...
The program stack traces at the time of failure indicated the problem was happening when Skeddly attempted to retrieve AWS credentials from the EC2 instance metadata. The credentials were already expired when they were retrieved from the metadata.
Our worker EC2 instances use IAM Roles for EC2 Instances for AWS credentials. These credentials are used for a few purposes:
Looking at the actions that failed, most were using IAM roles to allow Skeddly to access customer AWS accounts. So that would fit #1.
Our action execution logs are stored in S3. So parts of logs were unable to be uploaded from this EC2 instance. Bingo for #2.
The remaining actions that failed that were using IAM user access keys failed while attempting to upload or download an intermediate report from S3. That's #3.
After talking with AWS support, most likely the EC2 instance metadata was unreachable for some reason and the AWS SDK was caching the Skeddly credentials it received previously. Eventually those cached credentials expired.
Since the EC2 instance was terminated, we are unable to diagnose further, but they mentioned to running the following via PowerShell next time:
get-date;net time
ipconfig /all
route print
That may shed some light on the issue.
Hopefully we'll never need it.
We are applying our SLA to this incident. All action failures that occurred between 04:42 UTC and 05:40 UTC with an "unknown failure" will have the SLA applied.