Issues with front-end and action executions

Incident Report for Skeddly

Postmortem

Between 12:45 PM EST and 5:08 PM EST, Amazon S3 experienced a significant outage in us-east-1. API calls to S3 for reads and writes failed. As a result, many AWS services were affected.

Some other AWS services that were affected:

Elastic Load Balancing
Elastic Beanstalk
RDS
EBS snapshots
AMI images
Simple Email Service
Many others, totalling over 45 distinct AWS services.

We first encountered issues reading from and writing to S3, but also our internal API load balancers, which use Amazon Elastic Load Balancing, dropped HTTPS requests. These combined affected our front-end UI.

The issues affecting our front-end UI were short-lived and our front-end UI recovered quickly.

However, the S3 issues continued to affect action executions. Logs from action executions are stored in Amazon S3. Since API calls to S3 failed and/or timed-out, this caused delays in executing actions. At it's peak, actions were delayed by 91 minutes.

The net result of this AWS outage is as follows:

Actions executed, however during the affected time window, actions executions were delayed. Our SLA was automatically applied as appropriate.
Actions that made use of affected AWS services in us-east-1 failed. For example, if an action created EBS snapshots or AMI images, then that would have failed.
Actions executing during the affected time window will not have full logs available since the logs were not able to be saved to S3.

For many, this outage, like all outages, is a learning opportunity. We will take what we learned so that Skeddly will weather future issues and outages even better.

Posted Feb 28, 2017 - 18:18 EST

Resolved

AWS is reporting the S3 issues have been resolved.

http://status.aws.amazon.com/

Posted Feb 28, 2017 - 17:50 EST

Monitoring

We are noticing that writing logs to S3 has resumed and action execution delays have been eliminated.

AWS still lists S3 and other services as having issues on their status page.

http://status.aws.amazon.com/

We will continue to monitor the situation.

Posted Feb 28, 2017 - 17:06 EST

Update

AWS reports that the issues with S3 are partially resolved.

We are starting to see reductions in the delays of action executions as a result.

Posted Feb 28, 2017 - 16:03 EST

Update

Amazon has updated the status page with serviced affected by the outage.

http://status.aws.amazon.com/

As of this writing, 24 AWS services are affected, including S3, EC2, RDS, Elastic Beanstalk, CloudWatch, ELB, SES.

Posted Feb 28, 2017 - 14:53 EST

Update

The Skeddly front-end is functioning again. Action execution logs are stored in S3, so they cannot be retrieved at this time.

Actions are executing. But they are behind schedule.

Posted Feb 28, 2017 - 13:52 EST

Identified

Amazon Web Services have updated their status page with the current status:

http://status.aws.amazon.com/

Posted Feb 28, 2017 - 13:02 EST

Update

AWS is currently experiencing issues in us-east-1 with S3, EC2, and ELB.

Posted Feb 28, 2017 - 12:57 EST

Investigating

We are investigating issues accessing the Skeddly front-end and action executions.

Posted Feb 28, 2017 - 12:50 EST