The Long Tail of the AWS Outage

the-long-tail-of-the-aws-outage

A sprawling Amazon Web Services cloud outage that began early Monday morning illustrated the fragile interdependencies of the internet as major communication, financial, health care, education, and government platforms around the world suffered disruptions. As the day wore on, AWS diagnosed and began working to correct the issue, which stemmed from the company’s critical US-EAST-1 region based in northern Virginia. But the cascade of impacts took time to fully resolve.

Researchers reflecting on the incident particularly highlighted the length of the outage, which started around 3 am ET on Monday, October 20. AWS said in status updates that by 6:01 pm ET on Monday “all AWS services returned to normal operations.” The outage directly stemmed from Amazon’s DynamoDB database application programming interfaces and, according to the company, “impacted” 141 other AWS services. Multiple network engineers and infrastructure specialists emphasized to WIRED that errors are understandable and inevitable for so-called “hyperscalers” like AWS, Microsoft Azure, and Google Cloud Platform, given their complexity and sheer size. But they noted, too, that this reality shouldn’t simply absolve cloud providers when they have prolonged downtime.

“The word hindsight is key. It’s easy to find out what went wrong after the fact, but the overall reliability of AWS shows how difficult it is to prevent every failure,” says Ira Winkler, chief information security officer of the reliability and cybersecurity firm CYE. “Ideally, this will be a lesson learned, and Amazon will implement more redundancies that would prevent a disaster like this from happening in the future—or at least prevent them staying down as long as they did.”

AWS did not respond to questions from WIRED about the long tail of the recovery for customers. An AWS spokesperson says the company plans to publish one of its “post-event summaries” about the incident.

“I don’t think this was just a ‘stuff happens’ outage. I would have expected a full remediation much faster,” says Jake Williams, vice president of research and development at Hunter Strategy. “To give them their due, cascading failures aren’t something that they get a lot of experience working with because they don’t have outages very often. So that’s to their credit. But it’s really easy to get into the mindset of giving these companies a pass, and we shouldn’t forget that they create this situation by actively trying to attract ever more customers to their infrastructure. Clients don’t control whether they are overextending themselves or what they may have going on financially.”

The incident was caused by a familiar culprit in web outages—“domain name system” resolution issues. DNS is essentially the internet’s phonebook mechanism to direct web browsers to the right servers. As a result, DNS issues are a common source of outages, because they can cause requests to fail and keep content from loading.

“Cloud computing is a marvel, but the heart of it is a never-ending list of complex services and dependencies that are always one configuration away from failure,” says Mark St. John, chief operating officer and cofounder of the systems security startup Neon Cyber.

Speaking generally about hyperscalers, St. John echoed Williams’ point that in exchange for the mature architecture and secure baseline of cloud platforms, customers cede control of their underlying digital infrastructure and the extent to which their cloud provider is or isn’t investing in resilience and contingency planning at a given time. “At a certain scale, operational validation for service providers can’t be a casualty of cost-cutting,” St. John says.

Thinking specifically about Monday’s outage, one senior network architect at a major tech company, who requested anonymity because they are not authorized to speak to the press, also emphasized that the time it took for AWS to diagnose and remediate the issues was notable.

“It’s extraordinary that they don’t have more failures,” the source said, “but in this case it was weird that what was basically a core service—DynamoDB and the DNS around that—took so long to detect and get to a root cause.”

Related Posts

Leave a Reply