Amazon disclosed that a bug in its automation software was responsible for this week’s extensive AWS outage, which took down services like Signal and smart beds for several hours.
In a detailed summary released on Thursday, AWS explained that a series of cascading failures led to the downtime affecting thousands of sites and applications utilizing its services.
AWS reported that “due to a potential flaw in the service’s automatic DNS, customers faced issues connecting to DynamoDB, the database system where AWS clients store their data.” [domain name system] management system.”
DynamoDB manages hundreds of thousands of DNS records. It’s essential to automate system monitoring to ensure records are frequently updated, manage hardware failures, and efficiently distribute traffic as needed.
According to AWS, the root cause stemmed from an empty DNS record in the Virginia-based US-East-1 datacenter region. This issue required manual intervention for resolution, as it could not be automatically fixed.
AWS announced that it has globally disabled DynamoDB’s DNS Planner and DNS Executor automation while remedying the issues that prompted the failure, as well as implementing additional safeguards.
This outage also affected various other AWS tools.
Platforms like Signal, Snapchat, Roblox, and Duolingo, along with banking sites and services such as Ring Doorbell, were among the 2,000 businesses impacted by the outage, according to Downdetector, which recorded over 8.1 million user reports of problems globally.
Service was restored within hours, but the outage’s repercussions were widespread.
Customers of Eight Sleep—a company providing smart beds that connect to the internet for temperature and tilt control—were unable to adjust their beds or temperatures during the outage due to connectivity issues via their phone app.
The company’s CEO, Matteo Franceschetti, issued an apology. On X, he shared that they rolled out a service update allowing users to control critical bed functions via Bluetooth during such outages.
Dr. Suellet Dreyfuss, a lecturer in computing and information systems at the University of Melbourne, pointed out that this failure highlights the dependency on single points of failure within the internet infrastructure.
“It’s not solely AWS; while they are the largest cloud provider with around 30% of the market, the cloud essentially revolves around just three companies,” she explained.
“The Internet was originally designed to be resilient, allowing multiple routes to work around problems and attacks. However, we have diminished that resilience by relying heavily on a limited number of significant tech companies that not only provide data storage but also manage data services.”
Source: www.theguardian.com
