When Amazon Web Service goes down, a large chunk of the internet goes down with it. This is because the service supports a large number of other popular services. During the last outage of AWS, which did not spare Amazon’s services, popular games like PUBG, League of Legends, and Valorant went down as well, leaving many online players stranded.
Now, the internet enabler has explained what caused the outage.
Within Amazon, the outage affected the tools used by the workers. It disrupted the apps that track packages, schedule their delivery routes, etc. The outage ended up delaying many deliveries.
People also had problems loading Amazon’s website. Other Amazon products affected included the Alexa AI assistant, Amazon Music, Kindle, and security camera.
Other big names affected by the outage are Roku, Coinbase, Tinder, Cash App, Disney Plus, Roomba, Venmo, etc.
Even after some of the affected services were restored, the internet still felt slow as the network rerouted requests.
One of the things that complicated the outage was that the process Amazon uses to monitor what goes wrong with its system was affected by the outage. Unable to access the tool, the operations team had to troubleshoot using other less efficient means, explaining why the outage persisted for so long.
The outage also affected Amazon’s Support Contact Center, locking out customers from accessing help services for hours.
According to Amazon, the outage was caused by an automated process that started in the Northern Virginia region around 10:30 AM ET.
Explaining what went wrong, Amazon wrote:
“At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.”
Amazon is taking steps to prevent such outages in the future. The company revealed:
“We have taken several actions to prevent a recurrence of this event. We immediately disabled the scaling activities that triggered this event and will not resume them until we have deployed all remediations. Our systems are scaled adequately so that we do not need to resume these activities in the near-term. Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event. This code path has been in production for many years but the automated scaling activity triggered a previously unobserved behavior. We are developing a fix for this issue and expect to deploy this change over the next two weeks. We have also deployed additional network configuration that protects potentially impacted networking devices even in the face of a similar congestion event. These remediations give us confidence that we will not see a recurrence of this issue.”
After explaining why it was unable to communicate with affected clients, Amazon outlined how it would upgrade its monitoring tools:
“As the impact to services during this event all stemmed from a single root cause, we opted to provide updates via a global banner on the Service Health Dashboard, which we have since learned makes it difficult for some customers to find information about this issue. Our Support Contact Center also relies on the internal AWS network, so the ability to create support cases was impacted from 7:33 AM until 2:25 PM PST. We have been working on several enhancements to our Support Services to ensure we can more reliably and quickly communicate with customers during operational issues. We expect to release a new version of our Service Health Dashboard early next year that will make it easier to understand service impact and a new support system architecture that actively runs across multiple AWS regions to ensure we do not have delays in communicating with customers.”