Case Study

AWS us-east-1 December 2021: $150M+ in customer losses

On the morning of 7 December 2021, an automated scaling activity in AWS's internal network triggered a cascading failure across us-east-1 services. Most major consumer apps and SaaS platforms hosted in the region were unavailable or severely degraded for approximately seven hours of acute impact, with downstream recovery tails extending past 30 hours for customers with implicit single-region dependencies. Aggregate customer losses were estimated above $150 million.

Timeline

What happened, hour by hour

Time (Eastern)Event
07:30 ETAutomated scaling activity triggers internal-network instability
07:30 to 08:30Internal API failure cascades; public AWS APIs begin returning errors
08:30 to 09:00AWS Service Health Dashboard updated; many customer-facing services impacted
09:00 to 13:00Peak impact period; most us-east-1-hosted services unavailable or severely degraded
13:00 to 15:00Recovery begins; AWS engineers restore internal-network capacity
~15:00 ETAWS declares broad service restoration; some downstream services continue to recover
Following 30+ hoursCustomers with single-AZ dependencies inside us-east-1 see extended recovery tails

Timeline from AWS's official post-incident summary and contemporaneous status-page updates.

Affected Services

Selected major customers impacted

The list below is not exhaustive. Thousands of smaller services hosted in us-east-1 were also affected. The selection here illustrates the breadth of consumer-facing impact across streaming, finance, food delivery, IoT, transport, and messaging.

Company / serviceObserved impact
NetflixStreaming impaired across multiple regions
Disney+Login and playback failures
RobinhoodTrading platform issues during US market open
CoinbaseTrading and account-access issues
SlackMessaging delays and partial outages
RingDoorbell and camera notifications stopped
Roomba (iRobot)Cloud-controlled robots unresponsive
TinderApp failures during peak evening hours later
VenmoPayment processing issues
Amazon retail siteSome product pages and customer-account features impaired
DoorDashOrder processing degraded
McDonald's appMobile ordering down
United AirlinesBooking and check-in issues
Delta Air LinesSome booking pathway issues

Root Cause

The internal-network scaling cascade

Per AWS's official summary, the trigger was an automated scaling activity at 07:30 ET on the internal network that hosts AWS's internal networking devices. The scaling triggered an unexpected behaviour in the clients of the internal network, which began a connection-storm against the internal network. The connection-storm consumed the network's remaining capacity, which produced cascading failures across the internal services that the public AWS APIs depend on.

Because the AWS Service Health Dashboard itself partly depends on the same control-plane services, customers experienced an extended period during which they could see something was wrong but could not get reliable status information. This compounded the operational impact: customer engineers were debugging blind for the first hour or more.

Recovery required AWS engineers to restore the internal-network capacity carefully without re-triggering the connection-storm. The recovery process took approximately five hours from the start of active mitigation to broad service restoration, with a long tail of downstream impact as individual customer services worked back to nominal state.

Economic Impact

Estimating $150M+ in aggregate customer losses

AWS does not disclose customer impact figures. Industry analysts and trade-press estimates put aggregate customer losses above $150 million for the December 2021 incident, derived from per-hour cost benchmarks applied to the publicly-named affected services plus reasonable assumptions about smaller customers. The $150M figure is conservative: it counts only the directly-attributable revenue loss, not the downstream brand-damage or churn cost.

Two cost-distribution observations. First, the cost concentrated in a small number of large consumer-facing customers. Netflix, Disney+, Robinhood, and Coinbase together likely accounted for half or more of the disclosed customer impact. Second, the AWS SLA credits paid out were small relative to customer losses, both because many customers' cumulative monthly downtime remained within the regional SLA threshold and because the AWS SLA returns 10 to 25% of the service fee, not a percentage of customer revenue impact.

For framework on why SLA credits return so little, see our SLA credit asymmetry analysis. The us-east-1 case is a textbook example.

Architectural Lessons

Why "multi-region" is not the same as "regionally independent"

Many customers who believed they had multi-region resilience discovered, during the December 2021 incident, that their architectures had implicit single-region dependencies. Three patterns recurred. First, IAM and certain Route 53 features have control planes that are anchored in us-east-1, so a us-east-1 incident can affect identity and DNS operations even in other regions. Second, cross-region replication has its own control-plane dependencies that can fail open or fail closed in surprising ways. Third, many customer deployment pipelines, monitoring stacks, and operational tooling lived in us-east-1 because it was the original AWS region, so customers could not deploy fixes to their other regions during the incident.

The practical lesson is that true regional independence requires explicit testing through game days (controlled regional failover exercises) rather than just on-paper architecture diagrams. Customers who had recently run a us-east-1-loss game day generally recovered fastest. Customers who had multi-region architecture only on the deployment diagram discovered missing dependencies during the actual incident.

For the cost-benefit math on multi-region active-active versus single-region with strong backup, see our business case builder. The us-east-1 December 2021 incident is the most commonly cited reference point for the "why a single AWS region is not enough" argument, even though pure regional failures are still rare in absolute terms.

Recovery Tail

Why some customers were still recovering 30 hours later

AWS declared broad service restoration by approximately 15:00 ET on 7 December 2021. For many customers, the actual return to nominal service took much longer. The pattern was uneven. Customers with predominantly stateless workloads (read-mostly web services, content delivery) recovered quickly after AWS APIs returned. Customers with stateful workloads (databases, queues, event-streaming pipelines) took longer because they had to drain backlogs, reconcile inconsistent state, and unwind partial-failure conditions accumulated during the outage.

Some customers reported continuing partial impact past 30 hours after the initial incident. These were typically customers with deep single-AZ dependencies inside us-east-1 (services that ran in a single Availability Zone, depended on storage volumes in that AZ, and relied on operational tooling that also ran in that AZ). The long tail explains why some estimates of the incident's total cost run higher than the headline 7-hour figure: for the affected long-tail customers, the effective outage was 24-hour-class or worse.

Frequently Asked

Common Questions

What caused the AWS us-east-1 outage of 7 December 2021?
Per AWS's official summary, an automated scaling activity on the internal network at 07:30 ET triggered an unexpected client behaviour that produced a connection-storm. The connection-storm consumed the internal network's remaining capacity, which produced cascading failures across the internal services that power the public AWS APIs. Recovery required carefully restoring internal-network capacity without re-triggering the storm.
How long did the AWS us-east-1 December 2021 outage last?
Approximately 7 hours of acute outage (07:30 to ~15:00 ET on 7 December 2021), with recovery tails extending past 30 hours for some customers. Many customers experienced extended partial-impact periods because their architectures had stateful components that took time to drain backlogs and reconcile inconsistent state.
How much did the outage cost in aggregate?
Industry estimates put aggregate customer losses above $150 million. AWS itself did not disclose a direct dollar figure. Service credits paid to affected customers were modest because in many cases the regional uptime SLA threshold was not breached when measured across the full month, and the AWS SLA returns 10 to 25% of the service fee, not a percentage of customer revenue impact.
Which major services were affected?
Netflix, Disney+, Robinhood, Coinbase, Slack, Ring, Roomba, Tinder, Venmo, McDonald's mobile ordering, DoorDash, and many more. The official Amazon retail site itself was partly affected. Thousands of smaller services hosted in us-east-1 were also impacted. The breadth reflected the historical concentration of new-service deployments in us-east-1 plus the control-plane anchoring of certain AWS services in that region.
Did SLA credits compensate the cost?
No, not meaningfully. The AWS SLA returns 10% of the monthly service fee when uptime falls below 99.99%, rising to 25% for deeper breaches. For a customer running $100,000 per month on the affected service, that is a $10,000 to $25,000 credit. Against a $10 million business loss the credit covers 0.25%. The credit is signalling, not compensation, at any meaningful outage scale.
What is the architectural lesson?
True regional independence requires explicit testing through game days, not just deploy-twice. Many customers who believed they had multi-region resilience discovered implicit single-region dependencies during the actual incident: IAM and certain Route 53 features anchored in us-east-1, cross-region replication control planes, deployment pipelines and monitoring stacks that lived in the affected region. Multi-region in architecture diagrams is not the same as multi-region in operation.

Related

Updated 2026-04-27