Case Study
AWS us-east-1 December 2021: $150M+ in customer losses
On the morning of 7 December 2021, an automated scaling activity in AWS's internal network triggered a cascading failure across us-east-1 services. Most major consumer apps and SaaS platforms hosted in the region were unavailable or severely degraded for approximately seven hours of acute impact, with downstream recovery tails extending past 30 hours for customers with implicit single-region dependencies. Aggregate customer losses were estimated above $150 million.
Timeline
What happened, hour by hour
| Time (Eastern) | Event |
|---|---|
| 07:30 ET | Automated scaling activity triggers internal-network instability |
| 07:30 to 08:30 | Internal API failure cascades; public AWS APIs begin returning errors |
| 08:30 to 09:00 | AWS Service Health Dashboard updated; many customer-facing services impacted |
| 09:00 to 13:00 | Peak impact period; most us-east-1-hosted services unavailable or severely degraded |
| 13:00 to 15:00 | Recovery begins; AWS engineers restore internal-network capacity |
| ~15:00 ET | AWS declares broad service restoration; some downstream services continue to recover |
| Following 30+ hours | Customers with single-AZ dependencies inside us-east-1 see extended recovery tails |
Timeline from AWS's official post-incident summary and contemporaneous status-page updates.
Affected Services
Selected major customers impacted
The list below is not exhaustive. Thousands of smaller services hosted in us-east-1 were also affected. The selection here illustrates the breadth of consumer-facing impact across streaming, finance, food delivery, IoT, transport, and messaging.
| Company / service | Observed impact |
|---|---|
| Netflix | Streaming impaired across multiple regions |
| Disney+ | Login and playback failures |
| Robinhood | Trading platform issues during US market open |
| Coinbase | Trading and account-access issues |
| Slack | Messaging delays and partial outages |
| Ring | Doorbell and camera notifications stopped |
| Roomba (iRobot) | Cloud-controlled robots unresponsive |
| Tinder | App failures during peak evening hours later |
| Venmo | Payment processing issues |
| Amazon retail site | Some product pages and customer-account features impaired |
| DoorDash | Order processing degraded |
| McDonald's app | Mobile ordering down |
| United Airlines | Booking and check-in issues |
| Delta Air Lines | Some booking pathway issues |
Root Cause
The internal-network scaling cascade
Per AWS's official summary, the trigger was an automated scaling activity at 07:30 ET on the internal network that hosts AWS's internal networking devices. The scaling triggered an unexpected behaviour in the clients of the internal network, which began a connection-storm against the internal network. The connection-storm consumed the network's remaining capacity, which produced cascading failures across the internal services that the public AWS APIs depend on.
Because the AWS Service Health Dashboard itself partly depends on the same control-plane services, customers experienced an extended period during which they could see something was wrong but could not get reliable status information. This compounded the operational impact: customer engineers were debugging blind for the first hour or more.
Recovery required AWS engineers to restore the internal-network capacity carefully without re-triggering the connection-storm. The recovery process took approximately five hours from the start of active mitigation to broad service restoration, with a long tail of downstream impact as individual customer services worked back to nominal state.
Economic Impact
Estimating $150M+ in aggregate customer losses
AWS does not disclose customer impact figures. Industry analysts and trade-press estimates put aggregate customer losses above $150 million for the December 2021 incident, derived from per-hour cost benchmarks applied to the publicly-named affected services plus reasonable assumptions about smaller customers. The $150M figure is conservative: it counts only the directly-attributable revenue loss, not the downstream brand-damage or churn cost.
Two cost-distribution observations. First, the cost concentrated in a small number of large consumer-facing customers. Netflix, Disney+, Robinhood, and Coinbase together likely accounted for half or more of the disclosed customer impact. Second, the AWS SLA credits paid out were small relative to customer losses, both because many customers' cumulative monthly downtime remained within the regional SLA threshold and because the AWS SLA returns 10 to 25% of the service fee, not a percentage of customer revenue impact.
For framework on why SLA credits return so little, see our SLA credit asymmetry analysis. The us-east-1 case is a textbook example.
Architectural Lessons
Why "multi-region" is not the same as "regionally independent"
Many customers who believed they had multi-region resilience discovered, during the December 2021 incident, that their architectures had implicit single-region dependencies. Three patterns recurred. First, IAM and certain Route 53 features have control planes that are anchored in us-east-1, so a us-east-1 incident can affect identity and DNS operations even in other regions. Second, cross-region replication has its own control-plane dependencies that can fail open or fail closed in surprising ways. Third, many customer deployment pipelines, monitoring stacks, and operational tooling lived in us-east-1 because it was the original AWS region, so customers could not deploy fixes to their other regions during the incident.
The practical lesson is that true regional independence requires explicit testing through game days (controlled regional failover exercises) rather than just on-paper architecture diagrams. Customers who had recently run a us-east-1-loss game day generally recovered fastest. Customers who had multi-region architecture only on the deployment diagram discovered missing dependencies during the actual incident.
For the cost-benefit math on multi-region active-active versus single-region with strong backup, see our business case builder. The us-east-1 December 2021 incident is the most commonly cited reference point for the "why a single AWS region is not enough" argument, even though pure regional failures are still rare in absolute terms.
Recovery Tail
Why some customers were still recovering 30 hours later
AWS declared broad service restoration by approximately 15:00 ET on 7 December 2021. For many customers, the actual return to nominal service took much longer. The pattern was uneven. Customers with predominantly stateless workloads (read-mostly web services, content delivery) recovered quickly after AWS APIs returned. Customers with stateful workloads (databases, queues, event-streaming pipelines) took longer because they had to drain backlogs, reconcile inconsistent state, and unwind partial-failure conditions accumulated during the outage.
Some customers reported continuing partial impact past 30 hours after the initial incident. These were typically customers with deep single-AZ dependencies inside us-east-1 (services that ran in a single Availability Zone, depended on storage volumes in that AZ, and relied on operational tooling that also ran in that AZ). The long tail explains why some estimates of the incident's total cost run higher than the headline 7-hour figure: for the affected long-tail customers, the effective outage was 24-hour-class or worse.
Frequently Asked
Common Questions
What caused the AWS us-east-1 outage of 7 December 2021?
How long did the AWS us-east-1 December 2021 outage last?
How much did the outage cost in aggregate?
Which major services were affected?
Did SLA credits compensate the cost?
What is the architectural lesson?
Related