Dissecting the AWS Outage: A Step-by-Step Breakdown of What Went Wrong

The Latency Gambler 2025-10-22 Software Architecture & Engineering

A step-by-step technical breakdown of the October 2025 AWS US-EAST-1 outage — tracing DynamoDB DNS resolution failures as the root cause and walking through the cascading impact on services that took down Reddit, Snapchat, and much of the internet.

Contents

What Actually Happened: The Technical Timeline
Why US-EAST-1 Matters (And Why It’s Everywhere)
The Ripple Effect: When Dependencies Fail
What This Reveals About Cloud Architecture
The Cost of Centralization
Lessons for Your Architecture
Responses (12)
Related

Summary: I tried opening Medium at 6:30 AM on Monday, October 20th. Nothing loaded. Figured it was my wifi. Switched to mobile data. Still nothing. Then I checked Twitter. Or tried to. Reddit? Down. Snapcha...

I tried opening Medium at 6:30 AM on Monday, October 20th. Nothing loaded. Figured it was my wifi. Switched to mobile data. Still nothing. Then I checked Twitter. Or tried to. Reddit? Down. Snapchat? Dead. That’s when it hit me: this wasn’t my connection. This was something bigger.

By mid-morning, Downdetector had logged over 13 million user reports across thousands of services. The culprit was Amazon Web Services, specifically the US-EAST-1 region in Northern Virginia. As someone who’s spent years architecting cloud infrastructure, watching this unfold was both fascinating and terrifying.

What Actually Happened: The Technical Timeline

At 3:11 AM ET (12:11 AM PDT), AWS reported increased error rates across multiple services. The root cause was DNS resolution failures for DynamoDB API endpoints in the us-east-1 region. Let me break down what that means in practice.

DynamoDB is AWS’s managed NoSQL database service. When your application needs to read or write data, it makes an API call to a DynamoDB endpoint like:

dynamodb.us-east-1.amazonaws.com

But here’s the critical part: before your application can talk to that endpoint, it needs to resolve the domain name to an IP address using DNS. When DNS resolution fails, it doesn’t matter if the database is working perfectly. Your application can’t find it.

The Failure Cascade:

Application Request
    |
    v
DNS Lookup (dynamodb.us-east-1.amazonaws.com)
    |
    X  <- FAILURE POINT
    |
Cannot Resolve IP Address
    |
    v
Connection Timeout
    |
    v
Application Error (500/503)

AWS mitigated the DNS issue by 2:24 AM PDT, but the damage cascaded through dependent services. EC2 instance launches were throttled. Internal subsystems remained impaired. Full recovery wasn’t achieved until 3:01 PM PDT, over 12 hours after the initial incident.

Why US-EAST-1 Matters (And Why It’s Everywhere)

US-EAST-1 isn’t just another AWS region. It’s the original region, launched in 2006. Many AWS services have global dependencies on it. When I review architecture diagrams from companies, I consistently see this pattern:

# Typical multi-region setup (oversimplified)
PRIMARY_REGION = "us-east-1"  # Disaster waiting to happen
FAILOVER_REGION = "us-west-2"

def get_config():
    try:
        # Route53, IAM, CloudFront all have us-east-1 dependencies
        return dynamodb_client.get_item(
            TableName='app-config',
            Key={'env': 'production'}
        )
    except ClientError as e:
        if e.response['Error']['Code'] == 'ResourceNotFoundException':
            # Failover logic that doesn't help if DNS is broken
            return get_config_from_backup()

The problem is that even if you’re running in eu-west-1 or ap-southeast-1, many AWS control plane operations still route through us-east-1. This outage affected companies globally, including major Canadian services like Wealthsimple, and AI platforms like Perplexity and crypto exchanges like Coinbase.

According to AWS’s official update, the DNS resolution issue specifically impacted DynamoDB endpoints, which then rippled through EC2, Lambda, and other services that depend on DynamoDB for configuration and state management.

The Ripple Effect: When Dependencies Fail

What made this outage particularly devastating was the dependency chain. Modern applications don’t just use one AWS service. They use dozens. Here’s a simplified architecture that mirrors what thousands of companies run:

User Request
    |
    v
CloudFront (CDN) -----> us-east-1 (config)
    |
    v
Application Load Balancer
    |
    v
ECS/Fargate Container
    |
    +---> DynamoDB (user data) <- DNS FAILURE
    +---> S3 (static assets)
    +---> Lambda (business logic) ---> DynamoDB <- DNS FAILURE
    +---> SQS (job queue) ---> DynamoDB <- DNS FAILURE

When DynamoDB’s DNS failed, it didn’t just break database queries. It broke:

Lambda functions that couldn’t retrieve configuration
ECS task definitions stored in DynamoDB
Application state management
Session handling
Queue processing

The outage affected major consumer services including Snapchat, McDonald’s app, Ring doorbell cameras, Roblox, and Fortnite. Financial services weren’t spared either. Trading apps like Robinhood went dark during market hours.

What This Reveals About Cloud Architecture

After analyzing numerous reports from affected companies, three architectural failures stand out:

1. Single Region Dependency

Most companies optimize for latency and cost, not resilience. Deploying across multiple regions is expensive and complex. The harsh truth is that true multi-region architecture requires:

class MultiRegionDynamoDB:
    def __init__(self):
        self.regions = ['us-east-1', 'us-west-2', 'eu-west-1']
        self.clients = {
            region: boto3.client('dynamodb', region_name=region)
            for region in self.regions
        }

    async def get_item_with_fallback(self, table, key):
        for region in self.regions:
            try:
                response = self.clients[region].get_item(
                    TableName=table,
                    Key=key,
                    ConsistentRead=False  # Eventually consistent for cross-region
                )
                return response['Item']
            except (ClientError, DNSResolutionError) as e:
                logger.warning(f"Failed to query {region}: {e}")
                continue
        raise Exception("All regions failed")

This approach costs 3x more in data transfer and storage. Most startups and mid-size companies can’t justify it until after their first major outage.

2. DNS as a Single Point of Failure

The Al Jazeera analysis highlighted how fundamental DNS is to cloud operations. We’ve become so reliant on managed services that we forget DNS can fail. Better architectures implement:

Local DNS caching with extended TTLs
Hard-coded IP failovers for critical services
Health check endpoints that bypass DNS

3. Insufficient Circuit Breakers

When DynamoDB failed, applications should have degraded gracefully. Instead, most crashed entirely. Here’s what production-grade resilience looks like:

// Circuit breaker pattern for DynamoDB calls
type DynamoDBClient struct {
    client *dynamodb.Client
    breaker *gobreaker.CircuitBreaker
}

func (d *DynamoDBClient) GetItem(ctx context.Context, input *dynamodb.GetItemInput) (*dynamodb.GetItemOutput, error) {
    result, err := d.breaker.Execute(func() (interface{}, error) {
        return d.client.GetItem(ctx, input)
    })

    if err != nil {
        // Serve from local cache or return degraded response
        return d.getCachedItem(input.Key)
    }

    return result.(*dynamodb.GetItemOutput), nil
}

The Cost of Centralization

AWS controls approximately 37% of the global cloud market, with a customer base of 4 million companies. When a platform this dominant goes down, the economic impact is staggering. Early estimates place total losses in the billions.

The uncomfortable reality is that cloud computing has created new systemic risks. Three companies (AWS, Azure, Google Cloud) power 60% of the internet. When one stumbles, millions of businesses face existential threats.

Lessons for Your Architecture

If you’re running on AWS today, here’s what you should implement immediately:

Short-term fixes:

Enable CloudWatch alarms for increased error rates across all critical services
Implement aggressive client-side caching with stale-while-revalidate patterns
Deploy circuit breakers on all external service calls
Test your monitoring during DNS failures (most monitoring tools also failed)

Long-term resilience:

Design for multi-region from day one, even if you only deploy to one initially
Implement chaos engineering practices (deliberately fail services to test recovery)
Build static fallback pages that don’t require dynamic services
Document runbooks for manual failover procedures

The October 20th outage will be studied in engineering courses for years. Not because of what AWS did wrong, but because it exposed how fragile our distributed systems really are.

We’ve built incredible technology on foundations that can crumble when a single DNS resolver fails. That’s not AWS’s failure. That’s ours as an industry. We chose convenience over resilience, and on Monday morning, millions of people couldn’t check their bank accounts, order coffee, or play video games because of it.

The next time you architect a system, remember: the cloud isn’t magic. It’s just someone else’s computers, and sometimes those computers can’t even figure out their own addresses.

Responses (12)

Noah Yejia Tong

What are your thoughts?

specifically the US-EAST-1 region in Northern Virginia

A bunch of server farms built practically on top of Civil War battlefields. The outage could have been the result of the ghosts of Confederate soldiers rising up and wreaking vengeance!https://www.youtube.com/watch?v=Lh_W6FLaMvA

19

Hey. What tool do you use to produce these nice diagrams like the main one? Thanks for any hints.

6

Fantastic description 👏

4

Designing a URL Shortener That Works Like Infrastructure — a design built explicitly to survive the regional-failure, replication-lag, and serve-stale-on-error realities this outage exposed.
The API Design Patterns Nobody Teaches You — resilient error contracts, rate limiting, and idempotency are how individual APIs degrade gracefully when infrastructure fails.
What Senior Architects Know About SAGA That Juniors Don't — distributed-transaction resilience and observability under partial failure.
Async APIs - don't confuse your events, commands and state — how message-broker and event-bus failures cascade through async architectures.
Navigating Software Architecture at Scale Insights from Decathlon's Architecture Process — what architecture committees should be designing against, beyond happy-path diagrams.

architecture devops

Source: https://medium.com/@kanishks772/dissecting-the-aws-outage-a-step-by-step-breakdo...