When Hyperscalers Fail: What Three Major Outages Taught Us About Internet Resilience - Read The Manual - Tech

It was 7:20 PM GMT on November 18, 2025. I was finishing up some configuration changes to ReadTheManual’s self-hosted infrastructure when I noticed something odd: my Cloudflare tunnel had gone dark. No big deal, I thought, probably a blip. Then I checked Twitter. Spotify was down. ChatGPT was offline. Even Trump’s Truth Social had gone silent.

My self-hosted blog, running on my own hardware in my own rack, was effectively offline. Not because my infrastructure failed, but because Cloudflare (the service I relied on to expose it safely to the internet) had taken a nosedive.

That’s the uncomfortable truth about modern infrastructure: even when you control the entire stack, you’re still dependent on someone else’s pipes. And in the span of three weeks between late October and mid-November 2025, those pipes burst three times in spectacular fashion.

Let me tell you what happened, why it matters, and what these outages teach us about career advancement in infrastructure engineering.

The Three Outages That Broke the Internet

AWS US-EAST-1: When DNS Became a Single Point of Failure (October 19-20, 2025)

What Happened: At 12:26 AM UTC on October 20, AWS’s flagship region (us-east-1 in Northern Virginia) began experiencing what would become a 15-hour service degradation affecting millions of users worldwide.

The Technical Details: A latent race condition in DynamoDB’s automated DNS management system created an incorrect empty DNS record for dynamodb.us-east-1.amazonaws.com. Here’s what made it particularly nasty:

DynamoDB’s DNS system has two components:

DNS Planner: Calculates which load balancers should serve traffic with what weights
DNS Enactor: Three redundant workers in different Availability Zones that apply those plans to Route 53

Under unusual contention, one Enactor progressed slowly while another raced ahead with newer plans and garbage-collected older generations. The slow Enactor’s stale plan then overwrote the regional endpoint right before the fast Enactor’s cleanup job deleted that older plan, atomically removing all IP addresses for the endpoint.

The Cascade: What started as a DNS issue triggered a complex failure chain:

DynamoDB went dark (no DNS resolution = no service)
EC2’s DropletWorkflow Manager depends on DynamoDB to renew leases on physical machines. When DynamoDB disappeared, those renewals stopped, causing EC2 to think healthy machines were dead.
Network Load Balancer health checks failed, causing connectivity issues for Lambda, CloudWatch, and other services
Metastable failure state: Even after AWS fixed the DNS issue at 2:24 AM, retry storms prevented the region from stabilizing for hours

Real-World Impact: Downdetector received 6.5 million reports. Services like Slack, Atlassian, Snapchat, McDonald’s mobile app, Ring doorbells, Roblox, and Fortnite all went offline. This wasn’t a minor blip, this was internet infrastructure failing at scale.

AWS’s Response: They disabled the DynamoDB DNS automation worldwide and are implementing velocity controls on Network Load Balancer AZ failovers and rate-limiting on EC2 data propagation systems.

Azure Front Door: A Configuration Change That Lasted Nine Hours (October 29-30, 2025)

What Happened: Starting at 15:41 UTC on October 29, customers using Azure Front Door (AFD) and Azure Content Delivery Network experienced connection timeouts and DNS resolution failures for nearly nine hours.

The Technical Details: A faulty control-plane configuration change introduced an invalid state that prevented AFD nodes from loading correctly. The real kicker? The misconfiguration bypassed normal safety checks due to a software defect in the deployment system.

This is Infrastructure 101 stuff: you have guardrails for a reason. When those guardrails fail, you get global outages.

The Cascade: AFD sits at the edge of Microsoft’s infrastructure, routing traffic for:

Microsoft 365
Xbox Live
Azure Portal (yes, the admin interface went down)
Azure Active Directory B2C (identity failures rippled across everything)
Azure App Service, Databricks, Media Services, Static Web Apps
Thousands of customer websites
Consumer services (Starbucks, Dairy Queen reported disruptions)

This demonstrated what security experts call an “architectural anti-pattern”: a single bad push knocked over a global edge network because too many critical services had a centralized dependency.

Recovery Timeline:

17:26 UTC: Azure Portal failed away from AFD to restore admin access (imagine not being able to access your own admin panel during an outage)
17:30 UTC: Microsoft blocked all global configuration changes
17:40 UTC: Last-known-good configuration rollback initiated
18:45 UTC: Manual node recovery began
00:05 UTC: Customer impact finally mitigated (8.5 hours later)

The Aftermath: Microsoft implemented a week-long configuration freeze. Thousands of businesses couldn’t make critical updates to their production infrastructure until November 5. Think about that: an outage caused by a bad configuration change resulted in a week where nobody could make configuration changes. The cure became another problem.

Cloudflare: When Bot Management Brought Down the Internet (November 18, 2025)

What Happened: At 11:28 UTC on November 18, Cloudflare’s network began experiencing widespread failures that lasted six hours, affecting thousands of websites globally.

The Technical Details: This one was almost elegant in its simplicity. At 11:05 UTC, Cloudflare modified access controls on their ClickHouse database cluster. A query used by Bot Management returned duplicate column metadata:

SELECT name, type FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;

The query lacked a database filter, so after the permissions update, it returned metadata from both the default database and the underlying r0 database, effectively doubling the configuration file size.

The Bot Management module has a hard limit of 200 pre-allocated features for performance optimization. The bloated file exceeded this threshold, causing the system to panic with: thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

Result? HTTP 5xx errors across Cloudflare’s entire global network.

Services Impacted:

Core CDN and security services
Turnstile (CAPTCHA alternative)
Workers KV (edge storage)
Dashboard (login failures due to Turnstile being down)
Access (authentication failures)
Email Security

Real-World Impact: Spotify, ChatGPT, Discord, Truth Social, and thousands of websites went offline. This was Cloudflare’s most significant outage since 2019.

Timeline:

11:05 UTC: Database change deployed
11:28 UTC: Errors first observed
11:35 UTC: Incident call created
14:24 UTC: Bad configuration propagation stopped
14:30 UTC: Core traffic largely restored
17:06 UTC: All systems fully operational (6 hours total)

CEO Matthew Prince issued a public apology: “On behalf of the entire team at Cloudflare, I would like to apologise for the pain we caused the Internet today.”

The Internet’s Original Sin: Forgetting Resilience by Design

Here’s the thing that gets me: the internet was specifically designed to survive nuclear war.

ARPANET, the precursor to the modern internet, was built on the principle of distributed redundancy. The original architecture assumed nodes would fail: equipment would break, connections would sever, entire data centers might disappear in a mushroom cloud. The protocol stack was designed to route around damage automatically.

TCP/IP doesn’t care if packets take different paths to reach their destination. DNS was designed with hierarchical redundancy. BGP routing was built to find alternate paths when networks went dark.

And yet here we are in 2025, where a database permissions change at one company can take down Spotify, ChatGPT, and thousands of other services simultaneously.

What changed?

The Hyperscaler Concentration Problem

The modern internet runs on approximately five companies:

Amazon (AWS): 31% of cloud infrastructure market
Microsoft (Azure): 20% of cloud infrastructure market
Google (GCP): 12% of cloud infrastructure market
Cloudflare: Handles roughly 20% of all internet traffic
Equinix: Interconnection hub for major cloud providers

This concentration creates systemic risk. When AWS us-east-1 goes down, it’s not just Amazon’s problem, it’s the internet’s problem. That region hosts critical infrastructure for thousands of companies who can’t or won’t multi-region their applications.

Why us-east-1 specifically? Because it’s the oldest, cheapest, and most feature-complete AWS region. Many services launch there first. It’s also where AWS’s own internal services run, which means outages cascade through AWS’s control plane.

The Enterprise Reality: In my work at HiltDigital, I’ve seen companies with “multi-cloud strategies” that are really just “multi-cloud in name only.” They run production workloads in AWS, disaster recovery in Azure, and development environments in GCP. But when push comes to shove, their DNS is in Route 53, their CDN is Cloudflare, and their authentication is in Azure AD.

Every single one of those is a potential single point of failure.

The Homelab Paradox: Self-Hosted, But Not Independent

Here’s the uncomfortable realization I had on November 18: my self-hosted infrastructure isn’t really self-hosted.

ReadTheManual runs on physical hardware I own, in a rack I built, on a network I configured. I have:

Redundant power supplies
RAID storage with hot spares
Automated backups to multiple locations
Monitoring with alerting
Documented runbooks for failure scenarios

But I still depend on:

Cloudflare Tunnels to expose services without opening ports (went down November 18)
Cloudflare DNS for domain resolution (also down November 18)
My ISP’s connectivity (single point of failure)
Upstream BGP routing (completely outside my control)
Power from the grid (battery backup lasts 4 hours maximum)

Even if you go full hermit mode and host everything yourself, you’re still dependent on:

Domain registrars (who can be compelled to seize domains)
Certificate authorities (for HTTPS)
Network transit providers
DDoS mitigation services (unless you want to get knocked offline by script kiddies)

The Trade-Off: True independence requires significant investment. Running your own AS number, maintaining BGP peering relationships, owning your IP space: that’s enterprise-level complexity and cost. For most homelab builders and small businesses, it’s not economically viable.

So we make compromises. We use Cloudflare for security and performance. We use managed DNS for reliability. We use cloud storage for off-site backups.

And when Cloudflare has a bad day, so do we.

What Went Wrong: Common Patterns in All Three Outages

Looking at these three outages, several patterns emerge:

1. Configuration Changes Without Adequate Testing

AWS: DNS automation deployed without accounting for race conditions under contention
Azure: Control-plane change bypassed safety checks due to software defect
Cloudflare: Database permissions change affected downstream query behavior

The Lesson: Canary deployments aren’t just for application code. Infrastructure changes need gradual rollouts with automated rollback capabilities.

2. Cascading Failures From Tight Coupling

All three outages demonstrated how interconnected modern infrastructure has become. DynamoDB DNS failure took down EC2. Azure Front Door failure affected identity services. Cloudflare’s Bot Management killed the entire edge network.

The Lesson: Dependency mapping isn’t optional. You need to understand what relies on what, and implement circuit breakers to prevent cascade failures.

3. Metastable Failure States

AWS couldn’t recover even after fixing the root cause because retry storms prevented stabilization. This is a known distributed systems problem that’s hard to test for.

The Lesson: Exponential backoff with jitter isn’t just a nice-to-have. Rate limiting needs to work in both directions: limiting requests to failing services AND limiting recovery traffic.

4. Inadequate Blast Radius Control

Cloudflare’s Bot Management affected their entire global network. Azure Front Door’s failure impacted thousands of services. AWS us-east-1 is so large that regional failures become global incidents.

The Lesson: Fault isolation domains need to be smaller. Bulkheads matter.

What This Means for Your Homelab

If you’re running a homelab (and you should be if you want to understand infrastructure), here’s what you can learn:

1. Document Your Dependencies

Map every external service you rely on:

DNS providers (primary and secondary)
CDN/reverse proxy services
Certificate authorities
Monitoring services
Backup destinations
Authentication providers

For each dependency, ask: “What happens if this goes down?”

2. Implement Graceful Degradation

Your blog doesn’t need a CDN to function. It needs a CDN to handle traffic spikes and DDoS attacks. Can your origin handle normal traffic if Cloudflare goes down?

Your application doesn’t need OAuth to function. Can users still access read-only content if authentication fails?

3. Multi-Provider DNS

This is low-hanging fruit. Use two DNS providers with the same records:

Cloudflare (free tier)
Route 53 (pennies per month)
Or any other combination

Most domain registrars let you specify multiple nameservers. Use them.

4. Understand Your Recovery Time Objectives

If Cloudflare Tunnels go down, how long until you can switch to a direct connection or VPN? If your primary DNS provider fails, how long until queries propagate to your secondary?

These aren’t theoretical questions. On November 18, people who had answers recovered faster.

5. Test Your Failure Modes

Chaos engineering isn’t just for Netflix. Intentionally break things in your homelab:

Disable your primary DNS provider and see what happens
Block access to your CDN and test origin performance
Kill a disk in your RAID array
Unplug a network cable

If you can’t test it, you don’t understand it.

The Career Angle: Why This Knowledge Is Worth £60k-80k

These three outages represent exactly the kind of knowledge that separates junior sysadmins making £35k from Senior Site Reliability Engineers making £75k.

Interview Gold: “Tell Me About a Time You Dealt With a Vendor Outage”

This is a common question in infrastructure roles. Most candidates say something generic about “following the incident response plan” or “checking the vendor’s status page.”

But candidates who understand distributed systems failure modes? They talk about:

Implementing circuit breakers to prevent cascade failures
Setting up multi-provider DNS for failover
Using health checks to automatically route around failed dependencies
Building monitoring that distinguishes between your failures and vendor failures
Designing systems that degrade gracefully when external dependencies fail

The homelab version: “When Cloudflare went down in November 2025, my self-hosted blog was affected because I relied on Cloudflare Tunnels. I set up an automated failover to a direct connection with proper firewall rules and DDoS mitigation. Now when my primary edge provider fails, traffic automatically fails over to my secondary path within 60 seconds.”

That’s the kind of answer that gets you hired.

Production Skills That Translate Directly

Understanding these outages teaches you:

1. Dependency Management: In enterprise SRE roles, you’ll be asked to map service dependencies and identify single points of failure. If you can articulate why AWS us-east-1’s DNS failure cascaded to EC2, you understand distributed systems architecture.

2. Chaos Engineering: Companies like Netflix, Amazon, and Google intentionally inject failures to test resilience. If you’ve intentionally broken your homelab to test failure modes, you have practical chaos engineering experience.

3. Incident Response: These outages demonstrate textbook incident response patterns: detection, triage, mitigation, resolution, and post-mortem. If you’ve written runbooks for your homelab incidents, you have transferable skills.

4. Trade-Off Analysis: Every architecture decision is a trade-off. Cloudflare Tunnels are convenient but create a dependency. Direct exposure is more resilient but requires careful security configuration. Understanding these trade-offs is fundamental to infrastructure engineering.

Specific Skills Worth Money

Skill	Homelab Implementation	Enterprise Application	Salary Impact
Multi-region architecture	Run services in multiple cloud providers	Design applications that span AWS regions	£60k-75k (Cloud Architect)
Chaos engineering	Intentionally break homelab services	Implement GameDay exercises	£65k-80k (Senior SRE)
Incident response	Document homelab outages with post-mortems	Lead production incidents	£55k-70k (SRE/DevOps)
Observability	Implement Prometheus/Grafana/Loki	Design monitoring for microservices	£60k-75k (Observability Engineer)
Disaster recovery	Test backup restoration procedures	Design multi-region DR strategy	£70k-85k (Principal Engineer)

Certification Relevance

These outages directly relate to:

AWS Solutions Architect Associate: Multi-region design, Route 53 failover
Azure Administrator/Architect: Azure Front Door, Azure DNS, traffic management
Kubernetes certifications: Understanding failure domains, pod disruption budgets
SRE certifications: Incident response, post-mortem culture, reliability patterns

What Enterprises Do Differently (And What You Can Steal)

In production environments, here’s how these outages would be handled differently:

1. Multi-Cloud DNS with Health Checks

Enterprise DNS strategies use multiple providers with active health checking:

Primary: AWS Route 53 (global health checks)
Secondary: Cloudflare DNS (automatically activated if Route 53 fails)
Tertiary: On-premises DNS (for split-horizon scenarios)

Homelab version: Use Cloudflare and Route 53 with the same records. Configure your domain registrar to list both as nameservers.

2. CDN Failover Strategies

Large sites don’t rely on a single CDN. They use:

Primary: Cloudflare (performance)
Secondary: AWS CloudFront (reliability)
Tertiary: Azure Front Door (vendor diversity)

Traffic management automatically fails over based on health checks.

Homelab version: Configure your application to serve directly from origin if CDN health checks fail. Use a simple nginx proxy with health check logic.

3. Circuit Breakers and Bulkheads

Services implement the circuit breaker pattern: when a dependency starts failing, stop calling it for a period and either return cached data or degrade gracefully.

Homelab version: If your authentication service is down, serve read-only cached content instead of showing error pages.

4. Runbooks and Escalation Procedures

Every dependency has documented failure scenarios:

What breaks when this service fails?
How do we detect it?
What’s the immediate mitigation?
Who needs to be notified?
What’s the long-term fix?

Homelab version: Create a simple markdown document for each service with these sections. When something breaks at 2 AM, you’ll thank yourself.

5. Blameless Post-Mortems

Notice how AWS, Azure, and Cloudflare all published detailed post-mortem reports? That’s not just good PR, it’s organizational learning.

Homelab version: When something breaks in your homelab, write it up. What happened? Why? What did you learn? How do you prevent it next time? These documents become your portfolio.

The Uncomfortable Truth About Cloud Resilience

Here’s what these outages really teach us: perfect reliability is impossible, and the pursuit of it creates new failure modes.

AWS built redundant DNS automation to prevent manual errors. The automation introduced a race condition.

Azure implemented safety checks on deployments. A software defect bypassed those checks.

Cloudflare optimized Bot Management with pre-allocated feature limits. A database permissions change bloated the configuration file past those limits.

Every solution creates new problems. Every abstraction layer introduces new failure modes.

This isn’t an argument against cloud providers or managed services. It’s a recognition that complexity has a cost, and that cost is measured in failure modes you haven’t imagined yet.

The SRE Perspective: Google’s Site Reliability Engineering book teaches that 100% reliability is the wrong goal. Every extra nine of uptime (99.9% to 99.99%) costs exponentially more and slows down innovation. The right question isn’t “how do we prevent all failures?” but “how do we fail gracefully and recover quickly?”

Practical Action Items: What You Should Do Right Now

Whether you’re running a homelab or managing production infrastructure, here’s what you can implement today:

For Homelab Builders:

Set up secondary DNS (30 minutes)
- Create Route 53 hosted zone
- Mirror your Cloudflare DNS records
- Add both nameservers to your domain registrar
Document your dependencies (1 hour)
- List every external service you use
- Identify what breaks if each one fails
- Prioritize mitigations based on impact
Implement basic monitoring (2 hours)
- External uptime monitoring (UptimeRobot free tier)
- Log aggregation (Grafana Loki)
- Set up alerts for critical services
Create failover documentation (1 hour)
- How to switch off Cloudflare Tunnels if needed
- How to access services if DNS fails
- Emergency contact information for vendors
Test a failure scenario (30 minutes)
- Disable Cloudflare and see what breaks
- Document how long recovery takes
- Identify improvements needed

For Infrastructure Professionals:

Audit your architectural dependencies
- Map service-to-service dependencies
- Identify single points of failure
- Calculate blast radius for each critical service
Implement circuit breakers
- Add timeout and retry logic to all external calls
- Return cached/default data when dependencies fail
- Monitor circuit breaker state in your observability platform
Set up multi-provider strategies
- Secondary DNS provider
- CDN failover capabilities
- Geographic diversity for critical services
Practice incident response
- Run GameDay exercises
- Intentionally break non-critical services
- Measure detection, response, and recovery times
Write post-mortems
- Document every production incident
- Focus on systemic improvements, not blame
- Share learnings across teams

The Future: Decentralization or Consolidation?

These three outages raise an important question: is the internet becoming more fragile or more resilient?

The Consolidation View: As cloud providers mature, they invest more in reliability. AWS’s 99.95% five-year uptime is genuinely impressive. These outages are rare exceptions, not the norm.

The Decentralization View: Concentration of internet traffic in five companies creates systemic risk. A truly resilient internet requires more providers, more diversity, and more redundancy.

I suspect the answer is “both.” The hyperscalers will continue to dominate because they offer capabilities that smaller providers can’t match. But the savvy builders (whether running homelabs or production infrastructure) will design for vendor diversity and graceful degradation.

The Edge Computing Shift: One interesting trend is the move toward edge computing. Cloudflare Workers, AWS Lambda@Edge, Azure CDN with Functions all push compute closer to users and reduce reliance on centralized regions. But as we saw on November 18, edge infrastructure can fail globally too.

Lessons Learned: The Six Principles of Resilient Infrastructure

After analyzing these three outages, here are the principles I’m applying to my own infrastructure:

1. Assume Everything Fails

Don’t ask “what if Cloudflare goes down?” Ask “when Cloudflare goes down, how long until I’m back online?”

2. Vendor Diversity Costs Money, But Failure Costs More

Multi-provider strategies are more expensive and complex. They’re also more resilient. Calculate the cost of downtime vs. the cost of redundancy.

3. Observability Isn’t Optional

You can’t fix what you can’t see. Invest in monitoring, logging, and tracing before you invest in redundancy.

4. Document Everything

Runbooks, architecture diagrams, dependency maps: these aren’t busywork. They’re how you survive incidents at 2 AM.

5. Test Your Failure Modes

If you haven’t intentionally broken it, you don’t know how it fails. Chaos engineering applies to homelabs too.

6. Design for Graceful Degradation

Perfect uptime is impossible. Graceful degradation is achievable. Serve cached content when databases fail. Return default values when APIs time out. Show read-only views when authentication breaks.

Final Thoughts: The Real Lesson

On November 18, my self-hosted blog went offline because Cloudflare had a bad day. On October 29, thousands of businesses couldn’t use the Azure Portal to manage their infrastructure. On October 20, AWS us-east-1’s DNS failure cascaded through the entire region for 15 hours.

These weren’t hypothetical scenarios or war stories from decades ago. These were three weeks in late 2025 when the internet’s fragility was on full display.

The real lesson isn’t “cloud providers are unreliable” or “you should self-host everything.” The lesson is that resilience is a design choice, not a vendor feature.

You can’t outsource your understanding of how systems fail. You can’t assume that because you’re paying AWS or Azure or Cloudflare that they’ll handle everything. You need to understand your dependencies, design for failure, and practice incident response.

This is exactly the kind of knowledge that separates homelabbers from infrastructure engineers. It’s the difference between someone who follows tutorials and someone who understands distributed systems.

And it’s worth £15k-25k in salary difference.

The Homelab Challenge: Pick one of your homelab services and intentionally break it. Document what happens. Figure out how to make it more resilient. Write up your findings.

That document is portfolio gold. That experience is interview gold. That knowledge is career gold.

Because the next time a hyperscaler fails (and there will be a next time), you’ll be the person in the room who understands why it happened, how to mitigate it, and what it teaches us about building resilient systems.

And that person gets the £75k SRE job.

Resources & Further Reading

Official Post-Mortems:

Enterprise References:

Dave runs ReadTheManual and builds infrastructure at HiltDigital. He’s been running production infrastructure for 15 years and homelabs for longer. When hyperscalers fail, his self-hosted blog usually goes down too—but he’s working on that.

Enterprise infrastructure implementations: HiltDigital