Azure’s 10 Hour Outage Exposes a Big Cloud Myth

This Week in Cloud — September 18, 2025

Welcome back to The Cloud Cover, your essential guide to navigating the dynamic world of cloud for Solutions Architects, engineers, and IT leaders. This week, Azure reminded us that even the most sophisticated hyperscale clouds can be humbled by a control plane failure, while all the major players raced to build the future of "agentic AI." Let's dive in.

Azure's Cascading Failure Exposes Hyperscale Fragility

A nearly 10-hour service management disruption in Azure's critical East US 2 region on September 10th served as a stark reminder of the complexities lurking within hyperscale architecture. The incident, which ran from 09:03 to 18:50 UTC, prevented customers from starting, stopping, or modifying resources, with Virtual Machines being the hardest hit. The impact rippled through dependent services like Azure Kubernetes Service (AKS), Databricks, and Azure Backup, grinding operations to a halt for many.

The root cause wasn't a simple hardware issue but a perfect storm within Azure's core control plane. According to Microsoft's post-incident review, a performance degradation in the "Allocator" service was severely worsened by a recent software change that altered its throttling behavior. Under the immense load of the East US 2 region, provisioning requests began to retry aggressively, creating a feedback loop that overwhelmed the system.

Most concerning was the cascading failure between Availability Zones. As the platform's automated resiliency logic tried to redirect traffic from the first impacted AZ to a second, the sudden influx of requests overloaded and toppled the second AZ as well. This event challenges the core assumption of an AZ as a fully independent failure domain and highlights a critical risk for architects building resilient systems. Adding to the operational challenges, Microsoft's first public notification on its status page didn't appear until about eight hours after the customer impact began, a significant communication delay the company itself acknowledged.

🔍 The Rundown

AWS

Enhanced Mac Development: Amazon launched new EC2 M4 and M4 Pro Mac instances, promising 15-20% better build performance for developers in the Apple ecosystem.

LocalStack VS Code Integration: In a pragmatic nod to how developers actually work, AWS added official support for LocalStack in VS Code toolkit, smoothing the path from local development to cloud deployment.

Azure

Confidential PostgreSQL Launch: Microsoft announced the general availability of Azure Confidential Computing for PostgreSQL, allowing data to be processed inside secure hardware-based enclaves.

September Security Updates: The monthly "Patch Tuesday" security cycle addressed over 81 vulnerabilities, including two publicly disclosed zero-days affecting the Windows SMB Server and a JSON library bundled with Microsoft SQL Server.

AI Migration Tools: At its Migrate and Modernize Summit, Microsoft showcased new AI-assisted migration tools, including an integration between Azure Migrate and GitHub Copilot to accelerate modernization projects.

GCP

PayPal Partnership: Google and PayPal announced a multi-year infrastructure modernization partnership on GCP to co-develop new "agentic commerce" experiences using Google's AI.

UK Defence Contract: Google Cloud secured a major £400 million sovereign cloud contract for the UK Ministry of Defence, built on its "air-gapped" Google Distributed Cloud platform.

Conversational Commerce Launch: A new Vertex AI service, the Conversational Commerce agent, became generally available to help retailers transform search bars into natural, AI-guided shopping conversations.

Bandai Namco Gaming: The gaming giant launched its new title on GCP, using a suite of services including GKE with Agones and Cloud Spanner to manage its global, cross-platform experience.

OCI

OpenAI Mega Deal: OpenAI confirmed a five-year deal worth up to $300 billion to build its next-generation AI data centers on Oracle Cloud Infrastructure, a monumental win that validates OCI's focus on high-performance computing, balanced with fears about an AI-driven bubble.

UK Sovereign Expansion: Oracle announced it is expanding its UK government cloud offerings with new AI infrastructure as part of a $5 billion investment plan.

📈 Trending Now: Is Agentic AI Reaching Escape Velocity?

For months, we’ve been tracking the rise of agentic AI from a theoretical concept to an emerging cloud category. Finally, we are starting to the landscape crystallizing, as hyperscalers move past promising roadmaps and into the realm of hard metrics and major commercial bets. We're no longer just talking about what's possible; we're seeing what developers are building, and what companies are buying, right now.

The evidence for this inflection point was across the board. AWS celebrated its open-source Strands Agents SDK rocketing past one million downloads in less than four months, a tangible sign of massive developer interest in building multi-agent AI systems. Meanwhile, Google Cloud and PayPal put a massive commercial stake in the ground, not only migrating infrastructure but co-developing "agentic commerce" experiences and advocating for new industry standards to govern AI-led transactions.

Even the strategic thinking has matured. Microsoft's "Agent Factory" concept provided a blueprint for the next crucial step: building secure, safe, and governed AI agents for the enterprise. The takeaway is that the conversation has fundamentally changed. The question is no longer if agents will be the next application layer, but which competing ecosystem—from developer frameworks to security blueprints and commercial solutions—will win the race to define it.

📅 Event Radar

Oct
9
AWS Summit Bogota | Bogota, Columbia
Registration still open
Oct
8-10
Forrester Tech & Innovation Summit EMEA | London + Virtual
Speakers list now available
Oct
28-29
Google Cloud Public Sector Summit | Washington DC
Register today!

👋 Until Next Week

That's a wrap for this week. The race to build AI-powered everything continues to accelerate, but as Azure's lengthy outage demonstrated, operational excellence and transparent communication remain the bedrock of cloud trust. While the promise of intelligent agents is exciting, their utility depends entirely on the resilience of the infrastructure they run on. We'll be watching to see how the providers balance the two.

Do you enjoy these emails? Your friends and colleagues might, too! Help us grow the cloud community by sharing the newsletter with others.