Emergency Runbook: What IT Should Do When a Major Cloud Provider Has a Widespread Outage
A concise, actionable emergency runbook for ops teams: triage steps, customer comms, and rapid mitigations during major cloud provider outages.
When a major cloud provider goes down: a concise emergency runbook for ops teams
Hook: You’ve just discovered elevated 5xx errors, alerts flooded Slack, and the provider status page shows a widespread incident. Your execs want answers, customers want uptime, and your SRE on-call is triaging. This playbook gives a compact, prioritized set of actions — triage steps, rapid mitigations, and ready-to-send customer comm templates — designed for technology professionals in 2026 who need to act fast and reduce blast radius when a cloud provider outage escalates to a company incident.
Why this matters now (2026 context)
Large provider incidents remain rare but high-impact. Late 2025 and early 2026 saw multiple high-profile disruptions across CDN, DNS and hyperscaler control planes, reinforcing that no single provider is immune to systemic failure. At the same time, SRE practices evolved: runbooks-as-code, chaos engineering, and multi-path failover are now standard mitigation strategies. This playbook assumes you operate in that environment and are ready to apply those tactics under time pressure.
Inverted pyramid: immediate priorities (first 0–15 minutes)
When the alert fires, act in this strict order — it prevents wasted effort and secures the most critical outcomes first.
- Declare an incident and open a channel: Create a dedicated incident channel (Slack/MS Teams) and invite the incident commander (IC), leads for infra, app, DB, network and customer ops.
- Assess scope quickly: Determine whether the incident affects control plane, data plane, specific regions, or multiple providers. Use provider status pages and external monitors.
- Inform stakeholders with a short, precise message: Post an initial customer comm (templates below) and an internal summary with priority: impacted services, suspected cause, and next update ETA.
- Prevent change churn: Put a temporary change freeze on unrelated deployments to avoid cascading failures.
- Capture evidence: Start an incident log (Timeline) and capture key metrics and logs for postmortem.
Quick checklist: first commands and checks
- Check provider status page(s): list region/zone incidents.
- Run
curl -Ianddigfrom multiple vantage points (internal and external): verify DNS, HTTP responses and TLS handshake results. - Verify control plane API errors (e.g., auth failures) and any quota or rate-limit spikes in provider dashboards.
- Check critical dashboards: 5xx rate, latency percentiles (P50/P95/P99), connection errors, and DB replication lag.
- Confirm whether your CDN/edge is affected (purge failures, edge 503s).
Diagnosis matrix: triage decision tree (0–30 minutes)
Use this decision tree to decide whether to fail over, mitigate, or wait for provider remediation.
- Is this a provider-wide control-plane outage?
- Yes → You may be unable to modify resources (spawn VMs, change LB rules, update DNS). Focus on static mitigations (cache, feature flags) and customer comms.
- No → You can perform active failovers and infrastructure changes in affected regions.
- Is data plane impacted but control plane is healthy?
- Yes → Consider automated region failover or switching to standby replicas.
- No → The issue could be application-level; roll back recent configs if necessary.
- Is DNS resolution failing or slow?
- Yes → Use configured secondary DNS providers or lower TTLs; trigger DNS failover if pre-configured.
- No → Probe deeper into network/BGP and CDN behaviour.
Rapid mitigation tactics (15–120 minutes)
Ordered by safety and speed. These are practical, provider-neutral techniques you can apply during a major outage.
1. Graceful degradation and circuit breakers
Turn off non-essential services (analytics, background jobs), limit API features, and engage circuit breakers to reduce load on failing subsystems. Use feature flags for quick toggles.
2. Cache everything possible
If origin connectivity is problematic, increase cache TTLs at CDN and browser levels, and serve stale content where safe. For APIs, return cached responses for low-risk endpoints.
3. DNS & traffic steering
If your primary DNS/provider is affected and you have multi-DNS configured, initiate failover to an alternate authoritative DNS provider. If using Route53, Cloud DNS or comparable services, ensure your secondary has synchronized zone files and pre-warmed records.
Quick DNS tactics:
- Lower TTLs pre-incident as part of runbook practice. If TTLs are high, use anycast CDNs or edge caching as a bridge.
- Trigger pre-authorized DNS changes from a pre-approved automation pipeline (runbook-as-code) to avoid needing provider console access when control plane is slow.
4. Failover to warm/cold standby
If you maintain standby infrastructure in another provider or region, promote read-replicas and update routing to redirect traffic. Validate database consistency first for writes-heavy systems — use read-only failover if necessary to preserve data integrity.
5. Application-level routing changes
For Kubernetes clusters, you can patch Services/Ingress or update ExternalName targets to redirect to healthy clusters. Example quick command to redirect traffic at the ingress level:
kubectl patch ing my-ingress -n prod -p '{"spec":{"rules":[{"host":"example.com","http":{"paths":[{"path":"/","backend":{"serviceName":"maintenance-service","servicePort":80}}]}}]}}'
6. Use alternative CDNs or edge providers
If a CDN provider is healthy while the origin provider is partially down, route traffic through the CDN and configure origin shielding and origin pull from secondary sources.
7. Fallback to static mode or maintenance pages
Serve a minimal static site or status page from a globally distributed object storage (S3/Blob) or next-gen object edge — simple, authenticated status and ETA reduces customer frustration.
Technical play snippets: DNS failover and Route53 example
When you need to programmatically failover DNS, pre-authorized CLI scripts are lifesavers. Example JSON for AWS Route53 change-resource-record-sets (replace IDs and IPs):
{
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "www.example.com",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "203.0.113.42"}]
}
}
]
}
Apply with:
aws route53 change-resource-record-sets --hosted-zone-id Z123456 --change-batch file://change.json
Note: If provider control plane is impaired, these API calls may fail. Always have an out-of-band method to execute pre-approved changes (SSH bastion to management host, jumpbox with AWS CLI credentials, or runbook automation in a separate provider).
Customer communications: templates and cadence
Clear, honest and frequent updates reduce churn and support tickets. Use templates below; customize SLA and ETA fields.
Initial public status update (first message)
We’re aware of a widespread cloud provider incident affecting authentication and API traffic for customers in multiple regions. Our engineering team is actively triaging. Impact: login/API errors and delayed page loads. Next update: in 15 minutes. We will post detailed updates to our status page: https://status.example.com
Internal incident channel opener
[INCIDENT] Major cloud provider outage — creating incident. IC: @oncall. Initial assessment: elevated 5xx across services, control plane errors on provider console. Action plan: 1) Confirm scope, 2) Prevent change churn, 3) Execute mitigations (cache-first, DNS failover if available). Next update in 10 minutes. Timeline will be logged: /docs/incidents/2026-xx-xx.
Customer update template (periodic)
Update: We continue to see degraded performance due to a third-party cloud provider incident. Our teams are executing pre-approved mitigations (increasing cache TTLs, promoting standby services, and routing through alternate CDNs). Estimated next update: in 30 minutes. Impacted features: API write operations, dashboard logins. We prioritize customers on SLA tier: reach out to [support-email] for escalations.
Resolution & follow-up
Resolved: Service has returned to normal as of [time UTC]. We are monitoring for stability and will publish a detailed postmortem within 72 hours. If you experienced data loss or transaction failures, contact [support-escalations] and reference incident #INC-2026-XX.
Operational playbook: roles & responsibilities
Assign clear roles in your incident process to reduce confusion:
- Incident Commander (IC): Runs the incident, prioritizes actions, and approves public comms.
- Technical Lead(s): Infra, App, DB, Network — each responsible for diagnostics and mitigations in their domain.
- Customer Ops / Communications Lead: Crafts public updates and coordinates with Sales & Legal.
- Recorder: Maintains the timeline and evidence for postmortem.
Pre-incident readiness (what to build beforehand)
The best mitigation is preparation. These are non-negotiable items to include in your runbook before an incident occurs.
- Runbook-as-code: Store incident scripts and DNS change templates in a repo protected with MFA and approval gates.
- Secondary management paths: Keep a small set of credentials and pre-authorized automation in a different provider or an air-gapped management plane.
- Multi-DNS and multi-CDN: Pre-provision authoritative DNS with a standby provider and pre-sync zone files.
- Warm standbys and replication: Replicate critical data to a different provider/region and run DR drills quarterly.
- Chaos engineering: Regularly test partial and full-provider failover scenarios; document mean time to failover.
- Customer comm templates & SLA runbooks: Keep templates and escalation contact lists ready; practice messaging in tabletop exercises.
Observability and evidence collection
During the incident, collect evidence for both live decisions and postmortem analysis:
- Metrics snapshots for 30–120 min pre-incident and during incident (Prometheus, Datadog, New Relic).
- Control plane API logs and CloudTrail-like audit trails.
- Packet captures where appropriate (tcpdump) and DNS query logs.
- Timed screenshots of provider status pages and error responses.
Legal, compliance and data integrity considerations
Major provider incidents may have regulatory impact. During mitigation remember to:
- Document any cross-border data movements if you route to different regions/providers.
- Preserve logs for compliance — do not delete logs to improve apparent availability.
- Coordinate with legal/security if there’s evidence of data corruption or breach.
Postmortem checklist & learning loop
After the service stabilizes, perform a blameless postmortem. Key sections to include:
- Incident timeline and decisions (with timestamps).
- Root cause analysis (provider vs. own system).
- Mitigations executed and their effectiveness.
- Action items with owners and firm deadlines (implement multi-DNS, add runbook scripts, lower TTLs).
- Customer impact and communication effectiveness review.
Recent trends that should shape your runbook (2026)
Several industry trends in late 2025 and early 2026 are relevant:
- Runbooks-as-code and IaC for incident actions: Automate safe, pre-approved actions to avoid manual errors under stress.
- Edge-first architectures: Increasing reliance on edge computing makes graceful degradation and caching more effective.
- Regulatory scrutiny: Incidents now often require faster public disclosures and postmortems for regulated sectors.
- Multi-cloud and hybrid strategies: More teams maintain a minimal footprint in alternate providers to reduce single-provider blast radius.
- Observability convergence: Unified telemetry across clouds simplifies triage during cross-provider incidents.
Example: condensed 30-minute incident play
- 0:00 – Declare incident, open channel, post initial status update.
- 0:00–0:05 – Run probes (curl/dig/traceroute) from 3 vantage points; capture metrics snapshot.
- 0:05–0:10 – IC decides: control plane down vs data plane. If CP down → issue change freeze; if DP down → plan failover.
- 0:10–0:20 – Execute safe mitigations: increase CDN TTL, enable static maintenance page, throttle background jobs.
- 0:20–0:30 – If valid, trigger DNS failover or route traffic to warm standby; send 30-min customer update.
Final notes and recommended runbook hygiene
Keep your runbook living: update it after every incident and run quarterly drills that involve stakeholders beyond engineering (support, legal, sales). Test failover paths end-to-end and ensure your execs know the expected RTO/RPO for each service tier.
Call to action
Incidents are inevitable; preparation separates recovery from catastrophe. Download our ready-to-edit emergency runbook template, including DNS failover scripts, incident timeline spreadsheet, and customer comm templates — tailored for 2026 multi-cloud realities. Visit pyramides.cloud/runbook-template to get the kit and schedule a free 30-minute runbook review with our SRE team.
Related Reading
- Network Observability for Cloud Outages: What To Monitor to Detect Provider Failures Faster
- How to Harden CDN Configurations to Avoid Cascading Failures
- CDN Transparency, Edge Performance, and Creative Delivery: Rewiring Media Ops for 2026
- Technical Brief: Caching Strategies for Estimating Platforms — Serverless Patterns for 2026
- The Evolution of Cloud-Native Hosting in 2026: Multi‑Cloud, Edge & On‑Device AI
- Nutrition Trend Watch 2026: Functional Mushrooms in Everyday Cooking — Evidence, Recipes, and Safety
- From One West Point to Your SUV: Choosing the Best Vehicles for Dog Lovers
- AI Curation for Museums and Galleries: From Reading Lists to Digital Exhibits
- Collector’s Roadmap: Where to Buy Splatoon & Zelda Amiibo for ACNH Stream Giveaways
- Field Review: Top 8 Plant‑Based Snack Bars for Recovery & Energy — 2026 Hands‑On
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Leveraging iOS 26 Innovations for Cloud-Based App Development
Navigating the New Wave of Arm-based Laptops
Unlocking the Power of No-Code with Claude Code
Preparing for the Future of Mobile with Emerging iOS Features
A Comparative Analysis of Major Smartphone Releases in 2026: Impact on Cloud Services
From Our Network
Trending stories across our publication group