Emergency Runbook: What IT Should Do When a Major Cloud Provider Has a Widespread Outage
runbookincident-responsecloud

Emergency Runbook: What IT Should Do When a Major Cloud Provider Has a Widespread Outage

UUnknown
2026-02-15
10 min read
Advertisement

A concise, actionable emergency runbook for ops teams: triage steps, customer comms, and rapid mitigations during major cloud provider outages.

When a major cloud provider goes down: a concise emergency runbook for ops teams

Hook: You’ve just discovered elevated 5xx errors, alerts flooded Slack, and the provider status page shows a widespread incident. Your execs want answers, customers want uptime, and your SRE on-call is triaging. This playbook gives a compact, prioritized set of actions — triage steps, rapid mitigations, and ready-to-send customer comm templates — designed for technology professionals in 2026 who need to act fast and reduce blast radius when a cloud provider outage escalates to a company incident.

Why this matters now (2026 context)

Large provider incidents remain rare but high-impact. Late 2025 and early 2026 saw multiple high-profile disruptions across CDN, DNS and hyperscaler control planes, reinforcing that no single provider is immune to systemic failure. At the same time, SRE practices evolved: runbooks-as-code, chaos engineering, and multi-path failover are now standard mitigation strategies. This playbook assumes you operate in that environment and are ready to apply those tactics under time pressure.

Inverted pyramid: immediate priorities (first 0–15 minutes)

When the alert fires, act in this strict order — it prevents wasted effort and secures the most critical outcomes first.

  1. Declare an incident and open a channel: Create a dedicated incident channel (Slack/MS Teams) and invite the incident commander (IC), leads for infra, app, DB, network and customer ops.
  2. Assess scope quickly: Determine whether the incident affects control plane, data plane, specific regions, or multiple providers. Use provider status pages and external monitors.
  3. Inform stakeholders with a short, precise message: Post an initial customer comm (templates below) and an internal summary with priority: impacted services, suspected cause, and next update ETA.
  4. Prevent change churn: Put a temporary change freeze on unrelated deployments to avoid cascading failures.
  5. Capture evidence: Start an incident log (Timeline) and capture key metrics and logs for postmortem.

Quick checklist: first commands and checks

  • Check provider status page(s): list region/zone incidents.
  • Run curl -I and dig from multiple vantage points (internal and external): verify DNS, HTTP responses and TLS handshake results.
  • Verify control plane API errors (e.g., auth failures) and any quota or rate-limit spikes in provider dashboards.
  • Check critical dashboards: 5xx rate, latency percentiles (P50/P95/P99), connection errors, and DB replication lag.
  • Confirm whether your CDN/edge is affected (purge failures, edge 503s).

Diagnosis matrix: triage decision tree (0–30 minutes)

Use this decision tree to decide whether to fail over, mitigate, or wait for provider remediation.

  1. Is this a provider-wide control-plane outage?
    • Yes → You may be unable to modify resources (spawn VMs, change LB rules, update DNS). Focus on static mitigations (cache, feature flags) and customer comms.
    • No → You can perform active failovers and infrastructure changes in affected regions.
  2. Is data plane impacted but control plane is healthy?
    • Yes → Consider automated region failover or switching to standby replicas.
    • No → The issue could be application-level; roll back recent configs if necessary.
  3. Is DNS resolution failing or slow?
    • Yes → Use configured secondary DNS providers or lower TTLs; trigger DNS failover if pre-configured.
    • No → Probe deeper into network/BGP and CDN behaviour.

Rapid mitigation tactics (15–120 minutes)

Ordered by safety and speed. These are practical, provider-neutral techniques you can apply during a major outage.

1. Graceful degradation and circuit breakers

Turn off non-essential services (analytics, background jobs), limit API features, and engage circuit breakers to reduce load on failing subsystems. Use feature flags for quick toggles.

2. Cache everything possible

If origin connectivity is problematic, increase cache TTLs at CDN and browser levels, and serve stale content where safe. For APIs, return cached responses for low-risk endpoints.

3. DNS & traffic steering

If your primary DNS/provider is affected and you have multi-DNS configured, initiate failover to an alternate authoritative DNS provider. If using Route53, Cloud DNS or comparable services, ensure your secondary has synchronized zone files and pre-warmed records.

Quick DNS tactics:

  • Lower TTLs pre-incident as part of runbook practice. If TTLs are high, use anycast CDNs or edge caching as a bridge.
  • Trigger pre-authorized DNS changes from a pre-approved automation pipeline (runbook-as-code) to avoid needing provider console access when control plane is slow.

4. Failover to warm/cold standby

If you maintain standby infrastructure in another provider or region, promote read-replicas and update routing to redirect traffic. Validate database consistency first for writes-heavy systems — use read-only failover if necessary to preserve data integrity.

5. Application-level routing changes

For Kubernetes clusters, you can patch Services/Ingress or update ExternalName targets to redirect to healthy clusters. Example quick command to redirect traffic at the ingress level:

kubectl patch ing my-ingress -n prod -p '{"spec":{"rules":[{"host":"example.com","http":{"paths":[{"path":"/","backend":{"serviceName":"maintenance-service","servicePort":80}}]}}]}}'

6. Use alternative CDNs or edge providers

If a CDN provider is healthy while the origin provider is partially down, route traffic through the CDN and configure origin shielding and origin pull from secondary sources.

7. Fallback to static mode or maintenance pages

Serve a minimal static site or status page from a globally distributed object storage (S3/Blob) or next-gen object edge — simple, authenticated status and ETA reduces customer frustration.

Technical play snippets: DNS failover and Route53 example

When you need to programmatically failover DNS, pre-authorized CLI scripts are lifesavers. Example JSON for AWS Route53 change-resource-record-sets (replace IDs and IPs):

{
  "Changes": [
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "www.example.com",
        "Type": "A",
        "TTL": 60,
        "ResourceRecords": [{"Value": "203.0.113.42"}]
      }
    }
  ]
}

Apply with:

aws route53 change-resource-record-sets --hosted-zone-id Z123456 --change-batch file://change.json

Note: If provider control plane is impaired, these API calls may fail. Always have an out-of-band method to execute pre-approved changes (SSH bastion to management host, jumpbox with AWS CLI credentials, or runbook automation in a separate provider).

Customer communications: templates and cadence

Clear, honest and frequent updates reduce churn and support tickets. Use templates below; customize SLA and ETA fields.

Initial public status update (first message)

We’re aware of a widespread cloud provider incident affecting authentication and API traffic for customers in multiple regions. Our engineering team is actively triaging. Impact: login/API errors and delayed page loads. Next update: in 15 minutes. We will post detailed updates to our status page: https://status.example.com

Internal incident channel opener

[INCIDENT] Major cloud provider outage — creating incident. IC: @oncall. Initial assessment: elevated 5xx across services, control plane errors on provider console. Action plan: 1) Confirm scope, 2) Prevent change churn, 3) Execute mitigations (cache-first, DNS failover if available). Next update in 10 minutes. Timeline will be logged: /docs/incidents/2026-xx-xx.

Customer update template (periodic)

Update: We continue to see degraded performance due to a third-party cloud provider incident. Our teams are executing pre-approved mitigations (increasing cache TTLs, promoting standby services, and routing through alternate CDNs). Estimated next update: in 30 minutes. Impacted features: API write operations, dashboard logins. We prioritize customers on SLA tier: reach out to [support-email] for escalations.

Resolution & follow-up

Resolved: Service has returned to normal as of [time UTC]. We are monitoring for stability and will publish a detailed postmortem within 72 hours. If you experienced data loss or transaction failures, contact [support-escalations] and reference incident #INC-2026-XX.

Operational playbook: roles & responsibilities

Assign clear roles in your incident process to reduce confusion:

  • Incident Commander (IC): Runs the incident, prioritizes actions, and approves public comms.
  • Technical Lead(s): Infra, App, DB, Network — each responsible for diagnostics and mitigations in their domain.
  • Customer Ops / Communications Lead: Crafts public updates and coordinates with Sales & Legal.
  • Recorder: Maintains the timeline and evidence for postmortem.

Pre-incident readiness (what to build beforehand)

The best mitigation is preparation. These are non-negotiable items to include in your runbook before an incident occurs.

  1. Runbook-as-code: Store incident scripts and DNS change templates in a repo protected with MFA and approval gates.
  2. Secondary management paths: Keep a small set of credentials and pre-authorized automation in a different provider or an air-gapped management plane.
  3. Multi-DNS and multi-CDN: Pre-provision authoritative DNS with a standby provider and pre-sync zone files.
  4. Warm standbys and replication: Replicate critical data to a different provider/region and run DR drills quarterly.
  5. Chaos engineering: Regularly test partial and full-provider failover scenarios; document mean time to failover.
  6. Customer comm templates & SLA runbooks: Keep templates and escalation contact lists ready; practice messaging in tabletop exercises.

Observability and evidence collection

During the incident, collect evidence for both live decisions and postmortem analysis:

Major provider incidents may have regulatory impact. During mitigation remember to:

  • Document any cross-border data movements if you route to different regions/providers.
  • Preserve logs for compliance — do not delete logs to improve apparent availability.
  • Coordinate with legal/security if there’s evidence of data corruption or breach.

Postmortem checklist & learning loop

After the service stabilizes, perform a blameless postmortem. Key sections to include:

  1. Incident timeline and decisions (with timestamps).
  2. Root cause analysis (provider vs. own system).
  3. Mitigations executed and their effectiveness.
  4. Action items with owners and firm deadlines (implement multi-DNS, add runbook scripts, lower TTLs).
  5. Customer impact and communication effectiveness review.

Several industry trends in late 2025 and early 2026 are relevant:

  • Runbooks-as-code and IaC for incident actions: Automate safe, pre-approved actions to avoid manual errors under stress.
  • Edge-first architectures: Increasing reliance on edge computing makes graceful degradation and caching more effective.
  • Regulatory scrutiny: Incidents now often require faster public disclosures and postmortems for regulated sectors.
  • Multi-cloud and hybrid strategies: More teams maintain a minimal footprint in alternate providers to reduce single-provider blast radius.
  • Observability convergence: Unified telemetry across clouds simplifies triage during cross-provider incidents.

Example: condensed 30-minute incident play

  1. 0:00 – Declare incident, open channel, post initial status update.
  2. 0:00–0:05 – Run probes (curl/dig/traceroute) from 3 vantage points; capture metrics snapshot.
  3. 0:05–0:10 – IC decides: control plane down vs data plane. If CP down → issue change freeze; if DP down → plan failover.
  4. 0:10–0:20 – Execute safe mitigations: increase CDN TTL, enable static maintenance page, throttle background jobs.
  5. 0:20–0:30 – If valid, trigger DNS failover or route traffic to warm standby; send 30-min customer update.

Keep your runbook living: update it after every incident and run quarterly drills that involve stakeholders beyond engineering (support, legal, sales). Test failover paths end-to-end and ensure your execs know the expected RTO/RPO for each service tier.

Call to action

Incidents are inevitable; preparation separates recovery from catastrophe. Download our ready-to-edit emergency runbook template, including DNS failover scripts, incident timeline spreadsheet, and customer comm templates — tailored for 2026 multi-cloud realities. Visit pyramides.cloud/runbook-template to get the kit and schedule a free 30-minute runbook review with our SRE team.

Advertisement

Related Topics

#runbook#incident-response#cloud
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-29T18:12:47.383Z