Generate a runbook entry from a specific alert

intermediateClaude SonnetIT & SecuritySrerunbooksreon-callalertingreliability

Use case

Use this prompt every time you create a new alert that pages a human. The doc forces you to answer: when this fires at 3am, what does the responder do? Alerts without runbooks are tickets in disguise.

The prompt

You are an SRE writing a runbook entry for an alert that pages on-call. Be concrete. The reader is half-awake and trying to act, not learn.

Alert:
- Name: {{alert_name}}
- Definition (query / threshold / window): {{alert_definition}}
- What it really detects (in plain language): {{detection_intent}}
- Service: {{service}}
- Severity (page / ticket): {{severity}}
- Known historical causes: {{historical_causes}}
- Tooling: {{tooling}}

Produce a runbook entry with:

1. **TL;DR**: 1–2 sentences — what this alert means and the most likely action
2. **Confirm it's real**: 2–3 quick checks to rule out false positives (e.g., metric stale, deploy in progress, scheduled maintenance)
3. **Triage decision tree**: a flowchart-as-text — "if X, go to A; if Y, go to B" — covering the top 3–4 causes
4. **Common causes** with for each: symptoms, exact diagnostic commands, mitigation steps
5. **Mitigation actions** (commands, dashboard links, kill switches) — copy-pastable
6. **Escalation**: when and who to escalate to, what to share
7. **What this alert is NOT**: what it doesn't detect, so the responder doesn't get tunnel-vision
8. **Post-incident**: what to leave behind — ticket, postmortem trigger, dashboard annotation

Format: short headings, bullet points, code blocks for commands. Avoid prose paragraphs.

Variables

{{{{alert_name}}}}Replace with your {{alert name}}
{{{{alert_definition}}}}Replace with your {{alert definition}}
{{{{detection_intent}}}}Replace with your {{detection intent}}
{{{{service}}}}Replace with your {{service}}
{{{{severity}}}}Replace with your {{severity}}
{{{{historical_causes}}}}Replace with your {{historical causes}}
{{{{tooling}}}}Replace with your {{tooling}}

Example

Input

alert_name: orders-api high p99 latency
alert_definition: |
  Datadog metric: max:trace.express.request.duration{service:orders-api,env:prod}.as_count()
  p99 over 5min > 1500ms for 3 consecutive evaluation windows
detection_intent: Customer-facing latency exceeding our SLO budget; users feel order placement as slow or hung
service: orders-api (Node.js, EKS, depends on Postgres + Redis + Stripe)
severity: page (SEV2 default; SEV1 if also error rate > 2%)
historical_causes: 1) Postgres connection pool exhaustion under traffic spikes; 2) Redis evictions causing cache misses; 3) Stripe API slowness propagating; 4) noisy-neighbor pod (memory pressure)
tooling: Datadog APM + dashboards, kubectl, AWS RDS Performance Insights, PagerDuty

Output

# Runbook: orders-api high p99 latency

Owner: Orders team. Linked alert: `dd:orders_api_p99_latency`. Last validated: 2026-05-01.

## TL;DR
p99 latency on `orders-api` over 1.5s for 15 minutes. Most likely Postgres pool exhaustion or Redis cache misses. Start with the connection-pool check.

## Confirm it's real (60 seconds)

- Open `<dashboard:orders-api-overview>`. Is p99 actually elevated, or is the alert stale?
- Check `#deploys-orders` for a deploy in the last 30 minutes — if yes, latency may be cold-cache transient. Wait 5 min, recheck.
- Check `<dashboard:upstream-traffic>` — is request volume up >2x? Could be a load event, not a regression.
- Check Datadog `Events` for "maintenance" tags on dependencies.

If still elevated and not explained → continue.

## Triage decision tree

```
Is error rate > 2%?           → upgrade to SEV1, comms lead, see incident-response-playbook
Is Postgres CPU > 80%?         → A) Postgres saturation
Is Redis hit rate < 60%?       → B) Cache regression
Is Stripe latency elevated?    → C) Upstream Stripe
None of the above?             → D) Pod / node issue
```

## Common causes

### A) Postgres connection pool exhaustion
- **Symptoms**: orders-api logs `connection pool timeout`; Postgres `pg_stat_activity` near `max_connections`; DB CPU normal but waits high
- **Diagnose**:
  ```bash
  kubectl exec -n orders deploy/orders-api -- node -e "console.log(require('pg').pool.totalCount)"
  aws rds describe-db-clusters --db-cluster-identifier orders-prod --query 'DBClusters[0].DatabaseName'
  ```
  Open RDS Performance Insights → top SQL → look for unindexed query
- **Mitigate**:
  - Short-term: scale orders-api replicas down by 30% (paradoxically reduces total connections during a noisy-retry storm). `kubectl scale deploy/orders-api --replicas=N` with N = current * 0.7
  - If query is identifiable: add to query-killer Lambda allow-list or kill via `pg_terminate_backend(pid)`
  - If pool size is the cap: increase `PG_POOL_MAX` env var (requires redeploy)

### B) Redis cache regression
- **Symptoms**: hit rate drop on `<dashboard:redis-orders-cache>`; latency rises despite normal DB metrics
- **Diagnose**:
  ```bash
  redis-cli -h orders-redis.prod INFO stats | grep keyspace
  redis-cli -h orders-redis.prod INFO memory | grep evicted_keys
  ```
- **Mitigate**:
  - If `evicted_keys` is climbing: increase Redis memory tier (terraform `redis_node_size`)
  - If hit rate drop is post-deploy: roll back; cache key shape may have changed

### C) Upstream Stripe slowness
- **Symptoms**: Datadog APM trace waterfall shows Stripe span dominating; Stripe status page shows degraded
- **Diagnose**: `<dashboard:stripe-integration>` → Stripe API duration percentiles
- **Mitigate**:
  - Enable async order-confirmation flow: feature flag `orders.stripe_async=true` in LaunchDarkly. Customers get a "processing" UI instead of blocked checkout.
  - If Stripe is fully down: failover to Adyen routing — see `<runbook:payments-region-failover>`

### D) Pod / node issue (noisy neighbor or single-pod failure)
- **Symptoms**: latency p99 elevated only on a subset of pods; specific node CPU/mem pressure
- **Diagnose**:
  ```bash
  kubectl top pods -n orders --sort-by=memory
  kubectl describe node $(kubectl get pods -n orders -o wide | awk '/orders-api/ {print $7}' | sort -u)
  ```
- **Mitigate**: cordon/drain the affected node; let pods reschedule. `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data`

## Mitigation cheat sheet

| Action | Command |
|---|---|
| Scale replicas | `kubectl scale deploy/orders-api --replicas=N -n orders` |
| Force redeploy | `kubectl rollout restart deploy/orders-api -n orders` |
| Roll back | `kubectl rollout undo deploy/orders-api -n orders` |
| Async Stripe flag | LaunchDarkly: `orders.stripe_async=true` |
| Drain noisy node | `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data` |

## Escalation

- Page Postgres team if RDS Performance Insights shows query-storm not solved by app-side action
- Page Payments team if Stripe-routed; their on-call may already be engaged
- Escalate IC to manager if not mitigating after 30 minutes

Share when escalating: alert link, dashboard link, time of detection, mitigation already attempted, current customer-visible state.

## What this alert is NOT

- Not an availability alert — orders may still be succeeding, just slowly. Check error rate separately.
- Not a capacity-planning trigger — sustained breach over multiple days needs SLO/capacity review, not pages
- Doesn't catch correctness bugs (orders saved with wrong data) — those are tracked by reconciliation reports

## Post-incident

- If you mitigated without root cause, open a ticket in `orders-eng` to investigate
- Annotate the Datadog dashboard with what happened — future you will appreciate it
- Add an entry to the team's `runbook-validated` log: which step in this runbook actually worked
- If runbook was wrong or missing a case → file a PR against this doc before the end of your shift

Tips for best results

1Every alert that pages a human gets a runbook. No exceptions. If you can't write the runbook, the alert is too vague to page on.
2Use commands, not descriptions. 'Check the connection pool' is unhelpful at 3am; `kubectl exec ...` is.
3Update the runbook after every incident. The first version is always wrong somewhere; the third version is usable.
4Validate runbooks during game days, not during real incidents. If the command in the runbook needs fixing, fix it under low pressure.
5AI assistance is not a replacement for security review by qualified professionals. Have a senior SRE review escalation paths and any commands that touch production data before adoption.

Related prompts

Generate an incident response playbook for a service

advanced

Produce a service-specific incident response playbook covering severity, roles, comms, common failure modes, and recovery steps.

IT & Securityincident-responsesreplaybook

Write a blameless postmortem from incident notes

intermediate

Convert raw incident notes into a structured blameless postmortem with timeline, contributing factors, and tracked action items.

IT & Securitypostmortemsreincident-response

Write an SLO definition document for a service

advanced

Produce a complete SLO definition with SLIs, error budgets, alerting policy, and consequences when the budget is exhausted.

IT & Securityslosrereliability

Need help implementing this prompt in your workflow?

Book a call

You are an SRE writing a runbook entry for an alert that pages on-call. Be concrete. The reader is half-awake and trying to act, not learn. Alert: - Name: {{alert_name}} - Definition (query / threshold / window): {{alert_definition}} - What it really detects (in plain language): {{detection_intent}} - Service: {{service}} - Severity (page / ticket): {{severity}} - Known historical causes: {{historical_causes}} - Tooling: {{tooling}} Produce a runbook entry with: 1. **TL;DR**: 1–2 sentences — what this alert means and the most likely action 2. **Confirm it's real**: 2–3 quick checks to rule out false positives (e.g., metric stale, deploy in progress, scheduled maintenance) 3. **Triage decision tree**: a flowchart-as-text — "if X, go to A; if Y, go to B" — covering the top 3–4 causes 4. **Common causes** with for each: symptoms, exact diagnostic commands, mitigation steps 5. **Mitigation actions** (commands, dashboard links, kill switches) — copy-pastable 6. **Escalation**: when and who to escalate to, what to share 7. **What this alert is NOT**: what it doesn't detect, so the responder doesn't get tunnel-vision 8. **Post-incident**: what to leave behind — ticket, postmortem trigger, dashboard annotation Format: short headings, bullet points, code blocks for commands. Avoid prose paragraphs.

Variables

{{{{alert_name}}}}Replace with your {{alert name}}

{{{{alert_definition}}}}Replace with your {{alert definition}}

{{{{detection_intent}}}}Replace with your {{detection intent}}

{{{{service}}}}Replace with your {{service}}

{{{{severity}}}}Replace with your {{severity}}

{{{{historical_causes}}}}Replace with your {{historical causes}}

{{{{tooling}}}}Replace with your {{tooling}}

Example

Input

alert_name: orders-api high p99 latency
alert_definition: |
  Datadog metric: max:trace.express.request.duration{service:orders-api,env:prod}.as_count()
  p99 over 5min > 1500ms for 3 consecutive evaluation windows
detection_intent: Customer-facing latency exceeding our SLO budget; users feel order placement as slow or hung
service: orders-api (Node.js, EKS, depends on Postgres + Redis + Stripe)
severity: page (SEV2 default; SEV1 if also error rate > 2%)
historical_causes: 1) Postgres connection pool exhaustion under traffic spikes; 2) Redis evictions causing cache misses; 3) Stripe API slowness propagating; 4) noisy-neighbor pod (memory pressure)
tooling: Datadog APM + dashboards, kubectl, AWS RDS Performance Insights, PagerDuty

Output

# Runbook: orders-api high p99 latency

Owner: Orders team. Linked alert: `dd:orders_api_p99_latency`. Last validated: 2026-05-01.

## TL;DR
p99 latency on `orders-api` over 1.5s for 15 minutes. Most likely Postgres pool exhaustion or Redis cache misses. Start with the connection-pool check.

## Confirm it's real (60 seconds)

- Open `<dashboard:orders-api-overview>`. Is p99 actually elevated, or is the alert stale?
- Check `#deploys-orders` for a deploy in the last 30 minutes — if yes, latency may be cold-cache transient. Wait 5 min, recheck.
- Check `<dashboard:upstream-traffic>` — is request volume up >2x? Could be a load event, not a regression.
- Check Datadog `Events` for "maintenance" tags on dependencies.

If still elevated and not explained → continue.

## Triage decision tree

```
Is error rate > 2%?           → upgrade to SEV1, comms lead, see incident-response-playbook
Is Postgres CPU > 80%?         → A) Postgres saturation
Is Redis hit rate < 60%?       → B) Cache regression
Is Stripe latency elevated?    → C) Upstream Stripe
None of the above?             → D) Pod / node issue
```

## Common causes

### A) Postgres connection pool exhaustion
- **Symptoms**: orders-api logs `connection pool timeout`; Postgres `pg_stat_activity` near `max_connections`; DB CPU normal but waits high
- **Diagnose**:
  ```bash
  kubectl exec -n orders deploy/orders-api -- node -e "console.log(require('pg').pool.totalCount)"
  aws rds describe-db-clusters --db-cluster-identifier orders-prod --query 'DBClusters[0].DatabaseName'
  ```
  Open RDS Performance Insights → top SQL → look for unindexed query
- **Mitigate**:
  - Short-term: scale orders-api replicas down by 30% (paradoxically reduces total connections during a noisy-retry storm). `kubectl scale deploy/orders-api --replicas=N` with N = current * 0.7
  - If query is identifiable: add to query-killer Lambda allow-list or kill via `pg_terminate_backend(pid)`
  - If pool size is the cap: increase `PG_POOL_MAX` env var (requires redeploy)

### B) Redis cache regression
- **Symptoms**: hit rate drop on `<dashboard:redis-orders-cache>`; latency rises despite normal DB metrics
- **Diagnose**:
  ```bash
  redis-cli -h orders-redis.prod INFO stats | grep keyspace
  redis-cli -h orders-redis.prod INFO memory | grep evicted_keys
  ```
- **Mitigate**:
  - If `evicted_keys` is climbing: increase Redis memory tier (terraform `redis_node_size`)
  - If hit rate drop is post-deploy: roll back; cache key shape may have changed

### C) Upstream Stripe slowness
- **Symptoms**: Datadog APM trace waterfall shows Stripe span dominating; Stripe status page shows degraded
- **Diagnose**: `<dashboard:stripe-integration>` → Stripe API duration percentiles
- **Mitigate**:
  - Enable async order-confirmation flow: feature flag `orders.stripe_async=true` in LaunchDarkly. Customers get a "processing" UI instead of blocked checkout.
  - If Stripe is fully down: failover to Adyen routing — see `<runbook:payments-region-failover>`

### D) Pod / node issue (noisy neighbor or single-pod failure)
- **Symptoms**: latency p99 elevated only on a subset of pods; specific node CPU/mem pressure
- **Diagnose**:
  ```bash
  kubectl top pods -n orders --sort-by=memory
  kubectl describe node $(kubectl get pods -n orders -o wide | awk '/orders-api/ {print $7}' | sort -u)
  ```
- **Mitigate**: cordon/drain the affected node; let pods reschedule. `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data`

## Mitigation cheat sheet

| Action | Command |
|---|---|
| Scale replicas | `kubectl scale deploy/orders-api --replicas=N -n orders` |
| Force redeploy | `kubectl rollout restart deploy/orders-api -n orders` |
| Roll back | `kubectl rollout undo deploy/orders-api -n orders` |
| Async Stripe flag | LaunchDarkly: `orders.stripe_async=true` |
| Drain noisy node | `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data` |

## Escalation

- Page Postgres team if RDS Performance Insights shows query-storm not solved by app-side action
- Page Payments team if Stripe-routed; their on-call may already be engaged
- Escalate IC to manager if not mitigating after 30 minutes

Share when escalating: alert link, dashboard link, time of detection, mitigation already attempted, current customer-visible state.

## What this alert is NOT

- Not an availability alert — orders may still be succeeding, just slowly. Check error rate separately.
- Not a capacity-planning trigger — sustained breach over multiple days needs SLO/capacity review, not pages
- Doesn't catch correctness bugs (orders saved with wrong data) — those are tracked by reconciliation reports

## Post-incident

- If you mitigated without root cause, open a ticket in `orders-eng` to investigate
- Annotate the Datadog dashboard with what happened — future you will appreciate it
- Add an entry to the team's `runbook-validated` log: which step in this runbook actually worked
- If runbook was wrong or missing a case → file a PR against this doc before the end of your shift

Tips for best results

1Every alert that pages a human gets a runbook. No exceptions. If you can't write the runbook, the alert is too vague to page on.

2Use commands, not descriptions. 'Check the connection pool' is unhelpful at 3am; `kubectl exec ...` is.

3Update the runbook after every incident. The first version is always wrong somewhere; the third version is usable.

4Validate runbooks during game days, not during real incidents. If the command in the runbook needs fixing, fix it under low pressure.

5AI assistance is not a replacement for security review by qualified professionals. Have a senior SRE review escalation paths and any commands that touch production data before adoption.