Write a blameless postmortem from incident notes

intermediateClaude SonnetIT & SecuritySrepostmortemsreincident-responseblamelessreliability

Use case

Use this prompt the day after a SEV1 or SEV2 incident, when you have a Slack channel full of timestamped messages and need to turn it into a document the team and leadership can review. The output is honest, blameless, and actionable — not a CYA narrative.

The prompt

You are an experienced SRE writing a blameless postmortem. Convert the incident notes below into a postmortem document. Be honest about what went wrong, blameless about who did what, and specific about action items.

Incident:
- Title: {{title}}
- Severity: {{severity}}
- Start / detection / resolution timestamps: {{timestamps}}
- Service(s) impacted: {{services}}
- Customer impact (qualitative + quantitative): {{customer_impact}}

Raw notes / timeline source:
```
{{notes}}
```

Produce a postmortem with these sections:

1. **Summary** (3 sentences): what happened, customer impact, why it ended
2. **Impact**: numbers — duration, affected users, revenue, SLO budget consumed
3. **Timeline**: chronological, UTC, with detection/declaration/mitigation/resolution markers; weave in what was happening for the user
4. **Root cause**: technical sequence of events that produced the failure
5. **Contributing factors**: organizational, tooling, design factors — separate from "the proximate cause." Use 5 Whys-style framing without rehashing 5 Whys explicitly.
6. **What went well**: actually call out what worked — alerting, runbook, comms, fast mitigation
7. **What went poorly**: blameless. Critique systems and decisions, never people.
8. **Lessons**: 3–5 takeaways — generalizable, not just "we'll add a check"
9. **Action items**: each with owner, priority (P0/P1/P2), due date, tracking link placeholder. Include preventive AND detective controls.
10. **Open questions**: things still unknown

Tone: blameless but not sanitized. Auditors and engineers should both find it useful. No "individual error" framing — find the system that allowed the error.

Variables

{{{{title}}}}Replace with your {{title}}
{{{{severity}}}}Replace with your {{severity}}
{{{{timestamps}}}}Replace with your {{timestamps}}
{{{{services}}}}Replace with your {{services}}
{{{{customer_impact}}}}Replace with your {{customer impact}}
{{{{notes}}}}Replace with your {{notes}}

Example

Input

title: search-api outage from OpenSearch index corruption
severity: SEV1
timestamps: Started 2026-04-23 14:02 UTC, detected 14:11 UTC, resolved 15:47 UTC
services: search-api, web checkout (degraded)
customer_impact: ~1h45m. Estimated 31% of search queries returned empty results during the window. Approx 4,200 abandoned checkouts. No data loss.
notes: |
  14:01 — Maria deployed change to update OpenSearch mapping (added new analyzer for product titles). PR #4421.
  14:02 — Reindex job kicked off. Job log shows "Mapping conflict on field title.raw — recreating index"
  14:09 — First customer complaint in #cs-tickets ("search broken")
  14:11 — Datadog monitor "search zero-result rate > 15%" fired
  14:11 — Page to search-oncall (Devon)
  14:14 — Devon ack'd. Started investigation
  14:18 — Devon: "I see zero-result rate spiking, looking at OpenSearch"
  14:22 — Devon: "Index search-products-v17 is in red state"
  14:25 — Maria joined channel: "I deployed the mapping change at 14:01, could be related"
  14:28 — IC declared (Devon). Comms lead: Priya
  14:30 — Statuspage updated: "Search degraded"
  14:35 — Decision: roll back to previous index search-products-v16 (still on disk)
  14:42 — Rollback CLI errored: alias didn't switch because v16 was missing one document (had been deleted between v16 build and v17 build)
  14:55 — Devon force-promoted v16 with --ignore-missing flag (manual)
  15:02 — Search starting to recover for ~70% of queries
  15:18 — Remaining 30% traced to cache poisoning — Redis still serving zero-result responses cached during outage
  15:25 — Cache flushed
  15:40 — All metrics back to baseline
  15:47 — IC declared resolved. Statuspage cleared.
  Notes: rollback procedure was not in runbook — Devon improvised. Cache TTL on zero-result responses was 60min, never reviewed.

Output

# Postmortem: search-api outage from OpenSearch index corruption (PG-2026-04-23)

**Severity**: SEV1
**Date**: 2026-04-23
**Authors**: Devon (IC), Priya (Comms), Maria (deploy author). Reviewer: SRE lead.
**Status**: Draft for review

## 1. Summary

At 14:02 UTC on April 23, a deploy that updated an OpenSearch field analyzer triggered a mapping conflict that left the new search index in a red state, causing ~31% of search queries to return zero results for 1 hour 45 minutes. Recovery required a manual rollback to a prior index, complicated by a missing-document edge case that wasn't anticipated by the rollback tooling. A 60-minute cache TTL on zero-result responses extended user-visible impact beyond the index recovery.

## 2. Impact

- Duration of customer impact: 14:02–15:47 UTC (1h 45m)
- Search queries with zero results during window: ~31% (vs ~5% baseline)
- Abandoned checkouts directly attributable: ~4,200
- SLO budget consumed: ~85% of monthly availability budget for search-api
- No data loss; no security impact

## 3. Timeline (UTC)

- **14:01** — Maria deploys PR #4421 (analyzer change). Reindex job starts.
- **14:02** — *Incident start*. Reindex hits mapping conflict on `title.raw`; index goes red.
- **14:09** — First customer report in `#cs-tickets`.
- **14:11** — *Detection*. Datadog monitor "zero-result rate > 15%" fires; Devon paged.
- **14:14** — Devon acks page.
- **14:22** — Devon identifies red index `search-products-v17`.
- **14:25** — Maria joins channel, links the deploy.
- **14:28** — *Incident declared*. IC: Devon. Comms: Priya.
- **14:30** — Statuspage: "Search degraded".
- **14:35** — Decision to roll back to `search-products-v16`.
- **14:42** — Rollback fails — v16 missing a document deleted between v16 and v17 builds; alias-switch tooling refuses inconsistent state.
- **14:55** — *Mitigation*. Devon manually force-promotes v16 with `--ignore-missing`.
- **15:02** — Recovery starts; ~70% of queries succeed.
- **15:18** — Remaining 30% traced to Redis serving cached zero-result responses (60-min TTL).
- **15:25** — *Mitigation 2*. Redis cache flushed for search namespace.
- **15:40** — Metrics back to baseline.
- **15:47** — *Resolution declared*. Statuspage cleared.

## 4. Root cause

The deploy added an analyzer to the `title` field. The reindex job recreates the index when there's a mapping conflict. During the recreate, the new index entered red state because the analyzer change was incompatible with an existing custom analyzer dependency in the template — a configuration in `index_template_v3.json` that the deploy author didn't know existed. The cluster did not roll back the index automatically; it left v17 in red and the alias still pointing at v17.

## 5. Contributing factors

- **Index template ownership unclear**: `index_template_v3.json` was modified by a different team six months ago. There was no warning that an analyzer change in this PR would conflict with that template.
- **Reindex job has no pre-flight validation**: the job runs the change against prod; there is no staging or shadow-index dry run that would have caught the mapping conflict.
- **Rollback tooling assumes consistency**: the alias-switch tool refuses to switch aliases when the target index is missing documents present in the live one. This is correct as a default, but the tool has no documented escape hatch — Devon improvised the `--ignore-missing` flag from reading the source code.
- **Cache TTL on zero-result responses was never reviewed**: 60 minutes is twice the typical TTL for this service; it amplified the user-visible duration well past the actual fix.
- **Runbook gap**: the runbook covers "OpenSearch slow" but not "OpenSearch red". Devon was navigating without explicit guidance.

## 6. What went well

- Datadog monitor fired within 9 minutes of impact start — well under our 15-minute detection target.
- Comms cadence was on time and clear; CS team had talking points by 14:35.
- Maria proactively joined the incident channel and surfaced the deploy correlation, which cut investigation time.
- The decision to roll back was made within 25 minutes of detection, before debate consumed time.

## 7. What went poorly

- Pre-deploy validation didn't exist for index mapping changes — these are exactly the changes that benefit from staging.
- The rollback procedure required undocumented improvisation. Devon's read of the tool's source code added ~10 minutes to recovery.
- Cache TTL extended the incident by ~15 minutes after the underlying fix.
- Detection threshold (15% zero-result rate) is too lenient; baseline is 5–6% so a sustained >10% would have detected sooner.

## 8. Lessons

1. **Mapping changes are deploys, not configuration tweaks** — they need the same staging discipline as code.
2. **Rollback tools should fail loudly, but with a documented override path** — refusing inconsistent state is correct; making the override require source-reading is not.
3. **Cache TTL is part of incident duration** — every cache layer needs a documented flush procedure and is implicitly part of any related runbook.
4. **Index ownership is a real thing** — shared OpenSearch templates need explicit owners and a contract for changes.
5. **Detection thresholds drift from reality** — a 15% threshold made sense when baseline was 12%; we're at 5% now and need to recalibrate.

## 9. Action items

| # | Item | Owner | Priority | Due | Tracking |
|---|---|---|---|---|---|
| 1 | Add pre-flight mapping validation step in reindex job (run against shadow index, fail deploy on red) | Maria | P0 | 2026-05-07 | `<linear:SE-2104>` |
| 2 | Document `--ignore-missing` rollback procedure in search runbook | Devon | P0 | 2026-04-30 | `<linear:SE-2105>` |
| 3 | Reduce zero-result cache TTL from 60 min to 5 min; flush procedure documented | Maria | P1 | 2026-05-14 | `<linear:SE-2106>` |
| 4 | Recalibrate Datadog zero-result alert threshold to 10% (from 15%) and add a 25% page-immediately tier | Devon | P1 | 2026-05-07 | `<linear:SE-2107>` |
| 5 | Document ownership and change protocol for shared `index_template_*.json` files | Search team lead | P1 | 2026-05-21 | `<linear:SE-2108>` |
| 6 | Add tabletop exercise for "OpenSearch red index" scenario to next on-call drill | SRE | P2 | 2026-06-15 | `<linear:SRE-880>` |

## 10. Open questions

- Are there other shared OpenSearch templates with the same ownership ambiguity? Audit needed.
- Should we pursue a blue-green index strategy (deploy to v17 alongside v16, switch alias only after green) versus today's in-place re-create?
- Would a synthetic search-quality monitor (run a known query, expect non-zero results) have caught this faster than the rate-based alert? Worth piloting.

Tips for best results

1Blameless does not mean sanitized. Critique systems, decisions, and tools sharply. Never name an individual as the cause — they responded to the system you gave them.
2Action items without owners and dates are decoration. Every item gets a name and a date or it doesn't go in the doc.
3Lessons should generalize. 'We will add a validation step' is an action item. 'Configuration changes are deploys' is a lesson — it applies beyond this incident.
4Write the timeline in past tense, anchored in UTC. Time-zone confusion in postmortems is its own contributing factor at distributed companies.
5AI assistance is not a replacement for security review by qualified professionals. Postmortems involving security or data exposure should be reviewed by your security lead before publication.

Related prompts

Generate an incident response playbook for a service

advanced

Produce a service-specific incident response playbook covering severity, roles, comms, common failure modes, and recovery steps.

IT & Securityincident-responsesreplaybook

Generate a runbook entry from a specific alert

intermediate

Turn an alert definition into a complete runbook entry with diagnostics, mitigations, and escalation guidance for the on-call engineer.

IT & Securityrunbooksreon-call

Write an SLO definition document for a service

advanced

Produce a complete SLO definition with SLIs, error budgets, alerting policy, and consequences when the budget is exhausted.

IT & Securityslosrereliability

Need help implementing this prompt in your workflow?

Book a call

You are an experienced SRE writing a blameless postmortem. Convert the incident notes below into a postmortem document. Be honest about what went wrong, blameless about who did what, and specific about action items. Incident: - Title: {{title}} - Severity: {{severity}} - Start / detection / resolution timestamps: {{timestamps}} - Service(s) impacted: {{services}} - Customer impact (qualitative + quantitative): {{customer_impact}} Raw notes / timeline source: ``` {{notes}} ``` Produce a postmortem with these sections: 1. **Summary** (3 sentences): what happened, customer impact, why it ended 2. **Impact**: numbers — duration, affected users, revenue, SLO budget consumed 3. **Timeline**: chronological, UTC, with detection/declaration/mitigation/resolution markers; weave in what was happening for the user 4. **Root cause**: technical sequence of events that produced the failure 5. **Contributing factors**: organizational, tooling, design factors — separate from "the proximate cause." Use 5 Whys-style framing without rehashing 5 Whys explicitly. 6. **What went well**: actually call out what worked — alerting, runbook, comms, fast mitigation 7. **What went poorly**: blameless. Critique systems and decisions, never people. 8. **Lessons**: 3–5 takeaways — generalizable, not just "we'll add a check" 9. **Action items**: each with owner, priority (P0/P1/P2), due date, tracking link placeholder. Include preventive AND detective controls. 10. **Open questions**: things still unknown Tone: blameless but not sanitized. Auditors and engineers should both find it useful. No "individual error" framing — find the system that allowed the error.

Example

Input

title: search-api outage from OpenSearch index corruption
severity: SEV1
timestamps: Started 2026-04-23 14:02 UTC, detected 14:11 UTC, resolved 15:47 UTC
services: search-api, web checkout (degraded)
customer_impact: ~1h45m. Estimated 31% of search queries returned empty results during the window. Approx 4,200 abandoned checkouts. No data loss.
notes: |
  14:01 — Maria deployed change to update OpenSearch mapping (added new analyzer for product titles). PR #4421.
  14:02 — Reindex job kicked off. Job log shows "Mapping conflict on field title.raw — recreating index"
  14:09 — First customer complaint in #cs-tickets ("search broken")
  14:11 — Datadog monitor "search zero-result rate > 15%" fired
  14:11 — Page to search-oncall (Devon)
  14:14 — Devon ack'd. Started investigation
  14:18 — Devon: "I see zero-result rate spiking, looking at OpenSearch"
  14:22 — Devon: "Index search-products-v17 is in red state"
  14:25 — Maria joined channel: "I deployed the mapping change at 14:01, could be related"
  14:28 — IC declared (Devon). Comms lead: Priya
  14:30 — Statuspage updated: "Search degraded"
  14:35 — Decision: roll back to previous index search-products-v16 (still on disk)
  14:42 — Rollback CLI errored: alias didn't switch because v16 was missing one document (had been deleted between v16 build and v17 build)
  14:55 — Devon force-promoted v16 with --ignore-missing flag (manual)
  15:02 — Search starting to recover for ~70% of queries
  15:18 — Remaining 30% traced to cache poisoning — Redis still serving zero-result responses cached during outage
  15:25 — Cache flushed
  15:40 — All metrics back to baseline
  15:47 — IC declared resolved. Statuspage cleared.
  Notes: rollback procedure was not in runbook — Devon improvised. Cache TTL on zero-result responses was 60min, never reviewed.

Output

# Postmortem: search-api outage from OpenSearch index corruption (PG-2026-04-23)

**Severity**: SEV1
**Date**: 2026-04-23
**Authors**: Devon (IC), Priya (Comms), Maria (deploy author). Reviewer: SRE lead.
**Status**: Draft for review

## 1. Summary

At 14:02 UTC on April 23, a deploy that updated an OpenSearch field analyzer triggered a mapping conflict that left the new search index in a red state, causing ~31% of search queries to return zero results for 1 hour 45 minutes. Recovery required a manual rollback to a prior index, complicated by a missing-document edge case that wasn't anticipated by the rollback tooling. A 60-minute cache TTL on zero-result responses extended user-visible impact beyond the index recovery.

## 2. Impact

- Duration of customer impact: 14:02–15:47 UTC (1h 45m)
- Search queries with zero results during window: ~31% (vs ~5% baseline)
- Abandoned checkouts directly attributable: ~4,200
- SLO budget consumed: ~85% of monthly availability budget for search-api
- No data loss; no security impact

## 3. Timeline (UTC)

- **14:01** — Maria deploys PR #4421 (analyzer change). Reindex job starts.
- **14:02** — *Incident start*. Reindex hits mapping conflict on `title.raw`; index goes red.
- **14:09** — First customer report in `#cs-tickets`.
- **14:11** — *Detection*. Datadog monitor "zero-result rate > 15%" fires; Devon paged.
- **14:14** — Devon acks page.
- **14:22** — Devon identifies red index `search-products-v17`.
- **14:25** — Maria joins channel, links the deploy.
- **14:28** — *Incident declared*. IC: Devon. Comms: Priya.
- **14:30** — Statuspage: "Search degraded".
- **14:35** — Decision to roll back to `search-products-v16`.
- **14:42** — Rollback fails — v16 missing a document deleted between v16 and v17 builds; alias-switch tooling refuses inconsistent state.
- **14:55** — *Mitigation*. Devon manually force-promotes v16 with `--ignore-missing`.
- **15:02** — Recovery starts; ~70% of queries succeed.
- **15:18** — Remaining 30% traced to Redis serving cached zero-result responses (60-min TTL).
- **15:25** — *Mitigation 2*. Redis cache flushed for search namespace.
- **15:40** — Metrics back to baseline.
- **15:47** — *Resolution declared*. Statuspage cleared.

## 4. Root cause

The deploy added an analyzer to the `title` field. The reindex job recreates the index when there's a mapping conflict. During the recreate, the new index entered red state because the analyzer change was incompatible with an existing custom analyzer dependency in the template — a configuration in `index_template_v3.json` that the deploy author didn't know existed. The cluster did not roll back the index automatically; it left v17 in red and the alias still pointing at v17.

## 5. Contributing factors

- **Index template ownership unclear**: `index_template_v3.json` was modified by a different team six months ago. There was no warning that an analyzer change in this PR would conflict with that template.
- **Reindex job has no pre-flight validation**: the job runs the change against prod; there is no staging or shadow-index dry run that would have caught the mapping conflict.
- **Rollback tooling assumes consistency**: the alias-switch tool refuses to switch aliases when the target index is missing documents present in the live one. This is correct as a default, but the tool has no documented escape hatch — Devon improvised the `--ignore-missing` flag from reading the source code.
- **Cache TTL on zero-result responses was never reviewed**: 60 minutes is twice the typical TTL for this service; it amplified the user-visible duration well past the actual fix.
- **Runbook gap**: the runbook covers "OpenSearch slow" but not "OpenSearch red". Devon was navigating without explicit guidance.

## 6. What went well

- Datadog monitor fired within 9 minutes of impact start — well under our 15-minute detection target.
- Comms cadence was on time and clear; CS team had talking points by 14:35.
- Maria proactively joined the incident channel and surfaced the deploy correlation, which cut investigation time.
- The decision to roll back was made within 25 minutes of detection, before debate consumed time.

## 7. What went poorly

- Pre-deploy validation didn't exist for index mapping changes — these are exactly the changes that benefit from staging.
- The rollback procedure required undocumented improvisation. Devon's read of the tool's source code added ~10 minutes to recovery.
- Cache TTL extended the incident by ~15 minutes after the underlying fix.
- Detection threshold (15% zero-result rate) is too lenient; baseline is 5–6% so a sustained >10% would have detected sooner.

## 8. Lessons

1. **Mapping changes are deploys, not configuration tweaks** — they need the same staging discipline as code.
2. **Rollback tools should fail loudly, but with a documented override path** — refusing inconsistent state is correct; making the override require source-reading is not.
3. **Cache TTL is part of incident duration** — every cache layer needs a documented flush procedure and is implicitly part of any related runbook.
4. **Index ownership is a real thing** — shared OpenSearch templates need explicit owners and a contract for changes.
5. **Detection thresholds drift from reality** — a 15% threshold made sense when baseline was 12%; we're at 5% now and need to recalibrate.

## 9. Action items

| # | Item | Owner | Priority | Due | Tracking |
|---|---|---|---|---|---|
| 1 | Add pre-flight mapping validation step in reindex job (run against shadow index, fail deploy on red) | Maria | P0 | 2026-05-07 | `<linear:SE-2104>` |
| 2 | Document `--ignore-missing` rollback procedure in search runbook | Devon | P0 | 2026-04-30 | `<linear:SE-2105>` |
| 3 | Reduce zero-result cache TTL from 60 min to 5 min; flush procedure documented | Maria | P1 | 2026-05-14 | `<linear:SE-2106>` |
| 4 | Recalibrate Datadog zero-result alert threshold to 10% (from 15%) and add a 25% page-immediately tier | Devon | P1 | 2026-05-07 | `<linear:SE-2107>` |
| 5 | Document ownership and change protocol for shared `index_template_*.json` files | Search team lead | P1 | 2026-05-21 | `<linear:SE-2108>` |
| 6 | Add tabletop exercise for "OpenSearch red index" scenario to next on-call drill | SRE | P2 | 2026-06-15 | `<linear:SRE-880>` |

## 10. Open questions

- Are there other shared OpenSearch templates with the same ownership ambiguity? Audit needed.
- Should we pursue a blue-green index strategy (deploy to v17 alongside v16, switch alias only after green) versus today's in-place re-create?
- Would a synthetic search-quality monitor (run a known query, expect non-zero results) have caught this faster than the rate-based alert? Worth piloting.

Tips for best results

1Blameless does not mean sanitized. Critique systems, decisions, and tools sharply. Never name an individual as the cause — they responded to the system you gave them.

2Action items without owners and dates are decoration. Every item gets a name and a date or it doesn't go in the doc.

3Lessons should generalize. 'We will add a validation step' is an action item. 'Configuration changes are deploys' is a lesson — it applies beyond this incident.

4Write the timeline in past tense, anchored in UTC. Time-zone confusion in postmortems is its own contributing factor at distributed companies.

5AI assistance is not a replacement for security review by qualified professionals. Postmortems involving security or data exposure should be reviewed by your security lead before publication.