Feb 13, 20224 min read

How to Run a Post-Mortem That Actually Improves Things

Most post-mortems are theater. Here's how I run them so they actually prevent the next incident.

Incident Management

I've been in the room for post-mortems that changed how teams operate. I've also sat through post-mortems that were complete wastes of time — checkbox exercises that made everyone feel better but changed nothing.

The difference? How you run them.

The Point of a Post-Mortem

It's not to assign blame. It's not to document what happened for compliance. It's to answer one question: How do we make sure this specific thing never happens again?

That's it. Everything else is noise.

A Real Example

Let me walk through an actual incident I dealt with.

What happened: E-commerce platform went down during a flash sale. 2 hours of downtime. Rough estimate: $150K in lost revenue.

Timeline:

  • 14:30 — Traffic spike starts as sale goes live
  • 14:35 — Response times degrade
  • 14:42 — Database connections maxed out
  • 14:45 — Site goes unresponsive
  • 14:50 — On-call gets paged
  • 15:10 — Root cause identified (connection pool exhaustion)
  • 15:30 — Temporary fix deployed (increased pool size)
  • 16:45 — Full recovery confirmed

What actually went wrong:

  1. No load testing done for the flash sale scenario
  2. Connection pool was sized for normal traffic, not 10x spike
  3. Monitoring didn't alert on connection pool saturation (see A Manifest for Better Logging)
  4. Runbook for "database overwhelmed" didn't exist

The Post-Mortem Format I Use

Skip the 20-page document. I use a simple template:

INCIDENT: [Name]
DATE: [When]
DURATION: [How long]
IMPACT: [What broke, who was affected]

TIMELINE:
[Bullet points of what happened when]

ROOT CAUSE:
[One sentence. Not symptoms — the actual cause.]

CONTRIBUTING FACTORS:
[What made it worse or slower to fix]

ACTION ITEMS:
[Specific, assigned, with deadlines]

That's it. One page.

The Meeting

Keep it short. 30-45 minutes max.

Who's in the room:

  • People who were directly involved
  • The person who'll own the action items
  • One person to take notes

That's it. No executives unless they were hands-on-keyboard. No one "attending to stay informed."

How it runs:

  1. Walk through the timeline together (10 min)
  2. Identify root cause vs symptoms (10 min)
  3. Generate action items (15 min)
  4. Assign owners and deadlines (5 min)

No blame. No "who made the mistake." Systems fail, not people. If one person's error can take down production, you have a system problem.

Action Items That Actually Get Done

Most post-mortem action items never happen. They go into a backlog and die.

My rules:

  • Maximum 3 action items per incident
  • Each one has an owner (a person, not a team)
  • Each one has a deadline (within 2 weeks, or it won't happen)
  • Each one gets tracked in whatever system the team actually looks at

Example from that flash sale incident:

  1. Add connection pool monitoring — Owner: Platform lead — Due: 3 days
  2. Create flash sale load test scenario — Owner: QA lead — Due: 1 week
  3. Update runbook for DB saturation — Owner: On-call rotation lead — Due: 1 week

All three got done. Next flash sale? No issues.

Common Mistakes

The blame game: "John should have caught this in code review." Cool. Now John feels terrible and nothing improved. Focus on the system: why didn't tests catch it? Why didn't monitoring alert?

Too many action items: 15 action items = 0 action items. Pick the 2-3 that would have the biggest impact.

Vague actions: "Improve monitoring" is not an action item. "Add alert for connection pool > 80%" is an action item.

No follow-up: If you don't check whether actions got done, they won't. I add a calendar reminder for 2 weeks after every post-mortem.

When to Skip the Post-Mortem

Not every incident needs a formal review. If the fix was obvious and already applied, a quick Slack thread is fine.

Post-mortem when:

  • Multiple people were involved in the response
  • Impact was significant
  • Root cause wasn't immediately obvious
  • Same or similar incident happened before

The goal is learning, not bureaucracy. If you're running post-mortems for trivial incidents, people will start treating all post-mortems as trivial.

Enjoyed this article?

Share it with others or connect with me