How to Run a Post-Mortem That Actually Improves Things
Most post-mortems are theater. Here's how I run them so they actually prevent the next incident.
Most post-mortems are theater. Here's how I run them so they actually prevent the next incident.

I've been in the room for post-mortems that changed how teams operate. I've also sat through post-mortems that were complete wastes of time — checkbox exercises that made everyone feel better but changed nothing.
The difference? How you run them.
It's not to assign blame. It's not to document what happened for compliance. It's to answer one question: How do we make sure this specific thing never happens again?
That's it. Everything else is noise.
Let me walk through an actual incident I dealt with.
What happened: E-commerce platform went down during a flash sale. 2 hours of downtime. Rough estimate: $150K in lost revenue.
Timeline:
What actually went wrong:
Skip the 20-page document. I use a simple template:
INCIDENT: [Name]
DATE: [When]
DURATION: [How long]
IMPACT: [What broke, who was affected]
TIMELINE:
[Bullet points of what happened when]
ROOT CAUSE:
[One sentence. Not symptoms — the actual cause.]
CONTRIBUTING FACTORS:
[What made it worse or slower to fix]
ACTION ITEMS:
[Specific, assigned, with deadlines]
That's it. One page.
Keep it short. 30-45 minutes max.
Who's in the room:
That's it. No executives unless they were hands-on-keyboard. No one "attending to stay informed."
How it runs:
No blame. No "who made the mistake." Systems fail, not people. If one person's error can take down production, you have a system problem.
Most post-mortem action items never happen. They go into a backlog and die.
My rules:
Example from that flash sale incident:
All three got done. Next flash sale? No issues.
The blame game: "John should have caught this in code review." Cool. Now John feels terrible and nothing improved. Focus on the system: why didn't tests catch it? Why didn't monitoring alert?
Too many action items: 15 action items = 0 action items. Pick the 2-3 that would have the biggest impact.
Vague actions: "Improve monitoring" is not an action item. "Add alert for connection pool > 80%" is an action item.
No follow-up: If you don't check whether actions got done, they won't. I add a calendar reminder for 2 weeks after every post-mortem.
Not every incident needs a formal review. If the fix was obvious and already applied, a quick Slack thread is fine.
Post-mortem when:
The goal is learning, not bureaucracy. If you're running post-mortems for trivial incidents, people will start treating all post-mortems as trivial.