Mehmet Erturk | How to Run a Post-Mortem That Actually Improves Things

Incident Management

I've been in the room for post-mortems that changed how teams operate. I've also sat through post-mortems that were complete wastes of time — checkbox exercises that made everyone feel better but changed nothing.

The difference? How you run them.

The Point of a Post-Mortem

It's not to assign blame. It's not to document what happened for compliance. It's to answer one question: How do we make sure this specific thing never happens again?

That's it. Everything else is noise.

A Real Example

Let me walk through an actual incident I dealt with.

What happened: E-commerce platform went down during a flash sale. 2 hours of downtime. Rough estimate: $150K in lost revenue.

Timeline:

14:30 — Traffic spike starts as sale goes live
14:35 — Response times degrade
14:42 — Database connections maxed out
14:45 — Site goes unresponsive
14:50 — On-call gets paged
15:10 — Root cause identified (connection pool exhaustion)
15:30 — Temporary fix deployed (increased pool size)
16:45 — Full recovery confirmed

What actually went wrong:

No load testing done for the flash sale scenario
Connection pool was sized for normal traffic, not 10x spike
Monitoring didn't alert on connection pool saturation (see A Manifest for Better Logging)
Runbook for "database overwhelmed" didn't exist

The Post-Mortem Format I Use

Skip the 20-page document. I use a simple template:

INCIDENT: [Name]
DATE: [When]
DURATION: [How long]
IMPACT: [What broke, who was affected]

TIMELINE:
[Bullet points of what happened when]

ROOT CAUSE:
[One sentence. Not symptoms — the actual cause.]

CONTRIBUTING FACTORS:
[What made it worse or slower to fix]

ACTION ITEMS:
[Specific, assigned, with deadlines]

That's it. One page.

The Meeting

Keep it short. 30-45 minutes max.

Who's in the room:

People who were directly involved
The person who'll own the action items
One person to take notes

That's it. No executives unless they were hands-on-keyboard. No one "attending to stay informed."

How it runs:

Walk through the timeline together (10 min)
Identify root cause vs symptoms (10 min)
Generate action items (15 min)
Assign owners and deadlines (5 min)

No blame. No "who made the mistake." Systems fail, not people. If one person's error can take down production, you have a system problem.

Action Items That Actually Get Done

Most post-mortem action items never happen. They go into a backlog and die.

My rules:

Maximum 3 action items per incident
Each one has an owner (a person, not a team)
Each one has a deadline (within 2 weeks, or it won't happen)
Each one gets tracked in whatever system the team actually looks at

Example from that flash sale incident:

Add connection pool monitoring — Owner: Platform lead — Due: 3 days
Create flash sale load test scenario — Owner: QA lead — Due: 1 week
Update runbook for DB saturation — Owner: On-call rotation lead — Due: 1 week

All three got done. Next flash sale? No issues.

Common Mistakes

The blame game: "John should have caught this in code review." Cool. Now John feels terrible and nothing improved. Focus on the system: why didn't tests catch it? Why didn't monitoring alert?

Too many action items: 15 action items = 0 action items. Pick the 2-3 that would have the biggest impact.

Vague actions: "Improve monitoring" is not an action item. "Add alert for connection pool > 80%" is an action item.

No follow-up: If you don't check whether actions got done, they won't. I add a calendar reminder for 2 weeks after every post-mortem.

When to Skip the Post-Mortem

Not every incident needs a formal review. If the fix was obvious and already applied, a quick Slack thread is fine.

Post-mortem when:

Multiple people were involved in the response
Impact was significant
Root cause wasn't immediately obvious
Same or similar incident happened before

The goal is learning, not bureaucracy. If you're running post-mortems for trivial incidents, people will start treating all post-mortems as trivial.