Dec 17, 20225 min read

A Manifest for Better Logging

Most logging is useless noise. Here's how I think about logs that actually help when things break at 3am.

Better Logging

I've debugged production incidents where logs told me exactly what went wrong. I've also debugged incidents where thousands of log lines told me absolutely nothing useful.

The difference isn't volume. It's intent.

The Problem With Most Logs

Open any codebase. You'll find:

logger.info("Starting process")
logger.info("Process started")
logger.debug("Entering function X")
logger.info("Done")

Congratulations. You've documented that code runs. You've learned nothing about what it's doing or why it failed.

When something breaks at 3am, you don't need to know that processes started. You need to know:

  • What was the input?
  • What was the state when it failed?
  • What decision did the system make?
  • What external systems were involved?

Log for the Debug Session You'll Have Later

When writing a log statement, imagine yourself debugging at 3am with an angry customer on hold. What would you wish you had logged?

Bad:

logger.info("Processing order")

Good:

logger.info("Processing order", {
  order_id: order.id,
  customer_id: order.customer_id,
  total: order.total,
  items_count: order.items.length,
  payment_method: order.payment_method
})

When that order fails, you'll know exactly which order, for whom, and what was special about it.

The Levels Actually Mean Something

I've seen codebases where everything is INFO. Or where DEBUG is used for actual errors. The levels exist for a reason:

ERROR — Something broke. The operation failed. You might get paged.

  • Payment failed
  • Database connection lost
  • External API returned 500

WARN — Something's wrong but we handled it. Worth investigating.

  • Retry succeeded after failure
  • Fallback used
  • Rate limit approaching

INFO — Significant business events. The happy path.

  • Order created
  • User signed up
  • Workflow completed

DEBUG — Detailed technical info. Off in production usually.

  • Function entry/exit
  • Variable values
  • Cache hits/misses

If your production logs are 90% INFO, you're probably logging too much noise. If you have zero WARN, you're probably missing early warning signs.

Structure Over Strings

Unstructured logs are hell to query.

"User 12345 purchased item ABC for $99.99"

Good luck writing a query to find all purchases over $50.

Structured logs are queryable:

{
  "event": "purchase_completed",
  "user_id": "12345",
  "item_id": "ABC",
  "amount": 99.99,
  "currency": "USD"
}

Now you can: amount > 50 AND event = "purchase_completed"

Every log should be a structured event with consistent field names.

Correlation IDs Are Non-Negotiable

A single user request might hit 5 services and generate 50 log lines. How do you connect them?

Correlation ID. Also called trace ID, request ID.

Generate it at the edge. Pass it through every service. Include it in every log line.

{"correlation_id": "abc-123", "service": "api", "event": "request_received"}
{"correlation_id": "abc-123", "service": "payments", "event": "charge_initiated"}
{"correlation_id": "abc-123", "service": "notifications", "event": "email_queued"}

Now when something fails, you can trace the entire journey: correlation_id = "abc-123"

Log Decisions, Not Just Actions

Most logs tell you what happened. Better logs tell you why it happened.

Just action:

logger.info("Routing to fallback service")

With decision context:

logger.info("Routing to fallback service", {
  reason: "primary_timeout",
  primary_latency_ms: 5200,
  timeout_threshold_ms: 5000,
  fallback_service: "backup-api-west"
})

When you're debugging why traffic went to the fallback, you'll know it was a timeout, not an error.

What Not to Log

Sensitive data — No passwords, tokens, full credit card numbers, PII without consent. This seems obvious but I've seen it in production logs.

High-frequency noise — If something happens 10,000 times per second, sampling or aggregating is better than logging every instance.

Success without contextlogger.info("Success") tells you nothing. Either add context or don't bother.

Someone else's logs — Don't log what downstream services already log. You'll double-count everything and confuse yourself.

My Logging Checklist

For any significant operation:

  1. Entry point — What started this? (request ID, user ID, operation type)
  2. Decision points — What path did we take and why?
  3. External calls — What did we ask, what did they respond? (latency, status)
  4. Outcome — Did it succeed? What was the result?
  5. Errors — What failed, what was the state, what's the impact?

Not every function needs all five. But any operation a user would care about should have most of them.

The Ops Perspective

Logs aren't just for debugging. They're for:

  • Alerting — ERROR rate spikes → page someone (see How to Run a Post-Mortem)
  • Dashboards — Business metrics from log events
  • Auditing — Who did what, when (especially for compliance)
  • Capacity planning — Traffic patterns, usage trends

If your logs can't answer "how many orders did we process yesterday?" and "why did user X get an error at 2pm?", they're not doing their job.

Tools Matter Less Than Discipline

ELK, Datadog, CloudWatch, Splunk — the tool doesn't matter if your logs are garbage.

I've seen teams with expensive observability platforms and useless logs. I've seen teams with basic tooling and excellent debugging capability.

The discipline is:

  • Structure everything
  • Correlate across services
  • Log decisions with context
  • Log at appropriate levels
  • Review logs regularly (not just when things break)

Logging is infrastructure. Treat it like you'd treat your database schema — with intention and consistency. The investment pays off the first time you debug a production issue in 10 minutes instead of 10 hours.

Enjoyed this article?

Share it with others or connect with me