Mehmet Erturk | Building AI That Actually Remembers

AI Memory Systems

Ask ChatGPT a question. It answers. Close the tab. Come back tomorrow.

It has no idea who you are. Every conversation starts fresh.

For casual chat, that's fine. For AI that's supposed to work alongside you — learn your preferences, remember past decisions, build on previous work — it's a fundamental limitation.

I've been building memory systems for AI agents. The problem is harder than it looks.

The Naive Approach (And Why It Fails)

"Just store all the messages and include them in context!"

Problems:

Context windows have limits (even big ones fill up)
Token costs scale linearly with history (see The Cost Problem in AI)
Relevance degrades (old messages aren't always useful)
No structure (can't query "what do we know about customer X?")

"Just use a vector database!"

Problems:

Semantic similarity isn't always relevance
No relationships (facts exist in isolation)
Hallucination without provenance (the AI "knows" things but can't say where it learned them)
Retrieval is noisy (returns similar but not useful chunks)

Both approaches optimize for the easy case and fall apart in production.

Messages Are Not Memory

This is the key insight: messages are history, memory is knowledge.

When you talk to a colleague, they remember:

Facts about your project ("the deadline is March 15th")
Preferences ("you prefer TypeScript over JavaScript")
Past decisions ("we chose Postgres because...")
Experiences ("last time we tried X, it didn't work")

They don't remember the exact words of every conversation you've ever had.

The same should be true for AI. Store messages for history. But extract and curate knowledge separately.

Four Types of Memory

After working through this, I've landed on four distinct memory types:

1. Short-term Memory (Conversation State)

The current conversation. Recent messages. Rolling summary of older messages if the conversation is long.

Purpose: Keep the current thread coherent. Lifetime: Session. Storage: Ephemeral (Redis or in-memory).

2. Episodic Memory (Experiences)

What happened in past sessions. Summaries of previous conversations and workflows.

"Last week we discussed migrating to a new CRM. You decided to wait until Q2."

Purpose: Remember experiences without storing every word. Lifetime: Weeks to months. Storage: Summarized events, searchable.

3. Semantic Memory (Facts and Entities)

Structured knowledge. Things the agent knows with confidence.

Entities: "Acme Corp is a customer, contact is Jane, they use Enterprise plan"
Facts: "API rate limit is 1000 requests/minute"

Key requirement: provenance. Every fact should link back to where it came from. "I know this because of [conversation/document/observation]."

Purpose: Grounded knowledge retrieval. Lifetime: Long-term, with decay/refresh. Storage: Graph database with vector search.

4. Procedural Memory (How the Agent Operates)

Instructions, policies, preferences. The rules that govern behavior.

"Always check inventory before confirming orders"
"Respond formally to enterprise customers"
"Escalate refund requests over $500"

Purpose: Consistent behavior. Human-defined rules. Lifetime: Until changed by humans. Storage: Deterministic, pinned to agent identity.

The Write Path (How Memory Gets Created)

Memory doesn't appear magically. You need pipelines.

Store raw messages (append-only, for audit trail)
Run extraction jobs:
- Entity linking ("this message mentions [Customer X]")
- Fact distillation ("extract the key claim from this exchange")
- Episode summarization ("summarize this conversation")
Send to review queue (more on this below)
Apply to memory store (after review or auto-approval)

This is where most systems fail. They either don't extract (so memory is just raw messages) or they auto-apply everything (so memory fills with noise).

The Read Path (How Memory Gets Retrieved)

At inference time, you need to assemble the right context:

Step 1: Inject procedural memory
  → Always include: instructions, policies, preferences

Step 2: Retrieve task-relevant semantic memory
  → Vector search + graph expansion from detected entities

Step 3: Add relevant episodic memory
  → "Remember when..." context

Step 4: Add short-term context
  → Recent messages from current session

Step 5: Build MemoryContext
  → Structured sections, with provenance links

The key is being selective. You can't dump everything into context. You need to retrieve what's relevant for this specific query.

The Provenance Problem

Here's something most people ignore: where did the AI learn this?

If an agent says "the customer prefers email contact" — how do you verify that? What if it's wrong?

Every fact in memory should have a provenance edge:

Source (which message/document/observation)
Timestamp (when was this learned)
Confidence (how certain are we)
Last verified (when was this confirmed)

When facts conflict, provenance lets you resolve them. When the AI is wrong, provenance lets you trace back to the error source.

Without provenance, memory is just hallucination you've written to disk.

Learning Mode: Guardrails for Auto-Update

Some memory should update automatically (the agent learns from experience). Some should require human review (the agent doesn't rewrite its own rules).

My approach:

Auto-apply (with limits):

Preferences ("user seems to prefer concise answers")
Knowledge notes (new information from conversations)
Episode summaries
Entity mentions

Require human approval:

Instructions (core behavior)
Policies (business rules)
Procedures (how to handle specific situations)

The agent can suggest changes to its own instructions. But a human has to approve them. Otherwise you get drift.

Selective Embeddings (Cost Control)

Embedding everything is expensive and noisy.

What I actually embed:

Facts (high ROI, semantic recall)
Episodes (sometimes, for "remember when" queries)

What I don't embed:

Raw messages (too noisy)
Instructions (retrieved deterministically, not by similarity)
Entities (looked up by ID, not similarity)

This cuts embedding costs significantly while keeping retrieval quality high.

The Graph Part of GraphRAG

Vector search alone isn't enough. You also need graph relationships.

Query: "What do we know about Project Alpha?"

Vector search might find:

A message mentioning Project Alpha
A fact about project timelines

Graph expansion adds:

The customer who owns Project Alpha
Other projects from that customer
Past issues with similar projects

The combination — vector for relevance, graph for connections — is more powerful than either alone.

What I Got Wrong Initially

Mistake 1: Storing everything as flat documents. No structure = can't query = can't use effectively.

Mistake 2: Auto-applying all extracted facts. Memory filled with noise. Quality matters more than quantity.

Mistake 3: Ignoring provenance. Months later, we couldn't debug why the agent "knew" wrong things.

Mistake 4: Same retrieval for all queries. Different query types need different retrieval strategies.

Memory is what separates a chatbot from a colleague. But building reliable memory systems is infrastructure work — extraction pipelines, quality control, provenance tracking, selective retrieval.

The LLM doesn't solve this. You have to build it.