Building AI That Actually Remembers
Most AI systems are goldfish. Here's how I think about memory for agents that need to learn and persist knowledge.
Most AI systems are goldfish. Here's how I think about memory for agents that need to learn and persist knowledge.

Ask ChatGPT a question. It answers. Close the tab. Come back tomorrow.
It has no idea who you are. Every conversation starts fresh.
For casual chat, that's fine. For AI that's supposed to work alongside you — learn your preferences, remember past decisions, build on previous work — it's a fundamental limitation.
I've been building memory systems for AI agents. The problem is harder than it looks.
"Just store all the messages and include them in context!"
Problems:
"Just use a vector database!"
Problems:
Both approaches optimize for the easy case and fall apart in production.
This is the key insight: messages are history, memory is knowledge.
When you talk to a colleague, they remember:
They don't remember the exact words of every conversation you've ever had.
The same should be true for AI. Store messages for history. But extract and curate knowledge separately.
After working through this, I've landed on four distinct memory types:
The current conversation. Recent messages. Rolling summary of older messages if the conversation is long.
Purpose: Keep the current thread coherent. Lifetime: Session. Storage: Ephemeral (Redis or in-memory).
What happened in past sessions. Summaries of previous conversations and workflows.
"Last week we discussed migrating to a new CRM. You decided to wait until Q2."
Purpose: Remember experiences without storing every word. Lifetime: Weeks to months. Storage: Summarized events, searchable.
Structured knowledge. Things the agent knows with confidence.
Key requirement: provenance. Every fact should link back to where it came from. "I know this because of [conversation/document/observation]."
Purpose: Grounded knowledge retrieval. Lifetime: Long-term, with decay/refresh. Storage: Graph database with vector search.
Instructions, policies, preferences. The rules that govern behavior.
Purpose: Consistent behavior. Human-defined rules. Lifetime: Until changed by humans. Storage: Deterministic, pinned to agent identity.
Memory doesn't appear magically. You need pipelines.
This is where most systems fail. They either don't extract (so memory is just raw messages) or they auto-apply everything (so memory fills with noise).
At inference time, you need to assemble the right context:
Step 1: Inject procedural memory
→ Always include: instructions, policies, preferences
Step 2: Retrieve task-relevant semantic memory
→ Vector search + graph expansion from detected entities
Step 3: Add relevant episodic memory
→ "Remember when..." context
Step 4: Add short-term context
→ Recent messages from current session
Step 5: Build MemoryContext
→ Structured sections, with provenance links
The key is being selective. You can't dump everything into context. You need to retrieve what's relevant for this specific query.
Here's something most people ignore: where did the AI learn this?
If an agent says "the customer prefers email contact" — how do you verify that? What if it's wrong?
Every fact in memory should have a provenance edge:
When facts conflict, provenance lets you resolve them. When the AI is wrong, provenance lets you trace back to the error source.
Without provenance, memory is just hallucination you've written to disk.
Some memory should update automatically (the agent learns from experience). Some should require human review (the agent doesn't rewrite its own rules).
My approach:
Auto-apply (with limits):
Require human approval:
The agent can suggest changes to its own instructions. But a human has to approve them. Otherwise you get drift.
Embedding everything is expensive and noisy.
What I actually embed:
What I don't embed:
This cuts embedding costs significantly while keeping retrieval quality high.
Vector search alone isn't enough. You also need graph relationships.
Query: "What do we know about Project Alpha?"
Vector search might find:
Graph expansion adds:
The combination — vector for relevance, graph for connections — is more powerful than either alone.
Mistake 1: Storing everything as flat documents. No structure = can't query = can't use effectively.
Mistake 2: Auto-applying all extracted facts. Memory filled with noise. Quality matters more than quantity.
Mistake 3: Ignoring provenance. Months later, we couldn't debug why the agent "knew" wrong things.
Mistake 4: Same retrieval for all queries. Different query types need different retrieval strategies.
Memory is what separates a chatbot from a colleague. But building reliable memory systems is infrastructure work — extraction pipelines, quality control, provenance tracking, selective retrieval.
The LLM doesn't solve this. You have to build it.