AI Agent Memory That Doesn
83% of the hallucinations we traced in production weren't the LLM making things up — they were the agent confidently acting on memories that used to be true.
That number stopped me cold when I pulled the error logs from Familiar, the autonomous agent system we've been running in production for the better part of a year. I'd gone in expecting to find the usual suspects: context window overflows, retrieval misses, prompt injection artifacts. What I found instead was a subtler and more dangerous failure mode — one that no benchmark covers and no paper I've read adequately warns you about. Our AI agent memory architecture was working exactly as designed, and that was precisely the problem.
The Scale of the Problem Nobody Talks About
The conversation around AI agent memory has matured considerably in the past 18 months. Anthropic's recent work on making Claude domain-specific — teaching it to reason like a chemist with specialized knowledge grounding — illustrates the direction the field is moving: agents that carry persistent, structured knowledge about a domain, not just raw retrieval. That's the right instinct. But it sidesteps the temporal problem entirely.
Here's the failure mode in plain terms: an agent remembers that User A prefers concise responses, always works in EST, and is the sole decision-maker on budget approvals. Six months later, User A got promoted, moved to London, and now has three direct reports who handle approvals under $50K. The agent has zero awareness of any of this. Its memory is pristine, well-indexed, and completely wrong.
In our system, we measured the following across a 90-day window:
- 34% of retrieved memories were older than 60 days with no freshness signal attached
- Agents acted on stale preference data in 61% of sessions where the user had meaningfully changed their behavior
- Average confidence score on stale retrievals: 0.91 — nearly identical to fresh retrievals at 0.94
That last number is the one that should alarm you. The retrieval pipeline couldn't distinguish between a memory that was accurate yesterday and one that had been wrong for two months. Both looked equally confident to the agent. Both got acted on with equal conviction.
What We Actually Built (And What Broke First)
Familiar's memory system started as a fairly standard single-tier vector store. We were embedding user interactions, preferences, and task outcomes into pgvector and retrieving by cosine similarity at query time. Fast, cheap, and — for the first few weeks — apparently working.
The first serious break came around week six. An agent was helping a user plan a recurring meeting series. It retrieved a memory from onboarding that the user "prefers morning slots, 9-10am." What it didn't know: the user had started a new project with a team in Singapore. Every 9am slot had been blocked for three weeks. The agent booked into those windows five times before the user explicitly told it to stop — and even then, the correction went into working memory but the original preference stayed in long-term storage, unmodified, with a 0.89 similarity score that kept pulling it back into context.
We rebuilt. The two-tier architecture we landed on separates memory by decay profile, not by content type.
Tier 1 — Volatile Context Memory: Short-lived embeddings with explicit TTLs. Anything session-derived lives here. We set aggressive expirations: 24 hours for behavioral signals, 7 days for stated preferences, 14 days for task outcomes. Nothing in Tier 1 is treated as ground truth. It's evidence.
Tier 2 — Anchored Fact Memory: Long-lived structured records, not embeddings. These are explicit facts the user has confirmed or that have been corroborated across multiple sessions: role, timezone, organizational relationships, hard constraints. These get updated through a separate reconciliation process, not through passive retrieval.
The critical design decision — and the one that took us longest to get right — was what happens when Tier 1 and Tier 2 contradict each other. Our first instinct was to let the agent resolve the conflict. That was wrong. Agents are terrible at this. They tend to weight whichever memory is more semantically similar to the current query, which has nothing to do with which memory is more likely to be current.
We moved conflict resolution upstream, into the memory retrieval layer itself. When a Tier 1 signal contradicts a Tier 2 anchor, the system surfaces both explicitly in the context window with timestamps and a conflict flag, rather than silently picking one. The agent's job is to ask the user to resolve it, not to guess.
The Cryptography Problem Hidden Inside Private Agent Memory
Matthew Green's analysis of Apple's Siri private inference architecture raises a point that's directly relevant here, even though it's framed as a privacy argument: when you push memory off-device or into opaque inference infrastructure, you lose auditability. You can't inspect what the agent remembers. You can't correct it. You can't timestamp it.
This is a memory architecture problem disguised as a privacy problem. Green's core complaint is that "private inference" systems make it structurally impossible to verify what data is being used in what context. For agent memory, that's not just a privacy concern — it's an operational one. If you can't audit your agent's memory, you can't diagnose the 83% problem I described above. You're flying blind.
Our team made a deliberate choice to keep memory infrastructure fully inspectable. Every memory record in Familiar carries: creation timestamp, last-accessed timestamp, source session ID, confidence score at write time, and a human-readable provenance string. That last field — "User stated this explicitly during onboarding interview, 2025-11-03" versus "Inferred from three consecutive session behaviors, 2026-01-17" — turns out to be one of the most useful debugging signals we have.
When an agent misbehaves, the first thing I do is pull the memory context it was operating on and read the provenance strings. Nine times out of ten, the problem is obvious within 30 seconds.
What the Data Actually Shows About Memory Decay
After implementing the two-tier system and running it for 60 days, here's what changed:
| Metric | Single-Tier Baseline | Two-Tier System | Delta |
|---|---|---|---|
| Stale memory retrieval rate | 34% | 8% | -76% |
| Agent actions on stale data | 61% of affected sessions | 14% of affected sessions | -77% |
| User correction events per 100 sessions | 22 | 6 | -73% |
| Average session confidence (user-rated) | 3.1 / 5 | 4.3 / 5 | +39% |
| False-positive memory conflicts surfaced | N/A | 11% of conflict flags | Acceptable |
That 11% false-positive rate on conflict flags is worth acknowledging. Occasionally the system surfaces a conflict that isn't actually a conflict — the user's preference didn't change, there's just noise in the Tier 1 signal. Users find this mildly annoying but not harmful. We've tuned the conflict threshold upward over time, and it's sitting around 7% now. I'd rather have a 7% false-positive rate on conflict surfacing than a 61% rate on acting on wrong data.
The user-rated confidence jump from 3.1 to 4.3 was the number that surprised me most. Users weren't aware of the architectural change. They just noticed that the agent stopped doing things that felt wrong.
The Normalization Problem Is Upstream of Everything
One thing this work surfaced that I didn't expect: memory quality is a normalization problem before it's a retrieval problem. The principle is the same one driving recent discussion around email data normalization for automation pipelines — reliability starts with clean, consistently structured data at ingestion, not at query time.
We were writing memories in whatever shape the LLM produced them. That meant timezone preferences stored as "mornings," "9am EST," "before noon," and "not afternoons" — all meaning roughly the same thing, none of them queryable in a consistent way. When we went to detect conflicts or update records, we were doing fuzzy string matching on semantically equivalent but syntactically incompatible strings.
We added a normalization pass at write time. Every memory record goes through a structured extraction step before storage — time preferences become cron-style windows, organizational relationships become typed graph edges, communication preferences become enumerated values from a fixed schema. This added about 40ms of latency per write and reduced our storage footprint by roughly 30% through deduplication of equivalent memories.
More importantly, it made conflict detection tractable. You can't reliably detect that "mornings" and "9am EST" are the same preference and that "9am EST" and "not before 11am" are in conflict unless both are normalized to a common representation first.
What I'd Redesign From Scratch
If I were starting Familiar's AI agent memory architecture over today, here's what I'd do differently:
1. Treat every memory as having a half-life at write time, not at read time. We added TTLs reactively. They should be assigned based on memory type the moment a record is created, with the decay curve baked into the confidence score rather than handled as a separate expiration mechanism.
2. Build the reconciliation process before the retrieval process. We built retrieval first because it was the obvious thing to build. Reconciliation — the process by which new observations update or invalidate old memories — came later and was bolted on. It should be the core of the system.
3. Never let the LLM resolve memory conflicts autonomously. This sounds counterintuitive. The whole point of an autonomous agent is that it resolves things autonomously. But memory conflicts are a special case where the cost of a wrong resolution compounds over time. Surface the conflict to the user. Take the latency hit. It's worth it.
4. Version memory records, don't overwrite them. When a user's preference changes, we used to update the record in place. Now we append a new version and mark the old one superseded. This gives us a full history of how the agent's understanding evolved, which is invaluable for debugging and — increasingly — for explaining agent behavior to users who want to understand why it's doing what it's doing.
5. Separate "what the user told us" from "what we inferred." These are different epistemic categories and should be stored, retrieved, and weighted differently. Conflating them is where most of our early failures originated.
The Cost of Getting This Wrong
The AI agent memory architecture problem isn't academic. As agents move from assistants to autonomous actors — scheduling, purchasing, communicating on behalf of users — the cost of a confident wrong memory scales with the autonomy you've granted. An agent that incorrectly remembers your meeting preferences wastes 20 minutes. An agent that incorrectly remembers your approval authority or your organizational relationships can create real organizational and financial damage.
The field is moving fast toward more capable, more autonomous LLM-driven agents. Anthropic is teaching models to reason like domain experts. OpenAI and others are pushing hard on long-horizon task completion. None of that capability is safe without memory infrastructure that knows what it knows, knows when it learned it, and knows when to admit it might be wrong.
Stale memories aren't a minor inconvenience. In a sufficiently autonomous agent, they're a liability. Build your memory layer like that's true from day one — because by the time you have the data to prove it, you'll already be cleaning up the damage.