Building Reliable AI Agents: Lessons from Familiar

Matthew J. Whitney

•June 21, 2026•9 min read

artificial intelligencellmai integrationmachine learningbackend

Reliable AI agent architecture is having a moment. This week, Martin Fowler's site published a long-form piece on building reliable agentic AI systems that hit 105 points on Hacker News within hours. It's a solid theoretical framework. But I've been running 30+ autonomous agents in production, 24/7, for months — and the gap between what that article describes and what actually breaks in the real world is significant enough to write about.

This is what I learned building Familiar.

What Familiar Is and Why It Pushed Us to the Edge

Familiar is an AI-native platform we built that orchestrates a fleet of autonomous agents — each one responsible for a slice of a user's digital life. Think persistent context, proactive actions, memory retrieval, tool use, and cross-agent coordination, all running continuously without a human in the loop. At peak we had 34 agents running simultaneously, each with their own memory state, tool access, and LLM call budget. The system runs on a Strix Halo-based rig (Flow Z13, AMD Ryzen AI Max) that we use for local inference alongside cloud model calls.

Nothing about that is theoretical. And almost everything that broke, we didn't see coming from any blog post.

The State Isolation Problem Nobody Talks About

The Fowler article touches on agent reliability in terms of prompt engineering and retry logic. That's fine as far as it goes. But the first thing that nearly killed Familiar in production wasn't a bad prompt — it was state bleed.

When you run multiple agents against a shared backend, context leaks in ways that are genuinely hard to debug. Agent A finishes a task and writes a summary back to a shared memory store. Agent B, running a different task, pulls recent context to orient itself. If your retrieval layer isn't scoped hard to agent identity, Agent B starts reasoning about Agent A's task. It doesn't fail loudly. It just... drifts. The outputs become subtly wrong in ways that users notice but can't articulate.

Our fix was strict namespace isolation at the vector store level. Every agent gets a UUID-scoped memory partition. Cross-agent reads require an explicit bridge call with a declared purpose — there's no ambient context sharing. This added latency (about 40ms per retrieval on average) but eliminated the drift problem entirely. The architectural principle is simple: treat each agent's memory as a private database, not a shared cache.

This maps to something lcamtuf's recent piece on the "100k Whys of AI" gestures at — LLMs are extraordinarily sensitive to context framing. What's in the context window isn't just input, it's instruction. If you don't control it aggressively, the model will find signal in noise you didn't intend to put there.

Runaway Loops: The Failure Mode That Will Destroy Your Budget

Here's the failure mode that no architecture diagram prepares you for: the agent that can't stop.

In Familiar, agents have a tool that lets them schedule follow-up tasks for themselves. This is intentional — it's how a persistent agent maintains continuity across sessions. But early on, we had an agent enter a state where its goal condition was never satisfiably met because the evaluation prompt was ambiguous. It kept scheduling follow-ups. Each follow-up triggered another LLM call. Each LLM call produced output that, evaluated against the same ambiguous goal, looked incomplete.

In 11 minutes, that single agent made 847 LLM API calls. We caught it because our spend alert fired. We didn't catch it because the agent logged an error — it didn't. From the agent's perspective, it was working hard.

The fix required three layers:

1. Hard call budgets per agent per session. Every agent gets a token budget and a call count ceiling. When it hits the ceiling, it suspends and emits a BUDGET_EXCEEDED signal to a supervisor agent rather than continuing. This is not optional. It is the circuit breaker.

2. Goal condition validation before task creation. Before an agent can schedule a follow-up for itself, the goal condition it's trying to satisfy gets evaluated by a separate lightweight model call that asks one question: "Is this goal condition unambiguously completable?" If the answer is no, the task is rejected and escalated to a human review queue.

3. Loop detection via call signature hashing. We hash the combination of tool name + input parameters for each call. If the same hash appears more than three times in a rolling window of 10 calls, the agent is suspended. This catches semantic loops even when the exact inputs vary slightly.

None of this is elegant. It's defensive engineering. But it's the kind of thing you only build after watching $200 evaporate in 11 minutes.

LLM Integration Reality: Models Are Not Interchangeable

The Fowler piece, like most theoretical treatments, treats the LLM as an abstraction — a black box you call with a prompt and get a response. In practice, the model choice is an architectural decision with real downstream consequences.

Familiar uses a tiered model strategy. Fast, cheap models (we use gemini-2.0-flash and gpt-4o-mini) handle classification, routing, and goal condition evaluation. Expensive models (claude-opus-4, gpt-4.1) handle complex reasoning, long-context synthesis, and anything that touches user-facing output. Local inference on the Strix Halo handles privacy-sensitive operations and anything latency-critical where we can tolerate slightly lower quality.

The problem we ran into: behavioral consistency varies dramatically across models for the same prompt. An agent designed around Claude's tendency to ask clarifying questions before acting will behave dangerously if you swap in a model that prefers to act first. We had an agent designed to draft emails for user review that, when we switched models during a cost optimization pass, started sending emails directly — because the new model interpreted "prepare for sending" as "send."

The lesson: model identity is part of your agent's specification. When you change the model, you've changed the agent. Treat model upgrades with the same rigor as code deployments — staged rollout, behavioral regression tests, rollback capability.

This connects to the broader conversation happening in the community right now. The piece on when to reject AI-generated code even when it works — which hit 181 points on Hacker News today — makes a related point about AI outputs: correctness isn't the only axis that matters. Consistency, predictability, and auditability matter too. That's doubly true for agents that take actions.

The Supervisor Pattern: Where Reliable AI Agent Architecture Actually Lives

After all the failures, the architecture that stabilized Familiar is built around a clear hierarchy: worker agents, coordinator agents, and a single supervisor.

Worker agents are narrow. They do one thing. They have one tool set. They have no memory beyond their current task context. They are stateless between invocations.

Coordinator agents manage a domain — say, all agents related to a user's communication or all agents related to a user's calendar. They hold cross-task context for their domain, route work to workers, and aggregate results. They are the only agents that write to persistent memory.

The supervisor is not an LLM agent. It's deterministic backend code. It monitors call budgets, detects loops, handles escalation, manages scheduling, and maintains the health of the fleet. Giving the supervisor an LLM brain was one of our early mistakes — you need something that cannot be confused, cannot be prompt-injected, and cannot rationalize its way past a circuit breaker. That means code, not a model.

This pattern — narrow workers, domain coordinators, deterministic supervisor — is the closest thing we have to a reliable AI agent architecture in production. It's not glamorous. It looks a lot like the microservices patterns we've been using for a decade, applied to agents. That's probably not a coincidence.

What the Theory Gets Right (and What It Misses)

The Fowler article's core insight is sound: agentic systems need explicit reliability mechanisms because LLMs are probabilistic and the world is stateful. Retries, idempotency, and clear failure modes are all necessary. I don't disagree with any of it.

What it underweights is the operational dimension. Building a reliable agent architecture isn't primarily a prompting problem or even a software design problem — it's an operations problem. You need observability into agent state that's as good as what you'd have for any production service. You need alerting on behavioral drift, not just errors. You need the ability to pause, inspect, and resume individual agents without affecting the fleet.

We built a lightweight agent dashboard for Familiar that shows, per agent: current state, last N tool calls, token spend for the session, goal condition status, and a manual override button. That dashboard has saved us more times than any architectural pattern. You cannot debug what you cannot see.

The effective use-cases for LLMs piece making the rounds today makes a point worth repeating: LLMs are excellent at specific, bounded tasks and fragile at open-ended autonomous operation. The architecture should reflect that. Every place in Familiar where we gave an agent too much autonomy and too little constraint, we paid for it.

Three Things Worth Taking Away

If you're building autonomous agents and you take nothing else from this:

State isolation is not optional. Shared context is a reliability hazard. Namespace everything. Make cross-agent communication explicit and auditable.

Budget your agents like you budget your infrastructure. Call limits, token ceilings, and loop detection are circuit breakers. Build them before you need them, because by the time you need them, you'll already have lost money.

The supervisor must be deterministic. The most important component in your agent architecture is the one that doesn't use an LLM. Build that first.

The theory is catching up to what production systems actually require. But there's still a gap — and the only way to close it is to run these systems at scale, watch them fail, and build the defenses that actually hold. That's what Familiar taught us, and it's what we bring to every AI integration engagement at Bedda.tech.

The field is moving fast. The fundamentals, it turns out, are not.

Familiar: 30+ AI Agents Running 24/7 Without Chaos

We run 30+ autonomous AI agents 24/7 with Familiar. Here

June 18, 2026•11 min read

Norway Bans AI in Schools: Right Call or Moral Panic?

Norway just banned AI in elementary schools. Is this smart regulation or moral panic? A hard look at what the AI industry keeps getting wrong.

June 20, 2026•10 min read

AI Killed Self-Help Books. Good Riddance.

AI is killing self-help nonfiction — and after 15 years in tech, I think that

June 17, 2026•9 min read

Have Questions or Need Help?

Our team is ready to assist you with your project needs.

What Familiar Is and Why It Pushed Us to the Edge

The State Isolation Problem Nobody Talks About

Runaway Loops: The Failure Mode That Will Destroy Your Budget

LLM Integration Reality: Models Are Not Interchangeable

The Supervisor Pattern: Where Reliable AI Agent Architecture Actually Lives

What the Theory Gets Right (and What It Misses)

Three Things Worth Taking Away

Norway Bans AI in Schools: Right Call or Moral Panic?

Related Posts

Familiar: 30+ AI Agents Running 24/7 Without Chaos

Norway Bans AI in Schools: Right Call or Moral Panic?

AI Killed Self-Help Books. Good Riddance.

Have Questions or Need Help?