bedda.tech logobedda.tech
← Back to blog

Familiar: 30+ AI Agents Running 24/7 Without Chaos

Matthew J. Whitney
11 min read
artificial intelligencellmai integrationmachine learningbackend

Here's the deal: every autonomous AI agent framework looks incredible in a demo. You give it a task, it reasons through steps, calls tools, produces output, and the audience claps. Then you try to run thirty of them simultaneously, around the clock, and the whole thing turns into a dumpster fire by Tuesday morning.

That's not a hypothetical. That's what happened to us before we built Familiar.

I want to talk about what actually keeps a production multi-agent system alive — not the theory, not the research paper version, but the unglamorous plumbing decisions that determine whether your agents are still running when you wake up at 2am.

What Familiar Actually Is (And Why We Built It)

Familiar is our internal orchestration layer for running persistent, autonomous AI agents. We didn't build it because we wanted to — we built it because nothing else handled our specific constraints: agents that need to maintain context across days, not just a single session; agents that interact with external systems that fail unpredictably; and agents that need to compose with each other without turning into a deadlock nightmare.

We're currently running 32 active agents across several workloads — content intelligence, backend monitoring, code review assistance, data pipeline validation, and a few things I can't talk about publicly yet. They run on a combination of Claude Sonnet 3.7, GPT-4o, and a locally-hosted Mistral instance on my Flow Z13 / Strix Halo rig for latency-sensitive tasks that can't tolerate a round-trip to an API endpoint.

The architecture has three hard-won pillars: state isolation, runaway loop prevention, and output signal management. Let me break down each one.

State Isolation: The Problem Nobody Warns You About

Here's what most guides miss: agent state corruption is the silent killer of multi-agent systems. It doesn't crash loudly. It just quietly poisons your agent's behavior over time until it starts producing nonsense that looks plausible.

The naive approach is to give each agent a conversation history and let it grow. This works fine for an agent that runs for five minutes. For an agent that's been running for 72 hours, you end up with a context window stuffed with stale tool outputs, contradicted assumptions, and error messages from three days ago that the agent is still trying to compensate for.

What we do in Familiar is treat agent state as explicitly scoped memory tiers:

Working memory — the current task context, what the agent is doing right now. This gets cleared aggressively. When a task completes or fails, working memory is flushed.

Session memory — facts the agent has established during its current operational window (we define a window as roughly 8 hours of wall-clock time, regardless of actual activity). This is stored as a structured JSON document, not raw conversation history. The agent can read and write to it, but it's schema-validated on every write. An agent can't accidentally corrupt session memory with freeform text.

Long-term memory — distilled facts that have been validated and promoted. Promotion is not automatic. It requires either a human approval step or a quorum vote from a separate validation agent. This sounds heavy, but it's the only thing that prevented our content intelligence agent from permanently "learning" an incorrect fact about a client's product line.

The insight here is that LLMs don't have a good internal sense of what's stale. They'll confidently act on information from 60 conversation turns ago even when more recent information contradicts it. You have to enforce recency externally, at the architecture level.

LLM Loop Prevention: Runaway Agents Are a Real Operational Risk

The scariest thing I've seen in production agentic systems isn't hallucination. It's an agent that gets into a reasoning loop and starts burning tokens at $0.30 per thousand while convincing itself it's making progress.

We had a monitoring agent that hit a tool timeout, concluded the tool was returning bad data, decided to validate the data through a secondary tool, which also timed out, decided that meant there was a systemic issue, and started spawning diagnostic sub-tasks. In 40 minutes it had consumed $23 in API costs and done exactly nothing useful.

Familiar now enforces three distinct loop-breaking mechanisms:

Step budgets per task. Every task gets a maximum number of reasoning steps before it must either complete, escalate, or abort. This is non-negotiable. The agent doesn't get to decide it needs more steps — that decision is made by a human or a separate supervisor agent.

Semantic drift detection. We embed the agent's stated goal at task creation, then embed its current reasoning summary every N steps. If the cosine similarity between current reasoning and the original goal drops below a threshold (we've tuned this to 0.72 for our workloads), the agent gets a hard interrupt and has to re-anchor to its original objective. This catches the subtle drift cases that step counts miss — an agent that's technically making steps but has completely lost the plot.

Tool call deduplication with exponential backoff. If an agent calls the same tool with the same arguments twice within a rolling window, it gets a synthetic error response telling it the tool is temporarily unavailable. This sounds crude, but it kills 80% of the loops we were seeing in early testing. Loops almost always involve repeatedly calling the same failing tool.

The contract testing angle here is worth noting — we've been watching Strictland, a new contract testing library announced this week for message compatibility. The problem it's solving — ensuring that message producers and consumers agree on schema — maps almost directly onto the tool call interface problem in agentic systems. When an agent's tool call schema diverges from what the tool actually accepts, you get silent failures that turn into loops. We're evaluating whether Strictland's approach translates to validating agent-tool contracts at runtime.

Machine Learning Meets Operational Reality: Making Output Readable

Thirty-two agents running 24/7 produce an enormous amount of output. If you try to read all of it, you fail. If you ignore all of it, you miss the things that matter. The signal-to-noise problem in multi-agent systems is underappreciated.

Our approach is what we internally call "progressive disclosure" for agent output:

Level 0 — heartbeat. Every agent emits a structured heartbeat every 5 minutes: status (running/blocked/completed/error), current task description (one sentence), and step count. This is what hits our monitoring dashboard. No operator reads this — it feeds alerting rules.

Level 1 — task summaries. When a task completes, the agent produces a 2-3 sentence summary of what it did and what it found. This is the thing a human actually reads during a morning review. It's enforced by the task completion schema — agents can't emit a completion event without a summary that passes a length and format check.

Level 2 — full trace. Every reasoning step, every tool call, every intermediate output, stored in append-only logs. Nobody reads this unless something goes wrong. When it does go wrong, having the full trace is the difference between a 10-minute debugging session and a 3-hour one.

The key insight is that the agent itself generates Level 1 summaries as part of task completion — we don't post-process them separately. This means the summary is written while the agent still has full context, which makes them dramatically more useful than anything you could generate after the fact by summarizing the logs.

AI Integration Architecture: The Supervisor Pattern

One thing we got wrong early: treating all 32 agents as peers. In practice, you need a supervision hierarchy, and it needs to be explicit.

Familiar organizes agents into clusters of 4-8, each with a designated supervisor agent. The supervisor doesn't do domain work — its only job is to monitor the agents in its cluster, detect anomalies, and escalate to the human operator layer when something is outside its ability to resolve.

This is not a novel idea — it maps closely to how LangGraph models multi-agent hierarchies — but the implementation details matter. Our supervisors use a different, cheaper model than the worker agents (GPT-4o-mini for supervisors vs. Sonnet 3.7 for complex workers) because supervisor tasks are pattern-matching and anomaly detection, not reasoning-heavy work. This alone cut our API costs by about 18% while actually improving oversight quality because the supervisor can run more frequently.

The supervisor also handles the state promotion decisions I mentioned earlier. When a worker agent wants to write to long-term memory, it submits the proposed write to its supervisor, which validates it against existing long-term memory for contradictions before approving. This is the closest thing we have to a "consensus" mechanism, and it's prevented several instances of conflicting facts making it into persistent storage.

Backend Reliability: The Boring Stuff That Actually Matters

I'm going to be direct about something: the most impactful reliability improvements we made to Familiar had nothing to do with AI. They were boring backend engineering.

Idempotent task execution. Every task has a content-addressed ID derived from its inputs. If an agent crashes mid-task and restarts, it resumes from the last checkpoint rather than starting over. This required us to make all tool calls idempotent, which was painful but necessary.

Structured logging with correlation IDs. Every agent action, every tool call, every state transition carries a correlation ID that ties it back to the originating task. When something goes wrong — and things go wrong — you can reconstruct exactly what happened across all 32 agents without guessing.

Circuit breakers on external dependencies. If an external API starts failing, we don't want 32 agents all hammering it simultaneously. Familiar uses a shared circuit breaker registry so that when one agent detects a dependency is down, all agents that depend on it get notified immediately and switch to degraded-mode behavior rather than queuing up retry storms.

This is the part of the autonomous AI agent framework that conference talks never cover, because it's not exciting. But it's the reason Familiar has had 99.1% uptime over the past four months while similar systems I've seen at other companies are lucky to get through a week without manual intervention.

Where This Is Going

DeepSeek's new Vision capabilities announced this week are interesting to us specifically because multimodal input changes the tool call surface for agents substantially. Several of our agents currently have to OCR screenshots as a preprocessing step before passing data to an LLM. Native vision support in a model we can potentially run locally would collapse that pipeline significantly and reduce one of our more common failure modes.

The broader trend I'm watching is the shift from "agents that complete tasks" to "agents that maintain ongoing relationships with systems." That's a fundamentally different reliability problem. A task-completing agent can fail and retry. An agent that's maintaining a long-running relationship with an external system — tracking state, managing expectations, handling exceptions — needs the kind of operational rigor we've been building into Familiar from the start.

The Honest Recommendation

If you're standing up a multi-agent system and you're not yet running it in production, build the supervision hierarchy first. Before you worry about which LLM to use, before you optimize your prompts, build the scaffolding that will tell you when things go wrong.

The autonomous AI agent framework that survives production is not the one with the best reasoning capabilities — it's the one where you can actually see what's happening, interrupt runaway processes, and recover from failures without manual archaeology through logs.

State isolation, loop prevention, and output signal management aren't features you add later. They're the foundation. Build them first, or you'll be rebuilding your entire system six weeks after launch when the demo-to-production gap finally catches up with you.

We learned that the hard way. You don't have to.

Have Questions or Need Help?

Our team is ready to assist you with your project needs.

Contact Us