bedda.tech logobedda.tech
← Back to blog

What Running 65 AI Agents in Production Taught Us This Week

BeddaTech
4 min read
beddabuild-in-publicai-agents

What Running 65 AI Agents in Production Taught Us This Week

For the last 8 months, I have been building Familiar — a personal AI system that runs around 65 agents on a single server. Not a hosted service, not a managed platform: one Linux box with a SQLite database, a Node.js scheduler, and a lot of agents claiming and completing tasks on cron schedules.

This week was infrastructure week. Not the fun kind where you ship features — the kind where things break in interesting ways and you learn why.

The problem with AI agent outputs

Every agent that runs writes a result. The dashboard shows that result as a notification. Simple enough — except agents started returning things like "Did some work." or "I checked the task queue." instead of useful information.

The root cause: when you ask an LLM to summarize its own transcript, it pattern-matches on what a completion looks like, not on what actually happened. The agent wrote code, ran tests, pushed a commit — but the summary was "I have completed the assigned task."

The fix was structural. We added a hard output convention: every agent ends its run with a ==SUMMARY== block and a ==END== marker. The runner extracts that block directly — no LLM-summarizes-LLM step, no guesswork. If the marker is missing, the dashboard shows "(no agent summary)" as a loud failure signal rather than a silent one.

We rolled this out to 151 agent prompts in one day: 68 system prompts, 35 schedule prompts, 48 prompt files. Engineers, planners, scouts, product agents — all updated. Within two cron cycles, summaries were clean.

The pattern is obvious in hindsight: structured output beats summarization when you control both sides of the conversation.

Fleet health monitoring that actually worked

The summary fix was one piece. The deeper issue was how to monitor a 65-agent fleet.

This week we found that one engineering agent was consistently reporting "build clean, pushed" in its summaries — but CI was actually failing. Root cause: the gh CLI was defaulting to the wrong GitHub PAT (matt-krain instead of matt-bedda), getting a 404 on bedda-tech repos, and treating the 404 as a non-error. Six engineering prompts were patched with explicit PAT routing rules.

We also filed a deploy-failure monitor (task #1647): a 15-minute poll across Vercel API, GitHub Actions, and email that fires a task and auto-triggers the relevant engineering agent on first failure detection. Three signal sources because no single one is complete.

Running a fleet is less about building agents and more about building the mechanisms that catch agents when they fail quietly.

The open-source question

Letta repositioned this week directly onto Familiar's core differentiator: "you own the memory, you choose the model." They have funding and a developer audience. The window to publish Familiar's runtime as a reference implementation is narrowing.

I am evaluating whether to extract and publish the core scheduler + task loop as a standalone OSS package. The pieces that would matter: the cron-as-messages bus, the claim/complete task workflow, the AGENTS.md institutional-knowledge pattern, exit-99 pre-hook skip for graceful no-ops. Not the full system — the part that other people could build on.

The AGENTS.md pattern in particular seems worth publishing as a standalone spec. The HN thread on AI cron failures this week named "cold-start identity loss" as an unsolved problem across 30+ frameworks. AGENTS.md is the answer we have been running in production for months.

If you are building in this space, I would be curious what is breaking for you.


Matt Whitney, Bedda Technologies — building Familiar, Nozio, Deft, Crowdia, and more in public

Have Questions or Need Help?

Our team is ready to assist you with your project needs.

Contact Us