bedda.tech logobedda.tech
← Back to blog

We Audited Our Own App With Fable. Three Days Later, the Government Pulled the Model.

BeddaTech
8 min read
beddabuild-in-publicai-agents

We Audited Our Own App With Fable. Three Days Later, the Government Pulled the Model.

Anthropic shipped Fable 5 a few days ago. We built a security-analysis tool on it and, before they pulled it, ran it against one of our production apps. Yesterday the U.S. government ordered Anthropic to pull Fable (and Mythos 5) offline over export-control and national-security concerns, and as of this writing it's still down. So this is a field report on a frontier model you currently can't use. We're publishing it anyway, because the part worth writing about was never the model. It was what we had to do to trust its output.

Here's the number that stuck with us. About one in four of the high-severity bugs the tool reported were false positives. We know that because after the scan finished, our engineers went back and tried to kill every finding by re-reading the code. That second pass is the real subject of this post.

The app is KRAIN, our Web3 AI-agent marketplace. It runs a live airdrop and points economy, so a self-minting bug isn't theoretical here. It costs real money.

Why you can't just ask an LLM to find bugs

The honest state of the art for LLM vulnerability detection is not good, and the more careful the study, the worse it looks. Ding et al.'s PrimeVul work re-examined the popular detection benchmarks, found them contaminated with label noise and data leakage, and showed that once you clean that up, GPT-4 with chain-of-thought lands around 0.52 precision on a balanced two-class task. That's a coin flip. The model mostly can't tell patched code from the vulnerable version it was patched from.

It's not an isolated result. The DiverseVul paper reports LLMs beating graph-neural-net baselines (roughly 47% vs 30% F1 in one comparison) while still posting low absolute F1 and false-positive rates that routinely clear 50%. Broader surveys put false-discovery rates as high as 85% in the worst configurations. The failure mode isn't that these models miss everything; it's that they're confidently wrong often enough that a tool built on raw model output is a tool nobody trusts.

So the question we cared about wasn't "can Fable find bugs." It can, and it found plenty. The question was how much of what it finds is real, and how you tell. The literature is blunt: you don't get to skip verification. A model's "Critical" marks something to investigate, and someone still has to re-read the actual code and trace it across files before any of it is worth acting on. We decided that someone would be our engineers.

How the tool works

Finding (Fable). One pass per vulnerability dimension. Twelve of them: auth, injection, secrets, SSRF, path traversal, deserialization, an LLM-agent-specific class, supply chain, crypto, race conditions, denial of service, and output handling. Each runs in parallel, plus a loop-until-dry pass and a cross-file dataflow pass. Generating plausible exploit chains is the part that rewards a strong model, and that's what the Fable hype was about.

Verifying (our engineers). Every candidate finding went to a person whose job was to disprove it: break the exploit chain, find the mitigating control, or show the sink is unreachable. A finding survived only if we couldn't refute it after reading the real code across files. This is the step the benchmark literature says you can't skip, and it's the one that turns a noisy pile of model output into something you'd act on.

Everything else is plumbing. Dedup and clustering, severity scoring on a four-axis rubric, completeness checks. Fable does the one thing worth paying for, and people do the one thing you can't outsource.

We treat the target code as untrusted input throughout. We audit apps that are themselves AI agents, so a codebase can carry prompt-injection payloads in comments or docstrings. The rule is to never follow instructions embedded in the code you're auditing.

What happened on KRAIN

We scoped the run to the server-side attack surface: API route handlers, server actions, wallet auth, the data layer, rate limiting, marketplace sanitization. About 117 files.

Fable produced 179 raw candidate findings in roughly 82 minutes for about $94.

Then our engineers worked the 95 highest-severity findings by hand (we triaged the long tail separately). Of those 95, 70 held up and 25 were killed as false positives that didn't survive a close read. That's a ~26% false-positive rate on the high-severity output, which is the part of the report a person reads first and is most inclined to believe. The bands moved with it: Critical from 12 to 8, High from 79 to 58.

The most useful thing we learned

One systematic blind spot accounted for most of Fable's worst misses. It assumed every exported Next.js 'use server' function is a client-callable endpoint. It isn't. A server action is only reachable if a client component actually imports it. If nothing wires it to the client, there's no entry point and no exploit.

Three of the four killed "Critical" findings died right there. We grepped for the flagged functions, found that no client component imported them, and that was the end of the exploit chain. The model had built confident narratives on top of code an attacker can't reach.

Two more in the same family:

  • Fable flagged an auth endpoint as having "no rate limit" with withRateLimit('auth') sitting in the handler. Reading the file took ten seconds.
  • One finding contradicted itself. The title said "unauthenticated," while its own notes admitted "authentication required (401 if no session)."

None of this is exotic. It's the confident-but-wrong pattern the benchmarks predict, and every instance fell apart the moment a person read the code instead of the model's summary.

What was real

The survivors were real, and we fixed them. In categories, since this is a live app with custodial value and we're not publishing specifics: unauthenticated exposure of internal analytics, user-data enumeration oracles, a signature-replay gap in wallet sign-in, rate-limit identity spoofing, and points/XP self-attestation. Each was remediated. We added auth and ownership checks, moved wallet sign-in to server-issued single-use nonces, bounded attacker-controlled query inputs, and hardened the LLM moderation gates against prompt injection. We found them on ourselves before anyone else did, which is the whole reason to run this on your own production code.

Cost

The finding pass came to about $94 for 179 candidates in roughly 82 minutes. Fable ran about $10/$50 per million tokens in/out, roughly twice Opus, so you don't point it at boilerplate. Verification was engineer time, not a compute line.

The surprise was a fixed cost: about $0.34 per cold model call just to load the system prompt and instruction context, before any real reasoning. Fan out to dozens of small passes (one per dimension, plus dataflow and loop-until-dry) and that fixed cost starts to dominate the bill. If we were optimizing for price, the win wouldn't be a cheaper model. It'd be batching more work into fewer calls. Worth knowing before you design a many-small-passes harness and get a bill you didn't model.

What this run doesn't tell you

  • One app, one run. No seeded benchmark yet, so we can speak to precision (how many reported bugs were real) but not recall (how many we missed).
  • We verified the top of the report. The by-hand pass covered the 95 highest-severity findings. The lower-severity tail got a lighter pass, so the 26% doesn't extend to it.
  • Severity is Fable's own call, which our review sometimes disagreed with.

What's next, with an asterisk

Before Fable went offline, we'd started running the identical analysis with Opus 4.8 on the same codebase, to answer whether the new model is worth the premium in findings, false-positive rate, and cost side by side. We'll publish that comparison. With Fable pulled, the Opus run is also just useful on its own, because it's the model you can actually run today.

If and when Fable comes back, the rest of the plan stands: a seeded-recall benchmark to measure what it misses, more targets so the false-positive rate becomes a distribution instead of one data point, and a tighter loop between the scan and the review.

Was it worth it?

On raw bug-finding, the hype held up. Fable surfaced genuine, exploitable issues in a hardened production app in under 90 minutes. But the model alone isn't the product. A scanner that's confidently wrong a quarter of the time gives you a worklist to check, not an answer you can trust. The value was the model and the verification pass together, and the verification pass is the part that doesn't depend on which model happens to be online this week. Given how this went, that turned out to matter more than we expected.


If you're shipping something with real value behind it, like a token economy, user funds, or an AI agent wired to tools, this is something we do. We point the tool at your code, our engineers verify every finding by hand, and you get a triaged report with the false positives already stripped out. Reach out and we'll scope it.

Have Questions or Need Help?

Our team is ready to assist you with your project needs.

Contact Us