bedda.tech logobedda.tech
← Back to blog

Building a $0 AI Agent Pipeline with llama.cpp + Vulkan

Matthew J. Whitney
10 min read
artificial intelligencellmai integrationmachine learning
---
title: 'Building a $0 AI Agent Pipeline with llama.cpp + Vulkan'
date: '2026-07-01'
description: 'How we built a $0/month local AI agent pipeline using llama.cpp and Vulkan — real benchmarks, model comparisons, and honest cost vs. quality tradeoffs.'
author: 'Matthew J. Whitney'
tags: ['artificial intelligence', 'llm', 'ai integration', 'machine learning']
category: 'ai_ml'
published: true
---

Here's the deal: our local AI agent pipeline now handles roughly 80% of the internal software engineering tasks that we were previously paying Claude and GPT-4 to do. Monthly API cost went from ~$400 to $0. The hardware already existed. And no, the quality isn't the same — but it's good enough, more often than you'd think, and the tradeoffs are real and specific enough that I can actually tell you when to use which.

This isn't a "local LLMs are the future" hype post. It's a breakdown of what we actually built, what broke, what surprised us, and what I'd do differently if I started over today.

---

## The Hardware Reality: Strix Halo Changes the Math

Everything I'm about to describe runs on an ASUS ROG Flow Z13 with the Strix Halo APU — the one with the 40 RDNA 3.5 compute units sharing a 128GB unified memory pool. That last part is the critical detail. Unified memory means the GPU and CPU share the same physical RAM, which means a 70B parameter model at Q4_K_M quantization (~40GB) fits entirely in VRAM without any CPU offloading. No layer splitting. No bottleneck on the PCIe bus.

Before this machine, I was running a Ryzen 9 desktop with a 3090 (24GB VRAM). That setup forced me to offload 20–30 layers of a 34B model to RAM, which tanked inference speed to ~4 tokens/second. Barely usable for interactive work. The Flow Z13 changed that completely. We're seeing 18–22 tokens/second on Qwen2.5-72B-Instruct at Q4_K_M through llama.cpp with Vulkan backend. That's the threshold where it stops feeling like waiting and starts feeling like thinking.

If you're evaluating this approach and you don't have unified memory with at least 64GB accessible to the GPU, the math is different. I'll flag where that matters.

---

## Why Vulkan, Not CUDA or Metal

When I first set this up, the obvious question was why Vulkan. The answer is simple: the Flow Z13 runs AMD graphics, CUDA doesn't exist here, and Metal is macOS-only. Vulkan is the cross-platform GPU compute layer that llama.cpp supports on AMD hardware.

What I didn't expect was how solid the [llama.cpp Vulkan backend](https://github.com/ggerganov/llama.cpp) has become. A year ago it was genuinely flaky — random hangs on large context windows, quantization kernels that were slower than CPU on certain ops. As of build b3400+, it's stable. I've run 8-hour agentic loops without a crash. The performance gap vs. CUDA on equivalent hardware has narrowed to the point where it's not a meaningful factor in model selection anymore.

One real gotcha: Vulkan requires you to set `GGML_VULKAN=1` at compile time and explicitly pass `-ngl 99` (number of GPU layers) when running inference. If you forget `-ngl`, llama.cpp silently falls back to CPU and you spend 20 minutes wondering why your 72B model is running at 1.2 tokens/second. I've done this more than once.

---

## The Agent Architecture: What We Actually Built

The pipeline itself is not complicated. I want to be direct about this because a lot of "AI agent" content makes it sound like there's magic involved. There isn't.

We use [LangChain](https://python.langchain.com/docs/integrations/llms/llamacpp/) with the llama.cpp Python bindings as the inference backend. The agent loop is a standard ReAct pattern: the model reasons about a task, selects a tool, gets a result, reasons again. Tools are thin Python wrappers around: bash execution, file read/write, git operations, and a vector store (ChromaDB) for retrieval against our internal codebase.

The model does the thinking. The tools do the doing. The loop runs until the model emits a stop token or hits a step limit.

What makes this a *production* setup rather than a demo is the scaffolding around the loop: structured logging of every tool call and response, automatic context window management (we truncate and summarize conversation history when approaching the model's limit), and a dead-man's-switch that kills the process if it hasn't emitted output in 90 seconds. That last one has saved me from runaway loops more times than I'd like to admit.

---

## Model Selection: The Honest Ranking

After months of testing, here's where I actually landed on models for different task types:

**Qwen2.5-72B-Instruct Q4_K_M** is our workhorse. Code generation, refactoring, explaining complex systems, writing technical documentation. It's the model I reach for when the task has real stakes. The instruction following is tight, it rarely hallucinates API signatures for well-known libraries, and it handles our 8K-token context windows without degrading.

**Qwen2.5-Coder-32B-Instruct Q5_K_M** is the specialist. Anything where the task is purely code — generating boilerplate, writing tests, doing mechanical refactors — this model is faster and often more accurate than the 72B generalist. The smaller size means lower latency, which matters in tight agentic loops where you're making 15–20 model calls per task.

**Mistral-7B-Instruct Q8_0** is the throwaway. Classification, routing, quick yes/no decisions, extracting structured data from text. It runs at ~80 tokens/second on this hardware. When the agent needs to decide "is this error message about a network timeout or a parsing failure," I don't need 72 billion parameters to answer that.

The tiered approach — big model for reasoning, medium model for code, small model for routing — is the single biggest performance optimization we made. It's also what most guides miss.

---

## What Most Guides Miss: Context Window Management Is the Real Problem

Everyone talks about model selection. Nobody talks about context window management, and it will absolutely destroy your agentic pipeline if you ignore it.

Here's what happens in practice: your agent starts a task, makes a few tool calls, accumulates results in the conversation history, and 10 steps in, the context is 90% tool output noise. The model's attention is now spread across 6,000 tokens of bash output and file contents, and its reasoning quality degrades noticeably. You get hallucinated tool calls, repeated actions, and eventually the model gets confused about what it's already tried.

Our solution is a summarization step that triggers at 60% context utilization. We call the small Mistral model to compress the conversation history into a structured summary: what was attempted, what succeeded, what failed, what the current state is. That summary replaces the raw history. The 72B model picks up from the summary with a clean context. It works surprisingly well and adds maybe 2–3 seconds of latency per summarization event.

The other thing guides miss: stop sequences matter enormously. Without carefully tuned stop sequences, models in agentic loops will sometimes continue generating after they've issued a tool call, essentially hallucinating the tool's response. We define explicit XML-style delimiters for tool calls and strictly parse only what appears between those tags. Anything the model generates outside them gets discarded.

---

## The Cost vs. Quality Tradeoff, Honestly

I want to be straight about where local falls short, because the "replace everything with local LLMs" framing is misleading.

Complex multi-step reasoning tasks — the kind where Claude 3.5 Sonnet genuinely shines — are still noticeably better with a frontier model. If I'm doing deep architectural analysis of an unfamiliar codebase, or reasoning about subtle concurrency bugs, or writing a technical proposal that needs to be genuinely persuasive, I still reach for Claude. The quality gap is real.

What I've found is that the gap is task-specific, not universal. Mechanical tasks, well-scoped tasks, tasks with clear success criteria — local models handle these at 90%+ the quality of frontier models. Open-ended reasoning, novel problem solving, tasks requiring broad world knowledge — the gap is larger, maybe 70–75% quality.

For our internal engineering work, roughly 80% of tasks fall into the "mechanical and well-scoped" bucket. That's why the economics work. We're not replacing frontier models for everything; we're replacing them for the high-volume, lower-stakes work that was burning through our API budget.

This conversation is happening in a broader context, too. The [Godot project recently announced](https://www.pcgamer.com/gaming-industry/open-source-game-engine-godot-will-no-longer-accept-ai-authored-code-contributions-we-cant-trust-heavy-users-of-ai-to-understand-their-code-enough-to-fix-it/) they'll no longer accept AI-authored code contributions, citing concerns that heavy AI users don't understand their own code well enough to fix it. I think that's the right call for an open-source project accepting contributions from unknown developers — but it also underscores what I consider the correct mental model for AI-assisted engineering: the human has to understand what the model produces. Local models, running on your own hardware, with your own tooling, actually reinforce this better than cloud APIs do. You're closer to the inference, you see the reasoning, and the friction of the setup means you're more likely to have thought carefully about when to use it.

There's also a data residency angle that matters for client work. When we're working on KRAIN or Crowdia infrastructure, there are contexts where sending code to a third-party API is a non-starter. Local inference solves that cleanly.

---

## What I'd Do Differently

Three things I'd change if I were starting this today:

**Start with the 32B coder model, not the 72B generalist.** I spent the first month optimizing the 72B model's performance before realizing the 32B coder model was faster and better for 60% of our actual use cases. The 72B earns its place, but it shouldn't be your first deployment.

**Build the summarization layer before you need it.** I bolted it on after the pipeline was already in production and it required rearchitecting the conversation management. Design for it from the start.

**Track model call counts and token usage per task from day one.** We didn't, and it took us two months to realize that one particular agent task was making 40+ model calls on average because the stop sequence logic was broken. Instrumentation is not optional.

---

## The Actual Recommendation

If you have access to hardware with 64GB+ of unified memory — a Mac with M-series silicon, a Strix Halo machine, or a system with a discrete GPU that has enough VRAM — build this stack. The $0/month cost is real, the quality is genuinely sufficient for most internal engineering tasks, and the data privacy benefits are worth something.

If you're on a 24GB discrete GPU without unified memory, build the tiered model approach and accept that you're running 13B–34B models for most tasks. It's still worth it. You'll be surprised how far a well-prompted 32B coder model gets you.

Don't build this if you're looking to replace frontier models for your highest-stakes, most complex reasoning work. That's not what this is. This is a cost optimization and a data privacy solution for the 80% of AI workload that doesn't require the best model in the world — it just requires a good enough model, running right now, for free.

That's a real thing. Build it.

Have Questions or Need Help?

Our team is ready to assist you with your project needs.

Contact Us