bedda.tech logobedda.tech
← Back to blog

Building Deft: On-Device AI Phone Agent for Android

Matthew J. Whitney
10 min read
artificial intelligencellmai integrationmachine learning

Building an on-device AI phone agent sounds like the clean, privacy-respecting future of mobile AI. No cloud round-trips. No user data leaving the device. No API bills. We built Deft to prove that future is real — and to find out exactly where it breaks.

This is a comparison between two fundamentally different architectures for an AI phone agent: cloud-backed LLM inference versus fully on-device inference with Gemma 4. I've shipped both. The tradeoffs are not what most people expect, and the decision matters more than almost any other architectural choice you'll make on a mobile AI project.

Here's what we found after running Deft on real Android hardware, with real UI trees, doing real actions.


The Two Architectures: What We're Actually Comparing

Cloud-backed agent: The phone captures context (screen state, user intent, accessibility tree), ships it to a hosted model — GPT-4o, Claude Sonnet, Gemini 1.5 Pro — gets structured JSON back, executes actions. Fast inference, massive context windows, zero model management on device.

On-device agent (Deft): Gemma 4 runs locally using Google's LiteRT runtime (formerly TensorFlow Lite), quantized to fit in device RAM, with inference happening entirely on-chip. No network call in the hot path. No data leaves the phone.

We used three open-source React Native libraries to wire this together: react-native-mlkit for on-device ML utilities, react-native-accessibility-info for UI tree traversal, and react-native-gemma for the Gemma inference bridge. The stack is real, it's in production, and it forced us to solve three problems that cloud architectures simply don't have.


Cloud-Backed Agents: The Honest Strengths

I want to be fair here because we used cloud inference on KRAIN before we built Deft, and it genuinely worked well for certain workloads.

Context windows are enormous. GPT-4o at 128K tokens means you can dump an entire accessibility tree, conversation history, and system prompt without sweating. On a complex screen — say, a settings panel with nested toggles, conditional sections, and dynamic content — the cloud model sees everything and reasons across it cleanly.

Inference latency is predictable. You're hitting a datacenter with dedicated GPU clusters. A well-structured prompt with a constrained output schema comes back in 800ms–1.2s on a good connection. That's fast enough for most agentic loops.

Model capability is simply higher. GPT-4o and Claude Sonnet 3.7 outperform any model that currently fits in 4–8GB of device RAM on complex multi-step reasoning tasks. That's not an opinion, that's benchmark reality.

Where cloud breaks down for a phone agent specifically:

  • Every action requires a network round-trip. In a subway tunnel, a basement, or rural coverage, your agent dies.
  • You're serializing and transmitting the full accessibility tree on every step. That's a privacy nightmare for enterprise deployments.
  • API costs at scale aren't zero. For a consumer app doing 50 agentic steps per session, you're looking at real money per user per day.
  • Latency compounds. A five-step task at 1s per round-trip plus network jitter is a 6–8 second experience. Users abandon it.

On-Device with Gemma 4: What Actually Happened

Gemma 4 in its 4B parameter variant, quantized to INT4, runs on a Snapdragon 8 Gen 3 device at roughly 12–18 tokens per second using the NPU. On a mid-range Dimensity 9300 device, we saw 8–11 tokens/sec. Those numbers sound fine until you're in an agentic loop generating 200-token action plans per step.

Latency on Constrained Hardware

The first thing we learned: first-token latency is your real enemy, not throughput.

Gemma 4 on LiteRT takes 1.8–2.4 seconds to emit the first token on cold inference (model already loaded, KV cache cold). That's the pause the user feels before anything happens. Cloud inference at 800ms feels snappier even though total generation time might be comparable.

We solved part of this with speculative prefill — pre-loading the system prompt and static context into the KV cache during app initialization, before the user issues any command. This dropped first-token latency to 600–900ms on the Snapdragon device. The model is already "thinking about" the phone's current state by the time the user speaks.

The tradeoff: this burns memory and locks the NPU during prefill. On devices with less than 8GB RAM, we had to reduce the static context size or risk OOM kills. We ship two model configs — a 4B INT4 for flagship devices and a 2B INT8 for mid-range — with runtime detection at startup.

The 500+ Node UI Tree Problem

This is the one that genuinely surprised us. Modern Android apps have deep, complex accessibility trees. Open Google Maps and ask Deft to navigate somewhere. The accessibility tree for that screen has 600–900 nodes depending on the state of the map, search suggestions, and overlays.

Gemma 4B at INT4 has an effective context window of roughly 8,192 tokens in our LiteRT build. A naively serialized accessibility tree of 700 nodes blows past that in a single prompt. You haven't even added the system prompt, conversation history, or the user's request yet.

Cloud agents don't have this problem. You send the whole tree, the model handles it.

For Deft, we built a semantic tree pruner that runs before inference. It does three things:

  1. Strips non-interactive nodes (static text, decorative images, layout containers with no actions)
  2. Collapses identical sibling groups (a list of 40 identical contact rows becomes a typed summary: [ContactList: 40 items, each with call/message actions])
  3. Ranks remaining nodes by spatial proximity to the last user interaction and truncates aggressively, keeping the top 60–80 interactive elements

After pruning, a 700-node tree typically compresses to 180–240 tokens. We keep conversation history to 3 turns. System prompt is ~400 tokens. We stay comfortably under 2,000 tokens per inference call, which means faster prefill and more headroom for the model's output.

The cost: the pruner occasionally drops context the model needs. We've had Deft fail to find a button that was technically present but ranked low by the pruner. We log these cases and iterate on the ranking heuristic. It's an ongoing calibration problem, not a solved one.

Safe Action Validation Before Irreversible Operations

This is the one I lose sleep over.

A cloud agent that calls DELETE /account or sends a message to the wrong contact is bad. An on-device AI phone agent that does the same thing with no server-side guardrails, no audit log, and no rollback is worse. The blast radius is entirely on the user's device, and there's no middleware to catch it.

We implemented a three-tier action classification system before Deft executes anything:

Tier 1 — Read-only: Scroll, navigate, read screen content. Execute immediately.

Tier 2 — Reversible write: Type text into a field, toggle a setting that can be toggled back, open an app. Execute with a 2-second undo window shown in a toast.

Tier 3 — Irreversible: Send a message, make a call, delete data, submit a form, make a purchase. Full confirmation dialog, explicit user tap required, action described in plain English before execution.

The classification runs as a deterministic rule layer — not the LLM — before every action. The model proposes an action as structured JSON. The rule layer classifies it. If it's Tier 3, the model's output never executes until the user confirms.

This matters because Gemma 4B hallucinates action targets. It will occasionally generate { "action": "tap", "target": "Send" } when the user asked it to draft a message, not send one. The LLM's intent parsing isn't perfect, especially on ambiguous utterances. The deterministic safety layer is not optional.


Direct Comparison: The Dimensions That Matter

DimensionCloud-Backed AgentOn-Device (Deft / Gemma 4)
First-token latency800ms–1.2s (good network)600ms–2.4s (after optimization)
Offline capabilityNoneFull
Context window32K–128K tokens~8K effective tokens
PrivacyData leaves deviceFully local
Complex reasoningStronger (GPT-4o, Claude)Weaker (4B params)
API cost at scaleReal and compoundingZero marginal cost
Action safetyServer-side possibleMust build yourself
Model updatesTransparent, instantRequires app update
Hardware dependencyNoneNPU/RAM constrained

The Verdict: Use Cloud When, Use On-Device When

Use a cloud-backed agent when:

Your task complexity demands it. If you're building an agent that navigates multi-app workflows, reasons across long conversation histories, or handles highly ambiguous natural language, the capability gap between a 4B on-device model and GPT-4o is real and will cost you in task completion rate. We measured ~73% task completion on Deft vs ~91% on a cloud-backed equivalent across our benchmark task set. That 18-point gap matters for complex tasks.

Also: if your users are on mid-range hardware (6GB RAM, no dedicated NPU), on-device inference is too slow and too fragile to ship confidently today.

Use an on-device AI phone agent when:

Privacy is non-negotiable. Enterprise deployments where sensitive screen content cannot leave the device. Healthcare, finance, anything with compliance requirements. Deft's entire value proposition for our enterprise customers is that the model never sees a network packet.

Also when: you need offline reliability, you're cost-sensitive at scale, and your task set is constrained enough that a smaller model can handle it. Deft performs well on a defined task taxonomy — launching apps, sending pre-approved messages, reading notifications, adjusting settings. It's not a general-purpose agent. It's a reliable specialist.


What We'd Do Differently

The craftsmanship conversation happening in the dev community right now is directly relevant here. We over-engineered the tree pruner in v1 — it was clever code that was hard to debug and hard to tune. We rewrote it as a simple scored filter with explicit, readable rules. Boring code that works is better than clever code that almost works when an agent is about to send a message on someone's behalf.

I'd also invest earlier in the action safety layer. We bolted it on after our first internal demo where Deft called a contact during testing. It should have been the first thing we built.

The Google LiteRT documentation has improved significantly for Gemma 4 inference, and the Gemma model card is worth reading carefully for the quantization tradeoffs before you pick your deployment config. Don't assume INT4 is always faster than INT8 — on some NPU implementations, INT8 has better hardware support and wins on wall-clock time despite the larger model size.

Deft is not finished. The context window problem doesn't fully go away — it just gets managed. Latency on mid-range hardware is still a UX liability. But the privacy story is real, the offline reliability is real, and the zero marginal inference cost is real. For the right deployment, those aren't nice-to-haves. They're the whole point.

Have Questions or Need Help?

Our team is ready to assist you with your project needs.

Contact Us