bedda.tech logobedda.tech
← Back to blog

96GB Local AI: MoE Models on AMD Strix Halo

Matthew J. Whitney
10 min read
artificial intelligencellmmachine learningai integration

Local AI inference on MoE models just crossed a threshold I didn't expect to hit on a tablet. A few weeks ago I unlocked the BIOS memory ceiling on my ASUS ROG Flow Z13 — AMD Strix Halo APU, 96GB unified memory — ditched ROCm entirely, switched to Vulkan via llama.cpp, and ran Mixtral 8x7B, Qwen2-57B-A14B, and DeepSeek-V2-Lite back to back. The numbers surprised me. Not because they were fast by data center standards, but because they were fast enough to change how I think about where local AI goes from here.

This isn't a review. It's a prediction piece. What I saw on that Flow Z13 tells me something specific about the next 12 months of edge AI — and I want to make concrete, falsifiable calls while the window is still open.

The Inflection Point: What 96GB Unified Memory Actually Changes

Before the predictions, you need to understand why this hardware moment matters.

The Strix Halo's iGPU shares a unified memory pool with the CPU. Out of the box, the BIOS limits GPU-accessible memory to somewhere between 16GB and 24GB depending on firmware version — a completely artificial ceiling. Once you push that ceiling up (I set mine to 96GB in the UEFI advanced settings), llama.cpp with Vulkan backend sees the entire pool as VRAM. No PCIe bandwidth bottleneck. No offloading layers to CPU RAM with the associated penalty. The entire model lives on-die.

What that means practically: Mixtral 8x7B Q4_K_M (around 26GB) loads completely into GPU-accessible memory. Qwen2-57B-A14B in Q4 quantization sits at roughly 34GB — still fits. DeepSeek-V2-Lite at Q4 is around 15GB, trivially loaded.

My actual prompt processing and generation numbers using llama.cpp b3447 with Vulkan:

  • Mixtral 8x7B Q4_K_M: ~18 tokens/sec generation, ~420 tokens/sec prompt processing
  • Qwen2-57B-A14B Q4_K_M: ~11 tokens/sec generation, ~280 tokens/sec prompt processing
  • DeepSeek-V2-Lite Q4_K_M: ~28 tokens/sec generation, ~680 tokens/sec prompt processing

For context: 18 tokens/sec on Mixtral is faster than most people read. It's not A100 territory, but it runs offline, on battery, on a 1.2kg tablet. That's the inflection point.

ROCm, by the way, was a dead end on this hardware at the time of testing. Driver conflicts, missing kernel modules, and HIP compilation failures that I spent two days debugging before abandoning it. Vulkan just worked. That's its own signal about the state of AMD's open-source AI stack — but that's a separate article.

Prediction 1: Vulkan Becomes the Default Backend for Edge AI Inference by End of 2026

ROCm is AMD's answer to CUDA, and it's genuinely powerful on discrete RDNA3 and CDNA hardware in controlled environments. But on APUs and integrated graphics — the hardware that's actually proliferating — ROCm's installation complexity and driver fragility make it a non-starter for most practitioners.

Vulkan, particularly through llama.cpp's ggml-vulkan backend and the KomputeProject ecosystem, runs on virtually any GPU made in the last decade: AMD, Intel Arc, NVIDIA (without CUDA), and mobile silicon. The performance delta between Vulkan and ROCm on Strix Halo in my testing was less than 8% on generation tokens/sec — well within acceptable range for the massive gain in deployment simplicity.

My call: By Q4 2026, llama.cpp's Vulkan backend overtakes ROCm as the primary deployment target for local AI inference on AMD hardware outside of dedicated ML clusters. llama.cpp merge velocity on Vulkan improvements is already higher than ROCm-specific patches as of mid-2026.

What would prove me wrong: AMD ships a dramatically simplified ROCm installation path — something like a single-binary installer with automatic APU detection — and closes the performance gap on unified memory architectures. Possible, but their track record on developer experience suggests otherwise.

Prediction 2: MoE Architecture Dominates Local LLM Deployments Within 12 Months

This is the one I feel most confident about. Running MoE models locally changes the value proposition of local AI inference in a fundamental way.

Dense models — your LLaMA 3.1 8B, your Mistral 7B — are fast on small hardware but capability-capped. To get GPT-4 class reasoning you're looking at 70B+ parameter dense models, which require 40GB+ even at aggressive quantization. Most consumer hardware can't touch them.

MoE architectures break this constraint. Mixtral 8x7B has 56B total parameters but only activates ~14B per forward pass. Qwen2-57B-A14B activates 14B of its 57B. DeepSeek-V2-Lite activates 2.8B of 15.7B. You get the reasoning capability of a much larger model at the inference cost of a smaller one — and critically, at the memory bandwidth cost of a smaller one, which is the actual bottleneck on unified memory systems.

On the Flow Z13, Qwen2-57B-A14B at 11 tokens/sec beats a dense 13B model in reasoning benchmarks while running at comparable speed. That's not a marginal win. That's a category shift.

My call: By mid-2027, more than 60% of locally-deployed open-source LLMs in production use cases will be MoE or sparse architectures. The dense 7B/13B paradigm gets relegated to ultra-constrained edge devices (sub-8GB RAM) while MoE takes everything above that threshold.

What would prove me wrong: A breakthrough in dense model efficiency — something like BitNet b1.58 at scale achieving comparable quality at dramatically lower memory footprint — could keep dense models competitive. I'm watching the 1-bit quantization research closely. It's real, but not production-ready at the capability levels we need yet.

Prediction 3: The "96GB Tier" Creates a New Class of AI-Native Edge Devices

The Flow Z13 is an anomaly today — a tablet with workstation-class memory bandwidth running models that would have required a $10,000 server two years ago. But Strix Halo isn't staying in one device.

AMD's roadmap points toward this unified memory architecture scaling across the product stack. Apple already proved the market exists with M-series Macs — the M3 Ultra with 192GB unified memory is a serious local AI inference machine. Intel's Lunar Lake and Meteor Lake APUs are chasing the same architecture. Qualcomm's Snapdragon X Elite has 64GB configurations shipping in commercial laptops.

The 96GB unified memory tier is becoming a laptop and workstation standard, not an exotic configuration. When that happens, the entire local AI inference MoE models story changes — because you can run Qwen2-72B-A21B (the full-size MoE) at Q4 in roughly 44GB, leaving 52GB for the OS, applications, and context window.

My call: By Q2 2027, at least three major OEMs ship AI-focused laptops with 96GB+ unified memory as a standard configuration tier, explicitly marketed around local LLM capability. The positioning will be "run any open-source model locally" — because at 96GB, that's essentially true.

What would prove me wrong: Memory pricing spikes or supply constraints keep 96GB configurations in the $3,000+ range, limiting adoption to enterprise and prosumer segments. Possible given current LPDDR5X supply dynamics, but the cost curve has been moving in the right direction.

Prediction 4: Offline-First AI Integration Becomes a Compliance Requirement in Regulated Industries

This is the prediction with the longest tail, but I think the signal is already there.

Every model I ran on the Flow Z13 processed data that never left the device. No API call. No telemetry. No token logging on someone else's server. For the work we do at Bedda.tech — particularly in fintech and healthcare-adjacent projects — that matters enormously. When I'm prototyping AI integration features for a client handling PII or financial data, running inference locally isn't a performance choice, it's a compliance choice.

The regulatory environment around AI data handling is tightening globally. The EU AI Act has explicit provisions around data residency for high-risk AI applications. HIPAA covered entities are increasingly cautious about sending patient-adjacent data to third-party model APIs. SOC 2 Type II auditors are starting to ask questions about LLM API usage that nobody had answers to 18 months ago.

Local AI inference on MoE models — specifically because MoE gives you capability parity with cloud models at a fraction of the infrastructure cost — is the technical answer to these compliance questions.

My call: By Q1 2027, at least two major compliance frameworks (HIPAA guidance, SOC 2 criteria, or equivalent) publish explicit provisions that create strong incentives for on-premise or device-local AI inference over cloud API usage for regulated data categories. Legal teams will start demanding it before the frameworks even finalize.

What would prove me wrong: Cloud providers successfully lobby for "AI data processing" carve-outs in privacy regulations, or achieve certification levels (FedRAMP High equivalent for AI APIs) that satisfy compliance requirements without requiring local deployment. Microsoft and Google are actively pursuing this path — it's a real counterforce.

Prediction 5: llama.cpp Forks or Successors Emerge Targeting Unified Memory Architectures Specifically

llama.cpp is remarkable software. Georgi Gerganov built something that runs on everything from a Raspberry Pi to a 96GB APU. But that universality comes with architectural compromises that are increasingly visible when you're pushing a Strix Halo.

The ggml tensor library underneath llama.cpp was designed for CPU-first computation with GPU offloading as a secondary concern. On unified memory architectures, that mental model is wrong — the GPU and CPU share the same physical memory, and the optimal execution strategy looks different from either pure CPU or pure discrete-GPU inference. Memory layout, kernel fusion opportunities, and scheduling heuristics all change.

I'm already seeing this in my benchmarks. The gap between theoretical peak bandwidth utilization and what llama.cpp/Vulkan actually achieves on Strix Halo suggests there's significant headroom being left on the table. Someone is going to build a runtime that treats unified memory as the primary architecture rather than an edge case.

My call: By end of 2026, at least one serious alternative inference runtime emerges — either a llama.cpp fork or a new project — with unified memory APU optimization as its primary design target, achieving 25-40% better throughput than stock llama.cpp on this hardware class.

What would prove me wrong: The llama.cpp core team prioritizes unified memory optimization in the main branch and closes the gap themselves. Given the contributor velocity I've seen, this is actually plausible — but organizational focus is hard to sustain on a single hardware target when you're also supporting dozens of others.

The Most Important Thing to Do Right Now

If you're building AI-integrated software in 2026, the single most important thing you can do is stop treating local inference as a hobbyist curiosity and start treating it as a production deployment target.

The hardware is here. The models are here. The performance is here. What's missing is the engineering practice — the same rigor around testing, reliability, and deployment that we apply to every other piece of production infrastructure.

At Bedda.tech, we're building local inference into client architectures the same way we'd build in any other compute dependency: with fallback strategies, latency budgets, and explicit capability contracts. The flow that works is: local MoE model for sensitive or offline-required inference, cloud API as fallback for burst capacity or capability gaps, with the routing layer making that decision transparently.

The 96GB Strix Halo isn't the end state — it's the proof of concept that tells you where the end state is going. A tablet running Mixtral at 18 tokens/sec offline means a workstation running it at 60+, and a small server cluster running it at 200+, all without a cloud dependency. That's a different architecture for AI integration than anything most teams are building today.

The teams that figure out local-first AI integration now will have a 12-month head start on everyone else when compliance requirements, latency demands, and cost pressures make it mandatory. That window is open right now. It won't stay open long.

Have Questions or Need Help?

Our team is ready to assist you with your project needs.

Contact Us