AMD Ryzen AI Max Shows Dramatic LLM Prompt Speed Gains with ROCm Update
A breakthrough in ROCm 1188 has closed the performance gap between AMD's Radeon AI hardware and NVIDIA's Vulkan-based systems, boosting prompt processing speeds by up to 132% on Llama.cpp models. This leap could reshape the landscape of local AI inference on consumer-grade hardware.

AMD Ryzen AI Max Shows Dramatic LLM Prompt Speed Gains with ROCm Update
A significant performance leap in AMD’s ROCm software stack has dramatically improved prompt processing speeds on Ryzen AI Max-powered devices, bringing them within striking distance of Vulkan-based competitors. According to benchmarks published on Reddit’s r/LocalLLaMA community, the latest ROCm 1188 release (February 15) has increased prompt throughput by 50% to over 130% across multiple large language models, effectively erasing years of historical performance deficits.
The improvement centers on the llamacpp-rocm project, an open-source adaptation of llama.cpp optimized for AMD’s ROCm platform. Historically, ROCm has lagged behind Vulkan in prompt processing—where the model ingests and encodes the input text—while maintaining competitive token generation speeds. This imbalance made AMD hardware less attractive for latency-sensitive local AI applications. However, the latest update has transformed the landscape, with models like GPT-OSS-120B-MXFP4 jumping from 261 tokens per second (t/s) to 605 t/s—a 132% increase—nearly matching Vulkan’s 555 t/s baseline.
Performance gains were not uniform. The Nemotron-3-Nano-30B-A3B-Q8_0 model saw a 98% surge, rising from 501 to 990 t/s, while Qwen3-Coder-Next-MXFP4-MOE improved by 77%. Notably, GLM-4.7-Flash-UD-Q4_K_XL, which had already performed comparably to Vulkan on earlier ROCm versions, saw only a marginal 7% gain, suggesting its kernel optimizations were already mature. Token generation speeds remained largely unchanged, indicating the optimization targeted prompt encoding specifically—likely through improved memory bandwidth utilization, kernel fusion, or tensor parallelism enhancements within ROCm’s MIOpen library.
The benchmarks were conducted on AMD’s Ryzen AI Max 395 (codenamed Strix Halo), a high-end mobile processor integrating a 16-core Zen 4 CPU and a 16-core RDNA 3.5 GPU with 16GB of shared LPDDR5X memory. This hardware is designed for on-device AI workloads, making the performance gains particularly relevant for developers and enterprise users seeking to deploy LLMs without relying on cloud infrastructure.
Interactive benchmark charts, hosted on EvaluateAI.ai, confirm the trend across multiple builds between February 11 and 15. The data reveals a steep, consistent climb in prompt throughput with each ROCm revision, suggesting rapid iteration by the open-source community behind llamacpp-rocm. The project’s lead developer, who also operates EvaluateAI.ai, noted that future work will focus on output quality validation—ensuring speed gains do not compromise model accuracy or coherence.
This development carries substantial implications. For years, the AI inference market has been dominated by NVIDIA’s CUDA ecosystem. While AMD has made strides in GPU compute, local LLM deployment has remained a weak spot. The latest ROCm update signals that AMD’s open software stack is maturing rapidly, offering a viable, cost-effective alternative for developers seeking to avoid vendor lock-in or reduce cloud dependency. It also suggests that the open-source community, not just corporate R&D, is now driving critical innovation in AI infrastructure.
As AI models grow larger and more complex, efficient local inference becomes essential for privacy, latency, and regulatory compliance. With this breakthrough, AMD’s ecosystem may finally be poised to challenge NVIDIA’s dominance in the on-device AI space—not by matching raw compute, but by closing the software gap that once held it back.


