Breakthrough Optimization Boosts Qwen3-Next 80B MoE Performance 6x on Dual RTX 50-Series GPUs

A groundbreaking optimization technique has emerged from the local AI inference community, enabling the Qwen3-Next 80B MoE model to achieve unprecedented inference speeds on consumer-grade NVIDIA hardware. According to a detailed post on the r/LocalLLaMA subreddit, a user identified a pair of configuration flags that, when combined, increased token generation throughput from a sluggish 6.5 tokens per second to a blazing 39 tokens per second — a sixfold improvement — on a dual-GPU setup consisting of an RTX 5070 Ti and RTX 5060 Ti, each with 16GB VRAM.

The model, Qwen3-Next-80B-A3B-Instruct-UD-IQ2_XXS.gguf, is a highly sparse Mixture-of-Experts (MoE) architecture featuring 128 total experts with 8 active per token. Prior to the fix, users experienced severe CPU bottlenecks, with GPUs sitting idle during critical phases of computation. Despite correctly offloading all 49 layers to the GPU, the system was still consuming 34GB of system RAM and maxing out CPU cores, indicating a fundamental misalignment in how the model’s expert weights were being distributed across multiple GPUs.

The breakthrough came from two precise adjustments to the llama.cpp inference engine. First, the flag --n-cpu-moe 0 was applied to ensure that no MoE expert computation was offloaded to the CPU — a common default that inadvertently created unnecessary overhead. Second, and more critically, the user replaced the default -sm row (row-split) tensor partitioning with -sm layer (layer-split). Row-splitting divides each expert’s weight matrix across both GPUs, forcing constant PCIe communication every time an expert is activated. With 8 experts firing per token across 47 layers, this created thousands of synchronization events per second, crippling throughput.

Layer-splitting, by contrast, assigns entire layers and their associated experts to a single GPU. This means each GPU operates independently, with routing decisions directing tokens to the correct device without cross-GPU data transfer. The result is near-native GPU utilization: both GPUs now operate at 80-95% load during generation, CPU usage drops to negligible levels, and system RAM consumption plummets. The model, at 26.2GB, fits comfortably within the 32GB VRAM pool with 7GB remaining for KV cache.

This solution underscores a critical but often overlooked principle in distributed AI inference: architectural alignment matters as much as hardware specs. While many online guides focus on increasing n_gpu_layers or upgrading CUDA versions, the real bottleneck was a software-level misconfiguration of tensor partitioning. The fix required no hardware upgrades, no model retraining, and no proprietary software — only deep understanding of how MoE models interact with multi-GPU inference engines.

The winning command line is:

llama-server.exe -m Qwen3-Next-80B-A3B-Instruct-UD-IQ2_XXS.gguf -ngl 999 -c 4096 --port 8081 --n-cpu-moe 0 -t 6 -fa auto -sm layer

Notably, the user confirmed this works on llama.cpp build b8077 with CUDA 12.4, suggesting compatibility extends beyond bleeding-edge drivers. The implications are significant: users with dual RTX 40/50-series GPUs can now run 80B-class MoE models locally with near-real-time responsiveness, making high-end AI inference accessible without expensive multi-GPU server setups.

As MoE architectures become standard in next-generation LLMs, this discovery provides a blueprint for optimizing distributed inference. Developers and hobbyists alike are encouraged to revisit tensor splitting strategies when deploying large MoE models — the solution may be simpler than expected.

AI-Powered Content

Sources: www.merriam-webster.com • www.reddit.com

Breakthrough Optimization Boosts Qwen3-Next 80B MoE Performance 6x on Dual RTX 50-Series GPUs

Breakthrough Optimization Boosts Qwen3-Next 80B MoE Performance 6x on Dual RTX 50-Series GPUs

recommendRelated Articles

Local Semantic File Search Emerges as Privacy-Focused Alternative to Traditional File Systems

Local VLM Tool Enables Batch Image Captioning Without Cloud Dependency

Can AI Video Loops Be Seamlessly Created Using First and Last Frame Techniques?