Qwen 3.5 35B MoE Achieves 44 TPS on RTX 5060 Ti, Redefining Local AI Performance

A groundbreaking benchmark shared on the r/LocalLLaMA subreddit has revealed that Alibaba’s Qwen 3.5 35B MoE (Mixture of Experts) large language model can achieve up to 44.32 tokens per second (TPS) on a purported NVIDIA GeForce RTX 5060 Ti GPU — a performance level previously thought to require enterprise-grade hardware. The test, conducted by user /u/maho_Yun, utilized the llama-bench utility with a 100,000-token context window, demonstrating that high-context, efficient AI inference is now feasible on consumer-grade hardware.

According to the benchmark data, when running on NVIDIA’s CUDA backend, the model processed prompt processing (pp100000) at 1,304.93 ± 4.10 tokens per second and generated text (tg720) at 44.32 ± 2.16 TPS. This represents a significant performance advantage over the Vulkan backend, which achieved only 41.35 TPS. The disparity highlights the continued superiority of NVIDIA’s proprietary CUDA ecosystem for AI inference workloads, even on emerging consumer architectures.

The system configuration included an AMD Ryzen 7 9700X CPU running at 5.55 GHz, 47.61 GiB of system memory, and a GPU identified as the NVIDIA GeForce RTX 5060 Ti with 15.59 GiB of dedicated VRAM — a specification that, if accurate, would place the card between the RTX 4060 Ti and RTX 4070 in terms of memory capacity and computational throughput. Notably, the system also listed a "GameViewer Virtual Display Adapter," suggesting the test may have been conducted in a remote or virtualized environment, which could impact real-world reproducibility.

Qwen 3.5 35B MoE, released earlier this year by Alibaba’s Tongyi Lab, is designed to balance performance and efficiency through its Mixture of Experts architecture, activating only a subset of neural network experts per inference. This design reduces computational load while preserving high-quality outputs — making it particularly well-suited for deployment on resource-constrained devices. The ability to maintain over 40 TPS with a 100K context window is a milestone; such context lengths are critical for processing lengthy documents, codebases, or multi-turn conversations without truncation.

For developers and AI enthusiasts, this benchmark signals a potential paradigm shift. Until now, running 35B-class MoE models locally with high throughput required GPUs with 24GB+ VRAM, such as the RTX 4090 or H100. The apparent performance of the RTX 5060 Ti — if confirmed by independent testing — could democratize access to enterprise-grade AI capabilities for individual users, researchers, and small businesses.

NVIDIA has not officially confirmed the existence or specifications of the RTX 5060 Ti as of this report. If the GPU is real and performs as shown, it may represent a strategic move by NVIDIA to extend its dominance in the local AI inference market with a mid-tier, high-efficiency product. Competitors like AMD and Intel may face renewed pressure to accelerate their own AI inference optimizations.

As local AI becomes increasingly central to privacy-conscious applications — from secure enterprise chatbots to offline medical diagnostics — benchmarks like this underscore the accelerating pace of innovation. The Qwen 3.5 35B MoE model’s performance on what may be a consumer-grade GPU suggests that the era of "AI on your desktop" is no longer speculative. The next frontier lies in optimizing software stacks, quantization techniques, and driver-level support to make such performance consistent and accessible across platforms.

Further testing is required to validate the hardware claims and ensure reproducibility. However, this single benchmark has already ignited discussions across AI communities — and may herald a new chapter in decentralized, high-performance artificial intelligence.

AI-Powered Content

Sources: www.reddit.com

Qwen 3.5 35B MoE Achieves 44 TPS on RTX 5060 Ti, Redefining Local AI Performance

Qwen 3.5 35B MoE Achieves 44 TPS on RTX 5060 Ti, Redefining Local AI Performance

summarize3-Point Summary

psychology_altWhy It Matters

Qwen 3.5 35B MoE Achieves 44 TPS on RTX 5060 Ti, Redefining Local AI Performance

AI Terms in This Article

recommendRelated Articles

Attention Residuals (2026): Moonshot AI's Breakthrough for Efficient Transformer Scaling

Amazon Nova 2 Lite Content Moderation (2026): How New Prompts Beat Larger AI Models

Cursor Composer 2 AI Model (2026 Review): Beats Claude Opus 4.6 with 86% Lower Cost & Superior Be...