Qwen 3.5 35B MoE Achieves 44 TPS on RTX 5060 Ti, Redefining Local AI Performance
A Reddit user has demonstrated that Alibaba’s Qwen 3.5 35B MoE model can generate text at over 44 tokens per second on the rumored RTX 5060 Ti, leveraging CUDA backend optimizations. The benchmark, conducted with a 100,000-token context window, suggests a major leap in affordable, high-performance local AI inference.

Qwen 3.5 35B MoE Achieves 44 TPS on RTX 5060 Ti, Redefining Local AI Performance
summarize3-Point Summary
- 1A Reddit user has demonstrated that Alibaba’s Qwen 3.5 35B MoE model can generate text at over 44 tokens per second on the rumored RTX 5060 Ti, leveraging CUDA backend optimizations. The benchmark, conducted with a 100,000-token context window, suggests a major leap in affordable, high-performance local AI inference.
- 2Qwen 3.5 35B MoE Achieves 44 TPS on RTX 5060 Ti, Redefining Local AI Performance A groundbreaking benchmark shared on the r/LocalLLaMA subreddit has revealed that Alibaba’s Qwen 3.5 35B MoE (Mixture of Experts) large language model can achieve up to 44.32 tokens per second (TPS) on a purported NVIDIA GeForce RTX 5060 Ti GPU — a performance level previously thought to require enterprise-grade hardware.
- 3The test, conducted by user /u/maho_Yun, utilized the llama-bench utility with a 100,000-token context window, demonstrating that high-context, efficient AI inference is now feasible on consumer-grade hardware.
psychology_altWhy It Matters
- check_circleThis update has direct impact on the Yapay Zeka Modelleri topic cluster.
- check_circleThis topic remains relevant for short-term AI monitoring.
- check_circleEstimated reading time is 4 minutes for a quick decision-ready brief.
Qwen 3.5 35B MoE Achieves 44 TPS on RTX 5060 Ti, Redefining Local AI Performance
A groundbreaking benchmark shared on the r/LocalLLaMA subreddit has revealed that Alibaba’s Qwen 3.5 35B MoE (Mixture of Experts) large language model can achieve up to 44.32 tokens per second (TPS) on a purported NVIDIA GeForce RTX 5060 Ti GPU — a performance level previously thought to require enterprise-grade hardware. The test, conducted by user /u/maho_Yun, utilized the llama-bench utility with a 100,000-token context window, demonstrating that high-context, efficient AI inference is now feasible on consumer-grade hardware.
According to the benchmark data, when running on NVIDIA’s CUDA backend, the model processed prompt processing (pp100000) at 1,304.93 ± 4.10 tokens per second and generated text (tg720) at 44.32 ± 2.16 TPS. This represents a significant performance advantage over the Vulkan backend, which achieved only 41.35 TPS. The disparity highlights the continued superiority of NVIDIA’s proprietary CUDA ecosystem for AI inference workloads, even on emerging consumer architectures.
The system configuration included an AMD Ryzen 7 9700X CPU running at 5.55 GHz, 47.61 GiB of system memory, and a GPU identified as the NVIDIA GeForce RTX 5060 Ti with 15.59 GiB of dedicated VRAM — a specification that, if accurate, would place the card between the RTX 4060 Ti and RTX 4070 in terms of memory capacity and computational throughput. Notably, the system also listed a "GameViewer Virtual Display Adapter," suggesting the test may have been conducted in a remote or virtualized environment, which could impact real-world reproducibility.
Qwen 3.5 35B MoE, released earlier this year by Alibaba’s Tongyi Lab, is designed to balance performance and efficiency through its Mixture of Experts architecture, activating only a subset of neural network experts per inference. This design reduces computational load while preserving high-quality outputs — making it particularly well-suited for deployment on resource-constrained devices. The ability to maintain over 40 TPS with a 100K context window is a milestone; such context lengths are critical for processing lengthy documents, codebases, or multi-turn conversations without truncation.
For developers and AI enthusiasts, this benchmark signals a potential paradigm shift. Until now, running 35B-class MoE models locally with high throughput required GPUs with 24GB+ VRAM, such as the RTX 4090 or H100. The apparent performance of the RTX 5060 Ti — if confirmed by independent testing — could democratize access to enterprise-grade AI capabilities for individual users, researchers, and small businesses.
NVIDIA has not officially confirmed the existence or specifications of the RTX 5060 Ti as of this report. If the GPU is real and performs as shown, it may represent a strategic move by NVIDIA to extend its dominance in the local AI inference market with a mid-tier, high-efficiency product. Competitors like AMD and Intel may face renewed pressure to accelerate their own AI inference optimizations.
As local AI becomes increasingly central to privacy-conscious applications — from secure enterprise chatbots to offline medical diagnostics — benchmarks like this underscore the accelerating pace of innovation. The Qwen 3.5 35B MoE model’s performance on what may be a consumer-grade GPU suggests that the era of "AI on your desktop" is no longer speculative. The next frontier lies in optimizing software stacks, quantization techniques, and driver-level support to make such performance consistent and accessible across platforms.
Further testing is required to validate the hardware claims and ensure reproducibility. However, this single benchmark has already ignited discussions across AI communities — and may herald a new chapter in decentralized, high-performance artificial intelligence.


