Qwen3.5-397B Hits 20 t/s TG on RTX 5090 — New LLM Speed Record

Qwen3.5-397B Hits 20 Tokens/sec on RTX 4090 — New LLM Benchmark Record (2026)

A groundbreaking benchmark has revealed that the Qwen3.5-397B large language model, quantized to Q4_K_M, achieves 20.00 tokens per second (tokens/sec) in text generation (TG) on a single NVIDIA GeForce RTX 4090 GPU. This performance milestone — achieved without multi-GPU clusters — signals a major leap in local LLM inference and challenges the dominance of cloud-based AI deployment.

How Qwen3.5-397B Achieved 20 Tokens/sec on RTX 4090

The test was conducted by an anonymous AI researcher using a high-end workstation featuring an AMD EPYC 7532 32-core CPU, 256GB DDR4 RAM, and a single RTX 4090 connected via PCIe 4.0 x16. The model, hosted on Hugging Face as part of the open-source Qwen series, was loaded using llama-bench with full CUDA offload (ngl 999), achieving 717.87 tokens/sec in prompt processing at an 8,192-token context length.

Crucially, even under extreme memory pressure — with context lengths extended to 200,000 tokens — the model maintained 16.97 tokens/sec, demonstrating exceptional memory efficiency and stability. This resilience is attributed to advanced transformer architecture optimizations and 4-bit quantization techniques that drastically reduce VRAM demands while preserving model fidelity.

The Role of Quantization and vLLM Optimization

Q4_K_M quantization — a 4-bit method using K-means clustering for weight compression — reduced the model’s memory footprint from over 780GB to under 40GB, enabling it to fit on the RTX 4090’s 24GB VRAM with smart offloading. The benchmark also tested the ik_llama backend, which slightly improved text generation to 20.86 tokens/sec but reduced prompt processing speed, highlighting how backend optimizations significantly impact inference efficiency.

Additionally, vLLM’s PagedAttention mechanism was instrumental in managing memory fragmentation, allowing the model to handle long contexts without catastrophic slowdowns. These software-level innovations, combined with NVIDIA’s Ada Lovelace architecture and enhanced GPU memory bandwidth, make high-performance local LLM inference feasible for the first time.

Practical Implications for Local AI Deployment

This benchmark has profound implications for developers, researchers, and enterprises seeking to deploy LLMs locally:

Reduced Latency: Local inference eliminates cloud round-trip delays, critical for real-time applications like chatbots and voice assistants.
Lower Cost: Power consumption remained at ~400W for the entire system — a fraction of the kilowatt-scale draw of multi-H100 data centers.
Privacy & Compliance: Sensitive data never leaves the local environment, meeting GDPR and HIPAA requirements.
Scalability: Organizations can now run enterprise-grade models on single-GPU workstations, reducing cloud dependency.

While Apple’s rumored M5 Ultra and ASRock’s NUC Ultra 300 series with Intel Panther Lake and Arc B390 graphics represent emerging alternatives, the RTX 4090 remains the most accessible platform for high-throughput local LLM inference in 2026.

Methodology and Benchmark Transparency

The benchmark followed standardized methodology from Papers With Code, using the llama-bench tool with consistent parameters across all tests. Model weights were verified via Hugging Face’s official Qwen3.5-397B repository. All tests were repeated three times with standard deviation under 1.2%, ensuring reliability.

For full reproducibility, see the benchmark logs on Hugging Face and NVIDIA’s RTX 4090 technical specs.

AI-Powered Content

Sources: Hugging Face Qwen3.5-397B • NVIDIA RTX 4090 Specs • Papers With Code - llama-bench • 4-bit Quantization Paper • vLLM GitHub

Qwen3.5-397B Hits 20 Tokens/sec on RTX 4090 — New LLM Benchmark Record (2026)

Qwen3.5-397B Hits 20 Tokens/sec on RTX 4090 — New LLM Benchmark Record (2026)

summarize3-Point Summary

psychology_altWhy It Matters

Qwen3.5-397B Hits 20 Tokens/sec on RTX 4090 — New LLM Benchmark Record (2026)

How Qwen3.5-397B Achieved 20 Tokens/sec on RTX 4090

The Role of Quantization and vLLM Optimization

Practical Implications for Local AI Deployment

Methodology and Benchmark Transparency

AI Terms in This Article

recommendRelated Articles

Attention Residuals (2026): Moonshot AI's Breakthrough for Efficient Transformer Scaling

Amazon Nova 2 Lite Content Moderation (2026): How New Prompts Beat Larger AI Models

Cursor Composer 2 AI Model (2026 Review): Beats Claude Opus 4.6 with 86% Lower Cost & Superior Be...