Breakthrough in Local AI: Qwen 3.5-122B Run on 72GB VRAM with 90K Context

In a landmark achievement for decentralized artificial intelligence, an anonymous contributor to the r/LocalLLaMA subreddit has demonstrated that the massive Qwen 3.5-122B language model can be run locally on consumer-grade hardware with unprecedented context length and stability. Using a dual-GPU configuration totaling 72GB of VRAM and the llama.cpp inference engine, the system achieves approximately 50–60 tokens per second while maintaining integrity across context windows of up to 90,000 tokens—a feat previously thought to require enterprise-grade data center infrastructure.

The model in question, Qwen3.5-122B-A10B-UD-Q4_K_XL, is a quantized variant of Alibaba’s Qwen 3.5 series, developed by Unsloth and distributed via Hugging Face in GGUF format. This version reduces the model’s memory footprint through 4-bit quantization while preserving reasoning capabilities, making it feasible to deploy on systems with limited VRAM. The contributor’s setup combines an NVIDIA RTX A6000 (48GB VRAM) with an RTX 3090 Ti (24GB VRAM), powered by a 24-core AMD Ryzen Threadripper 3960X and 64 GiB of DDR4 system memory.

Running within a Docker container based on the official ghcr.io/ggml-org/llama.cpp:server-cuda image (version b8148, compiled February 25th), the deployment leverages llama.cpp’s advanced tensor splitting and GPU offloading features. Key flags include --split-mode layer and --tensor-split 2,1, which distribute model layers unevenly across the two GPUs to optimize memory utilization and reduce latency. The -ngl 999 flag ensures nearly all layers are offloaded to the GPU, minimizing CPU bottlenecks. Additional optimizations such as --flash-attn on, --cache-type-k q8_0, and --cache-type-v q8_0 enhance attention efficiency and key-value cache performance, critical for long-context inference.

Performance metrics reveal a consistent output rate of 50–60 tokens per second under real-world usage conditions, including integration with OpenCode for code generation and web search tools. The contributor confirmed system stability through stress tests using OpenCode prompts that exceeded 90,000 tokens—approaching the theoretical 105,000-token context limit reported by the llama.cpp web interface. Notably, no memory errors, slowdowns, or hallucination spikes were observed during extended testing, suggesting the configuration is robust enough for production-grade local applications.

The use of --reasoning-format deepseek and --jinja enables structured reasoning outputs and template-based prompt formatting, improving reliability for complex tasks such as multi-step problem solving and document analysis. The model’s responsiveness under high load underscores the growing viability of open-source inference backends like llama.cpp in challenging the dominance of cloud-based LLM APIs.

This deployment is significant not only for its technical achievements but also for its implications in AI democratization. By proving that a 122-billion-parameter model can be run locally without specialized hardware or cloud subscriptions, this setup offers a blueprint for researchers, developers, and privacy-conscious organizations seeking to avoid data leakage and vendor lock-in. While formal benchmarking via llama-bench is pending, the results already challenge assumptions about the hardware requirements for large-scale local AI.

Community feedback has been enthusiastic, with users requesting tests on other quantization levels, multi-user concurrency, and energy efficiency metrics. The contributor has invited further collaboration, signaling a growing trend of grassroots innovation in the open LLM ecosystem. As quantization techniques and inference engines continue to evolve, this setup may become a standard reference for local deployment of next-generation models.

AI-Powered Content

Sources: www.reddit.com

Breakthrough in Local AI: Qwen 3.5-122B Run on 72GB VRAM with 90K Context

Breakthrough in Local AI: Qwen 3.5-122B Run on 72GB VRAM with 90K Context

summarize3-Point Summary

psychology_altWhy It Matters

Breakthrough in Local AI: Qwen 3.5-122B Run on 72GB VRAM with 90K Context

AI Terms in This Article

recommendRelated Articles

Attention Residuals (2026): Moonshot AI's Breakthrough for Efficient Transformer Scaling

Amazon Nova 2 Lite Content Moderation (2026): How New Prompts Beat Larger AI Models

Cursor Composer 2 AI Model (2026 Review): Beats Claude Opus 4.6 with 86% Lower Cost & Superior Be...