Breakthrough in Local AI: Qwen 3.5-122B Run on 72GB VRAM with 90K Context
An anonymous AI enthusiast has successfully deployed the massive Qwen 3.5-122B model on consumer-grade hardware, achieving stable inference at 90,000-token context lengths using a dual-GPU setup and llama.cpp. The configuration marks a significant milestone for decentralized large language model deployment.

Breakthrough in Local AI: Qwen 3.5-122B Run on 72GB VRAM with 90K Context
summarize3-Point Summary
- 1An anonymous AI enthusiast has successfully deployed the massive Qwen 3.5-122B model on consumer-grade hardware, achieving stable inference at 90,000-token context lengths using a dual-GPU setup and llama.cpp. The configuration marks a significant milestone for decentralized large language model deployment.
- 2Breakthrough in Local AI: Qwen 3.5-122B Run on 72GB VRAM with 90K Context In a landmark achievement for decentralized artificial intelligence, an anonymous contributor to the r/LocalLLaMA subreddit has demonstrated that the massive Qwen 3.5-122B language model can be run locally on consumer-grade hardware with unprecedented context length and stability.
- 3Using a dual-GPU configuration totaling 72GB of VRAM and the llama.cpp inference engine, the system achieves approximately 50–60 tokens per second while maintaining integrity across context windows of up to 90,000 tokens—a feat previously thought to require enterprise-grade data center infrastructure.
psychology_altWhy It Matters
- check_circleThis update has direct impact on the Yapay Zeka Modelleri topic cluster.
- check_circleThis topic remains relevant for short-term AI monitoring.
- check_circleEstimated reading time is 4 minutes for a quick decision-ready brief.
Breakthrough in Local AI: Qwen 3.5-122B Run on 72GB VRAM with 90K Context
In a landmark achievement for decentralized artificial intelligence, an anonymous contributor to the r/LocalLLaMA subreddit has demonstrated that the massive Qwen 3.5-122B language model can be run locally on consumer-grade hardware with unprecedented context length and stability. Using a dual-GPU configuration totaling 72GB of VRAM and the llama.cpp inference engine, the system achieves approximately 50–60 tokens per second while maintaining integrity across context windows of up to 90,000 tokens—a feat previously thought to require enterprise-grade data center infrastructure.
The model in question, Qwen3.5-122B-A10B-UD-Q4_K_XL, is a quantized variant of Alibaba’s Qwen 3.5 series, developed by Unsloth and distributed via Hugging Face in GGUF format. This version reduces the model’s memory footprint through 4-bit quantization while preserving reasoning capabilities, making it feasible to deploy on systems with limited VRAM. The contributor’s setup combines an NVIDIA RTX A6000 (48GB VRAM) with an RTX 3090 Ti (24GB VRAM), powered by a 24-core AMD Ryzen Threadripper 3960X and 64 GiB of DDR4 system memory.
Running within a Docker container based on the official ghcr.io/ggml-org/llama.cpp:server-cuda image (version b8148, compiled February 25th), the deployment leverages llama.cpp’s advanced tensor splitting and GPU offloading features. Key flags include --split-mode layer and --tensor-split 2,1, which distribute model layers unevenly across the two GPUs to optimize memory utilization and reduce latency. The -ngl 999 flag ensures nearly all layers are offloaded to the GPU, minimizing CPU bottlenecks. Additional optimizations such as --flash-attn on, --cache-type-k q8_0, and --cache-type-v q8_0 enhance attention efficiency and key-value cache performance, critical for long-context inference.
Performance metrics reveal a consistent output rate of 50–60 tokens per second under real-world usage conditions, including integration with OpenCode for code generation and web search tools. The contributor confirmed system stability through stress tests using OpenCode prompts that exceeded 90,000 tokens—approaching the theoretical 105,000-token context limit reported by the llama.cpp web interface. Notably, no memory errors, slowdowns, or hallucination spikes were observed during extended testing, suggesting the configuration is robust enough for production-grade local applications.
The use of --reasoning-format deepseek and --jinja enables structured reasoning outputs and template-based prompt formatting, improving reliability for complex tasks such as multi-step problem solving and document analysis. The model’s responsiveness under high load underscores the growing viability of open-source inference backends like llama.cpp in challenging the dominance of cloud-based LLM APIs.
This deployment is significant not only for its technical achievements but also for its implications in AI democratization. By proving that a 122-billion-parameter model can be run locally without specialized hardware or cloud subscriptions, this setup offers a blueprint for researchers, developers, and privacy-conscious organizations seeking to avoid data leakage and vendor lock-in. While formal benchmarking via llama-bench is pending, the results already challenge assumptions about the hardware requirements for large-scale local AI.
Community feedback has been enthusiastic, with users requesting tests on other quantization levels, multi-user concurrency, and energy efficiency metrics. The contributor has invited further collaboration, signaling a growing trend of grassroots innovation in the open LLM ecosystem. As quantization techniques and inference engines continue to evolve, this setup may become a standard reference for local deployment of next-generation models.


