Qwen3.5-35B-A3B Runs on Triple RTX 3090s, Sets New Benchmarks for Local AI Inference
A detailed test by a Reddit user reveals that the Qwen3.5-35B-A3B model achieves remarkable inference speeds on consumer-grade NVIDIA RTX 3090 hardware, challenging assumptions about local deployment of large language models. The findings, shared in the r/LocalLLaMA community, highlight both breakthrough performance and lingering stability issues.

Qwen3.5-35B-A3B Runs on Triple RTX 3090s, Sets New Benchmarks for Local AI Inference
summarize3-Point Summary
- 1A detailed test by a Reddit user reveals that the Qwen3.5-35B-A3B model achieves remarkable inference speeds on consumer-grade NVIDIA RTX 3090 hardware, challenging assumptions about local deployment of large language models. The findings, shared in the r/LocalLLaMA community, highlight both breakthrough performance and lingering stability issues.
- 2Qwen3.5-35B-A3B Runs on Triple RTX 3090s, Sets New Benchmarks for Local AI Inference In a landmark demonstration of local AI deployment, a user in the r/LocalLLaMA subreddit has successfully run the 35-billion-parameter Qwen3.5-35B-A3B model on a triple RTX 3090 setup, achieving unprecedented inference speeds for a model of its scale on consumer hardware.
- 3The test, conducted using the GGUF quantized format and the llama.cpp inference engine, shows that advanced large language models (LLMs) are no longer confined to cloud servers or enterprise-grade GPUs.
psychology_altWhy It Matters
- check_circleThis update has direct impact on the Yapay Zeka Modelleri topic cluster.
- check_circleThis topic remains relevant for short-term AI monitoring.
- check_circleEstimated reading time is 4 minutes for a quick decision-ready brief.
Qwen3.5-35B-A3B Runs on Triple RTX 3090s, Sets New Benchmarks for Local AI Inference
In a landmark demonstration of local AI deployment, a user in the r/LocalLLaMA subreddit has successfully run the 35-billion-parameter Qwen3.5-35B-A3B model on a triple RTX 3090 setup, achieving unprecedented inference speeds for a model of its scale on consumer hardware. The test, conducted using the GGUF quantized format and the llama.cpp inference engine, shows that advanced large language models (LLMs) are no longer confined to cloud servers or enterprise-grade GPUs.
According to the Reddit post by user /u/jacek2023, the Qwen3.5-35B-A3B model — downloaded in Q8_0 quantization from Hugging Face — was loaded across three NVIDIA GeForce RTX 3090 GPUs, each with 24GB of VRAM and compute capability 8.6. The system’s CUDA backend initialized all three devices simultaneously, enabling efficient model sharding. Performance metrics from llama-bench revealed a prompt processing rate of 1,324 tokens per second (±2.17) for 512-token contexts, and 93.2 tokens per second (±2.17) for 128-token generation tasks — figures that rival or surpass many cloud-based API services.
The model, identified in the logs as "qwen35moe ?B Q8_0" (likely a naming artifact), occupies 34.36 GiB of memory and contains approximately 34.66 billion parameters. The high number of GPU layers (ngl = 99) indicates near-total offloading of computation to the GPU, minimizing CPU bottlenecks. This level of efficiency suggests that the GGUF format, combined with recent optimizations in llama.cpp, has matured to the point where even complex MoE (Mixture of Experts) architectures can be deployed locally without significant performance degradation.
Notably, the same user tested the smaller Qwen3.5-27B-Q8_0 model — downloaded from lmstudio-community’s Hugging Face repository — but encountered critical instability. While the llama-server component functioned briefly, it ultimately crashed during testing, highlighting inconsistent compatibility across model variants and tooling versions. This underscores a persistent challenge in the local LLM ecosystem: while quantized models are increasingly accessible, toolchain reliability remains uneven. The fact that llama-bench crashed but llama-server partially worked suggests a bug in benchmarking utilities rather than the model itself, pointing to the need for more robust testing frameworks.
These results have significant implications for developers, researchers, and privacy-conscious enterprises. The ability to run a 35B-parameter model locally on readily available hardware reduces reliance on proprietary cloud APIs, enhances data sovereignty, and lowers long-term operational costs. For edge AI applications — such as on-premise customer service bots, secure document analysis, or real-time translation systems — this performance threshold represents a tipping point.
However, experts caution that while raw speed is impressive, latency consistency, memory fragmentation, and long-term thermal stability under sustained load remain untested in this report. The use of three high-power GPUs also raises questions about energy efficiency and scalability. Future work should explore whether similar performance can be achieved on fewer, more power-efficient GPUs such as the RTX 4090 or upcoming consumer-grade AI accelerators.
The broader community has responded with enthusiasm, with comments suggesting replication attempts on single RTX 4090 cards and interest in benchmarking against other open models like Llama 3.7B and Mistral-7B. As open-source AI continues its rapid evolution, this test serves as a compelling proof-of-concept: the era of powerful, locally-hosted LLMs is no longer theoretical — it’s operational, and it’s here to stay.


