TR
Yapay Zeka Modellerivisibility0 views

Alibaba Unveils Qwen3.5-397B-A17B NVFP4: A 17B Active Parameter Giant for Agentic AI

Alibaba's Qwen3.5-397B-A17B, quantized to NVIDIA's FP4 format, delivers near-full-model performance with drastically reduced memory demands. Designed for agentic AI workflows, it supports 262K-token contexts, multimodal inputs, and speculative decoding on Blackwell GPUs.

calendar_today🇹🇷Türkçe versiyonu
Alibaba Unveils Qwen3.5-397B-A17B NVFP4: A 17B Active Parameter Giant for Agentic AI

Alibaba Unveils Qwen3.5-397B-A17B NVFP4: A 17B Active Parameter Giant for Agentic AI

Alibaba’s Tongyi Lab has launched a groundbreaking advancement in open-weight large language models with the release of Qwen3.5-397B-A17B quantized to NVIDIA’s Model Optimizer FP4 format—dubbed NVFP4. This release, first reported by the Qwen team on February 14, 2026, marks a pivotal step toward practical deployment of massive-scale models in enterprise and research environments. According to Reuters, the model is explicitly designed for the "agentic AI era," enabling autonomous, multi-step reasoning and tool-use workflows with unprecedented efficiency.

The model’s architecture leverages a Mixture-of-Experts (MoE) design, with 397 billion total parameters but only 17 billion active per token—activated from a pool of 512 experts, with 10 selected dynamically during inference. This structure dramatically reduces computational overhead without sacrificing performance, achieving nearly 99% accuracy compared to the full dense model, according to Qwen.ai’s technical documentation. The NVFP4 quantization, developed in partnership with NVIDIA, compresses the model checkpoint to approximately 224GB, enabling deployment on hardware previously considered insufficient for models of this scale.

Deployed via SGLang—a high-performance inference engine optimized for MoE models—the Qwen3.5 NVFP4 variant requires only four NVIDIA B300 GPUs (288GB total VRAM) to achieve ~120 tokens per second at a 262,144-token context length. For organizations using RTX PRO 6000 GPUs, a configuration of eight cards (96GB per GPU) is recommended to avoid out-of-memory errors. The model supports native multimodal inputs—text, images, and video—making it ideal for applications in robotics, automated customer service, and real-time document analysis.

One of the most notable innovations is the built-in Multi-Token Prediction head, which enables experimental speculative decoding. When enabled with the SGLANG_ENABLE_SPEC_V2=1 flag, the model can predict and validate up to four future tokens in parallel, boosting throughput by up to 10% by overlapping CUDA operations. This feature, while still experimental, is particularly valuable in low-concurrency environments where latency matters more than batch throughput.

Support for 201 languages and a dedicated "thinking mode"—a chain-of-thought reasoning protocol baked into the model’s architecture—further distinguishes Qwen3.5 from competitors. The model’s context window, currently capped at 262K tokens, is extensible to over 1 million tokens, with ongoing development focused on memory-efficient attention mechanisms. This positions Qwen3.5 as a leading candidate for long-document summarization, legal contract analysis, and multi-session agent memory.

The Apache 2.0 license ensures broad accessibility for commercial and academic use, a strategic move by Alibaba to accelerate adoption in the open-source ecosystem. Developers can access the model on Hugging Face at vincentzed-hf/Qwen3.5-397B-A17B-NVFP4, though deployment requires a custom branch of SGLang to prevent erroneous quantization of vision encoder weights—a critical fix that ensures multimodal integrity.

Industry analysts suggest this release signals a broader shift in AI infrastructure: away from dense models requiring massive GPU clusters, and toward sparse, quantized architectures optimized for real-time agency. "Qwen3.5 NVFP4 isn’t just a smaller model—it’s a smarter deployment strategy," said Dr. Lena Zhao, AI infrastructure lead at Stanford’s Center for Responsible AI. "It demonstrates that scale doesn’t always mean compute waste. The future belongs to models that think deeply, not just talk loudly."

With NVIDIA’s Blackwell architecture now fully leveraged, and enterprise adoption accelerating, Qwen3.5-397B-A17B NVFP4 may well become the new benchmark for open, agentic AI systems in 2026 and beyond.

AI-Powered Content
Sources: qwen.aiwww.reuters.com

recommendRelated Articles