TR
Yapay Zeka Modellerivisibility4 views

Breakthrough Hybrid Quantization Enables 229B Model to Run Efficiently on 192GB VRAM

A pioneering hybrid quantization technique combines AWQ int4 weights, FP8 attention, and calibrated FP8 KV cache to dramatically reduce memory usage, enabling the 229B-parameter MiniMax-M2.5 model to run with double the context length and unprecedented batching efficiency on consumer-grade GPUs.

calendar_today🇹🇷Türkçe versiyonu
Breakthrough Hybrid Quantization Enables 229B Model to Run Efficiently on 192GB VRAM

Revolutionary Quantization Method Unlocks Massive LLM Efficiency

A groundbreaking advancement in large language model (LLM) quantization has shattered previous memory efficiency barriers, allowing the 229-billion-parameter MiniMax-M2.5 model to operate on just 192GB of VRAM — a feat previously deemed impossible without severe performance trade-offs. Developed by independent researcher Elias Oenal and detailed in a Reddit post on r/LocalLLaMA, the new hybrid AWQ quantization method integrates four precision tiers into a single optimized checkpoint, enabling unprecedented throughput and context length on widely available hardware.

Traditional quantization methods, such as AWQ (Activation-Aware Quantization), compress the majority of model weights — typically the expert MLP layers — into 4-bit integers to save memory. However, attention mechanisms and key-value (KV) caches have historically remained in 16-bit floating point (bf16), consuming disproportionate memory and negating much of the compression benefit. Oenal’s innovation eliminates this bottleneck by preserving the original FP8_e4m3 precision for attention layers and introducing a calibrated FP8 KV cache, reducing memory usage by nearly 50% compared to standard AWQ implementations.

The results are staggering. On a configuration of four RTX A6000 GPUs (192GB total VRAM), MiniMax-M2.5 now supports a KV cache of approximately 370,000 tokens — more than double the ~160,000 tokens achievable with conventional AWQ. This expansion allows for significantly longer context windows, critical for tasks like document summarization, legal analysis, and multi-turn dialogue systems. Furthermore, the system achieves 92 tokens per second (t/s) on a single request and scales to 416 t/s when processing 16 concurrent requests, demonstrating exceptional batching efficiency.

Technically, the model applies AWQ int4 with a group size of 128 to the 224.7B parameters in the expert MLPs, which constitute 98.3% of the model’s total weights. The 2.7B attention parameters, along with embeddings, normalization layers, and output heads, are preserved in their original FP8_e4m3 or bf16/fp32 formats to maintain numerical stability. Crucially, the KV cache — previously a memory hog — is now quantized using per-layer calibrated FP8 scales, ensuring minimal accuracy loss while halving its footprint.

The development process was as innovative as the result. Oenal conducted the entire project remotely via SSH using his own open-source tool, term-cli, which enables AI agents like Claude Opus 4.6 to interact with headless GPU servers in real time. The AI assistant performed calibration, tested inference, debugged vLLM patches, and even iterated on fixes by asking the model simple arithmetic questions like "2+2" to validate correct implementation of KV scale calibration. This closed-loop agentic workflow not only accelerated development but also led to tangible improvements in term-cli, including the addition of secure in-band file transfers via gzip and SHA-256 verification.

Two critical bugs in the vLLM inference engine were uncovered during development and have since been patched and submitted upstream (PR #34863). Once merged, the hybrid quantization will become fully accessible to the broader community without requiring custom forks. The model is now publicly available on Hugging Face, and the patches are expected to benefit not only MiniMax-M2.5 but any future large models leveraging similar hybrid architectures.

This breakthrough signals a paradigm shift in LLM deployment: rather than scaling hardware to fit massive models, engineers can now optimize software to make them fit existing infrastructure. With support for 8x RTX 3090 cards — matching the same 192GB VRAM configuration — the technique democratizes access to billion-parameter models for research labs, startups, and edge deployments. The implications extend beyond efficiency; they redefine the economic and environmental calculus of running state-of-the-art AI systems.

AI-Powered Content
Sources: www.reddit.com

recommendRelated Articles