Qwen3.5-35B-A3B Quantization Showdown: KLD Scores Reveal Optimal Model Weights
A comprehensive analysis of Q4 quantization methods for the Qwen3.5-35B-A3B model reveals that AesSedai’s Q4_K_M variant achieves the highest faithfulness to the original BF16 model, while MXFP4 post-training quantization underperforms. The study provides data-driven guidance for developers prioritizing efficiency and accuracy.

Qwen3.5-35B-A3B Quantization Showdown: KLD Scores Reveal Optimal Model Weights
summarize3-Point Summary
- 1A comprehensive analysis of Q4 quantization methods for the Qwen3.5-35B-A3B model reveals that AesSedai’s Q4_K_M variant achieves the highest faithfulness to the original BF16 model, while MXFP4 post-training quantization underperforms. The study provides data-driven guidance for developers prioritizing efficiency and accuracy.
- 2Qwen3.5-35B-A3B Quantization Showdown: KLD Scores Reveal Optimal Model Weights A groundbreaking empirical comparison of Q4 quantization techniques for the Qwen3.5-35B-A3B large language model has identified the most faithful and efficient weight compression methods, offering critical insights for developers deploying AI models on resource-constrained hardware.
- 3The analysis, conducted by a community researcher under the username TitwitMuffbiscuit and published on r/LocalLLaMA, evaluates 17 quantization variants using Kullback-Leibler (KLD) divergence and perplexity (PPL) metrics against a BF16 baseline, revealing stark differences in performance that challenge conventional assumptions about quantization quality.
psychology_altWhy It Matters
- check_circleThis update has direct impact on the Yapay Zeka Modelleri topic cluster.
- check_circleThis topic remains relevant for short-term AI monitoring.
- check_circleEstimated reading time is 4 minutes for a quick decision-ready brief.
Qwen3.5-35B-A3B Quantization Showdown: KLD Scores Reveal Optimal Model Weights
A groundbreaking empirical comparison of Q4 quantization techniques for the Qwen3.5-35B-A3B large language model has identified the most faithful and efficient weight compression methods, offering critical insights for developers deploying AI models on resource-constrained hardware. The analysis, conducted by a community researcher under the username TitwitMuffbiscuit and published on r/LocalLLaMA, evaluates 17 quantization variants using Kullback-Leibler (KLD) divergence and perplexity (PPL) metrics against a BF16 baseline, revealing stark differences in performance that challenge conventional assumptions about quantization quality.
At the top of the leaderboard is AesSedai’s Q4_K_M quantization, which achieved the lowest KLD score of 0.010214 — indicating minimal drift from the original model’s probability distribution. This variant excels by preserving high-impact tensors, such as attention weights and shared experts, at higher precision (Q8_0), while intelligently differentiating between feed-forward network components. In contrast, Unsloth’s UD-Q4_K_XL recipe, which applies MXFP4 to nearly all tensors including attention layers, recorded the worst KLD score at 0.052439, despite being one of the smallest files at 18.34 GiB. This finding aligns with broader research from OpenReview, which suggests that post-training quantization of reasoning models can introduce significant structural degradation, particularly when aggressive formats like MXFP4 are applied without quantization-aware training (QAT).
The study introduces an innovative Efficiency Score — calculated as the Euclidean distance from an ideal model (zero size, zero KLD) — to identify the best trade-offs between model size and fidelity. AesSedai’s IQ4_XS variant ranked first in efficiency with a score of 0.327, offering a compelling 16.4 GiB footprint with minimal information loss. Meanwhile, Ubergarm’s Q4_0 outperformed other Q4_0 implementations by a factor of 2.5 in KLD, demonstrating that implementation details matter more than nominal quantization labels. Bartowski’s Q4_K_S and Q4_K_L variants also performed well, suggesting that symmetric quantization with layer-wise calibration remains a robust strategy.
Notably, MXFP4-based quantizations — promoted by Unsloth and Noctrex — consistently underperformed despite their theoretical advantages for low-precision arithmetic. The OpenReview paper on quantized reasoning models corroborates this, noting that MXFP4’s benefits emerge only during QAT, not post-training. When applied retroactively to BF16 models, MXFP4 introduces non-uniform error propagation, particularly in MoE architectures like Qwen3.5, where routing decisions are sensitive to small weight perturbations.
The experimental setup was rigorous: tests ran on an Intel i3-12100F with an RTX 3060, using ik_llama.cpp and llama.cpp on the wikitext2_test dataset with a 512-token context. All models were evaluated using identical hardware and inference parameters, ensuring comparability. The authors caution that while KLD is the most reliable metric for general faithfulness, task-specific performance (e.g., code generation or mathematical reasoning) may vary, and further benchmarks are needed.
For practitioners, the takeaway is clear: avoid defaulting to popular but poorly optimized quantizations like Q4_0 or MXFP4. Instead, prioritize AesSedai’s Q4_K_M for maximum accuracy or IQ4_XS for optimal efficiency. As quantization becomes standard for edge AI deployment, these granular findings underscore the need for data-driven selection — not anecdotal preference — in model compression.
For those seeking to replicate the analysis, the author provides command-line tools using llama-perplexity with the --kl-divergence-base flag to compare any quantized model against a BF16 baseline. The full dataset and plots are available via the Reddit post’s image links, offering a rare, transparent benchmark in the opaque world of LLM quantization.


