TR
Yapay Zeka Modellerivisibility7 views

Small LLMs Reveal Surprising Tool-Calling Mastery on CPU — Benchmark Results

A groundbreaking benchmark tests 21 small language models on their ability to judge when to invoke tools, revealing that ultra-compact models like Qwen3:0.6B and LFM2.5:1.2B outperform larger counterparts — all running on standard laptops without GPUs.

calendar_today🇹🇷Türkçe versiyonu
Small LLMs Reveal Surprising Tool-Calling Mastery on CPU — Benchmark Results

Small LLMs Reveal Surprising Tool-Calling Mastery on CPU — Benchmark Results

A new benchmark study has upended conventional wisdom about the relationship between model size and practical AI performance. In a rigorous, CPU-only evaluation of 21 small language models, researchers found that models under 1.5 billion parameters can match or exceed the tool-calling judgment of much larger systems — challenging the industry’s assumption that scale equals capability.

The study, conducted by independent researcher Mike Veerman and published on Reddit’s r/LocalLLaMA community, tested models on their ability to determine when to invoke external tools — such as weather queries, file searches, or calendar scheduling — rather than merely whether they could generate tool calls. The results, based on 756 inference calls across 12 carefully designed prompts, revealed that four models tied for first place with an Agent Score of 0.880: lfm2.5:1.2b, qwen3:0.6b, qwen3:4b, and phi4-mini:3.8b. Notably, the 600-million-parameter Qwen3:0.6b achieved this score with latency under 3.7 seconds, making it the most efficient top performer.

One of the most striking findings was the dramatic performance correction of lfm2.5:1.2b, a 1.2B state-space hybrid model. Initially scored at 0.640 due to its non-standard bracket notation for tool calls — [get_weather(city="Antwerp")] — rather than XML tags, a custom parser revealed the model had been making correct decisions all along. Its score jumped to 0.880, tying it with the largest models in the test. This underscores a critical lesson: how you evaluate matters as much as what you evaluate. Five models required custom parsing due to non-standard output formats, and in some cases, fixing the parser exposed flawed behavior — such as with Gemma3:1b, whose score dropped from 0.600 to 0.550 after accurate parsing revealed excessive tool calls.

The results also revealed a troubling "capability valley" in the Qwen3 family: the 1.7B model scored significantly lower (0.670) than both its smaller (0.6B) and larger (4B) siblings. This suggests that mid-sized models may be more prone to overconfidence, calling tools too aggressively without sufficient restraint. In contrast, top performers excelled not by calling tools more often, but by knowing when to abstain — especially on ambiguous prompts. Prompt P12, which asked whether to schedule a meeting when the weather was already stated, proved the hardest: only three models correctly declined to call a tool, highlighting the persistent challenge of context-aware restraint.

Performance was not correlated with parameter count. The 270M FunctionGemma model was the fastest (476ms), while the 4B Qwen3:4b was nearly 17 times slower than its 0.6B counterpart. The BitNet-2B-4T and SmolLM3:3B models demonstrated high action rates but poor restraint, leading to lower scores despite frequent correct calls. Meanwhile, the 3B Llama3.2 model scored just 0.660 despite its size, largely due to a complete failure in restraint — calling tools even when explicitly told not to.

These findings have profound implications for edge AI and on-device applications. The fact that multiple models under 4B parameters can achieve near-perfect tool-calling judgment on commodity laptops — without GPUs — opens new pathways for privacy-preserving, low-latency AI assistants. As Veerman notes in his GitHub repository, "Local tool-calling agents are not just feasible — they’re already competitive."

The full dataset, code, and raw inference logs are available on GitHub, inviting further replication and expansion. Veerman has already signaled interest in a Round 3, with community requests for additional models pouring in. For developers and researchers building constrained AI systems, this benchmark is no longer just a curiosity — it’s a roadmap.

AI-Powered Content

recommendRelated Articles