AI Agents Struggle with Unfamiliar Data Formats, New Research Reveals
A landmark study on AI agents reveals that frontier models like Claude and GPT outperform open-source alternatives when navigating complex file systems. The research, involving nearly 10,000 experiments, also uncovered a significant 'grep tax' where unfamiliar data formats cause massive inefficiencies in token usage. These findings provide crucial guidance for developers building the next generation of file-native AI systems.

AI Agents Struggle with Unfamiliar Data Formats, New Research Reveals
By Investigative AI Journalist
February 10, 2026
In a comprehensive study that challenges conventional wisdom about AI system design, researchers have discovered that the choice of data format and model architecture significantly impacts how effectively artificial intelligence agents can navigate and manipulate complex file systems. The findings, drawn from 9,649 experiments across 11 different AI models, reveal a stark performance gap between commercial "frontier" models and their open-source counterparts when operating in file-native environments.
The Scale of the Investigation
According to the research paper "Structured Context Engineering for File-Native Agentic Systems" published on arXiv, the study represents one of the most extensive empirical investigations into how AI agents process structured data. Led by Damon McMillan of HxAI Australia, the research used SQL generation as a proxy for programmatic agent operations, testing models against database schemas ranging from 10 to an astonishing 10,000 tables.
The study evaluated four different data formats: the familiar YAML, Markdown, and JSON, alongside the more esoteric Token-Oriented Object Notation (TOON), which is specifically designed to represent structured data using as few tokens as possible. What researchers discovered contradicted several common assumptions in the field.
The Frontier vs. Open-Source Divide
One of the most significant findings, according to the arXiv paper, concerns the performance gap between different classes of AI models. Frontier-tier models—specifically Claude (Opus 4.5), GPT (5.2), and Gemini (2.5 Pro)—demonstrated a clear advantage when using file-based context retrieval systems, showing an average accuracy improvement of 2.7%.
"Architecture choice is model-dependent," the paper states. "File-based context retrieval improves accuracy for frontier-tier models but shows mixed results for open source models."
In stark contrast, open-source models including DeepSeek V3.2, Kimi K2, and Llama 4 showed an aggregate performance decrease of 7.7% when using the same file-system approaches. This finding suggests that current open-weight models may not yet be optimized for the complex agentic loops required to navigate file systems effectively, potentially explaining why commercial models continue to dominate benchmarks like Terminal Bench 2.0.
The Surprising 'Grep Tax'
Perhaps the most counterintuitive discovery concerns data formats. While format choice showed no significant effect on aggregate accuracy across all models, researchers identified what they term the "grep tax"—a substantial efficiency penalty paid when models encounter unfamiliar data structures.
TOON, despite being designed specifically for token efficiency, backfired spectacularly at scale. According to the research, when processing a schema of 500 tables, TOON consumed 138% more tokens than YAML. This inefficiency exploded to 740% more tokens when the schema grew to 10,000 tables.
"The 'grep tax' emerged as schema size scaled," the paper explains. "Root cause: models lacked familiarity with TOON's syntax and could not construct effective refinement patterns."
This finding challenges the assumption that minimal token representation necessarily leads to efficiency. Instead, it suggests that model familiarity with a format's syntax and conventions may be more important than raw token count, particularly as task complexity increases.
Implications for AI Infrastructure
These findings arrive at a critical moment in AI development, as organizations worldwide are investing heavily in what The New Stack describes as "AI-ready infrastructure for the agentic era." The research provides empirical evidence that infrastructure decisions—from model selection to data formatting conventions—have measurable impacts on system performance and efficiency.
The study's scale is particularly noteworthy. With nearly 10,000 individual experiments, it represents one of the most comprehensive investigations into practical AI agent performance. The research methodology, using SQL generation as a proxy for broader programmatic operations, provides a replicable framework for future investigations into agentic systems.
Practical Guidance for Developers
For developers building file-native AI systems, the research offers several concrete recommendations:
- Model selection matters profoundly: The 21-percentage-point accuracy gap between frontier and open-source models dwarfs any format or architecture effect, making model capability the dominant factor in system performance.
- Familiarity beats minimalism: When choosing data formats, established standards like YAML or JSON may outperform more token-efficient but unfamiliar alternatives, particularly for complex tasks.
- Architecture should match model type: File-based context retrieval systems benefit frontier models but may hinder open-source alternatives, suggesting that different architectural approaches may be needed for different model classes.
- Scale changes everything: Format efficiency characteristics that appear negligible at small scales can become dominant cost factors when systems grow, necessitating forward-looking design decisions.
Looking Forward
As AI agents become increasingly integrated into everyday computing environments, understanding how they interact with file systems and structured data will only grow in importance. This research provides a foundational empirical basis for making informed decisions about AI system architecture.
The findings also suggest several avenues for future investigation, including whether open-source models can be specifically trained to improve their file-system navigation capabilities, and whether hybrid approaches combining different formats might optimize both familiarity and efficiency.
For now, the message to developers is clear: when building file-native AI systems, model choice matters most, but format familiarity and appropriate architecture can make the difference between an efficient system and one that pays a heavy "grep tax" at scale.
Primary source: "Structured Context Engineering for File-Native Agentic Systems" by Damon McMillan, published on arXiv. Additional context from AI infrastructure analysis published by The New Stack.


