TR
Bilim ve Araştırmavisibility18 views

Developer-Generated Coding Agent Data Could Revolutionize Open AI — If We Collect It

Millions of coding agent sessions — rich with real-world problem-solving trajectories — are being silently deleted from developers' machines daily. Experts argue this untapped data could power the first open-source AI coding model trained on actual human-AI collaboration.

calendar_today🇹🇷Türkçe versiyonu
Developer-Generated Coding Agent Data Could Revolutionize Open AI — If We Collect It
YAPAY ZEKA SPİKERİ

Developer-Generated Coding Agent Data Could Revolutionize Open AI — If We Collect It

0:000:00

summarize3-Point Summary

  • 1Millions of coding agent sessions — rich with real-world problem-solving trajectories — are being silently deleted from developers' machines daily. Experts argue this untapped data could power the first open-source AI coding model trained on actual human-AI collaboration.
  • 2Developer-Generated Coding Agent Data Could Revolutionize Open AI — If We Collect It Behind the scenes of every software developer using AI coding assistants like Claude Code or GitHub Copilot in agent mode, a treasure trove of high-fidelity training data is being generated — and then automatically erased.
  • 3According to a detailed post on r/LocalLLaMA, these tools log every interaction: the initial task prompt, the model’s internal reasoning steps, tool calls, system responses, error messages, retries, and final outcomes — forming complete reinforcement learning (RL) trajectories.

psychology_altWhy It Matters

  • check_circleThis update has direct impact on the Bilim ve Araştırma topic cluster.
  • check_circleThis topic remains relevant for short-term AI monitoring.
  • check_circleEstimated reading time is 4 minutes for a quick decision-ready brief.

Developer-Generated Coding Agent Data Could Revolutionize Open AI — If We Collect It

Behind the scenes of every software developer using AI coding assistants like Claude Code or GitHub Copilot in agent mode, a treasure trove of high-fidelity training data is being generated — and then automatically erased. According to a detailed post on r/LocalLLaMA, these tools log every interaction: the initial task prompt, the model’s internal reasoning steps, tool calls, system responses, error messages, retries, and final outcomes — forming complete reinforcement learning (RL) trajectories. This data, which includes binary feedback signals like exit codes and test pass/fail results, represents the most authentic record of human-AI collaboration in software development ever produced at scale.

One developer, analyzing local storage on two machines, found over 775 agentic sessions totaling 41 million tokens. Extrapolating across even a fraction of the global developer population — estimated at over 27 million — suggests hundreds of billions of tokens of high-value, real-world training data are being discarded. Yet, no equivalent of the Pile dataset exists for agentic coding behavior. Meanwhile, major AI labs like Anthropic and OpenAI are known to collect and use this same data internally to refine their proprietary models.

The key insight is not merely the volume of data, but its quality. Unlike web-scraped code snippets or Stack Overflow posts, agentic sessions capture causal reasoning, long-horizon planning, error recovery, and iterative refinement — precisely the skills current LLMs struggle with. Each session is a (state → action → reward → next state) tuple, the gold standard for reinforcement learning from human feedback (RLHF) and reinforcement learning from AI feedback (RLAF). The environment itself provides the reward signal: did the code compile? Did the tests pass? Did the tool execute as intended? This is supervision without manual labeling — the holy grail for training next-generation coding agents.

Currently, Claude Code and similar tools delete these logs after 30 days by default. But as the original poster demonstrates, users can extend retention indefinitely by modifying a single configuration file: echo '{"cleanupPeriodDays": 36500}' > ~/.claude/settings.json. The same applies to Codex CLI sessions stored under ~/.codex/sessions/.

The proposed solution is a federated learning framework: developers opt in to train a small, local LoRA adapter using their session data, then share only the encrypted, differentially private model weights — not the raw data — to a global aggregation server. The resulting model improves for everyone, while preserving privacy and intellectual property. Alternatively, anonymized, consent-based aggregation could create a public dataset, akin to Common Crawl for code, but with behavioral context.

Several open-source initiatives, including Hugging Face’s Open Codebase and the EleutherAI community, have expressed interest in exploring such a dataset. If even 1% of active developers preserved their sessions for six months, the resulting corpus could eclipse any existing coding dataset in realism and utility. The infrastructure exists; the will to share is the missing component.

For developers curious about their own data footprint, the original poster recommends running:

du -sh ~/.codex/sessions/ 2>/dev/null
du -sh ~/.claude/projects/ 2>/dev/null
find ~/.codex/sessions/ -name "*.jsonl" | wc -l
find ~/.claude/projects/ -name "*.jsonl" | wc -l

The community response on Reddit has been overwhelmingly positive, with hundreds sharing their own storage metrics. The next step? Building the infrastructure to turn this decentralized data goldmine into an open, collaborative model — one that doesn’t belong to any corporation, but to the developers who generated it.

AI-Powered Content
Sources: www.reddit.com
auto_awesome

AI Terms in This Article

View All

recommendRelated Articles