TR
Yapay Zekavisibility9 views

LLaDA2.1 Breaks Speed Record at 892 TPS, Solves Key Diffusion LLM Flaw

A new diffusion language model, LLaDA2.1, has reportedly achieved a staggering 891.74 tokens per second, dramatically outpacing traditional autoregressive models. Its novel 'Draft and Edit' approach fixes a fundamental token permanence problem that has plagued previous diffusion-based text generation.

calendar_today🇹🇷Türkçe versiyonu
LLaDA2.1 Breaks Speed Record at 892 TPS, Solves Key Diffusion LLM Flaw

LLaDA2.1 Shatters Inference Speed Records While Introducing Token-Editing Breakthrough

By Investigative AI & Tech Desk | February 11, 2026

In a significant leap for generative AI efficiency, researchers from Ant Group and collaborating universities have unveiled LLaDA2.1, a diffusion-based language model that claims to achieve inference speeds of nearly 900 tokens per second (TPS) while solving a core architectural limitation of its predecessors. According to the technical report, this performance, coupled with a novel token-editing mechanism, could reshape the economics and capabilities of large-scale AI deployment.

Unprecedented Speed Meets a Paradigm Shift

The core benchmark result is eye-watering: on the HumanEval+ coding benchmark, the 100-billion-parameter "flash" variant of LLaDA2.1, running in a speed-optimized "S Mode," processed text at 891.74 TPS. A smaller 16-billion-parameter model reportedly peaked at 1,586.93 TPS. For context, these figures are orders of magnitude faster than typical autoregressive models like GPT-4 or Llama at similar parameter counts, where inference is sequential and token-by-token. According to the research team's paper on arXiv, this demonstrates the "scaling potential of 100B-level block-diffusion models and their inherent parallelization."

The breakthrough is not merely about raw speed but about transcending a critical trade-off. "The delicate equilibrium between decoding speed and generation quality has remained an elusive frontier," the authors state in the LLaDA2.1 report. Their solution represents a "paradigm shift" by moving beyond the standard absorbing-state diffusion framework.

The "Permanent Token" Problem and the Draft-and-Edit Solution

Previous diffusion language models suffered from a fundamental flaw: once a token was generated in the diffusion process, it became fixed. An early mistake would propagate irreversibly through the rest of the sequence, limiting coherence and accuracy. LLaDA2.1 introduces a "Draft and Edit" methodology to break this constraint.

The model operates with two configurable probability thresholds governing different transitions. The first, Mask-to-Token (M2T), handles the initial generation of tokens from a masked state. The second, and more innovative, Token-to-Token (T2T) mechanism allows the model to retroactively edit and correct previously generated tokens based on newer contextual information. According to the technical documentation, this joint, configurable scheme is woven seamlessly into the decoding process, enabling the model to revise its output dynamically.

This architectural innovation gives rise to two operational "personas": a Quality (Q) Mode that prioritizes accuracy and a Speed (S) Mode that maximizes throughput. The research acknowledges the trade-off, noting that aggressive threshold-lowering in S Mode can cause "stuttering" artifacts like n-gram repetitions, making Q Mode preferable for general chat.

Quantifiable Gains and Novel Training

The performance improvements are substantiated across multiple benchmarks. Compared to its predecessor, LLaDA2.1 shows gains on complex reasoning tasks: AIME 2025 improved from 60.00 to 63.33, ZebraLogic jumped from 82.30 to 88.90, and GPQA rose from 62.31 to 67.30. Enabling an advanced "Multi-Block Editing" (MBE) feature pushed scores even higher, with AIME 2025 reaching 70.00, albeit with a modest cost to throughput.

The training framework itself is novel. The team claims LLaDA2.1 is the first large-scale reinforcement learning framework for diffusion LLMs, utilizing an ELBO-based Block-level Policy Optimization. Because calculating sequence-level likelihood is intractable for diffusion models, they employed Vectorized Likelihood Estimation for parallelized bound computation. This was combined with a Mixture of M2T and T2T training objectives and multi-turn forward data augmentation to make the correction mechanism robust.

Infrastructure and Broader Context

Hitting these speeds required specialized infrastructure. The model is built on a customized version of SGLang with an Alpha MoE megakernel and per-block FP8 quantization. This focus on inference efficiency aligns with broader industry trends to reduce the massive computational cost of running frontier AI models.

The development also intersects with other research into efficient diffusion model inference. A separate, related paper titled "Swordsman: Entropy-Driven Adaptive Block Partition for Efficient Diffusion Language Models" explores methodologies for dynamically partitioning computation based on entropy analysis to optimize efficiency, highlighting the active research field LLaDA2.1 now leads.

Implications and Open Questions

If the reported numbers are validated in independent testing and production environments, the implications for AI service providers are profound. Inference cost is a primary bottleneck in scaling AI applications, and a model that is both faster and capable of self-correction could dramatically lower operational expenses for coding assistants, search engines, and creative tools.

However, questions remain. The technical report notes that while code and math domains perform well in the fast S Mode, general chat is more problematic due to artifacts. The model's performance on long-form content generation and creative writing, where flow and consistency are paramount, is yet to be fully explored. Furthermore, the ecological impact of training such massive, specialized systems continues to be a subject of ethical scrutiny.

Nevertheless, LLaDA2.1 represents a bold step toward a new generation of non-autoregressive language models. By solving the permanent token problem and achieving record-breaking speeds, it challenges the dominance of sequential generation and points toward a future where AI can draft, reflect, and edit its thoughts in a manner that is both faster and more human-like.

Sources: This report synthesizes information from the LLaDA2.1 technical report "LLaDA2.1: Speeding Up Text Diffusion via Token Editing" (arXiv:2602.08676) and the related research overview "Swordsman: Entropy-Driven Adaptive Block Partition for Efficient Diffusion Language Models" (arXiv:2602.04399). All performance figures and technical descriptions are derived from the primary research documentation.

AI-Powered Content
Sources: arxiv.orgarxiv.org

recommendRelated Articles