FlashKDA Open-Sourced: 2.5x Faster Kimi Delta Attention on H200 GPUs (2026)
Moonshot AI has open-sourced FlashKDA, a high-performance implementation of Kimi Delta Attention that delivers up to 2.5x faster inference on Hopper GPUs. Built with CUTLASS and optimized for variable-length batching, it integrates seamlessly into the flash-linear-attention ecosystem.

FlashKDA Open-Sourced: 2.5x Faster Kimi Delta Attention on H200 GPUs (2026)
summarize3-Point Summary
- 1Moonshot AI has open-sourced FlashKDA, a high-performance implementation of Kimi Delta Attention that delivers up to 2.5x faster inference on Hopper GPUs. Built with CUTLASS and optimized for variable-length batching, it integrates seamlessly into the flash-linear-attention ecosystem.
- 2Delivering up to 2.5x faster inference on H200 GPUs, FlashKDA replaces slower Triton kernels with fused CUDA kernels — slashing latency for large-scale LLM deployments.
- 3Hosted on GitHub under an MIT license, it integrates seamlessly with the flash-linear-attention ecosystem via the chunk_kda API.
psychology_altWhy It Matters
- check_circleThis update has direct impact on the Yapay Zeka Araçları ve Ürünler topic cluster.
- check_circleThis topic remains relevant for short-term AI monitoring.
- check_circleEstimated reading time is 3 minutes for a quick decision-ready brief.
FlashKDA Open-Sourced: 2.5x Faster Kimi Delta Attention on H200 GPUs (2026)
Moonshot AI has open-sourced FlashKDA, a high-performance implementation of Kimi Delta Attention built with NVIDIA’s CUTLASS library. Delivering up to 2.5x faster inference on H200 GPUs, FlashKDA replaces slower Triton kernels with fused CUDA kernels — slashing latency for large-scale LLM deployments. Hosted on GitHub under an MIT license, it integrates seamlessly with the flash-linear-attention ecosystem via the chunk_kda API.
How FlashKDA Achieves 2.5x Speedup on H200 GPUs
FlashKDA’s breakthrough stems from three architectural innovations:
- Reduced Chunk Size (16 vs. 64): Keeps numerical ranges within bf16 precision, eliminating costly rescaling.
- Neumann-Series Inversion: Replaces expensive matrix decompositions with efficient 16×16 inversions.
- SM80 Tensor Core Optimization: Fully maps to Hopper and Ampere architectures, ensuring broad GPU compatibility.
Benchmarks show a 2.51x speedup at 512 tokens and sustained 1.6x gains at 16K tokens — even with batch size 2.
Variable-Length Batching vs. Static Batching
Real-world LLMs demand variable-length batching. FlashKDA outperforms Triton by 1.4x–1.5x across skewed, uniform, and random sequence distributions — making it ideal for dynamic inference workloads.
CUTLASS vs. Triton: Performance Benchmarks
Compared to Triton-based implementations, FlashKDA reduces global memory reads by over 60% by fusing softmax, state updates, and attention weighting into a single kernel. This fusion, enabled by CUTLASS’s low-level optimizations, minimizes HBM traffic and maximizes tensor core utilization.
Plug-and-Play Integration with flash-linear-attention
Deploying FlashKDA requires no architectural changes. After installing via pip install flashkda, simply call fla.ops.kda.chunk_kda() under torch.inference_mode(). The system auto-dispatches to the optimized CUDA kernel — no code rewrite needed.
With minimal dependencies (PyTorch 2.4+, CUDA 12.9, SM90+), FlashKDA is accessible to both research labs and production teams. While it doesn’t yet use TMA instructions, its SM80-compatible MMA operations ensure portability across H200, GB300, and future Blackwell (SM120) GPUs.
Industry momentum is clear: NVIDIA’s own Flash Attention v2 efforts on SM120 align with FlashKDA’s fused-kernel philosophy. This isn’t just an improvement — it’s the future of efficient attention.
FlashKDA: high-performance Kimi Delta Attention kernels built on CUTLASS — now open for all to optimize, extend, and deploy. Get the code on GitHub →


