FlashKDA: Open-Source Kimi Delta Attention Kernels with 2.5x Speedup

FlashKDA Open-Sourced: 2.5x Faster Kimi Delta Attention on H200 GPUs (2026)

Moonshot AI has open-sourced FlashKDA, a high-performance implementation of Kimi Delta Attention that delivers up to 2.5x faster inference on Hopper GPUs. Built with CUTLASS and optimized for variable-length batching, it integrates seamlessly into the flash-linear-attention ecosystem.

summarize3-Point Summary

1Moonshot AI has open-sourced FlashKDA, a high-performance implementation of Kimi Delta Attention that delivers up to 2.5x faster inference on Hopper GPUs. Built with CUTLASS and optimized for variable-length batching, it integrates seamlessly into the flash-linear-attention ecosystem.

2Delivering up to 2.5x faster inference on H200 GPUs, FlashKDA replaces slower Triton kernels with fused CUDA kernels — slashing latency for large-scale LLM deployments.

3Hosted on GitHub under an MIT license, it integrates seamlessly with the flash-linear-attention ecosystem via the chunk_kda API.

FlashKDA Open-Sourced: 2.5x Faster Kimi Delta Attention on H200 GPUs (2026)

Moonshot AI has open-sourced FlashKDA, a high-performance implementation of Kimi Delta Attention built with NVIDIA’s CUTLASS library. Delivering up to 2.5x faster inference on H200 GPUs, FlashKDA replaces slower Triton kernels with fused CUDA kernels — slashing latency for large-scale LLM deployments. Hosted on GitHub under an MIT license, it integrates seamlessly with the flash-linear-attention ecosystem via the chunk_kda API.

How FlashKDA Achieves 2.5x Speedup on H200 GPUs

FlashKDA’s breakthrough stems from three architectural innovations:

Reduced Chunk Size (16 vs. 64): Keeps numerical ranges within bf16 precision, eliminating costly rescaling.
Neumann-Series Inversion: Replaces expensive matrix decompositions with efficient 16×16 inversions.
SM80 Tensor Core Optimization: Fully maps to Hopper and Ampere architectures, ensuring broad GPU compatibility.

Benchmarks show a 2.51x speedup at 512 tokens and sustained 1.6x gains at 16K tokens — even with batch size 2.

Variable-Length Batching vs. Static Batching

Real-world LLMs demand variable-length batching. FlashKDA outperforms Triton by 1.4x–1.5x across skewed, uniform, and random sequence distributions — making it ideal for dynamic inference workloads.

CUTLASS vs. Triton: Performance Benchmarks

Compared to Triton-based implementations, FlashKDA reduces global memory reads by over 60% by fusing softmax, state updates, and attention weighting into a single kernel. This fusion, enabled by CUTLASS’s low-level optimizations, minimizes HBM traffic and maximizes tensor core utilization.

Plug-and-Play Integration with flash-linear-attention

Deploying FlashKDA requires no architectural changes. After installing via pip install flashkda, simply call fla.ops.kda.chunk_kda() under torch.inference_mode(). The system auto-dispatches to the optimized CUDA kernel — no code rewrite needed.

With minimal dependencies (PyTorch 2.4+, CUDA 12.9, SM90+), FlashKDA is accessible to both research labs and production teams. While it doesn’t yet use TMA instructions, its SM80-compatible MMA operations ensure portability across H200, GB300, and future Blackwell (SM120) GPUs.

Industry momentum is clear: NVIDIA’s own Flash Attention v2 efforts on SM120 align with FlashKDA’s fused-kernel philosophy. This isn’t just an improvement — it’s the future of efficient attention.

FlashKDA: high-performance Kimi Delta Attention kernels built on CUTLASS — now open for all to optimize, extend, and deploy. Get the code on GitHub →

AI-Powered Content

Sources: GitHub Repo • Technical Deep Dive • CUTLASS PR • GB300 Benchmarks • Attention Research Notes