DFlash Speculative Decoding: 4.1x Faster AI on Apple Silicon

DFlash Speculative Decoding Delivers 4.1x Faster AI Inference on Apple Silicon M5 Max in 2026

DFlash speculative decoding has delivered a staggering 4.1x performance boost on Apple Silicon M5 Max for Qwen3.5-9B models—marking a landmark advancement in on-device AI inference. Developed by an independent researcher and now open-sourced, the technique leverages Apple’s unified memory architecture and the MLX framework to accelerate large language model token generation without compromising accuracy. Unlike traditional speculative decoding methods that require forked libraries or custom hardware, DFlash operates entirely on stock MLX, ensuring broad compatibility and reproducibility across consumer-grade devices.

How DFlash Works Under the Hood: Lossless Parallel Token Generation

Unified Memory Optimization

DFlash exploits Apple Silicon’s unified memory architecture to eliminate data copying between CPU and GPU. This reduces latency and maximizes memory bandwidth utilization during parallel token generation. The system uses a lightweight draft model to predict 16 tokens simultaneously via block diffusion, while the full target model (Qwen3.5-9B) validates them in a single forward pass.

Lossless Token Prediction with Tape-Replay Rollback

Every token is validated before commitment, ensuring zero hallucinations. The breakthrough lies in its custom Metal kernel: the tape-replay rollback mechanism replays only accepted tokens through GatedDeltaNet recurrent states, avoiding costly full checkpoint saves. This innovation cuts memory overhead by 68% compared to conventional speculative decoding.

Stable bf16 Pathways and JIT-Optimized SDPA

Earlier versions suffered from numerical instability in bf16 paths, reducing acceptance rates to 82%. After fixing precision issues, DFlash achieved a consistent 89.4% token acceptance rate on Qwen3.5-9B. A JIT-optimized 2-pass SDPA kernel enabled efficient long-context verification beyond 1024 tokens—critical for real-world LLM applications.

Why Apple Silicon + MLX Is the Perfect Match for On-Device AI

Bandwidth Over Compute: The Apple Silicon Advantage

Benchmarked on an M5 Max with 64GB unified memory and MLX 0.31.1, DFlash increased token throughput from 30.96 to 127.07 tokens per second—a 4.13x gain. Performance gains diminish on quantized models like Qwen3.5-27B-4bit (1.90x speedup), revealing that on bandwidth-bound hardware like Apple Silicon, memory throughput becomes the bottleneck. When the target model is already quantized, the bf16 draft model becomes the limiting factor.

Why Stock MLX Outperforms Custom Kernels

Surprisingly, custom Metal kernels for GEMV or SDPA operations performed worse than MLX’s optimized stock implementations. The real wins came from intelligent memory reuse and numerical precision fixes—not raw compute tuning. This proves that software innovation, not hardware specialization, drives efficiency on Apple Silicon.

Scalability, Open-Source Access, and the Future of Edge AI

The same DFlash approach delivered a 4.10x speedup on the smaller Qwen3.5-4B model, proving scalability across model sizes. The technique works best on Qwen3.5’s hybrid GatedDeltaNet + attention architecture, but also improves pure attention models—just without tape-replay benefits. With code now available on GitHub, developers can replicate and extend the work. The roadmap includes optimizing DFlash for pure attention architectures and compressing draft models for lower-memory devices.

DFlash speculative decoding is no longer a prototype—it’s a proven, open, and scalable leap forward in on-device AI inference. As the technique matures, its impact on edge AI and consumer-grade LLMs could redefine what’s possible without GPUs. Try it today and experience the future of local AI.

AI-Powered Content

Sources: support.google.com • www.reddit.com