MLX Reimplementation of Qwen3-ASR Delivers Breakthrough Speech Recognition on Apple Silicon
A ground-up reimplementation of Alibaba's Qwen3-ASR speech recognition model using Apple's MLX framework now enables native, high-performance audio transcription on M-series chips. With record-low latency and minimal memory usage, the open-source tool sets a new standard for on-device AI transcription.

Groundbreaking MLX Port of Qwen3-ASR Unleashes Native Speech Recognition on Apple Silicon
A revolutionary advancement in on-device artificial intelligence has emerged from the open-source community: a complete reimplementation of Alibaba’s Qwen3-ASR automatic speech recognition (ASR) model using Apple’s MLX framework. Developed under the GitHub project mlx-qwen3-asr, this project enables state-of-the-art multilingual speech-to-text processing to run natively on Apple’s M-series chips—without relying on PyTorch or Hugging Face’s transformers library.
According to the project’s creator, identified on Reddit as u/PrimaryAbility9, the MLX-based implementation achieves unprecedented efficiency. Benchmarks conducted on an M4 Pro chip show that a 2.5-second audio clip is transcribed in just 0.46 seconds, yielding a real-time factor (RTF) of 0.08—meaning the system processes audio over five times faster than real-time. For a 10-second clip, inference time remains under one second at 0.83 seconds, maintaining the same RTF. This level of speed was previously unattainable on consumer-grade hardware without cloud dependency.
The model supports two variants: a compact 0.6 billion parameter version and a more powerful 1.7 billion parameter model, both capable of recognizing speech in 52 languages. Notably, the implementation includes a native MLX forced aligner that delivers word-level timestamps—an essential feature for subtitling, forensic analysis, and accessibility applications. Output formats include TXT, JSON, SRT, VTT, and TSV, making it compatible with video editing software, transcription services, and archival systems.
Memory efficiency is another standout achievement. The 0.6B model requires only ~1.2 GB of RAM, while the 1.7B model consumes ~3.4 GB—far below the memory footprint of equivalent PyTorch deployments. This enables seamless operation on MacBook Air and iPad Pro devices, opening the door for real-time transcription in mobile and field environments.
Quantization further enhances performance. The 4-bit quantized version achieves a 4.7x speedup over the FP16 baseline, with only a marginal increase in word error rate (WER): from 2.29% to 2.72% on the LibriSpeech test-clean dataset. On the multilingual-100 benchmark, the MLX implementation achieves a WER of 15.99%, outperforming the official PyTorch version’s 16.69%. These results challenge the long-held assumption that proprietary frameworks are necessary for optimal accuracy.
The software stack is remarkably lean: only four dependencies—MLX, NumPy, regex, and huggingface-hub—are required. Crucially, the inference pipeline contains no PyTorch components or transformer libraries, eliminating overhead and compatibility issues. The project includes 393 automated tests, all validated against committed JSON artifacts, ensuring reproducibility and reliability—a rarity in experimental AI deployments.
Experimental features such as streaming inference and speculative decoding are already implemented, hinting at future real-time applications in live captioning and voice assistants. The developer also teased the imminent addition of speaker diarization, which would allow the system to distinguish between multiple speakers—a critical upgrade for meeting transcription and podcast analysis.
This development signals a paradigm shift: high-performance AI models no longer require cloud infrastructure or NVIDIA GPUs. With Apple’s MLX framework, developers can now build fast, private, and energy-efficient speech recognition tools directly on consumer devices. For journalists, researchers, and accessibility advocates, this represents a major leap toward decentralized, privacy-preserving AI.
Install the package via pip install mlx-qwen3-asr. The full source code, benchmarks, and documentation are available on GitHub.


