Izwi AI Launches Local Audio Engine with Speaker Diarization and Real-Time Streaming
Izwi, a new local audio inference engine, has rolled out major upgrades including speaker diarization, word-level forced alignment, and multi-format audio support—all optimized for on-device performance. The open-source tool aims to empower developers with privacy-first, low-latency speech processing.

Izwi AI Unveils Groundbreaking Local Audio Processing Capabilities
A new open-source audio inference engine named Izwi has made a significant leap forward in on-device speech processing, introducing a suite of advanced features designed to bring enterprise-grade transcription and audio analysis to local hardware. According to a post on Reddit’s r/artificial community, Izwi now supports speaker diarization, forced alignment, real-time streaming, and multi-format audio input—all while running efficiently without cloud dependency.
The engine’s new speaker diarization module leverages Sortformer models to automatically identify and separate up to four distinct voices in a single audio stream. This capability is particularly valuable for meeting recordings, interviews, and courtroom transcripts, where knowing who spoke when is critical. Unlike cloud-based alternatives that require data uploads and incur latency, Izwi performs this analysis entirely on-device, enhancing privacy and reducing operational costs.
Complementing diarization is the integration of Qwen3-ForcedAligner, which delivers precise word-level timestamps between spoken audio and its transcribed text. This feature is a game-changer for subtitle generation, accessibility tools, and forensic audio analysis. By aligning each spoken word with its exact millisecond in the audio file, Izwi enables developers to build applications that synchronize captions with video or highlight spoken keywords in real time—without relying on external APIs.
Performance optimizations are equally impressive. Izwi employs parallel execution, batched automatic speech recognition (ASR), paged key-value (KV) caching, and Metal-specific hardware acceleration for Apple Silicon devices. These enhancements significantly reduce inference latency and memory overhead, allowing even low-power devices to handle complex audio tasks smoothly. The engine also supports native decoding of WAV, MP3, FLAC, and OGG formats via the Symphonia audio library, eliminating the need for preprocessing or format conversion.
Model diversity is another cornerstone of Izwi’s architecture. The platform supports multiple open-weight models across core audio functions: for ASR, it includes Qwen3-ASR (0.6B and 1.7B parameters), Parakeet TDT, and LFM2.5-Audio; for text-to-speech (TTS), it offers Qwen3-TTS and LFM2.5-Audio; and for conversational AI, it integrates Qwen3 (0.6B, 1.7B) and Gemma 3 (1B). This multi-model flexibility allows users to select the optimal balance between accuracy, speed, and resource usage depending on their deployment scenario.
Real-time streaming is now fully implemented across transcription, chat, and TTS workflows. Users can receive incremental outputs as audio is processed—enabling live captioning, interactive voice assistants, and responsive audio interfaces without waiting for full file completion. This is especially beneficial for teleconferencing apps, live transcription services, and accessibility tools used in real-time environments.
Izwi is openly available on GitHub under the Agentem AI organization, with comprehensive documentation hosted at izwiai.com. The development team encourages community contributions and feedback, signaling a commitment to open innovation. With no proprietary lock-in and full local execution, Izwi represents a compelling alternative to commercial cloud-based speech platforms like Google Speech-to-Text or AWS Transcribe—particularly for organizations bound by data sovereignty laws or seeking to minimize operational expenses.
As AI shifts toward edge computing, tools like Izwi signal a broader trend: powerful, privacy-preserving audio intelligence no longer requires the cloud. For developers building next-generation voice applications—from smart home systems to legal tech and healthcare transcription—this update may well be a turning point.


