MichiAI: Full-Duplex Speech LLM Achieves 75ms Latency
A new 530 million parameter Large Language Model (LLM), dubbed MichiAI, has demonstrated full-duplex speech capabilities with remarkably low latency. Developed with efficiency in mind, this model bypasses common coherence issues plaguing similar systems.

MichiAI: Full-Duplex Speech LLM Achieves 75ms Latency
A groundbreaking 530 million parameter Large Language Model (LLM), named MichiAI, has emerged, promising full-duplex speech capabilities with an impressive latency of approximately 75 milliseconds. Developed by Ketsui Labs, this innovative model addresses a significant challenge in real-time conversational AI: maintaining coherence without demanding excessive computational resources.
Information synthesized from Reddit post by /u/kwazar90, arXiv papers 2601.22779 and 2509.02521v3, and COEY blog post.
The development of MichiAI represents a notable step forward in the quest for natural and responsive voice-enabled AI systems. Unlike many existing full-duplex speech models that suffer from degraded coherence, MichiAI employs a novel architecture and training methodology to overcome these limitations. The core innovation lies in its efficient design, which prioritizes low compute for both training and inference, a crucial aspect for practical deployment.
Architectural Innovations for Efficiency and Coherence
MichiAI's architecture eschews traditional codebooks, opting instead for Rectified Flow Matching to predict continuous audio embeddings in a single forward pass. This contrasts with the multiple passes often required by discrete models. A key component is the 'Listen head,' which functions as a multimodal encoder, seamlessly integrating audio embeddings with text tokens. The inclusion of input text tokens, according to the developers, is a significant factor in retaining conversational coherence, a departure from models that rely solely on audio embeddings for the input stream.
The model leverages the established capabilities of the SmolLM 360M LLM as its backbone, effectively recycling and adapting its pre-trained textual knowledge for speech-related tasks. This approach has allowed MichiAI to achieve fluent speech with a relatively modest dataset of only 5,000 hours of audio, demonstrating efficient knowledge transfer.
Efficient Training and Low Latency Performance
The training process for MichiAI was meticulously designed for efficiency, with a significant portion conducted on a single NVIDIA 4090 GPU, with some memory-intensive parts utilizing two A6000 GPUs. This resource-conscious approach underscores the project's goal of making advanced AI more accessible.
One of the techniques employed to maintain coherence during training involved mixing pure text samples into the dataset. This strategy, coupled with the efficient architecture, has resulted in a model that exhibits no visible degradation in language modeling performance, as indicated by its loss curves. During testing, MichiAI maintained the same reasoning capabilities as its base LLM backbone.
The reported latency of approximately 75ms for Time-To-First-Acoustic (TTFA) on an unoptimized Python implementation running on a single 4090 GPU is particularly impressive. This low latency is critical for enabling truly natural, back-and-forth conversations, a hallmark of full-duplex communication.
Broader Trends in Speech AI
MichiAI's development aligns with a broader trend in the AI research community towards creating more efficient and capable speech processing models. Recent research, such as that presented in arXiv paper 2601.22779 ('Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization'), also highlights efforts to optimize decoder-only LLMs for streaming speech recognition, focusing on reducing latency. Similarly, arXiv paper 2509.02521v3 ('FLM-Audio: Natural Monologues Improve Native Full-Duplex Chatbots via Dual Training') explores methods to enhance full-duplex chatbots using natural monologues and dual training paradigms.
Furthermore, the open-source release of full-duplex voice agents, like PersonaPlex-7B, as reported by COEY on January 31, 2026, indicates a growing momentum in making these advanced conversational AIs accessible to a wider audience. MichiAI's efficient design and impressive performance suggest it could be a significant contributor to this burgeoning field.
Future Implications
The success of MichiAI has far-reaching implications for the development of more intuitive and interactive AI assistants, real-time translation services, and enhanced human-computer interfaces. By achieving high performance with relatively low computational requirements, MichiAI could pave the way for wider adoption of sophisticated voice AI in various applications, from consumer electronics to enterprise solutions.
The project's open-source nature, hinted at by the accompanying GitHub link, will likely foster further community innovation and development in the realm of full-duplex speech LLMs.
Key Features of MichiAI:
- 530 million parameters
- Full-duplex speech capability
- ~75ms latency (TTFA) on a single 4090
- Efficient architecture using Rectified Flow Matching
- Multimodal encoder for audio and text integration
- Leverages pre-trained LLM backbone (SmolLM 360M)
- Low training compute requirements
- Maintains language model coherence
- Trained on 5,000 hours of audio


