KaniTTS2: Open-Source 400M TTS Model Enables Voice Cloning on Low-End GPUs
A new open-source text-to-speech model, KaniTTS2, has been released with voice cloning capabilities and can run on just 3GB of VRAM, making advanced TTS accessible to developers and researchers worldwide. The full pretraining framework is now available, empowering users to train models for underrepresented languages.

KaniTTS2: Democratizing High-Quality Text-to-Speech with Open-Source Innovation
A groundbreaking advancement in artificial intelligence audio synthesis has emerged from the open-source community: KaniTTS2, a 400-million-parameter text-to-speech (TTS) model capable of real-time voice cloning with minimal hardware requirements. Released under the Apache 2.0 license, the model—developed by a team behind nineninesix-ai—has been made publicly available on Hugging Face and GitHub, signaling a major leap in accessibility for speech technology.
Unlike proprietary TTS systems that demand high-end GPUs and extensive computational resources, KaniTTS2 operates efficiently on as little as 3GB of GPU memory, achieving a real-time factor (RTF) of approximately 0.2 on an RTX 5090. This means the system can generate speech faster than it takes to listen to it, enabling seamless integration into live applications such as virtual assistants, customer service bots, and interactive educational tools. With a 22kHz sample rate and support for multilingual output—including English and Spanish, plus region-specific accents—the model is engineered for nuanced, human-like speech synthesis.
Perhaps the most transformative aspect of KaniTTS2 is its complete pretraining codebase, which has been released alongside the pretrained models. This allows researchers, linguists, and developers to train custom TTS models from scratch using their own datasets. For the first time, communities speaking low-resource or endangered languages can build localized speech systems without relying on corporate AI platforms. The team trained the base model on approximately 10,000 hours of speech data, completing the process in just six hours using eight NVIDIA H100 GPUs—a testament to the efficiency of their architecture.
The release includes two primary models: a multilingual pretrained variant (kani-tts-2-pt) and a specialized English model (kani-tts-2-en), both hosted on Hugging Face. Additionally, interactive demo spaces allow users to test voice cloning capabilities directly in their browsers, requiring no local installation. The accompanying Discord server has already attracted hundreds of developers, linguists, and accessibility advocates eager to experiment with the technology.
Industry experts are applauding the move as a significant step toward equitable AI. "This isn’t just another TTS model—it’s a toolkit for linguistic sovereignty," said Dr. Elena Ruiz, a computational linguist at Stanford University. "For decades, speech synthesis has favored dominant languages. KaniTTS2 gives marginalized communities the tools to reclaim their voices in digital spaces."
Applications are already emerging. A nonprofit in rural Colombia is exploring KaniTTS2 to create audio learning materials in indigenous languages. A developer in Japan is adapting the model to replicate regional dialects of Japanese for elderly users with hearing impairments. Meanwhile, independent creators are using the model to produce audiobooks and podcasts with personalized narration.
While the model’s performance is impressive, ethical considerations remain paramount. The team has included clear usage guidelines in their documentation, urging users to obtain consent before cloning voices and to avoid malicious applications. Still, the open nature of the release raises concerns about potential misuse, such as deepfake audio scams. The developers have partnered with AI ethics researchers to monitor emerging risks and plan to release an optional voice authentication module in future updates.
KaniTTS2’s release marks a turning point in the democratization of AI audio. By combining state-of-the-art performance with unprecedented accessibility and transparency, it empowers a global community to innovate beyond the constraints of proprietary systems. As more languages and accents are added in upcoming updates, KaniTTS2 could become the foundational layer for a new era of inclusive, decentralized speech technology.


