In a remarkable feat of ingenuity, two undergraduate students have developed a new AI speech model that aspires to rival some of the industry’s biggest names, including Google. Despite their limited background in artificial intelligence, this duo aims to democratize access to high-quality voice synthesis, bringing significant advancements to the field of AI-generated audio.
The rise of synthetic speech technologies has sparked major interest among investors, with startups focusing on voice AI reportedly raising over $398 million in venture capital last year. ElevenLabs has established itself as a major force in this space, but the surge in competition involves several others, making the landscape increasingly dynamic. Among the new contenders is Nari Labs, the brainchild of co-founders Toby Kim and his partner, who began exploring the nuances of speech AI a mere three months ago. Their inspiration stems from Google’s NotebookLM, prompting them to create an AI model providing extensive customization in generated audio outputs.
The students utilized Google’s TPU Research Cloud program, granting them access to powerful TPU AI chips for model training. The resulting model, named “Dia,” boasts an impressive scale with 1.6 billion parameters. This comprehensive capacity allows Dia to generate highly personalized dialogue based on user input, enabling variations in tone, speech patterns, and even nonverbal indicators like laughter and coughs.
Parameters signify the internal variables within AI models that influence prediction accuracy; typically, more parameters equate to improved performance. Available for public use through platforms like Hugging Face and GitHub, Dia is compatible with modern PCs that possess a minimum of 10GB VRAM. Users can start with a random voice or provide specific descriptions for more tailored output, including the ability to clone individual voices.
Initial testing of Dia through its web demo highlights its capabilities effectively, producing convincing dialogues on virtually any topic. The quality is on par with, if not superior to, existing voice generation technologies. However, caution is warranted as Dia lacks concrete safeguards against misuse. The risk of creating misleading or deceptive content remains high, and while Nari Labs discourages using their technology for unethical purposes, they officially disavow responsibility for any potential abuses.
There are also unanswered questions regarding the training data utilized for Dia. It remains unclear if copyrighted material played a role in its development—a concern raised by industry observers regarding fair use in AI model training. The complexity of this issue hangs over the entire sector, as stakeholders continue to debate the legality of using protected content for AI training.
Looking ahead, the vision for Nari Labs extends beyond Dia, with plans to incorporate social elements into their synthetic voice platform and support additional languages in future model iterations. A technical report detailing the workings of Dia is also forthcoming, ensuring transparency in their innovations.