A new open-source developer app for AI translation, dubbing and lip synching to try

Text-to-speech, voice cloning, and visual dubbing are some of the hottest trends in AI at the moment. Used in tandem with AI transcription and translation, they make it possible to generate hyper-realistic voiceovers, indistinguishable from the sound of the speaker’s natural voice and speech patterns — including in entirely new languages.

Our partners at Sync Labs have just published an open-source repo for building an app that translates any video to any language with perfectly matched lip movements. Its backbone leverages Gladia API for speech-to-text and translation, ElevenLabs for text-to-speech and voice cloning, and Sync Labs for visual dubbing.

Following a quick intro to all of the tech elements of this fantastic project, we’ll explain how you can test them first-hand using this app, which will be available for public access in a week.

Speech-to-text and translation

Speech-to-text or automatic speech recognition (ASR) converts spoken words into text. The process involves preprocessing audio data to enhance quality, employing advanced speech recognition algorithms to correctly identify words, and integrating language modeling to predict word sequences. Post-processing may be applied to refine the transcribed text, resulting in an accurate written representation of the spoken content. For a more detailed breakdown of how it works, feel free to visit our introduction to speech-to-text.

AI translation, also known as machine translation, employs AL/ML to automatically translate text or speech from one language to another. The process includes tokenization of input, utilizing natural language processing for context and grammar understanding, and employing machine learning models—often neural networks like the multilingual Whisper ASR—to predict accurate translations.

At Gladia, we rely on a hybrid ASR architecture, powered by optimized Whisper and other state-of-the-art models, supporting 99 languages for transcription and translation. Integrated into Sync’s app, our API allows us to transcribe what’s being said and translate it in near real-time, with the resulting transcript fed into the rest of the structure.

Text-to-speech and voice cloning

Text-to-speech technology does the opposite of speech-to-text by converting written text into spoken language. The system analyzes input text using natural language processing, understanding its structure and semantics. Prosody modeling is then applied to incorporate elements like intonation and rhythm, contributing to a natural and expressive synthesized speech. The synthesis engine generates speech based on the analyzed text and prosody modeling, resulting in a final output of synthesized voice that faithfully represents the spoken version of the input text.

Voice cloning comes into play to make the output as close to the human voice as possible. To yield realistic results, the process starts with collecting a substantial dataset of the target voice. Extracting relevant features like pitch and tone, machine learning models, often utilizing deep neural networks, are trained to mimic the unique characteristics of the voice across a wide emotional spectrum, i.e. confident speech, happy exclamations, angry rants, and so on.

ElevenLabs is among the top software out there for text-to-speech and voice cloning. The company leverages proprietary deep-learning tech to choose from a library of high-fidelity male and female voices (or produce them from scratch!), enabling seamless creation of custom videos, ebooks, and more in 29 languages.

Visual dubbing

Visual dubbing, or lip reanimation, is an AI technology that synchronizes translated or transcribed audio with realistic lip movements in video content.

By analyzing and replicating the original speaker's lip gestures, the system generates animated lip movements that align with the new audio. While the technology is raising obvious concerns about the use of deep fakes, it’s also a highly powerful tool to break down language barriers in video content, providing a high-fidelity alternative to traditional dubbing.

On a mission to break the language barriers in video content and reinvent dubbing, Sync Labs enables developers to seamlessly lip-sync a video to audio in near real-time using a single API.

How the translation app works

We invited you to dive into the x thread below for a detailed video tutorial and instructions. Theres's also this Medium tutorial available, and of course the link to the original repo to clone and launch the app yourself.

Conclusion

Thanks to this amazing open-source project, we can see just how powerful speech-to-text, text-to-speech, voice cloning, and lip-synching technologies can be when used together. We hope you enjoy this incredible free tool by Sync Labs, powered by Gladia. If you’re building a voice app using our API and would like us to spread the word, do not hesitate to reach out here.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Real-time agent assist: Unlocking better call center services with speech-to-text

Customer service is evolving fast to meet new challenges. Today's clients expect immediate, accurate answers to increasingly specific queries and complaints. Meanwhile, contact centers need to reduce costs, improve efficiency, and maintain compliance…all while delivering exceptional experiences.

Product News

How custom vocabulary improves STT accuracy

Even the most advanced speech-to-text (STT) systems can make mistakes, especially when they encounter unfamiliar words like brand names, technical acronyms, or non-standard pronunciations. For call centers and customer service platforms, these missteps aren’t just minor glitches. They can lead to broken workflows, misinterpreted customer needs, and frustrating experiences on both ends of the call.

Speech-To-Text

Call center quality assurance: How AI is transforming quality at scale

CCaaS and BPO providers live and die by the quality of the customer experience they deliver. Clients rely on them not just to answer calls, but to do so with consistency, professionalism, empathy, and accuracy every time.