API Comparison Table

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Sign up

Get started

Speech-to-text for AI medical scribes: Why clinical vocabulary breaks generic STT

TL;DR: Generic STT engines fail in clinical environments because language model probability overrides correct acoustic detection of medical terms, substituting phonetically plausible but clinically wrong candidates silently. The result corrupts drug names, dosages, and diagnoses before the LLM ever sees them. Before selecting an STT engine for a medical scribe, verify four things: whether vocabulary biasing works at inference time without fine-tuning, whether async diarization accurately separates clinician and patient audio, whether the model holds up on noisy consultation recordings rather than clean read-speech, and whether the vendor's data training policy covers PHI by default on your plan.

Speech-To-Text

Migrating from self-hosted Whisper to a managed speech-to-text API

TL;DR: Self-hosting Whisper's true cost rarely sits in the model weights. GPU idle time, VRAM leaks under parallel load, and the engineering hours spent maintaining CUDA dependencies and diarization pipelines are where the bill compounds. For teams processing under roughly 3,000 hours per month, assuming 20% of one US FTE at $150K loaded annual cost, a managed API is cheaper, though the break-even shifts materially against your actual labor cost. Above that threshold, the decision depends on your DevOps overhead and whether audio accuracy on real-world recordings matters for downstream systems like CRM sync and coaching scores.

Speech-To-Text

Migrating from AssemblyAI to Gladia: A step-by-step switching guide

TL;DR: Switching from AssemblyAI requires four concrete changes: update one auth header, remap batch endpoints, adjust the JSON response schema, and resample audio for WebSocket connections. Multiple customers independently report completing these in under a day with a rollback abstraction layer in place. The bigger structural difference is cost model: a production stack with diarization, sentiment, entities, and summarization runs $0.30/hr on AssemblyAI's Universal-2 tier because each feature is metered separately, versus a bundled base rate. This guide covers the exact parameter mappings, payload diffs, WebSocket reconfiguration, and a zero-downtime cutover strategy.

Mastering AI transcription for social media captions: Mojo's success story with Gladia

Published on Dec 17, 2023

From Reels to ads to YouTube shorts, video content consumed in vertical bite-size format on social media is becoming among the primary ways we interact with the world for both leisure and business.

It has been estimated that in 2023 people globally have watched, on average, 17 hours of online video per week (that is, at least one hour daily) — and there’s no reason to expect the trend to subside in the future.

In this context, high-quality automatic captions for videos are becoming a must-have feature for video marketing, given that most consumers expect to be able to consume video on the go without sound. Captions are likewise incredibly important to ensure the accessibility of one’s social applications to all types of users.

Mojo, which specializes in social media content editing, is among our clients relying on speech-to-text to generate high-quality automatic captions and other voice-based editing features.

About Mojo

Mojo is a social video and content app, designed to make high-quality social media content creation easy. With its one-stop-shop app, Mojo users can access hundreds of templates, text effects, and high-quality animations to create stunning social content in a matter of minutes.

Founded in France in 2018, the company targets primarily small businesses in search of easy-to-use tools to showcase and promote their brand across social media channels with Reels, stories, TikTok, posts and more. In the last four years, the app has gained over 46M users, including several hundred thousand paying subscribers.

Challenge

Mojo’s team knew that delivering on its mission meant that they had to make video content editing easy, intuitive, and feature-rich. To stand out in a globalized, ultra-competitive market, the company focused on building a library of hundreds of animated templates and delivering advanced editing tools and effects, such as video trimming and background removal.

The integration of audio transcription in particular into Mojo's app came in response to growing user demand for auto-captions, aligning with the company’s ongoing shift from template-based to tool-based app usage.

Having tried several alternative speech-to-text APIs, Mojo’s team turned to Gladia in search of a better solution.

The main issue encountered with the incumbent provider was that it did not meet the quality standard in word timings, aka word-level timestamps, which were especially critical for the right timing of visual effects.

In addition, the language support of the previous provider was not always reliable to the detriment of client satisfaction and retention.

Objectives

To deploy a high-quality, scalable transcription and audio intelligence API to power the Mojo app, with the following specifications:

Accurate and fast batch transcription for auto-captions at a scalable cost.
Language support (especially in European languages), to serve the app’s global client base, including the ability to automatically detect not only one’s language but also dialects and accents.
Top-level precision for word-level timestamps, with the start and end times of each word, detected perfectly, being an essential pre-requisite for video and captions editing.

Solution

Enter Gladia. With Gladia, the Mojo team was able to enhance in-app user experience with the following features:

Auto-captions, supported in multiple languages, enhanced with a style selection tool.
Remove pauses, where time-stamped transcripts are used to automatically detect and remove silences from a video.

*Auto captions as seen inside the Mojo app*

To enable these features, Gladia relies on an advanced hybrid model architecture with generative AI components. Not only does our API detect words accurately based on acoustic properties of speech, but it also fills the gaps in the transcript based on a contextualized understanding of language, while deploying a range of techniques to eliminate hallucinations and accurately detect a variety of accents even in complex environments.

Impact

By working with the Gladia team to iterate and scale up the volume to thousands of hours transcribed monthly, Mojo saw a noticeable impact on the usage of the auto captions feature, with happy users expressing their satisfaction with the new spot-on quality of the feature.

The Mojo team is continuing to explore the possibilities that transcription brings to the Mojo products and services, and is now considering how they will leverage it in the future with upcoming features like keyword detection.

We're thrilled to be part of this amazing journey with them, and thank Mojo for putting their trust in us! As we move forward, we're excited to team up with more clients, tackle new challenges, and make speech AI more accessible to media companies worldwide.

About Gladia

Gladia provides a speech-to-text and audio intelligence API for building virtual meeting and note-taking apps, call center platforms, and media products, providing transcription, translation, and insights powered by best-in-class ASR, LLMs and GenAI models.

Having read this case study, do you feel like Gladia could be the right fit for your business too?

Don't hesitate to contact our sales team to explore this in more detail, and follow us on X and LinkedIn.

Contact us

Your request has been registered

A problem occurred while submitting the form.

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Newsletter

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

Accept

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

New model: Solaria-3

Test our real-time and async transcription

2026 Meeting Assistant Report

Read more

Speech-to-text for AI medical scribes: Why clinical vocabulary breaks generic STT

Migrating from self-hosted Whisper to a managed speech-to-text API

Migrating from AssemblyAI to Gladia: A step-by-step switching guide

Mastering AI transcription for social media captions: Mojo's success story with Gladia

About Mojo

Challenge

Objectives

Solution

Impact

About Gladia

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.