Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Pricing
Get started
Get started

Read more

Speech-To-Text

Call center voice analytics: use cases, benefits, and how it works

TL;DR: Contact centers that rely on manual QA for call review typically sample only a small fraction of their total call volume, leaving the vast majority of audio unanalyzed. Voice analytics fixes this by converting raw phone calls into structured, LLM-ready data that feeds QA scorecards, CRM entries, and coaching workflows automatically. The catch is that telephony audio is uniquely hostile to standard speech APIs because narrowband codecs and packet loss break models trained on clean audio. This article explains the technical pipeline, the metrics that matter, and the infrastructure requirements that separate production-ready systems from vendor demos.

Speech-To-Text

Customer sentiment analysis: methods, tools, and what voice data adds

TL;DR: Reliable sentiment analysis requires WER below 5%, speaker diarization that separates customer and agent emotion, and language models that hold performance across accents and code-switching. Text-only sentiment tools miss critical voice signals (pace, talk-over, vocal intensity) that predict churn before survey data surfaces the same risk. Automated sentiment scoring on high-accuracy transcripts shifts QA from sampling 2–5% of calls to monitoring 100% of them, the only coverage level at which churn risk and agent burnout surface early enough to act on.

Speech-To-Text

Named Entity Recognition from call transcripts: improving precision

TL;DR: Standard NER models trained on clean text lose up to 27 F1 points when applied to raw ASR output. For CCaaS operations running automated QA and CRM sync, that gap translates directly into missed account numbers, corrupted customer records, and unreliable coaching scores. The fix starts at the transcription layer. Our Solaria-1 model delivers lower WER on conversational speech and 3x lower DER than alternatives, giving your NER pipeline a clean text foundation before a single field is written to the CRM.

What is OpenAI Whisper?

Published on June 19, 2026
By Ani Ghazaryan
What is OpenAI Whisper?

In September 2022, OpenAI quietly dropped something that changed the entire speech recognition industry: a model weight file on GitHub, free to download, free to run, free to modify. Within weeks, developers were running state-of-the-art transcription on their laptops. Within months, every speech-to-text vendor in the world was benchmarking against it.

That model was Whisper. And three and a half years later, it's still the reference point that every ASR conversation starts from, including this one.

But Whisper is also widely misunderstood. It's treated as a production-ready solution when it was released as a research artifact. It's praised for accuracy benchmarks that don't reflect real-world audio. And it's self-hosted at scale by teams who later discover the infrastructure costs more than just buying an API. This article is a clear-eyed look at what Whisper actually is, how it works, where it genuinely excels, and where it runs into walls.

TL;DR

  • OpenAI Whisper is an open-source, encoder-decoder transformer model trained on 680,000 hours of multilingual audio. It is the baseline for all modern speech recognition benchmarks and remains the leading open-source ASR model. 
  • Whisper Large-v3 achieves ~2.7% WER on clean benchmark audio; real-world performance on meetings, calls, and noisy audio lands at 8–12% WER, where production APIs now outperform it on most audio types.
  • Its structural limitations are no native real-time streaming, no built-in speaker diarization, hallucination on silent or low-signal audio, and a 25MB file cap on the managed API. 
  • For batch transcription, research, and self-hosted data-sovereign pipelines, Whisper remains strong. For production voice products that need streaming, diarization, noise resilience, or audio intelligence, teams consistently outgrow it.

Understanding OpenAI Whisper

OpenAI Whisper is an automatic speech recognition (ASR) system developed by Alec Radford and colleagues at OpenAI. It converts spoken audio into written text and can identify the language being spoken without any prior configuration. It also supports translation, converting non-English speech directly into English text.

Released under an MIT license, Whisper is free to download, modify, and self-host. OpenAI also offers a managed API (whisper-1 endpoint) at $0.006 per minute as of June 2026, which powers the OpenAI platform's audio transcription features.

What made Whisper architecturally distinctive is its training approach. Most ASR systems before it were trained on carefully curated, high-quality labeled datasets, which produced models that performed well in controlled conditions and degraded badly in the wild. Whisper took the opposite approach: 680,000 hours of audio scraped from the internet, covering wildly varied recording environments, accents, languages, noise conditions, and domains. The model was trained with weak supervision, meaning many of the transcripts paired with that audio were imperfect. However, the sheer scale and diversity made the resulting model remarkably robust. 

The speech recognition market OpenAI Whisper helped catalyze reached $8.49 billion in 2024 and is projected to reach $23.11 billion by 2030 at a 19.1% CAGR.

How Whisper works

Whisper is an encoder-decoder transformer, the same class of architecture underlying modern large language models. Understanding how it processes audio is useful because it directly explains both its strengths and its known failure modes.

Step 1: Audio chunking

Whisper splits incoming audio into 30-second segments and processes each independently. This 30-second window is a fundamental architectural constraint, not a tunable parameter. It's why Whisper cannot do true streaming. Each segment needs to be complete before the model can process it.

Step 2: Log-Mel spectrogram conversion

Each 30-second chunk is converted into a log-Mel spectrogram, which is a 2D representation of how audio frequencies change over time. Think of it as a visual "image" of sound represented as a matrix of numbers. Large-v3 increased the Mel frequency bins from 80 to 128, which contributed meaningfully to improved non-English language accuracy.

Step 3: Encoder processing

The spectrogram is passed through stacked transformer encoder layers, which build a rich contextual representation of the audio: what was said, in what acoustic context, with what speaker characteristics.

Step 4: Decoder generation

The transformer decoder autoregressively generates the output transcript, token by token, conditioned on the encoder output and a set of special prompt tokens. Those tokens specify the task (transcription, translation, or language identification) which is how a single model handles all three without needing separate specialist models.

Step 5: Output

Unlike older ASR systems that output raw uppercase strings, Whisper was trained on unnormalized transcriptions. It produces punctuated, capitalized, human-readable text natively: no post-processing pipeline required. 

The Whisper model family

Whisper comes in five base sizes. The speed figures are relative to Large on equivalent hardware.

Whisper Model Comparison
Model Parameters Relative speed Hardware requirement
Tiny 39M ~32x CPU, no GPU needed
Base 74M ~16x CPU, no GPU needed
Small 244M ~6x Light GPU or CPU
Medium 769M ~2x Mid-range GPU
Large 1,550M 1x ~10GB VRAM
Large-v2 1,550M 1x ~10GB VRAM
Large-v3 1,550M 1x ~10GB VRAM
Large-v3-turbo 809M ~8x ~6GB VRAM

For most self-hosted production workloads, Large-v3-turbo is the right answer. It's a distilled version of Large-v3 that runs at 8× the throughput with only a 0.3–0.7 percentage-point WER increase — roughly 3–4% WER on LibriSpeech clean versus the full model's 2.7%. Use Large-v3 only when accuracy is paramount and compute cost is secondary.

How Whisper evolved: version by version

Whisper (September 2022): The original release. Five model sizes, covering 96 languages. Established a new baseline for open-source ASR.

Large-v2 (December 2022): 2.5× more training iterations versus the original Large. Became the go-to accuracy reference for nearly a year. Still considered more stable than v3 for certain use cases where hallucination is a critical risk.

Large-v3 (November 2023, OpenAI Dev Day): Trained on 1 million hours of weakly labeled audio plus 4 million hours of audio pseudo-labeled by Large-v2. Increased Mel frequency bins from 80 to 128. Added Cantonese as a supported language. Delivered a 10–20% WER improvement over v2 on most languages, and introduced improved code-switching capability. However, it also intensified hallucination behavior in low-signal conditions relative to v2. 

Large-v3-turbo (October 2024): A distilled version of Large-v3 with 809M parameters. ~8× faster at roughly equivalent accuracy. The practical production choice for self-hosted deployments.

GPT-4o-transcribe (March 2025): Not Whisper. OpenAI's API-only transcription model built on GPT-4o architecture. Reportedly achieves WER as low as 2.46% under favorable conditions, making it more accurate than Whisper Large-v3 on clean English. Not open-source. Cannot be self-hosted.

Accuracy: understanding the numbers 

Whisper Large-v3 achieves ~2.7% WER on LibriSpeech test-clean — a benchmark consisting of clean, read English audiobook audio recorded in professional studio conditions. That number is technically accurate and practically misleading for most real-world use cases.

On audio that actually resembles production environments, the picture changes:

Whisper Large-v3 WER by Audio Condition
Audio condition Whisper Large-v3 WER
LibriSpeech test-clean (studio read speech) ~2.7%
LibriSpeech test-other (varied conditions) ~5.2%
Real-world English (meetings, interviews, podcasts) 8–12%
Noisy / telephony audio 12%+
Low-resource languages, dialects Varies widely

In 2026, the clean-English WER gap between top providers has compressed to near-statistical-noise: 2–3% across the best performers on LibriSpeech. The real differentiation between models now happens on Earnings22-style conversational audio, code-switching, low-resource languages, and noisy environments. That's where architectural and training decisions actually surface.

Where Whisper outperforms older systems:

Whisper's broad, noisy training data gives it meaningful advantages over ASR models trained exclusively on clean read speech. It handles accented speech better, manages domain-specific vocabulary better (because it was exposed to such wide coverage), and produces human-readable output natively without a post-processing layer.

Where it underperforms in production:

Conversational audio is where the gaps open: multi-speaker calls, code-switching, heavily accented voices in noisy environments. These require more than throwing a large model at clean benchmarks. They require specific architectural and training choices that vanilla Whisper simply wasn't built to make.

Whisper's real production limitations

This is the section most "What is Whisper" articles skip. If you're evaluating Whisper for any kind of real product, these matter.

Hallucination on low-signal audio

Whisper generates plausible-sounding text even when there is nothing to transcribe. Silence, low-energy background noise, audio beginning or ending abruptly — in these conditions, Whisper doesn't return empty output. It returns text that sounds like it could have been said but wasn't.

This is not an edge case. It is a documented artifact of the training data. The 680,000-hour dataset heavily includes YouTube auto-captions, which frequently contain filler phrases — "Thank you for watching," "Subscribe to the channel," "Stay tuned." Whisper learned to produce similar output when the audio signal is ambiguous or absent.

A peer-reviewed study presented at ACM FAccT documented this behavior formally. Reported hallucination rates range from 1% to 80% of segments depending on conditions — worst case being long silences, recordings that start or end in silence, and background noise that resembles speech. 

Research published in 2025 traced over 75% of non-speech hallucinations to three specific decoder self-attention heads. Targeted fine-tuning of those heads (the "Calm-Whisper" approach) reduced hallucination rates by approximately 84.5%, from 99.97% to 15.51% on UrbanSound8K, with less than 0.1% WER degradation. But this requires custom fine-tuning. It's not a drop-in fix, and vanilla Whisper retains the full hallucination behavior.

For a contact center processing thousands of calls per day, this translates directly into fabricated transcripts, downstream errors in sentiment analysis and entity extraction, and corrupted compliance logs.

No native real-time streaming

Whisper is a batch model. Its 30-second chunk architecture introduces irreducible latency for conversational applications. The OpenAI managed API endpoint typically takes 1–3 seconds to return a result for a 10-second audio clip.

Community tools (WhisperLive, faster-whisper with VAD chunking) approximate near-real-time behavior, and Groq's LPU-hosted Whisper runs at dramatically higher throughput. But these are engineering workarounds, not solutions to the fundamental architecture. For voice agents or live captioning requiring sub-300ms latency, purpose-built streaming models are the right tool.

No built-in speaker diarization

Whisper does not identify who is speaking. For any multi-speaker context you need a separate diarization system (pyannote.audio, NeMo, Falcon) layered on top of Whisper's output. Community pipelines like WhisperX and whisper-diarization automate this combination, but each adds engineering complexity, an additional failure point, and extra latency.

25MB file size cap on the API

The OpenAI managed API limits uploads to 25MB. For long recordings this requires client-side chunking, which adds engineering overhead and introduces potential transcription errors at chunk boundaries.

No custom vocabulary on the managed API

The OpenAI API doesn't support custom vocabulary injection. For products with proprietary terms, product names, or industry-specific jargon, accuracy on those terms will be degraded. Self-hosted Whisper can be fine-tuned on labeled data, but that requires ML engineering resources most teams don't have standing by.

Self-hosting costs more than it looks

Whisper is free to download. Running it at production scale is not. A community infrastructure analysis found that self-hosting Whisper for a representative production workload cost approximately $163,680 per year in GPU infrastructure — excluding developer and admin overhead. A comparable managed API service cost $38,880 for the same workload.

Aircall moved off a self-hosted solution to Gladia and cut transcription time by 95%, freeing engineering capacity for product work rather than pipeline maintenance. The infrastructure savings from self-hosting Whisper often don't survive contact with the engineering sprint cost of keeping it running. Use Gladia's TCO Calculator to run the numbers for your specific workload.

The Whisper ecosystem: what the community built around it

Because Whisper is open-source and MIT-licensed, its limitations became a development surface. The community responded.

WhisperX adds precise word-level timestamps, speaker diarization via pyannote.audio, and intelligent audio chunking via voice activity detection (VAD) for better handling of long recordings. It's the most widely adopted extension for multi-speaker transcription use cases. 

faster-whisper is a CTranslate2-based reimplementation of Whisper that runs Large-v3 at approximately 4× the speed of the original implementation with meaningfully lower memory usage. The practical baseline for serious self-hosted deployments.

Whisper.cpp is a C/C++ port that runs on CPU and Apple Silicon without Python dependencies — relevant for edge, on-device, or resource-constrained environments.

WhisperLive is a near-real-time server wrapping faster-whisper with WebSocket support and optional diarization.

Groq-hosted Whisper runs Whisper Large-v3 on LPU hardware at approximately $0.02/hr — by far the cheapest hosted Whisper endpoint as of 2026, and significantly faster than standard cloud inference, though still not true streaming. 

Whisper vs. production STT APIs: where the gaps are in 2026

Whisper's release changed developer expectations for ASR. Every production API that followed either started from a Whisper checkpoint or was explicitly designed to address what Whisper couldn't do. Here's where the actual differences land:

Whisper vs STT Providers Comparison
Capability Whisper (self-hosted) OpenAI Whisper API Gladia Solaria-1 Deepgram Nova-3 AssemblyAI Universal-2
Real-time streaming No (workarounds only) No Yes (<103ms) Yes (<300ms) Yes
Speaker diarization External tools required Not included Bundled Yes Yes
Hallucination mitigation Fine-tune required Not mitigated Yes Partial 30% fewer vs. Whisper
Custom vocabulary Fine-tune only Not supported Yes Yes Yes
File size limit None (local) 25MB No limit No limit No limit
Language coverage 99 99 100+ (42 exclusive) 36+ 99+
Code-switching Limited Limited Yes No Limited
Audio intelligence None None Bundled Add-on Add-on
Pricing (approx.) Infrastructure cost $0.006/min From $0.20/hr (async) Usage-based Usage-based

On accuracy, Gladia's Solaria-1 delivers on average 29% lower WER and 3× lower diarization error rate (DER) than alternatives across 7 datasets and 74+ hours of conversational audio. 

The pattern is consistent. Whisper defined the accuracy floor; production APIs competed by solving the infrastructure and feature gaps it left open. Gladia's original architecture was built on an optimized, re-engineered Whisper core, which is why we have firsthand visibility into exactly where those gaps appear in real deployments, and what it takes to close them.

OpenAI Whisper vs. Gladia's Solaria models

Gladia's relationship with Whisper is longer than most. The original Gladia API was built on an optimized, re-engineered Whisper core. So when Gladia's engineers say Solaria was designed to close specific gaps, they're not speaking from theory. They spent years running Whisper at production scale, hitting its limits repeatedly, and eventually building a model family to solve what they couldn't patch.

Today, Gladia runs two Solaria models in parallel, each designed for a different job:

Solaria-1 is the breadth model. It covers 100+ languages including 42 not available on any other API-level competitor, handles code-switching natively, and delivers consistently on clean, formal, and multilingual audio. It benchmarks at an average 29% lower WER and 3× lower DER than alternatives on conversational speech across 7 datasets and 74+ hours of audio. It's the model you reach for when language coverage and multilingual consistency are the primary constraints.

Solaria-3, on the other hand, is the production audio model – #1 on real customer recordings in English, on business audio and conversational call center speech. It was purpose-built for the audio most enterprise voice workflows actually deal with: compressed telephony, noisy meetings, overlapping speakers. It trades some of Solaria-1's breadth for depth in core European languages and the results on real-world audio are significant. On Earnings22, the industry's benchmark for financial and business speech, Solaria-3 hits 6.4% WER – the only model under 7%, beating AssemblyAI, ElevenLabs Scribe v2, Speechmatics, and Deepgram Nova-3. Whisper Large-v3 lands around 11.3% on the same dataset, which is a gap that compounds severely on a 30-minute earnings call.

What Whisper can't do that Solaria does natively

Accuracy on production audio is only part of the comparison. The structural gaps matter as much as the WER numbers:

Real-time streaming: Whisper has no native streaming endpoint. Solaria-1 delivers partial transcripts in under 103ms optimal / 270ms average, with final transcripts at approximately 698ms, inside the sub-800ms threshold that production voice AI targets require. Whisper's batch architecture cannot produce this without fundamental re-engineering.

Speaker diarization, bundled: Whisper requires external tools (pyannote.audio, NeMo, Falcon) to identify speakers. Gladia's Solaria-1 delivers 3× lower DER (Diarization Error Rate) than alternatives on conversational speech across a 74+ hour open benchmark, with diarization included in the base API call. 

Hallucination mitigation: Vanilla Whisper generates plausible text during silence and low-signal audio, a documented training artifact that causes real downstream damage in compliance and analytics pipelines. Gladia's pipeline includes validation and suppression steps that prevent fabricated output from reaching the transcript layer.

Code-switching: Whisper handles multilingual audio only in a limited sense: it processes one language at a time and doesn't gracefully handle speakers who shift between languages mid-sentence. Solaria-1 was built for code-switching natively, which matters for any global product where users don't stay in one language throughout a call.

Language coverage: Whisper supports 99 languages. Solaria-1 covers 100+ including 42 languages not available on any other API-level competitor. Solaria-3 is currently strong on 5 European languages and a bunch of others in active rollout.

Audio intelligence, bundled: Whisper returns a transcript. Full stop. Gladia's API bundles diarization, sentiment analysis, entity extraction, translation, and summarization into the same call, at the same base rate. Other vendors, such as AssemblyAI and Deepgram charge these as add-ons; Gladia includes them by default.

How the two Solaria models are designed to work together

Gladia runs Solaria-1 and Solaria-3 in parallel, not as replacements for each other. Solaria-3 is the right model when:

  • Your audio is English or core European languages (EN, FR, DE, ES, IT)
  • You're processing contact center calls, sales recordings, or meeting transcripts
  • Noise resilience, telephony audio, and business vocabulary accuracy are the primary requirements
  • You need the highest possible accuracy on the audio that actually breaks other models

Solaria-1 is the right model when:

  • You need 100+ language coverage or support for languages outside the Solaria-3 launch set
  • Your audio is formal, clean read-speech, or institutional
  • Code-switching across many languages is a core requirement
  • You need full multilingual breadth for a global product serving diverse audio types

The bottom line

Whisper changed what the industry thought was possible with speech recognition. It proved that a single model, trained at scale on messy real-world audio, could outperform years of carefully engineered specialist systems, and it did it as open-source, MIT-licensed code that anyone could run on a laptop.

That contribution is permanent. But it's also not new, and the production requirements that teams bring to ASR in 2026 look very different from what Whisper was built to handle. Multi-speaker meeting intelligence, contact center compliance, real-time voice agents, global products switching languages mid-call — these are the workloads that pushed the industry past Whisper, not away from it.

The honest framing: Whisper is where the reference architecture began. For teams that have outgrown it, or need to skip the infrastructure overhead entirely, the question isn't whether to move on. It's which production stack closes the gaps without introducing new ones.

Gladia's Solaria models were built to answer that question, starting from the same Whisper foundation, and going further. Try Solaria models for free on your own audio. 

FAQs 

What is OpenAI Whisper?

OpenAI Whisper is an open-source automatic speech recognition model released by OpenAI in September 2022. It converts spoken audio into written text, supports 99 languages, can identify the language spoken without configuration, and translates non-English speech into English. It's trained on 680,000 hours of internet-sourced audio using weak supervision and is free to self-host under an MIT license.

Is OpenAI Whisper free?

The model weights are free and MIT-licensed, you can download and run Whisper at no licensing cost. The managed OpenAI API charges $0.006 per minute as of June 2026. Self-hosting eliminates per-minute costs but introduces GPU infrastructure, engineering, and maintenance costs that typically exceed API pricing at production scale. (Source: Gladia TCO analysis)

What's the most accurate Whisper model?

Whisper Large-v3 achieves the highest accuracy: approximately 2.7% WER on LibriSpeech test-clean. Large-v3-turbo (October 2024) is a distilled version that runs at ~8× throughput with 0.3–0.7 percentage points higher WER, making it the better choice for most production workloads. GPT-4o-transcribe, OpenAI's March 2025 API-only successor, reportedly outperforms Large-v3 on clean English but is not open-source.

Does Whisper support real-time transcription?

No, not natively. Whisper processes audio in 30-second chunks and is fundamentally a batch model. The OpenAI API has no streaming endpoint on the whisper-1 path. Community tools like WhisperLive and faster-whisper with VAD chunking approximate near-real-time behavior, and Groq-hosted Whisper runs at higher throughput. For sub-300ms streaming required by voice agents or live captioning, purpose-built streaming models are the appropriate choice.

Why does Whisper hallucinate?

Whisper generates plausible-sounding text during silence or low-signal audio because its training data included YouTube auto-captions filled with filler phrases. The model learned to produce similar output when uncertain. A study at ACM FAccT 2024 documented hallucination rates from 1% to 80% of segments depending on conditions. Research published in May 2025 found that targeted fine-tuning of three decoder attention heads reduces hallucination rates by ~84.5%, but vanilla Whisper retains the full behavior without this intervention.

How does Whisper compare to Gladia's Solaria models?

On clean benchmarks, the gap is small. On real-world audio, it isn't. Solaria-3 hits 6.4% WER on Earnings22 versus Whisper Large-v3's ~11.3%, and unlike Whisper, it comes with native streaming, bundled diarization, and hallucination mitigation out of the box. No external tooling required.

When should I use Gladia instead of Whisper?

When your audio isn't clean and your pipeline can't afford to babysit it. Gladia is the right call for multi-speaker calls, noisy or telephony audio, real-time voice agents, and any workflow that needs diarization or audio intelligence without building a separate stack. If you need full on-premise control and have the ML engineering to support it, Whisper still makes sense. Otherwise, Gladia closes the gaps Whisper leaves open.

Can Whisper be fine-tuned?

Yes. Self-hosted Whisper can be fine-tuned on domain-specific audio datasets to improve accuracy on specialized vocabulary, accents, or languages. This requires labeled audio data, GPU infrastructure, and ML engineering. Fine-tuning is also the only way to add custom vocabulary to a self-hosted deployment. The managed API doesn't support vocabulary customization.

How much does it cost to self-host Whisper at scale?

More than most teams expect. A community infrastructure analysis found that a representative production Whisper workload costs approximately $163,680 per year in GPU infrastructure alone, compared to $38,880 for a comparable managed API. That gap is engineering sprint capacity — use Gladia's Whisper TCO Calculator to model your specific volume.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more