Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Read more

Product News

Transcribing audio with Gladia's async SDK

Transcribing an audio file should take one call. In practice, it usually takes five or six: upload the file, create a job, poll the endpoint until it's done, parse the response, and wrap the whole thing in retry logic for when something fails midway. It's not hard work, but it's the kind of repetitive plumbing that ends up in every project that touches speech-to-text.

Speech-To-Text

Building a meeting summarization pipeline: async STT + LLM in 5 steps

Building a meeting summarization pipeline with async STT and LLM in 5 steps: audio ingestion, API integration, and prompt engineering.

Speech-To-Text

Real-time latency for meeting transcription: latency budgets and live note-taking requirements

Real-time latency for meeting transcription requires measuring end-to-end delays across audio chunking, network routing, and rendering.

OpenAI Whisper vs Google STT vs Amazon Transcribe: the ASR rundown (2026 edition)

Published on Apr 15, 2026
By Ani Ghazaryan
OpenAI Whisper vs Google STT vs Amazon Transcribe: the ASR rundown (2026 edition)

Speech recognition has always been a crowded space. But in the last few years, the models have gotten faster, cheaper, and smarter. New architectures have entered the picture. And the baseline expectation for what "good enough" looks like has shifted dramatically.

If you're building a voice product in 2026 — an AI meeting assistant, a contact center solution, a voice agent, anything that turns spoken words into structured data — you're probably starting your evaluation with the three names that have dominated this category: OpenAI Whisper, Google Cloud Speech-to-Text, and Amazon Transcribe. They're the bigger providers out there. The ones your infrastructure team probably already has an SSO login for.

But the landscape has moved. In this updated rundown, we walk through each provider as they actually stand today — covering accuracy, features, pricing, developer experience, and the real-world limitations that benchmark tables won't show you.

The state of ASR in 2026

Speech recognition as a category has split in two. On one side, basic transcription has become a commodity. WERs (word error rates) on clean, single-speaker English audio are now so low across the major providers that they've essentially stopped being a meaningful differentiator. If you're transcribing podcast episodes or pre-recorded interviews in a controlled environment, almost any modern API will do the job adequately. The race to the bottom on clean-audio WER is largely over.

On the other side, conversational audio remains genuinely hard, and the gap between providers here is wide. Multi-speaker calls, overlapping speech, accented voices in noisy environments, code-switching, speaker diarization – these problems haven't been solved by throwing more parameters at LibriSpeech benchmarks. They require different architectural choices, different training data, and different product decisions, and not every provider has made them.

This is the context that matters when evaluating OpenAI Whisper, Google STT, and Amazon Transcribe in 2026. Each has made meaningful moves. OpenAI shifted its transcription stack onto the GPT-4o architecture with gpt-4o-transcribe, launched a Realtime API now in general availability, and introduced a dedicated diarization endpoint. Google progressed through Chirp 2 to Chirp 3, adding native diarization, a built-in denoiser, and streaming — finally making its V2 API competitive for production use cases beyond batch transcription. Amazon has been the quietest of the three on model innovation, but has continued deepening its lead on domain-specific tooling: call analytics, medical transcription, and AWS-native compliance workflows.

The practical result is that the Big Three are more capable than they were but capable in different ways, and still optimized around different assumptions about what audio looks like. Understanding those assumptions is the point of this guide. A few specialized providers — Gladia, Deepgram, AssemblyAI, ElevenLabs — have also made substantial moves and are increasingly competitive in specific use cases. Our best speech-to-text APIs guide for 2026 covers the full competitive landscape if you want a broader view.

Accuracy and speed: where things actually stand

The honest answer in 2026 is that WER numbers on public benchmarks are not the only buying signal. The gap between the best and worst performers on LibriSpeech or FLEURS has compressed to the point where it rarely drives purchase decisions.

What drives decisions (or should) is performance on your audio: accented speech, multi-speaker conversations, noisy environments, code-switching, etc. These are the variables that separate providers in production.

That said, here's where the benchmarks roughly land:

Provider Model Approx. WER (benchmark) Notes
OpenAI gpt-4o-transcribe ~2.5% Best-in-class on clean audio; strongest on underserved languages
OpenAI Whisper Large V3 ~15–16% Open-source; still widely used for self-hosted deployments
Google Chirp 2 / Chirp 3 ~11.6% (Chirp 2) Chirp 3 improves further; best language breadth
Amazon Transcribe (standard) ~14% Stronger on domain-specific models (medical, call center)

OpenAI (gpt-4o-transcribe) currently leads competitive accuracy tests, with one widely-cited evaluation reporting WER as low as 2.46% under favorable conditions. The older Whisper Large V3 sits closer to 15–16% WER on challenging real-world audio — a meaningful regression from how some older comparisons portrayed it. OpenAI has addressed Whisper's known weaknesses in non-English languages through the GPT-4o architecture; improvements are most dramatic in historically underserved languages where Whisper V3 struggled.

Google Chirp 2 / Chirp 3 has closed the gap considerably. Chirp 2 benchmarks at around 11.6% WER in comparable tests, a major improvement over the 16–20% figures that defined Google's legacy models. Chirp 3 improves further with better handling of noisy audio, thanks to the built-in denoiser. Google continues to lead the field in language breadth — 125+ languages with real coverage, not just listed support.

Amazon Transcribe lands around 14% WER in the same benchmark sets, competitive with the mid-tier but behind gpt-4o-transcribe on pure accuracy. Where Transcribe continues to punch above its weight is in specialized verticals: its medical transcription variant and call analytics modules perform significantly better than general-purpose benchmarks suggest, because they're trained on domain-specific data. For call center use cases specifically, the improvement is material.

The bottom line on accuracy: If pure transcription accuracy is your primary constraint, gpt-4o-transcribe is the current leader. But for most production use cases, the difference between providers on clean audio is marginal. The decisions that matter more are: does this provider handle speaker overlap? Does it do reliable diarization? What happens on accented speech that isn't well-represented in training data? Those are the tests you need to run with your own audio.

Beyond transcription: features that matter

The features that separate platforms for production use cases are the ones that come after the words appear on screen.

Feature OpenAI Google Chirp 3 Amazon Transcribe
Real-time streaming ✅ Realtime API (GA Aug 2025) ✅ Full support in Chirp 3 ✅ Available
Speaker diarization ✅ gpt-4o-transcribe-diarize (dedicated endpoint) ✅ Native in Chirp 3 ✅ Up to 10 speakers
PII redaction ✅ Native
Custom vocabulary ✅ Via prompting ✅ Speech adaptation ✅ Custom language models
Medical transcription ✅ HIPAA-eligible, $0.075/min
Call analytics ✅ Transcribe Call Analytics
Built-in audio denoiser ✅ Chirp 3 ✅ Noise reduction mode
Language support 99 languages 85+ languages (Chirp 3) 100+ languages
Translation ✅ English output ✅ Chirp 2+
Zero data retention ✅ Optional ✅ (processes in memory) ✅ With KMS encryption

Real-time streaming is now available across all three providers, but with meaningful differences in implementation quality. Google's Chirp 3 added streaming as a first-class feature. OpenAI's Realtime API supports voice-to-voice interaction beyond pure transcription. Amazon has offered streaming for some time. Latency and reliability vary — if you're building real-time meeting assistants or voice agents, our guide to real-time transcription architecture walks through the infrastructure decisions that actually matter, including how sub-300ms latency is achieved in production.

Speaker diarization — identifying who said what — remains one of the most technically demanding capabilities in conversational speech, and one of the most underestimated when evaluating ASR providers. Understanding how diarization error rate (DER) is measured and what benchmark results actually mean in real environments is essential before making a decision here. Google Chirp 3 includes diarization natively. OpenAI has introduced gpt-4o-transcribe-diarize as a dedicated endpoint, though it requires audio chunking for files longer than 30 seconds. Amazon Transcribe's 10-speaker limit can constrain meeting transcription use cases with larger groups.

PII redaction and compliance is table stakes for enterprise contact center and healthcare use cases. All three providers offer it. Amazon Transcribe includes content redaction natively and has HIPAA-eligible medical transcription (priced separately at $0.075/minute). Google and OpenAI both support data processing agreements under GDPR and CCPA frameworks.

Code-switching — handling audio where speakers switch languages mid-sentence — is where general-purpose models still struggle. This is particularly acute in contact center environments and global enterprise products where multilingual teams are the norm, not the exception.

Custom vocabulary and model adaptation is available across all three. Google's Chirp 3 supports speech adaptation for domain-specific vocabulary. Amazon allows custom language models for domain-specific use. OpenAI allows prompting to steer transcription toward specific terminology. For a practical walkthrough of how to configure accuracy tuning, code-switching, and language detection settings in a production integration, see our getting-started guide.

Pricing

Pricing has remained relatively stable at the top level, but with more tiering at scale.

Provider Model Price per minute Notes
OpenAI Whisper (hosted API) $0.006 Lowest commercial rate; 25MB file limit
OpenAI gpt-4o-transcribe / mini $0.006 / $0.003 Updated since launch; mini recommended for most use cases
Google Standard / Chirp (V2) $0.016 Chirp included at standard rate; +40% if data logging opt-out
Google Enhanced models $0.036 V1 API only
Amazon Tier 1 (standard) $0.024 Drops to $0.0078/min at 5M+ minutes/month
Amazon Medical transcription $0.075 HIPAA-eligible; domain-specific models

OpenAI Whisper API remains the lowest commercial per-minute rate at $0.006/minute and $0.003/minute for gpt-4o-mini-transcribe. If you're seriously weighing build vs. buy, our decision framework for open-source vs. API walks through the full TCO calculation.

Google Cloud Speech-to-Text charges $0.016/minute for standard models, with Chirp included at that rate in V2. Volume discounts are available at scale but require contacting Google Cloud sales for specifics. New accounts get $300 in free credits, and 60 minutes per month remain free regardless. Worth noting: GCP's supporting infrastructure costs — storage, egress, IAM, Functions — can meaningfully add to the effective rate for teams not already on Google Cloud.

Amazon Transcribe uses a tiered structure: $0.024/minute at Tier 1, dropping as low as $0.0078/minute at 5M+ minutes monthly. Medical transcription is $0.075/minute. If you're already on AWS, the integration with existing billing and infrastructure can offset the higher sticker price.

Developer experience

OpenAI's API remains the easiest onboarding in the category. Less than six lines of code for a basic transcription. Clear documentation. No upfront credit card requirement for exploration. The tradeoff: the API is less mature for enterprise workflows — no S3 integration, limited webhook support, and the 25MB file size limit on the hosted API creates friction for longer audio. For a detailed technical comparison of how the OpenAI transcription API performs against a purpose-built provider in real production scenarios, our Whisper API vs. Gladia breakdown is worth reading before you commit to an integration.

Google Cloud Speech-to-Text is powerful and well-documented, but the onboarding process remains genuinely complex for developers not already embedded in GCP. The V2 API with Chirp 3 is cleaner than legacy V1, but you're still navigating IAM permissions, GCS bucket configurations, and regional availability constraints before you get your first transcript. For teams already on GCP, this overhead is invisible; for everyone else, it's real.

Amazon Transcribe follows the same pattern as Google: deeply capable, but the AWS onboarding experience is its own category of friction. The S3-upload → job-creation → poll-for-output pattern is fine if your data already lives on AWS. Starting from scratch, it's the slowest time-to-first-transcript of the three.

Privacy and security

All three providers have mature approaches to data security, and all three are appropriate for enterprise use cases with proper configuration.

Amazon Transcribe uses TLS for data in transit and AWS KMS for encryption at rest. It supports input media encryption and is HIPAA-eligible for medical use cases.

Google processes audio in memory without persistent storage by default — a meaningful privacy advantage for sensitive audio. Full GDPR, HIPAA, and SOC 2/3 compliance is supported through the GCP compliance framework. Data residency controls via V2's single-region configuration address sovereignty requirements.

OpenAI retains uploaded audio for up to 30 days by default for quality and safety monitoring, with a Zero Data Retention option available. Enterprise DPA agreements are available. For highly regulated industries, this data retention policy warrants scrutiny — it's not a blocker, but it's worth verifying the ZDR configuration is active before processing sensitive audio.

Which one should you actually use?

There's no universal answer but some patterns can help you make an informed decision:

Use case Recommended starting point Why
Highest accuracy on clean audio OpenAI gpt-4o-transcribe Current benchmark leader; strong multilingual improvements
Broadest language coverage + GCP already in stack Google Chirp 3 85+ languages, streaming, diarization, built-in denoiser
Call center / contact center analytics Amazon Transcribe Call Analytics Domain-specific models built for this; deep AWS integration
Medical transcription Amazon Transcribe Medical HIPAA-eligible; specialized vocabulary
Lowest cost at high volume Whisper (self-hosted) or Amazon Transcribe (Tier 3) $0 marginal cost vs $0.0078/min at scale — but model the TCO carefully
Voice agents / real-time streaming OpenAI Realtime API or Google Chirp 3 Both now GA for streaming; test latency with your use case
Conversational / multi-speaker audio Test all three on your own audio — or see how purpose-built providers compare Benchmarks don't predict real-world conversational performance

If you're already on GCP or need the broadest language coverage with enterprise-grade compliance infrastructure, Google's Chirp 3 is a significant step up from where Google's ASR stood two years ago. It's competitive on accuracy, has real streaming support now, and the diarization improvements are meaningful.

If you're on AWS, have call center or medical transcription use cases, or need to process audio at very high volume with predictable pricing, Amazon Transcribe's ecosystem integration and specialized models justify the evaluation time.

If your use case involves conversational speech — multi-speaker audio, overlapping speech, real-world noise, code-switching, or meeting intelligence — this is where it might be worth extending your search beyond the Big Three. Their general-purpose architectures weren't designed around the problems that conversational audio creates. It's worth looking at providers purpose-built for this. Our Solaria-1 model was specifically designed for multi-speaker, multilingual, real-world audio with native code-switching across 100+ languages.

The benchmark problem no one talks about

Published WER numbers are designed to look good, not to tell you what will happen in your production environment. Every major ASR provider curates the benchmark datasets they publish against. The audio is often clean, the speakers are often native, the recording conditions are controlled. Real-world audio doesn't look like that.

The only evaluation that matters for your use case is the one you run on audio that actually represents what your users will throw at the system: multi-speaker calls with background noise, accented speech that doesn't appear in training data, technical vocabulary, etc. We open-sourced our benchmarking methodology across 8 providers and 74 hours of real audio for exactly this reason, so you can see what performance looks like on production-representative data, and reproduce every result yourself.

Frequently asked questions

Is OpenAI Whisper still worth using in 2026? 

It depends on what you mean by "Whisper." The open-source model (Large V3 Turbo) remains a solid baseline for self-hosted deployments and sits at roughly 15–16% WER on challenging audio — respectable, but no longer state of the art. OpenAI's hosted API now runs gpt-4o-transcribe rather than classic Whisper, and that model is substantially more accurate. For teams considering self-hosting Whisper, it's worth modelling the true total cost of ownership before assuming it's the cheapest option: infrastructure alone can run to $163,680/year before developer overhead.

How does Google Chirp 3 compare to OpenAI's models? 

Chirp 2 benchmarks at around 11.6% WER — better than Whisper Large V3 on standard tests, though behind gpt-4o-transcribe. Chirp 3 improves further and adds a built-in denoiser, native diarization, and full streaming support. Google's main advantages are language breadth (125+ languages) and deep GCP ecosystem integration. Its main friction point is onboarding complexity for teams not already on Google Cloud.

What is the cheapest speech-to-text API in 2026? 

At published rates, OpenAI's Whisper API at $0.006/minute is the lowest commercial price. Self-hosted open-source Whisper has no per-minute cost but requires GPU infrastructure and engineering overhead that typically exceeds API costs at anything above moderate volume. Amazon Transcribe becomes competitive at high volumes, dropping to $0.0078/minute at 5M+ minutes monthly. Pricing for features like diarization, PII redaction, and streaming is often charged separately, so the effective rate is usually higher than any headline figure.

Does Amazon Transcribe support real-time transcription? 

Yes. Amazon Transcribe has supported streaming transcription for some time and integrates cleanly within the AWS ecosystem. For voice agent use cases requiring very low latency (sub-300ms), it's worth comparing against purpose-built real-time APIs — latency characteristics differ meaningfully between providers, and the architecture decisions around buffering, chunking, and partial transcript delivery matter as much as raw speed.

How do I run a fair evaluation of ASR providers for my use case? 

Before committing engineering time to a full integration, it's worth grounding your evaluation in something more representative. We, at Gladia, tested 8 providers across 74 hours of real production audio and open-sourced the entire methodology so every result can be independently reproduced. You can run your own test, measure WER on multi-speaker segments specifically, check diarization accuracy under real noise, and pay attention to the accents and languages your users actually speak.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more