Speech-to-Text Benchmark Results — Word Error Rate (WER %, lower is better)

Methodology: benchmarks run on the same framework as the Solaria-1 campaign. Solaria-3 and Solaria-1 compared against 9 STT providers. Each provider tested on identical audio files via production APIs with default settings. Real customer audio = Gladia internal production dataset, annotated by humans. Soniox and Pipecat STT Benchmark excluded on some datasets pending data availability. Transcriptions normalized with OpenAI Whisper text normalizer before WER computation. Diarization Error Rate (DER) measured on DIHARD III using standard protocols. 74+ hours of audio across 8 evaluation datasets. Full open benchmark methodology: https://github.com/gladiaio/normalization

Columns: WER = Word Error Rate (%, lower is better) · Perfect = number of files with 0% WER · High WER = files with WER above 50% · RTFx = Real-Time Factor multiplier (higher = faster than real-time)

Real Customer Audio — English (Gladia Internal Dataset)

ProviderWER (%)
Solaria-39.6
ElevenLabs Scribe v29.9
AssemblyAI10.0
Deepgram Nova-310.7
Mistral Voxtral12.2
Solaria-112.9

Switchboard — Conversational Speech

ProviderWER (%)Perfect transcriptionsHigh WER filesRTFx
Solaria-333.92730
Solaria-137.32730
AssemblyAI42.32841
Speechmatics46.03170
Mistral Voxtral48.12566
Deepgram Nova-349.82671
ElevenLabs Scribe v255.23160

Common Voice 24 — Clean Audio (Multilingual)

ProviderWER (%)Perfect transcriptionsHigh WER filesRTFx
Speechmatics3.8106722
AssemblyAI3.9104110
ElevenLabs Scribe v23.9105936
Mistral Voxtral5.198426
AssemblyAI Universal v25.296821
Solaria-36.9
Soniox v47.290271
Deepgram Nova-37.980802
Solaria-18.290421

VoxPopuli — Formal Discourse (European Parliament)

ProviderWER (%)Perfect transcriptionsHigh WER filesRTFx
ElevenLabs Scribe v21.741805
AssemblyAI2.139602
Mistral Voxtral2.139405
Solaria-12.239303
AssemblyAI Universal v22.237703
Solaria-32.9
Speechmatics3.032606
Deepgram Nova-33.236307

Earnings22 Full — Financial Calls (Long-form, single file)

ProviderWER (%)RTFx
ElevenLabs Scribe v29.435
Speechmatics10.017
AssemblyAI Universal v311.071
AssemblyAI Universal v211.182
Mistral STT11.6135
Solaria-111.828
Deepgram v314.5348

Earnings22 Cleaned AA — Financial Calls (Curated by Artificial Analysis)

ProviderWER (%)RTFx
Solaria-36.4
AssemblyAI6.964
ElevenLabs Scribe v27.732
Speechmatics7.824
Mistral Voxtral7.957
Solaria-18.139
Deepgram Nova-312.0234

Multilingual LibriSpeech — Audiobooks (5 languages, average)

ProviderWER (%)Perfect transcriptionsHigh WER filesRTFx
ElevenLabs Scribe v23.7565315
AssemblyAI4.750831
Soniox v45.637836
Solaria-15.936733
AssemblyAI Universal v26.236923
Deepgram Nova-37.527047
Solaria-38.0

Multilingual LibriSpeech — WER by Language (%)

ProviderGerman (DE)Spanish (ES)French (FR)Italian (IT)Portuguese (PT)
Solaria-15.04.04.89.95.3
AssemblyAI Universal v33.53.22.69.74.4
ElevenLabs Scribe v23.13.22.96.13.0
Soniox v45.44.45.08.84.3
AssemblyAI Universal v23.44.05.811.95.9
Deepgram v36.94.66.28.811.3

Pipecat STT Benchmark — Real-Time Streaming

ProviderWER (%)Perfect transcriptionsHigh WER filesRTFx
AssemblyAI Universal v32.053102
ElevenLabs Scribe v22.251208
AssemblyAI Universal v22.549401
Mistral STT2.648505
Solaria-12.748204
Speechmatics2.747600
Soniox v42.948002
Deepgram v33.144908

Speaker Diarization Benchmark — DIHARD III — Diarization Error Rate (DER %, lower is better)

ProviderBroadcastMeetingWeb VideoSocio FieldCourtClinicalRestaurantSocio LabCTSMaptaskSimple AvgWeighted Avg
Solaria-19.429.944.412.33.913.341.35.57.74.517.216.6
NVIDIA NeMo Sortformer10.333.043.513.024.114.450.98.614.18.222.020.4
pyannoteAI Community-110.535.848.717.911.623.849.913.912.310.223.523.0
Speechmatics17.255.655.628.915.024.958.418.620.123.431.830.1
AWS Transcribe16.451.460.325.216.727.363.120.231.222.933.533.8
Soniox STT-async-preview-v124.858.357.530.139.335.167.428.029.227.639.737.8
ElevenLabs Scribe-v125.650.563.429.723.147.757.430.322.945.239.639.5
OpenAI GPT-4o Transcribe26.457.864.128.830.040.859.726.534.841.042.8
AssemblyAI Universal30.946.468.433.124.551.459.433.133.142.142.243.9
Deepgram v327.059.783.035.525.644.875.232.235.545.946.446.9
New Try our blind STT comparison tool →

Open benchmark for speech-to-text

We evaluated Solaria-1 against 8 leading providers across 8 datasets and 74 hours of audio. The full methodology is open-sourced so results can be independently reproduced.

ALL RESULTS AT A GLANCE

WER comparison across datasets

Lower WER is better. Filter by dataset to focus on what matters to you.

OPEN METHODOLOGY

How we benchmark

8
Evaluation datasets
74+
Hours of audio
9
STT providers compared

Each audio file is sent to every provider's production API with default settings. Solaria-3, Solaria-1, and every competitor are tested on identical files — no custom tuning or prompt engineering.

Transcription outputs are normalized using the OpenAI Whisper text normalizer before WER computation. Diarization Error Rate (DER) is measured on DIHARD III challenge datasets using standard protocols.

The full framework is open-sourced on GitHub. Real customer audio is Gladia's internal production dataset, annotated by humans — Soniox and Pipecat STT Benchmark excluded on some datasets pending data availability.

Transparent benchmarks,
open source

Full methodology, evaluation framework, and benchmark report. Reproduce every result independently.