Blog

Technical guides, customer stories, and product updates
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Speech-To-Text

What is Word Error Rate (WER): How it’s calculated, and why it can mislead

Word Error Rate (WER) is a metric that evaluates the performance of ASR systems by analyzing the accuracy of speech-to-text results. WER metric allows developers, scientists, and researchers to assess ASR performance. A lower WER indicates better ASR performance, and vice versa. The assessment allows for optimizing the ASR technologies over time and helps to compare speech-to-text models and providers for commercial use. 

Speech-To-Text

Text normalization in speech recognition explained

Speech recognition systems are good at turning audio into words. But the transcripts they produce aren’t always structured in ways that software can reliably work with.

Speech-To-Text

What is speaker diarization?

One of the major obstacles for speech-to-text AI has been identifying individual speakers in a multi-speaker audio stream before transcribing the speech. This is where speaker separation, also known as diarization, comes into play.

Speech-To-Text

Build a customer interview library with Gladia, Airtable & Make.com

TL;DR: Most product teams lose qualitative insights to scattered audio and transcripts that misattribute quotes. A reliable interview library needs accurate async diarization, automated routing, and a searchable database. Gladia's Solaria-1 sets the accuracy floor (29% lower WER, 3x lower DER on conversational speech), and Make.com routes its structured JSON into Airtable automatically, turning raw recordings into a searchable, theme-tagged customer content library.

Speech-To-Text

Build an automated sales call analyzer with Gladia and n8n

TL;DR: Off-the-shelf conversation intelligence platforms cost $1,200 to $2,400 per seat per year, while this n8n and Gladia pipeline scales at $0.20 to $0.61 per hour of audio with all features included. The async pipeline handles transcription, speaker diarization, and audio intelligence in a single API call, and the structured JSON output maps directly into HubSpot or Salesforce through n8n nodes. Gladia's Solaria-1 model covers 100+ languages, including 42 that no other API-level competitor supports, protecting CRM data quality for global sales teams.

Speech-To-Text

How to build a no-touch pipeline from sales calls to CRM

TL;DR: Manual CRM entry breaks sales intelligence pipelines because reps skip fields and misremember details, creating corrupted deal data that spreads into forecasts, coaching scores, and follow-up tasks. The bottleneck in fixing this isn't the CRM API or the LLM prompt, it's the transcription layer, since a high word error rate corrupts every entity Claude extracts downstream. This tutorial walks through a production-ready pipeline using Gladia's async STT for transcription, Claude for entity extraction, and n8n for orchestration, with most teams reaching production in under 24 hours. Gladia's Solaria-1 model delivers on average 29% lower WER than alternatives on conversational speech, directly protecting the accuracy of every deal record written to the CRM.

Speech-To-Text

Build a lead scoring pipeline from sales call recordings with Gladia and Claude

TL;DR: Accurate hot/warm/cold lead scoring needs speaker-attributed transcripts. Without diarization, Claude cannot separate prospect buying signals from the sales rep's talk track, so any score is unreliable. Gladia's async API (Solaria-1) returns speaker-labeled, LLM-ready JSON, with diarization, sentiment, and named entity recognition included in the base per-hour rate on Starter and Growth plans, each enabled explicitly in the request. On Growth and Enterprise plans, audio is never used for model training with no opt-out required, keeping the pipeline safe for sensitive sales calls under GDPR and SOC 2 Type II.

Speech-To-Text

How to flag low-confidence spans in AI meeting transcripts for reviewer QA

TL;DR: Transcription errors silently corrupt meeting summaries and CRM entries. Flag uncertainty with word-level confidence scores and pattern matching, then sync only the flagged spans to audio timestamps so reviewers verify the low-confidence parts instead of the whole transcript. Gladia's async API provides word-level confidence, pyannoteAI Precision-2 diarization, and native code-switching detection out of the box.

Speech-To-Text

Mastering real-time transcription: speed, accuracy, and Gladia's AI advantage

TL;DR: Most use cases like meeting assistants, post-call analytics, and note-taking tools don't need real-time transcription. Async delivers higher accuracy and better speaker attribution because the model processes the complete recording. Sub-300ms latency is a functional requirement only for voice agents, live captions, and live agent assist tools where immediate output is non-negotiable. Gladia's Solaria-1 delivers around 270ms average latency with 100+ language support and native code-switching for the use cases that do require it.