Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Pricing
Get started
Get started

Read more

Speech-To-Text

Vonage call transcription: adding real-time speech-to-text to Vonage

TL;DR: Integrating our speech-to-text infrastructure with the Vonage Voice API replaces fragmented recording, transcription, and enrichment stacks with a single API. By routing Vonage WebSocket streams directly to our endpoint, contact centers achieve approximately 270ms real-time latency for live agent assistance, or use post-call batch processing for automated QA scoring. Streaming is the right choice for live superviso. Async is the right choice when speaker-attributed QA scoring and full call context matter more than latency.

Speech-To-Text

Key data extraction: accurately extracting names, account numbers, and intents from calls

TL;DR: Downstream contact center automation fails silently when the transcription layer misinterprets a name, transposes a digit, or attributes speech to the wrong speaker. Every QA scorecard, CRM entry, and coaching signal is ceiling-bounded by the accuracy of the layer beneath it. A wrong digit or phonetic name substitution propagates into every CRM field and compliance event that follows. Extraction precision is capped by transcription quality: Solaria-1 delivers on average 29% lower WER on conversational speech and 3x lower DER than alternatives, benchmarked across 8 providers, 7 datasets, and 74+ hours of audio.

Speech-To-Text

Amazon Connect transcription: real-time speech-to-text for AWS contact centers

TL;DR: Contact centers using Amazon Connect struggle with high transcription costs and poor multilingual accuracy when relying on native tools. Routing audio via Kinesis Video Streams or S3 to Solaria-1 eliminates the Lambda 15-minute timeout risk and removes per-feature add-on costs. On conversational speech, Solaria-1 delivers on average 29% lower WER than alternatives, benchmarked across 7 datasets and 74+ hours of audio.

Vonage call transcription: adding real-time speech-to-text to Vonage

Published on June 26, 2026
by Ani Ghazaryan
Vonage call transcription: adding real-time speech-to-text to Vonage

TL;DR: Integrating our speech-to-text infrastructure with the Vonage Voice API replaces fragmented recording, transcription, and enrichment stacks with a single API. By routing Vonage WebSocket streams directly to our endpoint, contact centers achieve approximately 270ms real-time latency for live agent assistance, or use post-call batch processing for automated QA scoring. Streaming is the right choice for live superviso. Async is the right choice when speaker-attributed QA scoring and full call context matter more than latency.

A single mistranscribed product name or customer ID doesn't just ruin a call log. It silently corrupts your CRM and breaks your automated QA scoring, and by the time anyone notices, the damage is already downstream across every coaching scorecard, entity extraction result, and compliance record tied to that call.

This guide covers everything needed to connect our speech-to-text infrastructure to the Vonage Voice API. It includes real-time streaming for live agent assist and post-call batch processing for full-coverage QA. Connection patterns, audio formats, and code-level setup are all covered, along with the data governance requirements your compliance review will ask for.

This guide covers two audiences. If you use Vonage Business Cloud as your contact center platform, it shows you how to export recorded calls and submit them to our async API for post-call transcription and QA. And, developers (Vonage Voice API) If you're building directly on the Vonage Voice API, it also covers how to stream live call audio to us via WebSocket while routing completed recordings through the async pipeline for full-accuracy post-call analysis. The integration patterns, code samples, and compliance requirements apply to both workflows.

Impact of automated transcription on agent ROI

Manual QA sampling reviews only a small fraction of calls, which means coaching decisions rest on a statistically thin slice of actual agent performance. Automated transcription, when the underlying WER (word error rate) is low enough to be trusted, changes that ceiling to 100%. Contact center agent attrition and replacement costs are among the highest of any service industry, according to ICMI research. A live text stream gives supervisors the visibility to intervene before calls escalate, enabling earlier action that reduces burnout-driven attrition. The critical qualifier: automated QA is only as reliable as the source transcript.

Scaling compliance in Vonage Voice calls

Regulatory standards including PCI-DSS, HIPAA, and GDPR mandate documentation and auditability for customer interactions. Our compliance hub covers SOC 2 Type II, ISO 27001, HIPAA, and GDPR certification details for vendor review, while the AI transcription legal safety guide covers broader considerations specific to support call deployments. Our optional PII redaction (which must be explicitly enabled via the entity_types parameter in your API request payload) replaces sensitive entities like names, phone numbers, and card data with structured placeholders such as [NAME] and [PHONE_NUMBER].

Scaling QA with transcription data

Transcript accuracy is what makes automated QA scorecards defensible in executive reviews, particularly when tracking accuracy metrics by language and audio condition. Claap, a video collaboration platform processing high call volumes, reached 1-3% WER in production after switching to Gladia, with one hour of audio processed in under 60 seconds.

Integrating Gladia with the Vonage Voice API

The architecture connects Vonage's raw L16 PCM audio stream over WebSocket to Gladia's transcription endpoints, which return structured JSON with transcripts, speaker labels, and audio intelligence.

Third-party integration matrix:

Provider Core model Language support Diarization Pricing model
Gladia Solaria-1 100+ languages, including 42 languages not covered by other APIs pyannoteAI Precision-2, included at base rate Per hour, all features included on Starter and Growth
Deepgram Nova-3 50+ languages (per Deepgram's public product page) Add-on, priced separately Per minute base, add-ons inflate effective cost at scale
Microsoft Azure Proprietary 100+ languages Pricing varies by configuration; see Azure documentation Per minute, procurement lock-in
Symbl.ai Custom Limited Custom pricing Per minute, complex pricing structure

Vonage Voice API data pipeline setup

Vonage sends audio as raw 16-bit PCM frames at 16 kHz, with 20ms frames of 640 bytes each, prepended with a JSON metadata frame on connection. This format maps directly to our required configuration of encoding: 'wav/pcm', sample_rate: 16000, and bit_depth: 16, with no transcoding needed. The Vonage developer documentation provides the full WebSocket streaming reference.

The pipeline sequence is:

  1. Your application answers the call via the Vonage Voice API and starts a WebSocket stream.
  2. Your WebSocket server receives L16 PCM audio frames from Vonage.
  3. Your server forwards those frames to our real-time WebSocket endpoint.
  4. We return partial and final transcripts as JSON in near real time.

Supported audio formats for Vonage

We accept WAV, PCM, M4A, FLAC, and AAC for async processing, and raw PCM frames for real-time streaming. For Vonage specifically, the L16 PCM output is ready to forward to our real-time endpoint without conversion.

Mono vs. multi-track recording:

Recording type Audio channels Speaker separation Best use case
Mono (single channel) 1 Mixed, model-based separation required Basic transcription, single-speaker calls
Multi-track (up to 32 channels) Up to 32 (one per participant) Clean isolation per speaker High-accuracy diarization, multi-party QA scoring

Vonage supports recording up to 32 call participants in separate tracks, which eliminates crosstalk before it reaches the transcription model. For BPO environments with simultaneous agent and customer speech, multi-track recording can improve diarization accuracy through our pyannoteAI Precision-2 model in batch workflows.

Streaming vs. batch transcription

The choice between real-time and async depends on your use case, not just latency preference.

  • Real-time streaming: Approximately 270ms final transcript latency. Ideal for live agent assist, supervisor monitoring, and real-time sentiment alerts.
  • Batch (async) transcription: Processes one hour of audio in under 60 seconds. Full context across the entire call enables higher accuracy, and pyannoteAI Precision-2 diarization is available in this mode. Ideal for post-call QA scoring, CRM sync, and compliance audit trails. For most contact center architectures, streaming a live transcript for supervisor UX, then replacing it with the high-accuracy diarized async transcript once the call ends is a common pattern.

Setting up Vonage Voice API transcription

1. Configure Vonage WebSocket endpoint

Configure your Vonage Answer URL to return an NCCO that connects the call to a WebSocket endpoint your server controls. The content-type for L16 PCM audio at 16 kHz is audio/l16;rate=16000. The Vonage NCCO reference covers all available connection parameters.

{
  "action": "connect",
  "endpoint": [{
    "type": "websocket",
    "uri": "wss://your-server.com/vonage-audio",
    "content-type": "audio/l16;rate=16000"
  }]
}

2. Connect to Gladia for live transcription

Initialize our real-time client on your WebSocket server using your API key from the Gladia docs. Send the configuration message as the first JSON frame to specify audio format, sample rate, and any audio intelligence features.

// For mono (single-channel) audio
const gladiaConfig = {
  x_gladia_key: "YOUR_GLADIA_KEY",
  encoding: "wav/pcm",
  sample_rate: 16000,
  bit_depth: 16,
  channels: 1,
  language_behaviour: "automatic single language",
  code_switching: true
};
gladiaSocket.send(JSON.stringify(gladiaConfig));

// For multi-track (up to 32 channels) recording
const gladiaConfigMultiTrack = {
  x_gladia_key: "YOUR_GLADIA_KEY",
  encoding: "wav/pcm",
  sample_rate: 16000,
  bit_depth: 16,
  channels: 32,
  language_behaviour: "automatic single language",
  code_switching: true
};

Setting code_switching: true enables automatic detection of mid-conversation language changes across our full list of supported languages without requiring a session reset.

3. Stream audio for live transcription

Forward each L16 PCM frame from Vonage directly to our WebSocket. Your server acts as a relay between the two WebSocket connections, receiving Vonage's 20ms frames of 640 bytes each and forwarding them with no conversion.

vonageSocket.on('message', (audioFrame) => {
  if (gladiaSocket.readyState === WebSocket.OPEN) {
    gladiaSocket.send(JSON.stringify({
      frames: audioFrame.toString('base64')
    }));
  }
});

4. Handling live speech-to-text streams

We return two transcript types: partial transcripts (lower latency, subject to revision) and final transcripts (confirmed output with word-level timestamps). For live agent assist, display partials to give supervisors the fastest possible view, then overwrite with finals for the permanent call record.

gladiaSocket.on('message', (data) => {
  const response = JSON.parse(data);
  if (response.type === 'transcript' && response.transcription) {
    const { transcript, words } = response.transcription;
    if (response.transcription.is_final) {
      appendToCallRecord(transcript, words);
    } else {
      updateLiveDisplay(transcript);
    }
  }
});

WebSocket connections drop on network interruptions and server restarts. Re-open the session using the original session URL (the session token remains valid), and enable TCP keep-alive to prevent idle disconnections. Handle connection timeouts with automatic reconnection logic. Growth and Enterprise tiers allow the concurrent session counts needed for production contact center volumes. The live transcription API reference covers all available response fields.

Automating post-call data processing

Post-call batch transcription is our core strength for contact center workflows. One hour of audio processes in under 60 seconds, and the full-context analysis produces materially better accuracy, diarization, and named entity recognition than the real-time stream alone.

1. Vonage call data retention

After a call ends, retrieve the recorded audio file from Vonage's storage using the recording UUID from the call event payload. The Vonage Business Cloud support documentation covers the storage retrieval API. For VBC users, the "On Demand Call Recording" app is the prerequisite for accessing recorded files (current pricing in the Vonage App Center).

2. Submit audio to Gladia async API

If the audio is a local file, upload it first to the upload endpoint using multipart/form-data, which returns a URL. Then submit to the pre-recorded transcription endpoint with your configuration. Enable diarization, sentiment analysis, named entity recognition, and summarization in a single call.

curl --request POST \
  --url https://api.gladia.io/v2/pre-recorded \
  --header 'x-gladia-key: YOUR_GLADIA_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "audio_url": "https://your-vonage-recording-url.com/recording.wav",
    "diarization": true,
    "sentiment_analysis": true,
    "named_entity_recognition": true,
    "summarization": true
  }'

Use callbacks rather than polling to receive the result. Set a callback_url in your request and we will POST the completed transcript to your endpoint when processing finishes, avoiding unnecessary polling overhead on high call volumes.

3. Importing transcript data for QA

Our async response includes per-utterance speaker labels (SPEAKER_00, SPEAKER_01, etc.) powered by pyannoteAI Precision-2, which map to agent and customer channels for per-agent QA scoring.

4. Sync transcripts to CRM or QA tools

Push the structured output, including transcript text, speaker labels, named entities, sentiment scores (text-based, derived from the transcript rather than acoustic tone), and summaries, to your downstream CRM or QA platform via webhook.

Integration checklist

Use this checklist to verify your Vonage-to-Gladia integration before moving to production:

  • Vonage NCCO configured with WebSocket endpoint and content-type: audio/l16;rate=16000
  • Call recording enabled in Vonage application settings
  • WebSocket initialized with encoding: 'wav/pcm', sample_rate: 16000, bit_depth: 16 to connect to Gladia
  • code_switching: true set for any multilingual or BPO call flows
  • Reconnection logic implemented for WebSocket connection drops
  • Vonage recording UUID captured from call completed event for async retrieval
  • Async POST configured with diarization: true for speaker-labeled QA transcripts
  • Callback URL set in async request to avoid polling overhead
  • PII redaction preset (GDPR, HIPAA_SAFE_HARBOR, or PCI) explicitly enabled if required
  • Data training policy confirmed: Growth or Enterprise tier for regulated audio
  • Region selected to match your Vonage infrastructure geography (EU-west or US-west for optimal latency)
  • CRM or QA webhook endpoint tested with our structured JSON output

Scaling Vonage transcription for heavy call volumes

Optimizing speech recognition for accents

BPO environments introduce accent, dialect, and code-switching complexity that degrades QA accuracy when the transcription model can't handle it. Our Solaria-1 model delivers on average 29% lower WER on conversational speech and 3x lower DER compared to alternatives, benchmarked across 7 datasets and 74+ hours of audio with open, reproducible methodology. Our 100+ supported languages include 42 that no other API-level competitor covers, including Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, and Marathi, which are directly relevant for BPO operations in Southeast Asia and South Asia.

For European contact-center deployments processing EN, FR, DE, ES, or IT audio in async workflows, Solaria-3 is our most accurate model for real-world business and noisy audio. For real-time streaming, Solaria-1 is the model to use, it also covers the full 100+ language breadth including Southeast and South Asian BPO languages.

"Gladia deliver real time highly accurate transcription with minimal latency, even accross multiple languages and ascents, The API is straightforward and well documented, Making integration into our internal tools quick and easy." - Faes W. on G2

Data governance for Voice API calls

The data training policy differs by plan tier:

  • Starter plan: Customer audio can be used for model training by default.
  • Growth and Enterprise plans: Customer audio is never used for model training. No opt-out action required. This is the contractual default, verifiable in the Data Processing Agreement (DPA).
  • Enterprise plan: Adds zero data retention (ZDR), meaning audio is processed ephemerally and deleted immediately after transcription with no data stored at rest. For regulated industries including financial services and healthcare, the Growth or Enterprise tier is the appropriate default. Our compliance hub provides the SOC 2 Type II, ISO 27001, HIPAA, and GDPR certification documentation your legal team needs to complete a vendor review.

Managing Vonage usage-based expenses

Vonage Business Cloud (VBC) recording costs:

  • On Demand Call Recording: required prerequisite for call transcription (check current pricing in the Vonage App Center).
  • AI Transcription for On-Demand Call Recordings: available as a separate add-on (check current pricing in the Vonage App Center).
Plan Async Real-time Training policy
Starter $0.61/hr $0.75/hr Data can be used for training
Growth As low as $0.20/hr As low as $0.25/hr Never used for training
Enterprise Custom Custom Never used for training, ZDR available

All Starter and Growth plans include diarization, translation, sentiment analysis, named entity recognition, summarization, and code-switching at the base rate, with no add-on fees.

Optimizing Vonage Voice API latency

To keep real-time transcript delivery within the approximately 270ms final latency budget, select the Gladia region closest to your Vonage infrastructure (EU or US).

Boosting coaching outcomes with automated call logs

Live agent coaching and sentiment alerts

We provide text-based sentiment analysis derived from the transcript, which is NLP inference on what was said rather than acoustic emotion detection from vocal characteristics. When caller intent signals or sentiment flags appear, supervisors can review the corresponding transcript segment and decide whether to intervene, rather than responding to a score without context.

Automated QA scoring at 100% coverage

Moving from 2% manual QA sampling to 100% automated scoring changes the coaching conversation from "here's a random call we pulled" to "here's every call scored against the same rubric." Our low WER in production, combined with accurate named entity recognition that catches a 39% reduction in entity errors versus leading competitors, ensures that automated scorecards reflect what agents actually said rather than what the model hallucinated.

Syncing Vonage transcripts to CRM

Structured output from the async pipeline, including speaker-attributed transcript segments, named entities (product names, account numbers, contact details), and the call summary, maps directly to CRM record fields. Agents spend zero time on manual call notes, and CRM records are populated automatically after call completion, supplying the structured data that scoring rubrics for sales and support QA run against.

Automated audit trails for voice data

Every call transcript includes word-level timestamps, speaker labels, and structured entity data, creating a searchable, compliant audit record that doesn't require a human QA reviewer to reconstruct what was said during a compliance disclosure. This is the foundation for passing regulatory audits without pulling QA headcount off active coaching work.

Start with 10 free hours and have your integration in production in less than a day. For multilingual BPO environments, test our API on your own accented and code-switching audio to verify WER under your specific call conditions before committing to a plan tier.

FAQs

How do you transcribe Vonage Business Cloud (VBC) calls?

Purchase the "On Demand Call Recording" app from the Vonage App Center (required prerequisite, pricing subject to change), then export the recorded audio files and submit them to our async API using a POST to the pre-recorded endpoint with your audio URL and desired enrichment features.

What is the Vonage Voice API speech-to-text limit?

The Vonage Voice API supports continuous real-time WebSocket streaming for extended calls, with final transcript latency of approximately 270ms. For async batch processing, we handle files up to 135 minutes and 1,000MB per submission.

How does Gladia handle multi-language Vonage calls?

Our Solaria-1 model detects mid-conversation language changes automatically across our full language list when code_switching: true is set, maintaining transcript continuity without a session reset even when bilingual speakers alternate languages mid-call.

Is speaker diarization available in real-time Vonage call transcription?

Speaker diarization powered by pyannoteAI Precision-2 is available in async (batch) workflows only. For real-time call monitoring, speaker attribution can be handled in post-processing once the call ends, using the high-accuracy async transcript as the permanent record.

Which Gladia plan stops customer audio from being used for model training?

Growth and Enterprise plans guarantee that customer audio is never used for model training, with no opt-out action required. On the Starter plan, audio can be used for training by default.

Key terms glossary

Word error rate (WER): The standard metric for speech recognition accuracy, calculated by comparing the transcribed text against a human-verified ground truth. A full WER methodology explanation is available here.

Diarization error rate (DER): The metric that measures speaker attribution accuracy, evaluating how successfully a system identifies who spoke and when.

Code-switching: The practice of alternating between two or more languages mid-conversation, which our Solaria-1 model detects and handles automatically in both real-time and async modes across our supported languages.

WebSocket: A communication protocol providing full-duplex channels over a single TCP connection. Vonage uses WebSockets to stream live call audio at L16 PCM, 16 kHz to external processing endpoints.

pyannoteAI Precision-2: The speaker diarization model that powers our async speaker attribution. Available in batch workflows only within our async pipeline, where full recording context produces more accurate and stable speaker labels than any live approach.

Zero data retention (ZDR): An Enterprise-tier configuration where audio is processed ephemerally and deleted immediately after transcription, with no data stored at rest at any stage of the pipeline.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more