A single mistranscribed product name or customer ID doesn't just ruin a call log. It silently corrupts your CRM and breaks your automated QA scoring, and by the time anyone notices, the damage is already downstream across every coaching scorecard, entity extraction result, and compliance record tied to that call.
This guide covers everything needed to connect our speech-to-text infrastructure to the Vonage Voice API. It includes real-time streaming for live agent assist and post-call batch processing for full-coverage QA. Connection patterns, audio formats, and code-level setup are all covered, along with the data governance requirements your compliance review will ask for.
This guide covers two audiences. If you use Vonage Business Cloud as your contact center platform, it shows you how to export recorded calls and submit them to our async API for post-call transcription and QA. And, developers (Vonage Voice API) If you're building directly on the Vonage Voice API, it also covers how to stream live call audio to us via WebSocket while routing completed recordings through the async pipeline for full-accuracy post-call analysis. The integration patterns, code samples, and compliance requirements apply to both workflows.
Impact of automated transcription on agent ROI
Manual QA sampling reviews only a small fraction of calls, which means coaching decisions rest on a statistically thin slice of actual agent performance. Automated transcription, when the underlying WER (word error rate) is low enough to be trusted, changes that ceiling to 100%. Contact center agent attrition and replacement costs are among the highest of any service industry, according to ICMI research. A live text stream gives supervisors the visibility to intervene before calls escalate, enabling earlier action that reduces burnout-driven attrition. The critical qualifier: automated QA is only as reliable as the source transcript.
Scaling compliance in Vonage Voice calls
Regulatory standards including PCI-DSS, HIPAA, and GDPR mandate documentation and auditability for customer interactions. Our compliance hub covers SOC 2 Type II, ISO 27001, HIPAA, and GDPR certification details for vendor review, while the AI transcription legal safety guide covers broader considerations specific to support call deployments. Our optional PII redaction (which must be explicitly enabled via the entity_types parameter in your API request payload) replaces sensitive entities like names, phone numbers, and card data with structured placeholders such as [NAME] and [PHONE_NUMBER].
Scaling QA with transcription data
Transcript accuracy is what makes automated QA scorecards defensible in executive reviews, particularly when tracking accuracy metrics by language and audio condition. Claap, a video collaboration platform processing high call volumes, reached 1-3% WER in production after switching to Gladia, with one hour of audio processed in under 60 seconds.
Integrating Gladia with the Vonage Voice API
The architecture connects Vonage's raw L16 PCM audio stream over WebSocket to Gladia's transcription endpoints, which return structured JSON with transcripts, speaker labels, and audio intelligence.
Third-party integration matrix:
| Provider |
Core model |
Language support |
Diarization |
Pricing model |
| Gladia |
Solaria-1 |
100+ languages, including 42 languages not covered by other APIs |
pyannoteAI Precision-2, included at base rate |
Per hour, all features included on Starter and Growth |
| Deepgram |
Nova-3 |
50+ languages (per Deepgram's public product page) |
Add-on, priced separately |
Per minute base, add-ons inflate effective cost at scale |
| Microsoft Azure |
Proprietary |
100+ languages |
Pricing varies by configuration; see Azure documentation |
Per minute, procurement lock-in |
| Symbl.ai |
Custom |
Limited |
Custom pricing |
Per minute, complex pricing structure |
Vonage Voice API data pipeline setup
Vonage sends audio as raw 16-bit PCM frames at 16 kHz, with 20ms frames of 640 bytes each, prepended with a JSON metadata frame on connection. This format maps directly to our required configuration of encoding: 'wav/pcm', sample_rate: 16000, and bit_depth: 16, with no transcoding needed. The Vonage developer documentation provides the full WebSocket streaming reference.
The pipeline sequence is:
- Your application answers the call via the Vonage Voice API and starts a WebSocket stream.
- Your WebSocket server receives L16 PCM audio frames from Vonage.
- Your server forwards those frames to our real-time WebSocket endpoint.
- We return partial and final transcripts as JSON in near real time.
Supported audio formats for Vonage
We accept WAV, PCM, M4A, FLAC, and AAC for async processing, and raw PCM frames for real-time streaming. For Vonage specifically, the L16 PCM output is ready to forward to our real-time endpoint without conversion.
Mono vs. multi-track recording:
| Recording type |
Audio channels |
Speaker separation |
Best use case |
| Mono (single channel) |
1 |
Mixed, model-based separation required |
Basic transcription, single-speaker calls |
| Multi-track (up to 32 channels) |
Up to 32 (one per participant) |
Clean isolation per speaker |
High-accuracy diarization, multi-party QA scoring |
Vonage supports recording up to 32 call participants in separate tracks, which eliminates crosstalk before it reaches the transcription model. For BPO environments with simultaneous agent and customer speech, multi-track recording can improve diarization accuracy through our pyannoteAI Precision-2 model in batch workflows.
Streaming vs. batch transcription
The choice between real-time and async depends on your use case, not just latency preference.
- Real-time streaming: Approximately 270ms final transcript latency. Ideal for live agent assist, supervisor monitoring, and real-time sentiment alerts.
- Batch (async) transcription: Processes one hour of audio in under 60 seconds. Full context across the entire call enables higher accuracy, and pyannoteAI Precision-2 diarization is available in this mode. Ideal for post-call QA scoring, CRM sync, and compliance audit trails. For most contact center architectures, streaming a live transcript for supervisor UX, then replacing it with the high-accuracy diarized async transcript once the call ends is a common pattern.
Setting up Vonage Voice API transcription
1. Configure Vonage WebSocket endpoint
Configure your Vonage Answer URL to return an NCCO that connects the call to a WebSocket endpoint your server controls. The content-type for L16 PCM audio at 16 kHz is audio/l16;rate=16000. The Vonage NCCO reference covers all available connection parameters.
{
"action": "connect",
"endpoint": [{
"type": "websocket",
"uri": "wss://your-server.com/vonage-audio",
"content-type": "audio/l16;rate=16000"
}]
}
2. Connect to Gladia for live transcription
Initialize our real-time client on your WebSocket server using your API key from the Gladia docs. Send the configuration message as the first JSON frame to specify audio format, sample rate, and any audio intelligence features.
// For mono (single-channel) audio
const gladiaConfig = {
x_gladia_key: "YOUR_GLADIA_KEY",
encoding: "wav/pcm",
sample_rate: 16000,
bit_depth: 16,
channels: 1,
language_behaviour: "automatic single language",
code_switching: true
};
gladiaSocket.send(JSON.stringify(gladiaConfig));
// For multi-track (up to 32 channels) recording
const gladiaConfigMultiTrack = {
x_gladia_key: "YOUR_GLADIA_KEY",
encoding: "wav/pcm",
sample_rate: 16000,
bit_depth: 16,
channels: 32,
language_behaviour: "automatic single language",
code_switching: true
};
Setting code_switching: true enables automatic detection of mid-conversation language changes across our full list of supported languages without requiring a session reset.
3. Stream audio for live transcription
Forward each L16 PCM frame from Vonage directly to our WebSocket. Your server acts as a relay between the two WebSocket connections, receiving Vonage's 20ms frames of 640 bytes each and forwarding them with no conversion.
vonageSocket.on('message', (audioFrame) => {
if (gladiaSocket.readyState === WebSocket.OPEN) {
gladiaSocket.send(JSON.stringify({
frames: audioFrame.toString('base64')
}));
}
});
4. Handling live speech-to-text streams
We return two transcript types: partial transcripts (lower latency, subject to revision) and final transcripts (confirmed output with word-level timestamps). For live agent assist, display partials to give supervisors the fastest possible view, then overwrite with finals for the permanent call record.
gladiaSocket.on('message', (data) => {
const response = JSON.parse(data);
if (response.type === 'transcript' && response.transcription) {
const { transcript, words } = response.transcription;
if (response.transcription.is_final) {
appendToCallRecord(transcript, words);
} else {
updateLiveDisplay(transcript);
}
}
});
WebSocket connections drop on network interruptions and server restarts. Re-open the session using the original session URL (the session token remains valid), and enable TCP keep-alive to prevent idle disconnections. Handle connection timeouts with automatic reconnection logic. Growth and Enterprise tiers allow the concurrent session counts needed for production contact center volumes. The live transcription API reference covers all available response fields.
Automating post-call data processing
Post-call batch transcription is our core strength for contact center workflows. One hour of audio processes in under 60 seconds, and the full-context analysis produces materially better accuracy, diarization, and named entity recognition than the real-time stream alone.
1. Vonage call data retention
After a call ends, retrieve the recorded audio file from Vonage's storage using the recording UUID from the call event payload. The Vonage Business Cloud support documentation covers the storage retrieval API. For VBC users, the "On Demand Call Recording" app is the prerequisite for accessing recorded files (current pricing in the Vonage App Center).
2. Submit audio to Gladia async API
If the audio is a local file, upload it first to the upload endpoint using multipart/form-data, which returns a URL. Then submit to the pre-recorded transcription endpoint with your configuration. Enable diarization, sentiment analysis, named entity recognition, and summarization in a single call.
curl --request POST \
--url https://api.gladia.io/v2/pre-recorded \
--header 'x-gladia-key: YOUR_GLADIA_KEY' \
--header 'Content-Type: application/json' \
--data '{
"audio_url": "https://your-vonage-recording-url.com/recording.wav",
"diarization": true,
"sentiment_analysis": true,
"named_entity_recognition": true,
"summarization": true
}'
Use callbacks rather than polling to receive the result. Set a callback_url in your request and we will POST the completed transcript to your endpoint when processing finishes, avoiding unnecessary polling overhead on high call volumes.
3. Importing transcript data for QA
Our async response includes per-utterance speaker labels (SPEAKER_00, SPEAKER_01, etc.) powered by pyannoteAI Precision-2, which map to agent and customer channels for per-agent QA scoring.
4. Sync transcripts to CRM or QA tools
Push the structured output, including transcript text, speaker labels, named entities, sentiment scores (text-based, derived from the transcript rather than acoustic tone), and summaries, to your downstream CRM or QA platform via webhook.
Integration checklist
Use this checklist to verify your Vonage-to-Gladia integration before moving to production:
- Vonage NCCO configured with WebSocket endpoint and
content-type: audio/l16;rate=16000 - Call recording enabled in Vonage application settings
- WebSocket initialized with
encoding: 'wav/pcm', sample_rate: 16000, bit_depth: 16 to connect to Gladia code_switching: true set for any multilingual or BPO call flows- Reconnection logic implemented for WebSocket connection drops
- Vonage recording UUID captured from call completed event for async retrieval
- Async POST configured with
diarization: true for speaker-labeled QA transcripts - Callback URL set in async request to avoid polling overhead
- PII redaction preset (GDPR, HIPAA_SAFE_HARBOR, or PCI) explicitly enabled if required
- Data training policy confirmed: Growth or Enterprise tier for regulated audio
- Region selected to match your Vonage infrastructure geography (EU-west or US-west for optimal latency)
- CRM or QA webhook endpoint tested with our structured JSON output
Scaling Vonage transcription for heavy call volumes
Optimizing speech recognition for accents
BPO environments introduce accent, dialect, and code-switching complexity that degrades QA accuracy when the transcription model can't handle it. Our Solaria-1 model delivers on average 29% lower WER on conversational speech and 3x lower DER compared to alternatives, benchmarked across 7 datasets and 74+ hours of audio with open, reproducible methodology. Our 100+ supported languages include 42 that no other API-level competitor covers, including Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, and Marathi, which are directly relevant for BPO operations in Southeast Asia and South Asia.
For European contact-center deployments processing EN, FR, DE, ES, or IT audio in async workflows, Solaria-3 is our most accurate model for real-world business and noisy audio. For real-time streaming, Solaria-1 is the model to use, it also covers the full 100+ language breadth including Southeast and South Asian BPO languages.
"Gladia deliver real time highly accurate transcription with minimal latency, even accross multiple languages and ascents, The API is straightforward and well documented, Making integration into our internal tools quick and easy." - Faes W. on G2
Data governance for Voice API calls
The data training policy differs by plan tier:
- Starter plan: Customer audio can be used for model training by default.
- Growth and Enterprise plans: Customer audio is never used for model training. No opt-out action required. This is the contractual default, verifiable in the Data Processing Agreement (DPA).
- Enterprise plan: Adds zero data retention (ZDR), meaning audio is processed ephemerally and deleted immediately after transcription with no data stored at rest. For regulated industries including financial services and healthcare, the Growth or Enterprise tier is the appropriate default. Our compliance hub provides the SOC 2 Type II, ISO 27001, HIPAA, and GDPR certification documentation your legal team needs to complete a vendor review.
Managing Vonage usage-based expenses
Vonage Business Cloud (VBC) recording costs:
- On Demand Call Recording: required prerequisite for call transcription (check current pricing in the Vonage App Center).
- AI Transcription for On-Demand Call Recordings: available as a separate add-on (check current pricing in the Vonage App Center).
| Plan |
Async |
Real-time |
Training policy |
| Starter |
$0.61/hr |
$0.75/hr |
Data can be used for training |
| Growth |
As low as $0.20/hr |
As low as $0.25/hr |
Never used for training |
| Enterprise |
Custom |
Custom |
Never used for training, ZDR available |
All Starter and Growth plans include diarization, translation, sentiment analysis, named entity recognition, summarization, and code-switching at the base rate, with no add-on fees.
Optimizing Vonage Voice API latency
To keep real-time transcript delivery within the approximately 270ms final latency budget, select the Gladia region closest to your Vonage infrastructure (EU or US).
Boosting coaching outcomes with automated call logs
Live agent coaching and sentiment alerts
We provide text-based sentiment analysis derived from the transcript, which is NLP inference on what was said rather than acoustic emotion detection from vocal characteristics. When caller intent signals or sentiment flags appear, supervisors can review the corresponding transcript segment and decide whether to intervene, rather than responding to a score without context.
Automated QA scoring at 100% coverage
Moving from 2% manual QA sampling to 100% automated scoring changes the coaching conversation from "here's a random call we pulled" to "here's every call scored against the same rubric." Our low WER in production, combined with accurate named entity recognition that catches a 39% reduction in entity errors versus leading competitors, ensures that automated scorecards reflect what agents actually said rather than what the model hallucinated.
Syncing Vonage transcripts to CRM
Structured output from the async pipeline, including speaker-attributed transcript segments, named entities (product names, account numbers, contact details), and the call summary, maps directly to CRM record fields. Agents spend zero time on manual call notes, and CRM records are populated automatically after call completion, supplying the structured data that scoring rubrics for sales and support QA run against.
Automated audit trails for voice data
Every call transcript includes word-level timestamps, speaker labels, and structured entity data, creating a searchable, compliant audit record that doesn't require a human QA reviewer to reconstruct what was said during a compliance disclosure. This is the foundation for passing regulatory audits without pulling QA headcount off active coaching work.
Start with 10 free hours and have your integration in production in less than a day. For multilingual BPO environments, test our API on your own accented and code-switching audio to verify WER under your specific call conditions before committing to a plan tier.
FAQs
How do you transcribe Vonage Business Cloud (VBC) calls?
Purchase the "On Demand Call Recording" app from the Vonage App Center (required prerequisite, pricing subject to change), then export the recorded audio files and submit them to our async API using a POST to the pre-recorded endpoint with your audio URL and desired enrichment features.
What is the Vonage Voice API speech-to-text limit?
The Vonage Voice API supports continuous real-time WebSocket streaming for extended calls, with final transcript latency of approximately 270ms. For async batch processing, we handle files up to 135 minutes and 1,000MB per submission.
How does Gladia handle multi-language Vonage calls?
Our Solaria-1 model detects mid-conversation language changes automatically across our full language list when code_switching: true is set, maintaining transcript continuity without a session reset even when bilingual speakers alternate languages mid-call.
Is speaker diarization available in real-time Vonage call transcription?
Speaker diarization powered by pyannoteAI Precision-2 is available in async (batch) workflows only. For real-time call monitoring, speaker attribution can be handled in post-processing once the call ends, using the high-accuracy async transcript as the permanent record.
Which Gladia plan stops customer audio from being used for model training?
Growth and Enterprise plans guarantee that customer audio is never used for model training, with no opt-out action required. On the Starter plan, audio can be used for training by default.
Key terms glossary
Word error rate (WER): The standard metric for speech recognition accuracy, calculated by comparing the transcribed text against a human-verified ground truth. A full WER methodology explanation is available here.
Diarization error rate (DER): The metric that measures speaker attribution accuracy, evaluating how successfully a system identifies who spoke and when.
Code-switching: The practice of alternating between two or more languages mid-conversation, which our Solaria-1 model detects and handles automatically in both real-time and async modes across our supported languages.
WebSocket: A communication protocol providing full-duplex channels over a single TCP connection. Vonage uses WebSockets to stream live call audio at L16 PCM, 16 kHz to external processing endpoints.
pyannoteAI Precision-2: The speaker diarization model that powers our async speaker attribution. Available in batch workflows only within our async pipeline, where full recording context produces more accurate and stable speaker labels than any live approach.
Zero data retention (ZDR): An Enterprise-tier configuration where audio is processed ephemerally and deleted immediately after transcription, with no data stored at rest at any stage of the pipeline.