Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Pricing
Get started
Get started

Read more

Speech-To-Text

Vonage call transcription: adding real-time speech-to-text to Vonage

TL;DR: Integrating our speech-to-text infrastructure with the Vonage Voice API replaces fragmented recording, transcription, and enrichment stacks with a single API. By routing Vonage WebSocket streams directly to our endpoint, contact centers achieve approximately 270ms real-time latency for live agent assistance, or use post-call batch processing for automated QA scoring. Streaming is the right choice for live superviso. Async is the right choice when speaker-attributed QA scoring and full call context matter more than latency.

Speech-To-Text

Key data extraction: accurately extracting names, account numbers, and intents from calls

TL;DR: Downstream contact center automation fails silently when the transcription layer misinterprets a name, transposes a digit, or attributes speech to the wrong speaker. Every QA scorecard, CRM entry, and coaching signal is ceiling-bounded by the accuracy of the layer beneath it. A wrong digit or phonetic name substitution propagates into every CRM field and compliance event that follows. Extraction precision is capped by transcription quality: Solaria-1 delivers on average 29% lower WER on conversational speech and 3x lower DER than alternatives, benchmarked across 8 providers, 7 datasets, and 74+ hours of audio.

Speech-To-Text

Amazon Connect transcription: real-time speech-to-text for AWS contact centers

TL;DR: Contact centers using Amazon Connect struggle with high transcription costs and poor multilingual accuracy when relying on native tools. Routing audio via Kinesis Video Streams or S3 to Solaria-1 eliminates the Lambda 15-minute timeout risk and removes per-feature add-on costs. On conversational speech, Solaria-1 delivers on average 29% lower WER than alternatives, benchmarked across 7 datasets and 74+ hours of audio.

Amazon Connect transcription: real-time speech-to-text for AWS contact centers

Published on June 26, 2026
by Ani Ghazaryan
Amazon Connect transcription: real-time speech-to-text for AWS contact centers

TL;DR: Contact centers using Amazon Connect struggle with high transcription costs and poor multilingual accuracy when relying on native tools. Routing audio via Kinesis Video Streams or S3 to Solaria-1 eliminates the Lambda 15-minute timeout risk and removes per-feature add-on costs. On conversational speech, Solaria-1 delivers on average 29% lower WER than alternatives, benchmarked across 7 datasets and 74+ hours of audio.

The biggest bottleneck in your Amazon Connect agent assist pipeline isn't the LLM. It's the native transcription layer corrupting customer names, account numbers, and intent signals before they reach your CRM. A wrong account number in a financial services call doesn't just produce a bad transcript, it produces a bad CRM record, a flawed coaching scorecard, and a compliance exposure that surfaces in an audit rather than a QA review. Manual sampling of a small fraction of call volume can't catch this at scale, and automated QA is only as reliable as the transcription layer beneath it. This guide covers the exact architecture to route Amazon Connect audio through our Solaria-1 model for production-grade accuracy, whether you're building live agent assist or automating post-call QA across a multilingual BPO operation.

Defining Amazon Connect transcription capabilities

Operations teams often conflate these three services, but their functional and pricing boundaries directly affect your cost-per-contact model.

Service Type What it does Pricing
Amazon Transcribe ASR engine Managed speech recognition for developers Per-minute pricing (tiered by volume)
Contact Lens (Conversational Analytics) Analytics feature NLP layer for sentiment, themes, agent compliance inside Connect Add-on pricing or bundled in unlimited AI tier
Transcribe Call Analytics Enterprise API Rich transcripts, redaction, call insights via pay-as-you-go API Per-minute pricing (includes sentiment and categorization)

The key distinction: Transcribe is the engine, Contact Lens is the analytics feature built on top of it inside the Connect console, and Transcribe Call Analytics is a separate API for teams that want programmatic access to those insights. Sentiment analysis, call categorization, and summarization all require the Call Analytics tier per AWS's published Transcribe pricing.

Core Amazon Connect transcription features

Contact Lens delivers real-time insights during calls and post-call analytics once After Contact Work (ACW) completes. Real-time features surface agent compliance flags and live sentiment to supervisors, while post-call analysis delivers full transcripts, PII redaction, and aggregated sentiment scores.

The native limitations are where operational complexity begins:

  • Language coverage gaps: For languages critical to South and Southeast Asian BPO operations, including Tagalog, Bengali, Punjabi, Tamil, and Urdu, AWS Transcribe accuracy often falls short of automated QA thresholds in production contact center environments. Where coverage exists, production WER on accented speech can exceed what automated QA workflows require, driving errors into downstream CRM and coaching records.
  • Unbundled feature costs: PII redaction and Call Analytics each carry separate per-minute charges on top of the base transcription rate, inflating cost-per-contact at scale.
  • Lambda timeout for long calls: AWS Lambda has a hard execution limit of 15 minutes. Any call-transcription workflow relying on a Lambda function will produce an incomplete transcript for calls that run longer, corrupting CRM records silently.

Use cases for third-party STT

Native AWS tooling handles straightforward, English-dominant contact centers well. The gap opens under these conditions:

  • Accented and multilingual agents: BPO sites in the Philippines, India, or Latin America introduce accent and code-switching patterns that drive WER materially higher than clean-audio evaluations suggest.
  • High-volume cost sensitivity: At high monthly volumes, unbundled AWS pricing for a feature-equivalent stack (base transcription, PII redaction, sentiment via Call Analytics) can diverge significantly from our all-inclusive Growth plan.
  • Code-switching mid-conversation: Bilingual agents switching languages within a single call produce fragmented output in standard ASR engines. Our native code-switching detection handles this across all 100+ supported languages without a broken session.
  • Downstream CRM accuracy requirements: When transcription errors corrupt CRM entries, the damage is invisible until it surfaces in a coaching review or compliance audit.

Integrating Gladia with Amazon Connect

means replacing the ASR and analytics engine, not the telephony layer. Gladia sits as the transcription and enrichment layer between your Amazon Connect audio infrastructure and downstream systems (CRM, QA platform, workforce management), leaving the Connect contact flow, Kinesis Video Streams configuration, and S3 storage setup in place. The integration points are Lambda and IAM configuration, not telephony.

Because this integration handles call audio that may include regulated PII in financial services or healthcare environments, compliance posture matters before go-live. We hold SOC 2 Type II, ISO 27001, HIPAA, GDPR, and PCI certifications, documented in full on our compliance hub. On Growth and Enterprise plans, customer audio is never used for model training and no opt-out is required.

Technical data flow for Amazon Connect

The data flow differs depending on whether you're building real-time agent assist or post-call QA automation. Both paths originate from the same Amazon Connect infrastructure.

  • Real-time path (Kinesis Video Streams): When live media streaming is enabled in Connect, Kinesis Video Streams (KVS) captures dual-channel audio and Amazon Connect creates one KVS stream per active call automatically as your call volume scales. Your Lambda function reads audio fragments from KVS and streams them to our real-time endpoint. The resulting transcript routes to the CRM or agent assist interface via callback.
  • Post-call path (S3): Connect deposits call recordings into an S3 bucket on call completion. An S3 event notification on s3:ObjectCreated:* triggers a Lambda function that passes the recording URL to our async API. We return structured JSON with speaker labels, word-level timestamps, and text-based sentiment scores, which are then written to your CRM or QA platform.

Real-time vs. post-call data flows

Dimension Real-time (KVS + WebSocket) Post-call (S3 + async API)
Audio source KVS (dual-channel) S3 bucket (WAV, MP3, M4A, FLAC)
Latency ~300ms final transcript ~60 sec per hour of audio
Diarization N/A: speaker attribution available in post-call async processing pyannoteAI Precision-2, full dual-channel
Primary use case Live agent assist Automated QA, CRM population, coaching
Lambda timeout risk High (15-min hard cap) Low (S3 trigger, async processing)

The post-call path delivers production-grade accuracy. Async processing produces full-context accuracy, robust speaker diarization, and complete multilingual handling. The real-time path is production-viable for agent assist at approximately 300ms latency, but displaying live transcript in the agent's Connect Contact Control Panel via this KVS-to-WebSocket path requires a custom display layer to surface output from the WebSocket stream.

How to enable live Amazon Connect transcription

1. Stream call audio to transcription

Enable live media streaming in the Amazon Connect console by navigating to: Data storage > Live media streaming > Edit > Enable live media streaming. Once enabled, add Start media streaming and Stop media streaming blocks to your contact flow and configure them to specify which audio channels to capture. This configuration applies to all calls routed through that flow, so test on a low-volume pilot flow before enabling it across your full operation.

2. Configuring live audio data flows

KVS splits the call into two audio tracks: AUDIO_TO_CUSTOMER (agent audio) and AUDIO_FROM_CUSTOMER (caller audio). Your Lambda function reads both tracks from the KVS stream using the Kinesis Video Streams Parser Library to reconstruct dual-channel audio. This separation enables per-speaker sentiment scoring and accurate diarization in the async path. Your Lambda execution role will need appropriate IAM permissions for KVS access and S3 operations, detailed in the deployment checklist below.

3. Set up real-time transcription

Your Lambda function initiates a real-time session with a POST request to https://api.gladia.io/v2/live, which returns a session token. Use that token to open the WebSocket connection at wss://api.gladia.io/v2/live?token={session_id}, then forward raw PCM fragments to wss://api.gladia.io/v2/livecode_switchingthat stream. Enable code-switching detection in the session initialization payload so Solaria-1 detects language changes mid-conversation without breaking the stream. You can watch a real-time transcription playground walkthrough to see the WebSocket output format before writing your Lambda handler, including the architectural patterns covered in the webinar replay.

4. Integrate transcripts into CRM and QA

Critical constraint: the 15-minute Lambda cap. AWS Lambda has a hard execution limit of 15 minutes. Any contact center handling calls longer than this will produce an incomplete transcript for the tail of the call if the transcription loop runs inside Lambda, silently corrupting the CRM record.

The recommended pattern to eliminate this risk is a long-running container (ECS Fargate task or EC2 instance) that holds the WebSocket connection for the call duration and writes transcript segments to an S3 buffer or directly to a CRM webhook as they arrive.

Automating post-call transcription for QA

Post-call processing is where Gladia's accuracy advantage over native AWS tooling produces the highest operational ROI. Full-context analysis of a completed recording yields better diarization accuracy (on average 3x lower DER compared to alternatives), more consistent multilingual handling, and structured outputs that integrate with QA scorecards and CRM fields.

1. Pulling recorded calls from S3 storage

Configure Amazon Connect to deposit call recordings into a designated S3 bucket by enabling call recording in the contact flow and specifying the S3 output location. Set S3 event notifications on s3:ObjectCreated:* to trigger a Lambda function on each new recording. Your S3 bucket policy must grant the Lambda execution role s3:GetObject on the recordings prefix and s3:PutObject if writing enriched outputs back to S3.

2. Routing call audio to the async API

The Lambda function generates a pre-signed S3 URL for the recording and passes it to our async API endpoint. Enable diarization, sentiment analysis, and named entity recognition in a single API call. Set redact_pii to true explicitly for calls requiring PII redaction in line with the regulations covering your industry. Diarization is powered by pyannoteAI Precision-2.

3. Feed interaction data to your CRM

Our async API returns structured JSON with per-segment speaker labels, word-level timestamps, text-based sentiment scores (positive, negative, neutral per utterance), and named entities. The speaker field maps directly to agent vs. customer attribution, letting you populate CRM fields like "customer sentiment trend" or "agent script adherence score" without building an additional NLP layer.

Use cases for external STT in Amazon Connect

Real-time agent coaching support

Live transcripts fed to an agent assist interface can surface relevant knowledge base articles and scripted responses as the customer speaks. Solaria-1's ~300ms final transcript latencyon the real-time WebSocket path keeps the response window tight when combined with a low-latency LLM layer (typically 200-400ms for retrieval-augmented generation).

Scaling QA coverage with transcript data

Most contact centers still rely on manual QA sampling because automated scoring requires accurate source data. With a high-accuracy transcript covering 100% of calls, QA automation shifts from a bottleneck to a coverage multiplier, full-coverage QA is typically the highest-ROI starting point for contact center automation. Aircall processes 1M+ calls per week through Gladia and cut transcription time by 95%, which directly expands the volume that QA automation can score without adding headcount.

Maintaining multilingual QA accuracy

For BPO operations in Southeast Asia, South Asia, or Latin America, the transcription layer determines whether your QA framework measures agent performance or measures ASR failure rate. Our automatic language detection handles accented speech consistently across languages critical to offshore contact centers, including Tagalog, Bengali, Punjabi, Tamil, Urdu, and Marathi, plus 42 languages unsupported by any other API-level STT competitor. On conversational speech benchmarked across 7 datasets and 74+ hours of audio, Solaria-1 delivers on average 29% lower WER compared to alternatives, with open, reproducible methodology.

Managing transcription access controls

Transcript JSON stored in S3 or written to a CRM database should be protected with role-based access control aligned to your QA and compliance structure. Restrict s3:GetObject on the transcripts prefix to QA platform service accounts and compliance officers, and not to general agent-facing roles. For CRM platforms like Salesforce, map transcript fields to object-level permission sets so that agents view call summaries without accessing raw transcripts containing PII fields. Document your retention schedule explicitly, since our Growth and Enterprise plans support custom data retention policies including zero retention.

Native vs. specialized STT for AWS workflows

Accuracy with accented and multilingual audio

AWS Transcribe is a general-purpose ASR engine. Accented speech, regional dialects, and non-Latin-script languages can produce higher WER in production contact center audio compared to clean-studio benchmarks. AWS Transcribe includes language identification, though performance on mid-conversation language switches may vary by language pair and audio conditions.

Solaria-1 is purpose-built for real-world conversational audio, not studio recordings. Our CCaaS use case page details the production accuracy profile across BPO-relevant language families.

Governance controls for call metadata

AWS data governance defaults require active navigation to locate model training opt-outs, regional data residency configurations, and call recording retention policies. For financial services or healthcare contact centers, this creates procurement and audit complexity that legal teams must resolve before go-live.

On Growth and Enterprise plans, the data handling defaults are explicit enough to clear procurement without a legal review: customer audio is never used for model training, no opt-out action is required, and data residency is configurable across EU-west and US-west regions. On the Starter plan, data can be used for model training by default. Across all plans, Gladia holds SOC 2 Type II, ISO 27001, HIPAA, GDPR, and PCI certifications.

Controlling per-call unit costs

At 10,000 hours of contact center audio per month, the cost difference between AWS pricing and our all-inclusive Growth plan is substantial. The table below models the feature-equivalent stack (transcription, PII redaction, and sentiment analysis) for 600,000 minutes (10,000 hours) monthly.

Cost component AWS Transcribe Streaming AWS Call Analytics (feature-equivalent) Gladia Growth (all-inclusive)
Base transcription (600K min) ~$6,000* ($0.01/min, transcription only) ~$18,000† ($0.03/min, all features bundled) Contact sales for volume pricing
Diarization Not included Included Included
PII redaction Not included Included Included (when enabled)
Sentiment + categorization Not included Included Included
Translation Separate service charge Separate service charge Included

*AWS Transcribe Streaming at $0.01/min × 600,000 minutes. Transcription only - diarization, PII redaction, and sentiment are not included at this tier. †AWS Call Analytics at $0.03/min × 600,000 minutes (Tier 1). Includes base transcription, PII redaction, sentiment, and categorization in a single bundled rate. Figures use AWS's published Transcribe pricing and our Growth plan pricing.

If your operation requires full sentiment analysis and categorization through Call Analytics, the AWS Call Analytics tier at approximately $0.03/min includes these features plus PII redaction and base transcription in one bundled rate. Our all-inclusive async rate produces similar structured outputs on the Growth plan.

Deployment checklist for production

Use this checklist to move from evaluation to production in a structured sequence.

  1. S3 and KVS setup: Create the S3 bucket for call recordings with versioning enabled and configure KVS stream retention to cover your call volume needs.
  2. IAM roles: Create a Lambda execution role with permissions for KVS access and S3 operations. Scope S3 access to the recordings and transcripts prefixes only, and ensure the Connect instance service role has permissions to write to KVS.
  3. Contact flow configuration: In the Connect console, navigate to Data storage, enable live media streaming, and add Start/Stop media streaming blocks to the relevant contact flow.
  4. Lambda trigger for post-call: Configure S3 event notifications on s3:ObjectCreated:* for the recordings prefix, pointing to the Lambda function calling our async API. For calls that may exceed 15 minutes, replace Lambda with a long-running ECS Fargate task to avoid the hard execution timeout.
  5. CRM/QA platform integration validation: Run sample calls covering your top languages and accented speaker profiles. Verify that speaker labels, sentiment scores, and named entities integrate correctly with your CRM fields using the validation request examples in the API docs. Common integration failures stem from Lambda cold starts (use provisioned concurrency), the 15-minute Lambda timeout on long calls (use ECS Fargate), KVS retention mismatches, and missing IAM permissions on the Lambda execution role. Test with representative calls covering your language and accent mix before expanding to full production volume.

Start with 10 free hours and integrate Gladia with Amazon Connect. Test Solaria-1 on your own multilingual audio to see how it handles language detection, accent-heavy speech, and code-switching across your actual BPO call mix.

FAQs

When should you switch from Contact Lens to Gladia?

Switch when your operation crosses any of these thresholds: BPO volume exceeds 10,000 hours per month (where AWS pricing for full-feature analytics can diverge from our all-inclusive rate), WER on non-English agents is causing downstream QA or CRM issues, or you need code-switching support for bilingual agent populations.

What is the latency impact on agent assist?

Our real-time WebSocket path delivers final transcripts at approximately 300ms latency, keeping agent-assist card rendering within 1-2 seconds of the customer completing a sentence when combined with a low-latency card rendering layer. Partial transcripts arrive frequently, enabling speculative matching against knowledge base entries before the utterance is fully complete.

How should Amazon Connect transcripts be stored?

Store the raw JSON transcript responses in S3 with object-level encryption (SSE-S3 or SSE-KMS) and apply a lifecycle policy aligned to your GDPR or HIPAA retention schedule. Write derived fields (summary, sentiment score, entity list) to your CRM or QA platform immediately after processing. On Growth and Enterprise plans, configure our data retention settings to match your internal policy, with options including zero retention.

How does Amazon Connect separate agent and caller audio tracks?

Amazon Connect writes dual-channel audio to KVS as two named tracks: AUDIO_FROM_CUSTOMER and AUDIO_TO_CUSTOMER. These are separate media tracks within the same KVS stream, not mixed into a single channel. This native track separation provides clean source data for speaker attribution, and diarization in the async path via pyannoteAI Precision-2 produces accurate per-speaker labeling.

Key terms glossary

Word Error Rate (WER) The standard metric for measuring transcription accuracy. WER is calculated as the number of substitutions, deletions, and insertions required to convert a transcript into the reference text, divided by the total number of words in the reference. In contact center audio, WER is typically higher than in clean-studio benchmarks due to accented speech, background noise, and overlapping speakers.

Diarization Error Rate (DER) A metric measuring the accuracy of speaker diarization: how well a system assigns speech segments to the correct speaker. DER accounts for missed speech, false alarms, and speaker confusion. In multi-speaker contact center recordings, high DER corrupts per-agent sentiment scores and coaching records downstream.

Speaker diarization The process of partitioning an audio recording into segments by speaker identity, answering "who spoke when." In post-call QA workflows, accurate diarization is the prerequisite for separating agent and customer utterances before sentiment scoring, script adherence checks, and CRM population.

Code-switching The practice of alternating between two or more languages within a single conversation or utterance. Common in BPO environments serving bilingual populations, code-switching causes standard ASR engines to fragment output or default to the wrong language model mid-call. Native code-switching detection handles this without requiring a session restart.

After Contact Work (ACW) The period immediately following a customer call during which an agent completes wrap-up tasks: updating CRM records, logging call dispositions, and tagging interaction categories. ACW duration is a direct cost driver in contact centers. Accurate post-call transcription reduces ACW by pre-populating CRM fields with structured outputs rather than requiring manual entry.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more