Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Read more

Speech-To-Text

How to build a meeting assistant with async transcription and LLM: Complete architecture guide

Build a meeting assistant with async transcription and LLMs using clean architecture, diarization, and multilingual support.

Speech-To-Text

Rev.ai vs Gladia: Complete comparison for global teams (2026)

Rev.ai vs Gladia comparison for 2026: pricing, accuracy, and language coverage benchmarks to help product teams choose the right API.

Speech-To-Text

Building a Google Meet transcription bot: step-by-step API integration with real-time captions

Building a Google Meet transcription bot requires audio capture via Playwright and real-time STT API integration in under a week.

Building a Google Meet transcription bot: step-by-step API integration with real-time captions

Published on April 1, 2026
Ani Ghazaryan
Building a Google Meet transcription bot: step-by-step API integration with real-time captions

Building a Google Meet transcription bot requires audio capture via Playwright and real-time STT API integration in under a week.

TL; DR: Building a Google Meet transcription bot requires solving two separate problems: audio capture via a headless browser (Playwright) and audio processing via an STT API. The integration itself takes less than a day. Where teams lose weeks is evaluating the wrong things, specifically headline rates instead of total cost at scale, English Word Error Rate (WER) instead of multilingual accuracy, and feature lists instead of what is actually included in the base price. This guide covers both the code and the commercial architecture you need to model costs accurately and ship in under a week.

Engineering teams often spend three months building a Google Meet transcription bot, only to find their unit economics break the moment they enable speaker diarization at scale. The bot-joining logic is the easy part. The hard part is choosing an STT engine that holds its accuracy on accented speakers, handles mid-conversation language switches, and bills you at the same rate whether you enable diarization or not.

This guide breaks down how to build a Google Meet transcription bot that handles multilingual audio, returns transcripts in under 300ms, and costs as low as $0.20 per hour (Growth) with all features included. We cover the architecture of audio capture, how to stream audio via WebSocket, and how to evaluate the underlying speech-to-text engine for latency, multilingual accuracy, and predictable pricing at scale.

The architecture of a Google Meet transcription bot

A Google Meet transcription bot that runs reliably at scale is a two-layer system. The capture layer joins the meeting and routes the audio stream. The transcription layer receives that stream and returns formatted JSON with speaker labels, language tags, word-level timestamps, and confidence scores.

Each layer has its own failure modes. A capture layer that works flawlessly locally can lose audio when the meeting host mutes participants, when the network drops a packet, or when Chrome updates its audio permission model. Your STT engine can fail separately through hallucinations on silence, accuracy regressions in non-English languages, or WebSocket timeouts on long calls. Designing these layers independently lets you swap or debug either without touching the other.

Bot-based vs. bot-free options

Chrome extensions can intercept audio from the active tab, but they stop working the moment the user closes the browser or navigates away. For a reliable, server-side recording workflow, the main options are a headless browser bot, a desktop recording SDK, or an official meeting platform API where available. Each has different trade-offs in setup complexity, latency, and platform coverage.

The headless browser approach documented by Recall.ai's engineering team uses a Playwright bot to join the call, enable captions, and read the caption container directly from the DOM. The same pattern is available as an open-source reference in the Recall.ai Google Meet bot repository on GitHub.

Recall.ai raised a $38M Series B at a $250M valuation, led by Bessemer Venture Partners, which reflects both the scale of demand for meeting capture infrastructure and the complexity of building it reliably across platforms. For most product teams, using a managed capture API or the headless browser approach is the faster path to production than building and maintaining a custom capture layer across Google Meet, Zoom, and Microsoft Teams simultaneously.

Evaluating the STT engine for multilingual accuracy and cost at scale

The benchmark that matters is not what the vendor posts on their marketing page but what your WER looks like on your actual audio: accented speakers, overlapping voices, and mid-conversation language switches.

Accuracy benchmarks and hallucination mitigation

Most STT vendor documentation publishes WER figures for clean English audio, but if your product serves Finnish, Swedish, Bengali, or Tagalog speakers, that benchmark tells you almost nothing about production performance.

Our Solaria-1 universal STT model covers 100+ languages, including 42 languages not supported by any other API-level STT provider. The model is evaluated against Mozilla Common Voice and Google FLEURS datasets, covering diverse accents and audio conditions, not curated studio recordings.

Hallucinations are a separate problem from WER. Vanilla Whisper generates text that was never spoken, particularly on silence or low-signal audio. According to our AWS Marketplace listing, we eliminate up to 99% of hallucinations compared to vanilla Whisper, which directly affects how much post-processing your pipeline needs before results reach end users.

"Excellent multilingual real-time transcription with smooth language switching... Superior accuracy on accented speech compared to competitors... Clean API, easy to integrate and deploy to production." - Verified user review of Gladia

Data governance and compliance

If your product processes meeting recordings from enterprise customers, data governance is a contract-level conversation before it becomes a product one. The relevant questions are whether the vendor retrains their model on your audio by default, which tier opts you out of that, and whether their data residency fits your GDPR obligations.

We don't use customer audio to retrain models at any tier without a separate commercial agreement. That is a default position, not an enterprise-only opt-out. Our service is GDPR-compliant and SOC 2 Type 2-attested, with EU-west and US-west region options. For EU-domiciled products, this removes a common bottleneck in legal review.

"It's based in EU so it fits our GDPR compliance requirements... The product works great. They're improving the product continuously." - Verified user review of Gladia

Unit economics and pricing predictability

Add-on pricing compounds the way interest does: each feature priced separately makes the total bill harder to model, especially at scale. At 1,000 hours per month, the difference between all-inclusive and metered pricing is material.

The table below shows a cost model for 1,000 hours of real-time audio with audio intelligence features enabled, based on our published pricing.

Provider Base rate (real-time) Audio intelligence Cost at 1,000 hrs
Gladia (Growth) $0.25/hr Included ~$250
Deepgram Separate per-feature pricing Token-based Variable (see pricing page)
AssemblyAI Separate per-feature pricing Separate add-ons Variable (see pricing page)

For contact center workloads with short, variable-length calls, that difference accumulates across thousands of calls per month. Vendor add-on pricing compounds the same way: each feature priced separately makes the effective rate significantly harder to model at 10x or 100x current volume.

Prerequisites for the transcription bot

Before writing a line of code, confirm you have the following in place:

  1. Python installed locally or in your CI environment, along with Playwright installed via pip install playwright and browser binaries initialized via playwright install chromium.
  2. Google Workspace account with Meet access enabled. For the Playwright bot approach in this guide, the bot authenticates as a browser user rather than through a Google API OAuth flow, so a standard Google account with Meet access is sufficient. If your bot needs to join meetings on behalf of Workspace users, you will also need service account delegation configured in the Google Admin console. If you choose to authenticate the bot as a specific user via the Google API, OAuth 2.0 credentials are required.
  3. Gladia API key generated from app.gladia.io, which includes 10 hours per month on the free tier with no credit card required.

The complete implementation, including the Python backend and React frontend, is documented in the Gladia Google Meet bot tutorial. You can also watch a no-code playground walkthrough to verify API behavior before writing any integration code.

Step 1: Joining the Google Meet call automatically

The Playwright bot navigates to the Meet URL, handles the pre-join screen, and mutes its own microphone before entering to avoid audio feedback. Meeting participants must be notified that a bot is joining and recording, as consent requirements vary by jurisdiction.

# Adapted from: <https://gladia.io/blog/how-to-build-google-meet-transcription-bot-with-python-react-and-gladia-api>
import asyncio
from playwright.async_api import async_playwright

async def join_meeting(meet_url: str):
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=[
                "--use-fake-ui-for-media-stream",  # Grants media permissions silently
                "--disable-blink-features=AutomationControlled",
            ]
        )
        context = await browser.new_context(
            permissions=["microphone", "camera"]
        )
        page = await context.new_page()
        await page.goto(meet_url)
    # Dismiss the pre-join screen
    # Note: Selector may vary with Google Meet UI updates
    await page.wait_for_selector('[data-promo-anchor-id="join-button"]')
    await page.click('[data-promo-anchor-id="join-button"]')

    return page, browser

The --use-fake-ui-for-media-stream flag prevents Chromium from blocking on the browser-level permission dialog in a headless environment, which would otherwise stall the join flow with no user to click through.

Step 2: Capturing and streaming meeting audio

After the bot joins the call, you need to route the browser's audio output to a buffer that can be read and streamed to the STT API. The approach here reads PCM frames at 100 ms intervals.

# Source: https://docs.gladia.io/api-reference/v2/live/init
import pyaudio

SAMPLE_RATE = 16000
CHANNELS = 1
CHUNK_SIZE = 1600  # 100ms of audio at 16kHz
def capture_audio_chunk(stream):
    return stream.read(CHUNK_SIZE, exception_on_overflow=False)

Solaria-1 returns partial transcripts at 103-270 ms, with final transcripts at approximately 698 ms for a 3-second utterance. That leaves adequate headroom for network round-trip and rendering within a sub-second user-facing experience.

Watch the Gladia Real-Time Webinar for a live demonstration of how the latency budget plays out under different network and audio conditions.

Step 3: Integrating real-time speech-to-text

The Gladia live transcription API follows a two-step pattern: initialize a session via REST to receive a WebSocket URL, then connect and stream audio chunks.

# Source: <https://docs.gladia.io/api-reference/v2/live/init>
import aiohttp
import asyncio
import json

GLADIA_API_KEY = "your_api_key_here"

async def init_gladia_session():
    async with aiohttp.ClientSession() as session:
        payload = {
            "encoding": "wav/pcm",
            "sample_rate": 16000,
            "bit_depth": 16,
            "channels": 1,
            "language_config": {
                "languages": [],  # Empty enables auto-detection
                "code_switching": True
            }
        }
        headers = {"X-Gladia-Key": GLADIA_API_KEY}
        async with session.post(
            "<https://api.gladia.io/v2/live>",
            json=payload,
            headers=headers
        ) as response:
            data = await response.json()
            return data["url"]  # wss://api.gladia.io/v2/live?token=...
# Source: <https://docs.gladia.io/chapters/speech-to-text-api/pages/live-speech-recognition>
import websockets

async def stream_audio(ws_url: str, audio_queue: asyncio.Queue):
    async with websockets.connect(ws_url) as ws:
        async def send_audio():
            while True:
                chunk = await audio_queue.get()
                await ws.send(chunk)

        async def receive_transcripts():
            async for message in ws:
                data = json.loads(message)
                if data.get("type") == "transcript":
                    utterance = data["data"]["utterance"]
                    print(f"[{utterance['language']}] {utterance['text']}")

        await asyncio.gather(send_audio(), receive_transcripts())

Setting code_switching: true and leaving the languages array empty tells our Solaria-1 model to detect the active language on each utterance automatically, with no need to specify languages in advance. The live audio intelligence docs cover the full configuration options for this session type.

"Gladia delivers precise speech-to-text transcriptions with reliable timestamps, making it perfect for downstream tasks. It saves time and ensures smooth integration into our workflows." - Verified user review of Gladia

Step 4: Handling multilingual speakers and code-switching

Global meetings do not follow monolingual scripts. A product manager in Montreal switches from English to French to make a point more clearly. A support call in Singapore moves between English and Mandarin. Many STT APIs return garbled output at the language boundary or require you to specify languages in advance, which breaks when speakers switch unexpectedly.

Our Solaria-1 model handles code-switching across all 100 supported languages in both real-time and async modes. Each utterance in the JSON response includes a language field identifying the detected language, so your downstream pipeline can segment, route, or translate by language without additional processing.

// Source: <https://docs.gladia.io/chapters/speech-to-text-api/pages/live-speech-recognition>
{
  "type": "transcript",
  "data": {
    "utterance": {
      "text": "Let's confirm the budget numbers",
      "language": "en",
      "speaker": "speaker_1",
      "words": [...]
    }
  }
}
{
  "type": "transcript",
  "data": {
    "utterance": {
      "text": "D'accord, on confirme pour vendredi",
      "language": "fr",
      "speaker": "speaker_2",
      "words": [...]
    }
  }
}

Speaker diarization is powered by pyannoteAI's Precision-2 model. The speaker diarization documentation covers how speaker labels are assigned and how to configure the minimum number of speakers for your use case. Watch the Gladia x pyannoteAI webinar for a deeper walkthrough of how the diarization pipeline works in production.

"Gladia deliver real time highly accurate transcription with minimal latency, even accross multiple languages and ascents, The API is straightforward and well documented, Making integration into our internal tools quick and easy." - Verified user review of Gladia

Step 5: Generating actionable insights from transcripts

When the session closes, the final transcript payload should be persisted to a database or object storage layer before any downstream processing begins — applying your retention policies at ingestion time ensures the structured output remains available for compliance audits, legal discovery, and long-term reporting alongside everything else your product needs: summaries, action items, sentiment scores, named entities, and CRM sync data.

The post-transcript callback fires at the end of the session with the complete structured output. For async processing, you can also call the async transcription init endpoint with a recording URL and receive the same structured output including the full audio intelligence suite.

# Source: <https://docs.gladia.io/chapters/audio-intelligence/audio-to-llm>
def process_transcript_output(transcript_data: dict):
    summary = transcript_data.get("audio_intelligence", {}).get("summary", "")
    action_items = transcript_data.get("audio_intelligence", {}).get("chapters", [])

    return {
        "summary": summary,
        "action_items": [chapter["headline"] for chapter in action_items]
    }

We include summarization, chapterization, named entity recognition, and sentiment analysis in the base hourly rate. None of these require a separate API call or a separate billing line. The audio-to-LLM documentation covers how to use custom prompts against the transcript for freeform extraction beyond the built-in intelligence features.

For a full reference implementation including the React frontend for displaying live captions and post-meeting summaries, see the Gladia Google Meet bot blog post and the real-time React integration video.

Deploying to production and modeling costs

Infrastructure checklist

Before your bot goes live, confirm the following:

  • WebSocket reconnection logic: Handles dropped connections without losing buffered audio.
  • Audio encoding: Matches the session configuration (16kHz PCM by default, or 8kHz for Twilio-sourced audio).
  • Concurrent session limits: Match your expected meeting volume. Our Scaling plan supports flexible concurrency beyond the 30 concurrent real-time sessions on the Self-Serve plan.
  • Data residency: Configured for EU-west or US-west depending on your customer contracts.

Cost model at 1,000 hours per month

At 1,000 hours of real-time audio per month with all features enabled, our Growth plan costs as low as $250, based on our published rate of $0.25/hr. That figure includes NER, sentiment analysis, summarization, and code-switching across 100 languages. There are no add-on fees to layer on top. The Starter plan at $0.75/hr real-time offers the same feature set on a pay-as-you-go basis with no volume commitment.

By contrast, a stacked pricing model where diarization and audio intelligence are each metered separately makes the effective per-hour rate significantly harder to model accurately at 10x or 100x current volume. Each feature priced as an add-on introduces a compounding variable that breaks your cost projections at scale.

If you're migrating from an existing STT provider, we publish step-by-step migration guides from AssemblyAI and from Deepgram with code-level diffs for the most common integration patterns.

Start with 10 free hours and have your integration in production in less than a day. Get started at gladia.io with no credit card required, or view the live transcription API reference to review the full WebSocket initialization options before you begin.

FAQs

What is the maximum audio file size for async transcription?

Our async API accepts files up to 1,000 MB and a maximum duration of 135 minutes per request, as documented on the Gladia pricing page.

How many concurrent real-time WebSocket sessions does the  Growth plan support?

The Growth plan supports flexible concurrency beyond the 30 concurrent real-time sessions. Contact the Gladia team for specific limits at your target volume.

Does Gladia retrain its models using customer audio?

On paid plans (Growth, Enterprise), customer audio is never used for model retraining, and no opt-out is required. On the free tier, audio may be used for model training, as disclosed on the security page.

What latency should I budget for real-time captions in a meeting bot?

Solaria-1 returns partial transcripts in 103ms at best case and 270ms on average, with final transcripts at approximately 698ms for a 3-second utterance. Allow 50-100ms additional for network round-trip and rendering.

Can the bot handle participants switching languages mid-sentence?

Yes. Setting code_switching: true in the session configuration enables automatic language detection per utterance across all 100 supported languages, with no manual language selection required.

Key terms

Word error rate (WER): The standard metric for STT accuracy, calculated as the number of substitutions, deletions, and insertions divided by the total number of words in the reference transcript. Must be interpreted with reference to a specific language, audio condition, and normalization method to be meaningful.

Diarization: The process of segmenting an audio stream by speaker identity, answering "who spoke when." In our pipeline, this is powered by pyannoteAI's Precision-2 model and produces per-utterance speaker labels in the transcript output.

Code-switching: A linguistic pattern where a speaker alternates between two or more languages within a single conversation or sentence. Many STT engines return corrupted output at the language boundary or require languages to be specified in advance.

WebSocket: A protocol that holds a persistent, bidirectional connection between client and server, enabling continuous audio chunk delivery and real-time transcript return without the overhead of repeated HTTP handshakes.

Latency budget: The maximum allowable delay between audio capture and transcript display, typically calculated as the sum of audio buffer time, network round-trip, inference time, and rendering time.

SOC 2 Type 2: A compliance attestation issued by an independent auditor (a licensed CPA) confirming that a service organization's security controls operated effectively over a defined observation period, typically 6-12 months.

Hallucination: In STT systems, text generated by the model that was never spoken in the source audio. Most common on silence, low-signal audio, or heavily accented speech where the model fills uncertainty with plausible-sounding output.

Total cost of ownership (TCO): The full cost of using a speech-to-text API at production scale, including base transcription rates, per-feature add-on fees, billing granularity overhead (per-second vs. per-15-second blocks), and engineering time for integration and maintenance.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more