TL; DR: Google's native Meet transcription saves a text file to Drive with no real-time API access, no programmatic diarization, and a 30-day deletion window, making it unusable for product teams. Your architectural choice is between a headless browser bot, a Chrome extension using chrome.tabCapture, or the Google Meet Media API (Developer Preview with strict enrollment). All three require a production-grade STT backend. This guide is published by Gladia and covers our API as the STT backend: we deliver highly accurate transcription with code-switching across 100+ languages, diarization included in the base rate, and enterprise-grade security and compliance by default.
Meeting audio is among the most valuable unstructured data your product can capture, and Google Meet's native tooling is built to keep it locked inside Google's ecosystem. If you're building a product that needs speaker attribution, multilingual accuracy, real-time coaching, or searchable meeting archives, a text file saved to the meeting organizer's Drive is a structural dead end.
Building Google Meet transcription at the product level means solving two separate problems: audio capture and audio intelligence. The capture method determines where the audio originates and how it reaches your backend. The STT API determines whether the transcript is accurate, fast, and cost-effective at the volume your product will eventually reach. This guide covers both.
The limitations of Google Meet's native transcription for product teams
Google Meet offers built-in captions and a post-meeting transcript, but both are designed for individual compliance, not product integration. According to Nylas's developer documentation (Nylas is a unified API platform that gives developers access to email, calendar, and contacts across multiple providers, including Google through a single integration), there is no programmatic transcription API in Meet: you need someone to click the UI button every time, with no workarounds.
The structural constraints compound quickly:
No real-time stream access: Native captions render in the browser but are not exposed via any WebSocket or REST endpoint your application can consume.
Processing delays: Post-meeting transcripts can take more than 45 minutes to generate, as documented by Nylas, which makes any automation requiring immediate availability unreliable.
Data locked in Drive: The Google Meet REST API artifacts guide confirms transcripts are stored as a Google Doc in the meeting organizer's Drive, not in your system. Fetching them requires Drive API access scoped to that user's account.
30-day deletion window: Per the Meet API artifacts documentation, transcript entries are deleted 30 days after the conference ends, which breaks any long-term analytics use case.
No speaker diarization output: Native transcription does not return structured speaker attribution data suitable for downstream processing.
Chat messages excluded: The Meet API does not include chat alongside spoken words, as confirmed in the artifacts guide.
For any team building a meeting intelligence product, these constraints mean you're not evaluating whether to use Google's native transcription: you're choosing which alternative architecture to build.
The three architectural approaches to Google Meet transcription
Three methods exist for capturing audio from Google Meet outside the native transcript: a headless browser bot that joins as a participant, a Chrome extension that captures tab audio client-side, or the official Google Meet Media API. Each solves a different set of constraints and introduces its own trade-offs at the infrastructure layer.
Method 1: The "Bot" approach (headless browser automation)
A headless browser bot joins the Google Meet as a named participant, routes the tab's audio to a virtual sink using PulseAudio or snd-aloop, and streams that audio to your STT backend. Puppeteer is the most common implementation layer: a Node.js library developed by Google's Chrome DevTools team that controls Chrome or Chromium over the DevTools Protocol, running in headless mode by default.
Our Google Meet bot implementation guide covers the full Python and React integration. At a high level, the bot flow works like this:
A headless Chrome instance navigates to the Meet URL.
Browser arguments disable audio output to prevent feedback loops.
A virtual audio driver captures the tab's mixed audio stream.
ffmpeg
taps the virtual sink and pipes audio chunks to your STT API.
What works well:
No end-user installation: Users click a meeting link and your bot joins automatically.
Platform-agnostic: The same architecture works for Zoom, Teams, or any browser-based meeting tool.
Full pipeline control: You own the audio capture, routing, and quality settings end-to-end.
What breaks in production: Bot reliability and maintenance costs are real and ongoing.
Authentication fragility: Login sessions expire and repeated automation can trigger CAPTCHAs, which teams handle through credential rotation, OAuth token-based authentication, service accounts, or CAPTCHA-solving integrations depending on their setup.
UI brittleness: Join flows vary between meetings with different approval paths, preview screens, and permission prompts, and small UI changes break the CSS selectors used to find buttons or captions.
Infrastructure cost: Each bot runs a full browser process, consuming significant CPU and RAM for video rendering even when only audio is needed, making scaling expensive at 100+ concurrent sessions.
The bot approach is the right choice when you need to support users who aren't willing to install anything, or when you're building a product that monitors meetings across multiple platforms from a central service. The failure modes in production are well-documented and worth reviewing before committing to this architecture.
Method 2: The browser extension approach (client-side capture)
A Chrome extension using the chrome.tabCapture API captures a
MediaStream
containing the audio from the active Meet tab, then streams that audio directly to your STT endpoint. Because the compute happens on the user's machine, you eliminate the server-side cost of running a headless browser per session.
The API constraint is important: chrome.tabCapture.capture requires user action to start, such as clicking the extension's action button. If you only need audio, you can obtain the stream directly in the extension popup, and recording stops when the popup closes. A working pattern using
chrome.tabCapture
alongside the WebExtension screen capture API gives you a clean audio buffer ready to chunk and send over WebSocket.
What works well:
Zero server-side rendering cost: Audio capture happens on the user's machine, eliminating the infrastructure expense of running a browser per session.
Source-quality audio: You capture the audio before any network compression or relay, which improves downstream transcription accuracy.
Front-end integration: Chrome APIs are straightforward for engineers already familiar with browser extension development.
What breaks in production:
User activation friction: The extension requires manual installation from the Chrome Web Store and must be started by the user for each session.
Browser lock-in: It doesn't work on Firefox or Safari without separate packaging, and it's entirely ineffective for users on desktop Zoom or Teams apps.
API maintenance surface: Chrome API updates require ongoing compatibility monitoring, and tabCapture behavior can change across browser versions in ways that break your integration without warning.
The extension approach is optimal when your users are already browser-native and you can make installation part of your onboarding flow. The lower infrastructure cost makes it attractive for teams modeling unit economics, because you shift compute to the client rather than paying for server-side browser instances.
Method 3: Google Meet Media API (the official but complex route)
The Google Meet Media API is the only official path to a raw audio stream from Google Meet, and it comes with significant access constraints. It's available through the Google Workspace Developer Preview Program, which requires enrollment for your Google Cloud project, your OAuth principal, and all conference participants.
The API overview documentation confirms what the API can do: consume audio streams from Meet conferences and feed them into your own transcription service. What it cannot do is equally important: there are no recordings, transcripts, or chat available through this API. You get the raw stream, not any processed output.
The eligibility constraint is the sharpest limitation for most product teams. Consent requires meeting host rights, meaning Meet Media API apps are only permitted into a call if someone with organizational admin rights to the meeting approves access. For a B2B SaaS product selling to customers across many workspaces, this is a structural blocker.
The Google Workspace Developer Preview enrollment covers the application process as documented at the time of writing, verify the current enrollment status and requirements directly with Google, as the program and its conditions may have changed since the 2023 announcement. No public pricing tier is documented, and the API remains in Developer Preview. Given the enrollment friction and the consent model, most product teams land on the bot or extension approach for their v1, with the Media API reserved for enterprise configurations where the product owns the workspace.
Selecting the right speech-to-text API for your pipeline
Once you've resolved the audio capture layer, you're making a second distinct decision: which STT API processes that audio. The capture method determines where audio originates and how it gets to your backend. The STT API determines whether the transcript is accurate enough, fast enough, and affordable enough at the volume your product will eventually run.
Accuracy benchmarks for meeting audio
WER matters most when tested on audio that resembles your production conditions. Clean studio recordings don't predict performance on Google Meet calls with background noise, microphone artifacts, and overlapping speakers.
Independent benchmarks at artificialanalysis.ai include WER comparisons across multiple models on standardized test sets. For meeting-specific accuracy in production, Claap reported below 3% WER on their pipeline, which processes the same noisy, multilingual conditions your Google Meet integration will encounter. That result holds for one hour of video transcribed in under 60 seconds, which also tells you something about throughput on async workloads.
When you evaluate any STT API, specify the language and audio condition alongside the WER figure. A score on curated audio does not transfer to meeting recordings with HVAC noise, cross-talk, and sub-optimal microphones.
Real-time latency vs. accuracy trade-offs
For live coaching, agent assist, or in-meeting summaries, you need transcripts that arrive while the conversation is still happening, which requires a WebSocket connection to your STT backend rather than a batch upload after the meeting ends.
We use WebSocket to deliver real-time, bidirectional communication between the client and server, reducing network overhead compared to repeated REST polling and keeping latency consistent across variable network conditions. Our transcription is highly efficient, but for meeting assistant use cases, latency is generally less critical. Most teams rely on async transcription pipelines, and user feedback shows that waiting a few seconds or even minutes for stable, accurate meeting notes is perfectly acceptable. This ensures reliable transcripts without compromising the overall user experience.
The hybrid partial-and-final approach is what makes live captions usable. Partial transcripts surface immediately as speech arrives, which lets you render live captions. Final transcripts are processed by a larger model that rewrites the result retrospectively after detecting a natural endpoint, producing higher accuracy without requiring the user to wait for the full utterance to complete. Our Node.js WebSocket integration guide covers how to handle both partial and final transcript events.
Users running high-volume pipelines confirm this holds at production scale:
"Gladia provides a highly accurate real-time speech-to-text solution for high volumes of support and service calls. Latency is low and accuracy high, even for numericals. We've appreciated the quality of support across pre-processing, post-processing, and model optimization." - Verified user on G2
For post-meeting async transcription, the latency budget is less constrained, but accuracy requirements are often higher because users will read and act on the final transcript. Our pre-recorded transcription endpoint handles batch jobs, and you configure features like diarization, sentiment analysis, and named entity recognition in the same API call.
Handling multilingual audio and code-switching
Global teams don't stay in one language. A bilingual support rep in Singapore might open a call in English and shift to Mandarin mid-sentence. A Latin American sales team might mix Spanish and Portuguese depending on the customer. Standard models fail here because they lock to the language detected at session start and produce substitution errors whenever the speaker switches.
We built code-switching across 100+ languages into our API, with recognition adapting continuously across the audio stream without resetting when the language changes mid-sentence.
Engineers who've tested this in production confirm the behavior holds:
"Excellent multilingual real-time transcription with smooth language switching... Superior accuracy on accented speech compared to competitors... Clean API, easy to integrate and deploy to production" - Yassine R. on G2
Another user noted multilingual detection as their primary production use case:
"Gladia delivers real-time highly accurate transcription with minimal latency, even across multiple languages and accents. The API is straightforward and well documented, making integration into our internal tools quick and easy." - Faes W. on G2
If your product serves users in more than two countries, WER on accented audio is not a secondary concern. It's the metric your support team will track through complaint volume when the product reaches those markets.
Step-by-step: Building a basic transcription pipeline
The implementation steps below cover the extension-to-API path because it offers the fastest time to production for teams with browser-based users and avoids the server-side infrastructure costs of the bot approach. The bot architecture uses the same STT layer but adds the headless browser setup upstream.
Capture the audio stream. In your Chrome extension, call chrome.tabCapture.capture when the user activates transcription. This returns a
MediaStream
containing the Meet tab's mixed audio output.
Initialize the Gladia live transcription session. Send a POST request to https://api.gladia.io/v2/live with your API key and session configuration, as documented in the live transcription init endpoint reference:
const response = await fetch('', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-gladia-key': '',
},
body: JSON.stringify({
encoding: 'wav/pcm',
sample_rate: 16000,
language_config: {
languages: [],
code_switching: true
},
diarization: true
}),
});
const { id, url } = await response.json();
Open the WebSocket connection. Use the url returned from the init call to establish the WebSocket. The connection stays open for the duration of the meeting.
Stream audio chunks. Convert the MediaStream to PCM audio using the Web Audio API's
AudioWorkletProcessor
, then send chunks over the WebSocket as binary frames. Per our live transcription migration guide, you can send audio as bytes or base64 and the API detects the format automatically.
Handle partial and final transcripts. Listen for incoming WebSocket messages and parse the JSON payload. The example below uses the ws Node.js library for the WebSocket client; adapt the event listener syntax to match your WebSocket implementation:
// Using the 'ws' Node.js library: npm install ws
websocket.on('message', (data) => {
const event = JSON.parse(data);
if (event.type === 'transcript' && event.transcription) {
// Partial transcript: update live captions
updateLiveCaptions(event.transcription);
}
```javascript
if (event.eventtype === 'transcript.final')'transcript' && event.data?.is_final) {
// Final transcript: trigger downstream processing
storeFinalTranscript(event.transcription);
if (event.diarization) {
processSpeakerSegments(event.diarization.speakers);
}
}
});
Full WebSocket event schemas are in our live transcription API reference.
Route to downstream intelligence. Final transcripts feed into diarized speaker segments, sentiment tags, or LLM summarization via the audio-to-LLM endpoint.
You can watch the full flow in our real-time transcription playground walkthrough, which covers the setup without writing a single line of backend code first.
"Gladia delivers precise speech-to-text transcriptions with reliable timestamps, making it perfect for downstream tasks. It saves time and ensures smooth integration into our workflows." - Verified user on G2
Modeling unit economics at scale (1k to 10k hours)
Transcription pricing looks simple at low volume and becomes a significant line item at scale. The structural question is not just the per-minute rate: it's whether features like speaker diarization, sentiment analysis, and named entity recognition are metered separately or included in the base rate.
When diarization is priced as an add-on, your cost model at 1,000 hours understates the invoice at 10,000 hours by a multiplier that compounds with every feature you enable. For a product that ships diarization as a user-facing feature, you're paying two bills: one for transcription and one for speaker attribution. Add sentiment analysis and named entity recognition as separate line items, and the gap between your forecast and the actual invoice widens at every tier.
Our pricing page structures all core audio intelligence features, including diarization, sentiment analysis, and named entity recognition, at the base rate with no setup fees. We bill for real-time transcription based on the total duration of audio sent through the WebSocket connection, including silence, which you should account for when modeling call center audio with long hold periods.
The table below illustrates the structural cost difference between an all-inclusive and an add-on pricing model at three volume tiers. This comparison uses real competitor pricing data where available, rather than hypothetical estimates, to provide a more accurate view of the landscape:
| Provider |
Mode |
Base Rate / Effective Rate |
Notes |
| AssemblyAI |
Pre-recorded STT |
~$0.15/hr |
Base transcription rate (pay-as-you-go) without add-ons. |
| AssemblyAI |
Streaming (real-time) |
~$0.15/hr |
Same base rate for streaming. |
| AssemblyAI |
With common add-ons |
~$0.30–$0.45/hr |
With speaker diarization and multiple audio intelligence add-ons included. |
| Gladia |
Async (pre-recorded) |
~$0.61/hr |
All features included (speaker diarization, language detection, etc.). |
| Deepgram (Nova-3) |
Streaming |
~$0.46/hr |
Streaming rate for Nova-3. |
| Deepgram (Nova-3) |
Streaming + add-ons |
~$0.55/hr |
Multilingual streaming with higher tier. |
| Deepgram (Nova-3) |
Pre-recorded STT |
~$0.26/hr |
Multilingual tier Nova-3 per hour. |
The gap widens further if sentiment analysis or named entity recognition are each metered as additional line items. For actual rates, review our pricing page and model the volume tier that matches your projected usage.
Independent benchmarks at artificialanalysis.ai include cost-per-1,000-minutes data alongside accuracy comparisons that let you build a quality-adjusted cost model. A lower per-minute rate paired with higher WER on your target audio condition can increase downstream correction costs that don't appear on the transcription invoice.
Data governance and compliance in transcription
Meeting audio is often the most sensitive data your product touches, containing PII, financial discussions, legal conversations, and HR matters. Your data governance documentation needs to answer four questions before an enterprise customer's legal team will approve the vendor:
Where does the audio go? We process audio on European infrastructure by default to satisfy GDPR requirements, with other geographies and specific cloud providers available per customer request. We are GDPR-compliant and a Data Processing Agreement (DPA) is available for review before contract signature.
Who can access it? We encrypt all data in transit and at rest using TLS, with access scoped to your API key.
Does it train your model? Only users on the Free Plan are subject to data used for model training. No model training on paid plans applies to all Pro and Enterprise users, with no opt-out clause required.
How long do you keep it? By default, data is stored up to 12 months, but custom retention policies include 1-month, 1-week, 1-day, and zero-retention options. After the period ends, all data is removed from the dashboard and server. The delete transcription endpoint also lets you trigger deletion programmatically on demand.
On certifications: our SOC 2 Type 2 certification covers security, availability, confidentiality, processing integrity, and privacy. We are also HIPAA and ISO 27001 certified. A DPA is available for review before contract signature, which removes the most common bottleneck in enterprise security review cycles.
Get started
Start with 10 free hours and have your integration in production in less than a day. All features on paid plans, including diarization, code-switching, sentiment analysis, and named entity recognition, are included at the base rate with no setup fees and no add-ons.
If you're evaluating multiple STT providers, our free tier lets you test on real production audio before committing to a contract. For a personalized walkthrough of the real-time WebSocket implementation for your specific pipeline, book a demo with the team.
Migrating from another provider? Our migration guide from AssemblyAI and migration guide from Deepgram cover endpoint and payload differences to get you running quickly.
Frequently asked questions about Google Meet transcription
Does Google Meet have a real-time transcription API?
No. Google's native transcription saves to Drive after the meeting and can take over 45 minutes to process. The Google Meet Media API provides a raw audio stream in Developer Preview but requires enrollment and consent from the meeting's host organization, making it unsuitable for most multi-tenant SaaS products.
Can you transcribe Google Meet for free?
Yes. Our free tier includes up to 10 hours of transcription per month with diarization and code-switching. Google's native transcription requires a paid Google Workspace plan at the Business Standard tier or higher ($12/user/month); free accounts only get live captions, not full transcription, and neither tier provides programmatic access to the text stream or speaker diarization in a format usable by external applications.
What is the latency difference between a bot and an extension for real-time transcription?
A browser extension streams audio directly from the client to your STT API, eliminating the server-side relay. A bot approach introduces an additional network hop from the server-hosted browser to your backend, which can add 100–200ms depending on your infrastructure topology. Both paths stay within Gladia's sub-300ms STT inference latency budget, but the extension architecture introduces less upstream latency before audio reaches the transcription layer. Total end-to-end pipeline latency, covering capture, transit, inference, and UI rendering, typically targets under 500ms for live caption use cases.
Does Gladia charge extra for speaker diarization on Google Meet audio?
No. Speaker diarization is included in the base rate across paid plans with no separate line item. You configure it as a parameter in the API call, but it applies only to asynchronous transcription. Full configuration options are in the speaker diarization documentation.
How long does it take to integrate Gladia into a bot or extension pipeline?
According to Claap's published case study, they reached production quickly after integrating. The REST and WebSocket endpoints follow standard patterns, and our documentation includes code samples for both the live and pre-recorded paths, which most engineering teams can adapt to their stack in a single sprint.
What happens to my audio after Gladia processes it?
By default, audio is stored up to 12 months on encrypted EU-based infrastructure. Custom retention policies down to zero retention are available on request. Audio from Pro and Enterprise plan users is never used for model training, with no opt-out required.
Key terminology for audio intelligence
Word error rate (WER): The standard metric for evaluating STT accuracy, calculated as the sum of word insertions, substitutions, and deletions divided by the total number of words in the reference transcript. A WER of 0.05 means 5 errors per 100 words. Always specify the language and audio condition when comparing WER across providers, because a score on clean studio audio does not predict performance on noisy meeting recordings.
Speaker diarization: The process of identifying who spoke when in a conversation, assigning each audio segment to a speaker label. In meeting transcription, this produces "Speaker 1: ..." vs. "Speaker 2: ..." output rather than an undifferentiated text block. Our diarization is powered by an industry-leading provider, pyannoteAI, using their “precision-2” model, which delivers impressive benchmarks. This feature is available only through our asynchronous APIs. We do not offer real-time diarization.
Code-switching: Mid-conversation language changes, where a speaker shifts from one language to another within a single utterance or across consecutive turns. Standard models trained on single-language audio produce high substitution error rates when speakers switch languages because the model defaults to the language detected at session start.
Large language model (LLM): A machine learning model trained on large text corpora to generate and process natural language. In transcription pipelines, LLMs are used for downstream tasks such as summarization, named entity recognition, and meeting intelligence extraction.
Hallucination: An STT (speech-to-text) model generating text that is syntactically plausible but was not spoken in the source audio. This occurs most frequently during silence, background noise, or heavily accented speech, where the model fills gaps with statistically probable but incorrect output.
Latency budget: The total time allocated across the full pipeline, from audio input to rendered output, that keeps the user experience responsive. In a real-time meeting transcription pipeline, the budget covers audio capture latency, network transit, STT inference (sub-300ms at Gladia's layer), and UI rendering, typically constrained to under 500ms total for live caption use cases.