Audio

Intelligence API

Unlock essential Language AI tasks powered by state-of-the-art ASR models
Customizable to every app, company and use case

We simplified the most advanced AI models

to make every audio count

Benefits

Unparalleled Quality

Achieveing highly accurate transcription is our number one priority - without compromising speed (1h audio < 60s) while remaining one of the most affordable API providers on the market, add-ons included.

100% Security

We take our client’s privacy very seriously. Our data hosting is fully compliant with all EU and US regulations, and our portfolio is constantly updated with the latest security add-ons like PII redaction to guarantee full legal compliance.

Multilingual Support

Unlike any other provider on the market, our API supports 99 languages for transcription, translation, as well as all of the current and upcoming audio intelligence features.

Easy Integration

Our plug-and-play API is compatible with all existing tech stacks and does not require any AI expertise to integrate. Optimized for scalability, our enterprise-grade API is complemented by an intuitive in-app toggle system for the add-ons.

Features

Complete package of Speech-to-Text and Audio Intelligence add-ons, including speaker diarization, code-switching, and translation, to help companies easily capture, enrich and leverage audio data.

Discover our docs

Transcription

coming soon

Gladia API utilizes automatic speech recognition (ASR) technology to convert audio, video files, or URL to text format. It transcribes 1h of audio in less than 60s.

Asynchronous
Live (beta)
Automatic language detection
SRT and VTT caption formats
Noise reduction
Custom vocabulary
Automatic punctuation & casing

Diarization

Automatically partitions an audio recording into segments corresponding to different speakers. This is done by analyzing the audio signal and using various techniques to differentiate between different speakers based on their individual voice characteristics. Mono, stereo, and multi-channel files are all supported.

Word-level timestamp

Refers to the process of associating a specific timestamp with each recognized word in the transcribed text output of an ASR system. These timestamps are usually expressed in seconds.

Code-switching

Ability to accurately transcribe and differentiate between multiple languages or dialects within a spoken input. It involves the recognition and appropriate representation of language switches, where the AI system detects and labels the transitions between languages or dialects in the transcribed text.

Translation

beta

Speech-to-text translation is the process of converting spoken language into written text using automatic speech recognition technology and language modeling. We support translation to and from 99 languages. Full list here.

Dubbing

Word-level timestamp translation, or dubbing, refers to synchronizing translated subtitles or captions with the corresponding audio or video content. It involves generating accurate timestamps for individual words or phrases in the translated text to ensure precise alignment with the original speech.

Subtitling

SRT and VTT are two popular caption formats supported by Gladia that are used for subtitles in multimedia applications. Both formats are optimized to provide synchronized and readable subtitles that enhance the viewing experience.

Summarization

coming soon

Refers to the process of condensing and synthesizing spoken content into shorter, more concise textual summaries. It involves converting spoken words and phrases into written text and then applying NLP techniques to analyze and extract key information from the transcript.

Topic classification

coming soon

Based on the IAB2 classification, refers to the process of categorizing content into one of the 698 predefined topic categories for content indexation and more.

Chapterization

coming soon

Based on the IAB2 classification, refers to the process of categorizing content into one of the 698 predefined topic categories for content indexation and more.

Keyword extraction (NER)

coming soon

Allows to automatically identify and extract keywords and structured entities (named entities or NER) such as organizations, names, locations, events, dates, and many more elements from audio files and/or unstructured blocks of text.

Emotion detection

coming soon

Our emotion recognition system is built upon research from New York University and aims to accurately identify and distinguish between 27 human emotions across 99 languages.

Sentiment analysis

coming soon

Determining the sentiment or opinion behind a piece of audio, such as a conversation or dialogue, using natural language processing (NLP) techniques. Involves analyzing the words and phrases used in the text to identify whether the sentiment expressed is positive, negative, or neutral.

Encryption

coming soon

Converting an audio file into an unreadable format to prevent unauthorized access, modification, or theft. Designed to protect sensitive or confidential information contained in audio files, such as customer data, trade secrets, or intellectual property.

Speech moderation

coming soon

Allows to automatically identify and flag hate speech or other inappropriate and offensive verbal content according to pre-determined parameters (19 featured tags, including violence, NSFW, drugs), internal protocols, and external regulations.

PII Redaction / Personal Data Masking (GDPR)

coming soon

Used to detect, tag, and remove any personally identifying information, such as an address, card number, SSN, phone number, and more.

Pricing

Free

Perfect for developers, early-stage startups, and individuals

0

$
/month

(10h/month included)

Pro

Designed to grow with scaling digital companies

0.00017

$
/sec

+ $0.00004 / sec for live transcription

Entreprise

Custom plan tailored to the modern enterprise

Contact us

It's the first time we've been able to transcribe video with such accuracy and speed — including when the conversation is technical. Whatever the language or accent, the quality is always there.

ROBIN BONDUELLE, CEO CLAAP

FAQs

Have more questions?

Contact our support team to get what you need.

What is an audio transcription API?
  • An audio transcription API is an API that uses algorithms to analyze audio data and transcribe it into text.
  • Audio transcription technology is sometimes referred to as Speech-to-text or STT. While both refer to technology that results in audio data transcribed into text data, Speech-to-Text is technically a branch of natural language processing (NLP) that converts spoken language into written text, and is powered by AI models.
How does Gladia’s Speech-to-Text API work?
  • Gladia’s audio transcription API - also called a Speech-to-Text API - allows developers and product owners to add transcription to their products by calling on a single API for every audio transcription need.
  • On top of core transcription in 99 languages, Gladia’s API also offers a layer of audio intelligence features, such as diarization, word-level timestamps, various output formats, etc. See the Product page for a full list of Gladia’s features.
What are the key features of Gladia’s audio transcription API?
  • Gladia’s audio transcription API offers a range of features that fall into two broad categories: core audio transcription features, and audio intelligence features.
  • Core transcription includes features such as asynchronous transcription, speaker diarization, word-level timestamps, automatic language detection, punctuation, and casing, while audio intelligence can include features such as emotion detection, summarization, content tagging or PII redaction.
  • You can find the full, up-to-date list of features on Gladia’s Product page. You will also be able to see information such as features in beta, and information on upcoming products.
What audio formats does Gladia support?

Gladia’s audio transcription API supports a wide range of audio formats and codecs. The full list is available here, but make sure to reach out to our team if you encounter any issues with your specific file format.

What languages does the speech-to-text API support?

Gladia’s Speech-to-Text API supports 99 languages: afrikaans, albanian, amharic, arabic, armenian, assamese, azerbaijani, bashkir, basque, belarusian, bengali, bosnian, breton, bulgarian, burmese, castilian, catalan, chinese, croatian, czech, danish, dutch, english, estonian, faroese, finnish, flemish, french, galician, georgian, german, greek, gujarati, haitian, haitian creole, hausa, hawaiian, hebrew, hindi, hungarian, icelandic, indonesian, italian, japanese, javanese, kannada, kazakh, khmer, korean, lao, latin, latvian, letzeburgesch, lingala, lithuanian, luxembourgish, macedonian, malagasy, malay, malayalam, maltese, maori, marathi, moldavian, moldovan, mongolian, myanmar, nepali, norwegian, nynorsk, occitan, panjabi, pashto, persian, polish, portuguese, punjabi, pushto, romanian, russian, sanskrit, serbian, shona, sindhi, sinhala, sinhalese, slovak, slovenian, somali, spanish, sundanese, swahili, swedish, tagalog, tajik, tamil, tatar, telugu, thai, tibetan, turkish, turkmen, ukrainian, urdu, uzbek, valencian, vietnamese, welsh, yiddish, yoru

How accurate is Gladia’s audio transcription software?
  • Gladia’s API aims to be the most accurate AI-powered transcription software on the market. As of today, we provide top-tier quality for an for a near error-free experience, while remaining one of the fastest solutions on the market.
  • We are in the process of building benchmarks for languages other than English, but in the meantime, you can reach out directly to our team for information on your languages of interest. Gladia’s audio transcription can also address multiple languages in the same audio file, so you should ask us about that use case if it is relevant to you!
What is the pricing structure for the audio transcription API?
  • We aim to make Gladia’s audio transcription API a perfect balance of quality and speed while still remaining one of the most affordable options on the market.
  • Gladia’s pricing offers three tiers: free access, Pay-as-you-Go, and Enterprise. You can find more information about our pricing on the Pricing page.
What industries or use cases can benefit from the API?

Any company that manages or produces audio or video data can benefit from Gladia’s Speech-to-Text technology. More specifically, we work with the following types of companies and use cases:

  1. Customer or sales audio communications: from sales outreach platforms to call center technology providers, we enable companies to augment their human workforce by eradicating manual data entry or triage, and to improve their performance with detailed analytics and insights.
  2. Audio, video, and media production: streaming platforms, screencast or podcast production software, media platforms and forums, and audio and video recording or sharing products all use audio and video transcription to create more value for their users. Audio transcription makes their content exponentially faster to catalog and search, and much easier to access.
  3. Meetings and workforce management: virtual meeting providers and collaboration platforms can use audio transcription to help their customers store and exploit vast amounts of meeting data, giving them access to a previously untapped source of internal knowledge.
  4. Other use cases: specialized industries such as medicine, law and finance find immense value in speech-to-text technology that is fine-tuned to their specific language.
Are there any usage limits or restrictions?

Gladia’s Speech-to-Text API was built for developers and product owners with flexibility and ease of implementation in mind. Our rate limitations on the number of calls per hour and the total hours of audio transcribed depend on the user tier and are all described on the Pricing page. If your team has high volume requirements, don’t hesitate to contact us to get a custom quote.

How can I get started with implementing Speech-to-Text in my product?

Gladia’s API is extremely easy to implement. Once you have created your account, you’ll be able to generate an API key and use it to make calls on Gladia’s audio transcription. You can find all the information you need to get started in our Developer’s documentation.

Is Gladia GDPR compliant?

As Gladia already operates in Europe with organizations that require airtight data privacy compliance, Gladia is able to offer GDPR-compliant audio transcription. We are also working on further data privacy and security certifications, so reach out to our team if you have specific requirements on this front.

Is Gladia secure?

At Gladia, we are used to working with organizations with highly sensitive data and extremely tight security requirements. By default, we deliver our audio transcription services in a cloud-hosted environment which can be customized to your geographical footprint, but we are able to deliver on-premises hosting, as well as air-gapped hosting, depending on your security requirements.

From audio
to knowledge

Subscribe to receive Gladia's latest news,
product updates and curated AI content

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.