What is summarization?

Published on Feb 29, 2024

Summarization in speech-to-text (STT) AI is a popular feature that streamlines the extraction of essential information from spoken content. By condensing lengthy audio recordings or live conversations into concise summaries, STT summarization enhances user experience, facilitating quicker understanding and decision-making for the final users.

The feature leverages the capabilities of both automatic speech recognition (ASR) systems and large language models (LLMs), such as neural networks trained on vast datasets, to produce customized summaries tailored to any use case, such as medical consultation, online meetings and sales calls.

At Gladia, we have developed a innovative approach to summarization, leveraging the capabilities of Mistral and the main index. In this article, we explore the challenges and limitations of traditional methods and how our approach addresses these issues.

Feel free to try it directly below or keep on reading to learn more about the feature and its deployment.

How summarization works

Summarization in STT operates through a multi-step process, which involves linguistic analysis, machine learning algorithms, and natural language processing techniques to ensure the accuracy and coherence of the summaries.

Initially, an ASR system like Gladia’s Whisper-Zero transcribes the spoken content into textual form, converting audio signals into words. Subsequently, specialized large language models (LLMs) like OpenAI’s GPT-3.5 or Mistral 7B analyze this textual data to identify key phrases, extract important information, and generate summaries based on predetermined criteria.

In terms of the underlying techniques, summarization can be broadly categorized into two methods: extractive and abstractive.

Extractive summairzation, widely used in machine learning systems, involve analyzing language parameters, such as word frequencies and importance, to extract the most significant elements of a text. While effective, these methods are often limited in their ability to capture the nuances of a conversation.

Abstractive methods, on the other hand, focus on extracting the most important concepts from a conversation, rephrasing and reorganizing them to create a summary. This approach is less factual but provides a more abstract understanding of the conversation, resulting greater clarity for the user.

Key challenges of summarization and how Gladia solves them

Enabling infinite context

One of the significant challenges in summarization is the size of the context. A single hour of audio transcription can result in approximately 25,000 tokens, which can be overwhelming for traditional LLMs. These models have limited context sizes, typically ranging from 30,000 to 10,000 tokens, making it difficult to process lengthy conversations. Moreover, languages with lower resources may require more tokens to represent a single word, further exacerbating the issue.

To overcome these limitations, we at Gladia have developed an approach based on the work of the main index. By using different algorithms for embedding, such as chunking, we can process infinite contexts in a virtually limitless manner. This allows us to create abstract summaries without any generation or input limitations.

The chunking technique, inspired by the work at Meta AI's Research Lab, involves dividing the conversation into smaller segments, enabling the system to maintain attention throughout the conversation.

Overcoming catastrophic forgetting

Another issue with summarization is the concept of "catastrophic forgetting". Traditional LLMs tend to forget critical information in the middle of a conversation, resulting in a loss of precision. This happens due to the system's attention being focused on the beginning and end of the conversation, with a significant drop-off in attention in the middle.

By using chunking, we can recombine the segments to ensure that the system maintains attention throughout the conversation, resulting in more accurate summaries.

Both challenges have been taking into account when designing our Audio-to-LLM feature, enabling you to generate custom summaries, action items and more from your audio using your own prompts.

Output formats for your product

The beautiful thing about summarization is just how customizable the output can be hanks to: a) an increasing variety of LLMs to pick from; b) the infinite creativity of prompt engineering, enabling every company to find the perfect combination of prompts to produce desired results.

Companies can choose to deploy and tweak LLMs themselves or go with all-batteries-included audio intelligence APIs like ours. In the latter scenarios, their summarization capabilities will be seamlessly integrated with transcription services.

Currently, Gladia’s API allows you to access three most common industry-agnostic types of summaries, each catering to specific needs. Here's what they look like in practice:

1. General summary

The general summary provides a comprehensive overview of the transcription, capturing the main points and key details. It serves as a detailed reference for in-depth analysis or review.

2. Concise summary

For quick reference, the concise summary offers a condensed version of the transcription, highlighting only key takeaways. Its goal: efficient information consumption and decision-making.

3. Bullet points

The bullet points summary presents key insights and actionable points in a concise, easily digestible format. It organizes information into bullet-pointed lists, making it ideal for quick reference and strategic planning.

As you can see, with just a few lines of code, you can embed the most common types of summarization into your application. For more information on setting up and using our API, feel free to consult our documentation.

If you prefer to build your own summarization from scratch using open-source Whisper and GPT 3.5, here is a dedicated tutorial.

Maximizing quality of summaries with prompt engineering

As noted previously, the quality and relevance of summaries depend largely on the prompt provided to LLM. If you want to have full control over the summarization input parameters, here are some factors to consider.

Prompt engineering involves crafting tailored prompts for specific use cases to optimize the relevance and accuracy of the summaries generated. While high-quality, prompt engineering usually requires at least some specialized expertise, businesses can maximize the quality of summaries by following these actionable insights:

Understand use case requirements

Identify the specific objectives and priorities for summarization within your business context. Whether it's capturing meeting minutes, extracting key insights from customer interactions, or summarizing research findings, align the prompt with the desired outcomes.

Pick the right LLM

Selecting a suitable LLM is crucial for ensuring the quality and relevance of summaries. Consider factors such as language proficiency, domain expertise, model capabilities and price (based on the unique token economics of LLMs) when choosing a model for your summarization needs.

Then, evaluate different models based on their performance metrics – preferably the ones based on your own internal tests – to assess the compatibility with your use case.

Customize prompts accordingly

Tailor prompts to suit the linguistic style, vocabulary, and domain-specific terminology relevant to your industry or organization. By incorporating relevant keywords and context cues, you can enhance the summarization process and ensure the output aligns with your expectations.

To illustrate this, if you’re looking to summarize online meetings, check out our guide with specific examples of prompts for this use case.

Iterate and refine

It’s normal for early attempts at prompt engineering to not yield the desired results. Continuously evaluate the effectiveness of prompts and summaries based on user feedback and performance metrics. Iterate on prompt variations, adjusting parameters and refining language patterns to improve summarization quality over time.

Our approach would not have been possible without the contributions of open-source tools like Mistral, Facebook's work on embedding, and Jerry Liu's work on the main index. These tools have enabled us to develop a more efficient and effective summarization system.

In conclusion, our approach to summarization, leveraging the capabilities of Mistral and the main index, has overcome the limitations of traditional speech-to-text technology. By using chunking and embedding algorithms, we can process infinite contexts, maintain attention throughout the conversation, and create abstract summaries with unparalleled precision. This innovation has significant implications for the future of speech-to-text technology, enabling users to unlock the full potential of their conversations.

Conclusion

Summarization is a highly popular feature among final users. Product builders today are presented with an array of open-source and commercial tools for both transcription and summarization to succeed in providing the best summarization experience in their product.

Gladia's approach to summarization, leveraging the capabilities of Mistral and the main index, has overcome the limitations of traditional speech-to-text technology. By using chunking and embedding algorithms, we can process infinite contexts, maintain attention throughout the conversation, and create abstract summaries with unparalleled precision. This innovation has significant implications for the future of speech-to-text technology, enabling users to unlock the full potential of their conversations.

A little note of thank you for contributions of open-source tools like Mistral, Meta's embeddings projects, and Jerry Liu's work on the main index, which have all made this breakthrough with summarization possible.

If you want to unlock these capabilities for your platform with Gladia, feel free to sign up for our API or book a custom demo to chat with our team about your use case and needs.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Safety, hallucinations, and guardrails: How to build voice AI agents you can trust

As voice agents become a core part of customer and employee experience, users need to know these AI systems are accurate, safe, and acting within boundaries. That’s especially true for enterprise-grade tools, where a rogue voice agent can severely damage relationships and create major legal risks.

Speech-To-Text

How real-time STT empowers multilingual support & unlocks international growth

Businesses expanding globally face an immediate language barrier. Customers want service in their native tongue, but most companies and call center providers don’t have enough multilingual agents to meet that demand.

Speech-To-Text

Live transcription made simple with Twilio, Python & Gladia

Live voice AI is no longer a concept of the future. From customer support to smart IVR (Interactive Voice Response) systems, speech is now transcribed in real time—often before the speaker finishes a sentence.