RAG for voice platforms: combining the power of LLMs with real-time knowledge
Published on Oct 30, 2024
It happens all the time. A user submits a query to a large language model (LLM) and swiftly gets a response that is clear, comprehensive, and obviously incorrect.
This phenomenon, known as hallucination, usually occurs when the model lacks knowledge or doesn’t have enough context on the topic. Although models can produce outputs tailored to a request, they can only reference information that existed at the time of their training, and that may not be up-to-date.
There are several techniques to address this issue. One common approach is to fine-tune the model with domain-specific data, for example, medical or legal jargon.
Another approach is retrieval-augmented generation (RAG), which combines the model's generative capabilities with a retrieval mechanism to access external reputable sources, cross-check facts, and produce accurate and up-to-date answers.
Introduced in 2020 by Facebook AI researchers led by Patrick Lewis, RAG emerged from the need to address a critical shortcoming in early generation-based models: heavy reliance on the knowledge contained within their training data.
In this article, we will examine how RAG works, compare it to fine-tuning, and discuss how it can benefit voice-first platforms.
Context is everything
To generate a relevant and helpful response, LLMs need the right context. However, a vanilla LLM relies only on a generation component to fetch responses from a pre-trained knowledge base. This leaves it limited with options for fine-tuning and customization.
In addition, maintaining the model and keeping it up-to-date requires regular manual updates and retraining, leading to high-cost, resource-intensive, and time-consuming processes.
RAG is a powerful technique designed to enhance a model's contextual understanding through on-demand retrieval of external data, injecting relevant context into the prompt at runtime.
With RAG, LLMs can access information from additional sources such as customer documentation, web pages, and third-party applications. The data can be obtained through web scraping, API integration, and document indexing, allowing RAG-powered models to provide more accurate, context-aware responses tailored to specific queries.
Here is how the retrieval process works:
User prompt: The user gives a specific query and triggers LLM to create a response. RAG converts the query into vectorized representations called embeddings. Each element in an embedding corresponds to a specific property within the query’s text that the model can understand.
Semantic search: RAG then performs a similarity search using AI algorithms to match the query embeddings with the embeddings in a vector database that contains external knowledge. Vector databases store these embeddings in chunks. Each chunk contains a segment of data corresponding to a particular domain. Algorithms will compute similarity metrics to determine which chunk is closest to the query embeddings to understand the relevant context. Relevant embeddings will be fetched to provide the LLM with the correct context associated with the user’s query.
Prompt: LLM uses the context information retrieved from the vector database and the user’s query as input. It combines this with the configured prompt, which provides the LLM with the necessary instructions on how to generate a response.
Post-processing: LLM processes the input according to the prompt and provides a response.
The difference between RAG and fine-tuning
Both RAG and fine-tuning aim to enhance LLM’s output and make it more accurate and relevant.
However, RAG allows you to inject real-time context, dependent on your ingestion strategy, into your prompts to a deployed LLM, whereas fine-tuning is limited to context and data available in the model’s training data set.
Some may start with RAG and then fine-tune models to perform a more specific task. Others find that RAG alone is a sufficient method for customization.
Below is a brief overview of the main differences between the two techniques.
Fine-tuning
RAG
Adaptation
After the fine-tuning phase for a specific task, LLMs become static.
RAG is an evolving system that can learn from additional sources over time.
Data training
Fine-tuning re-trains the parameters of a model to optimize performance with new data for a specific task. Unlike RAG, however, this data needs to be prepped and cleaned so that it can be used for fine-tuning.
RAG adds information from external sources related to a specific topic, without changing the model's internal parameters.
Versatility
If a model hasn’t been fine-tuned for a domain-specific task, it doesn’t have sufficient knowledge to handle related queries. For example, if the model is fed with data consisting of legal employment contracts, it can only answer questions about work-related issues with legal consequences.
RAG can augment the LLM with any information source related to any domain without re-training the model on a new dataset and knowledge.
Catastrophic forgetting
Fine-tuning an LLM for a new task can lead to forgetting or losing previous knowledge learned during the pre-training phase.
Since RAG does not change the model’s internal parameters, LLMs retain their pre-training knowledge
Computational requirements
Fine-tuning a model requires extensive computational resources and the use of GPUs.
RAG-powered models can be resource-intensive.
Leveraging RAG for voice-first platforms
RAG’s ability to use external authoritative sources to generate and retrieve a response makes it highly valuable in verticals where domain-specific knowledge is required to answer user queries in real time.
For voice-first platforms, such as meeting assistants and contact centers, the combination of speech-to-text AI, LLMs and RAG unlocks a range of benefits.
For instance, one of the significant challenges in conversational AI is understanding context. RAG uses the retrieval step to maintain context throughout the interaction and better understand meaning behind user queries and provide more precise search results than traditional keyword-based methods.
In the long run, RAG-powered systems can analyze customer feedback to identify common themes and pain points and provide insights into current and potential product features and improvements.
As an example, let's see how contact centers can improve customer satisfaction and support agents more effectively by integrating RAG into their systems.
Contact center agents can use RAG-based frameworks to fetch relevant data from the company database to analyze customer history and preferences, tailor responses to individual requirements, and generate personalized recommendations. This can automate response generation, saving time and reducing average handle time and resolution rate.
If you want to learn more about trends in speech AI and how they enable contact centers to augment their agents and streamline workflows, check out our article: Enhancing CX with AI: Key trends to watch 2024
Challenges of implementing RAG with LLMs
When implementing RAG, there are a few challenges and mitigation strategies to consider.
Data privacy
Users can ask questions that cause RAG to fetch critical information from sensitive documents. The unintended exposure of confidential knowledge can lead to costly data breaches. For example, a chatbot using patients’ historical data can improve response quality but also raises concerns about exposing sensitive information. Implementing robust encryption protocols and anonymization techniques to protect personal data can help mitigate potential data privacy issues.
Poor quality data
RAG frameworks are only as effective as the data sources they rely on to fetch information. Implementing a curated dataset and multiple cross-reference materials within the database can ensure that the responses are accurate and relevant.
Information overflow
User queries may lead the model to fetch data from multiple sources with extensive information. The information overflow can cause the system to lose track of what the original query asked for. "Needle in the haystack" tests can ensure that only relevant data is retrieved during retrieval.
Security risks
Maintaining the security of external sources can be challenging. If the storage repositories have weak security protocols, there is a potential threat of data leakage. For example, any data you pass through to GPT via a prompt will be retained by OpenAI for a certain period of time and can be used to train their models.
Access issues
Specific documents may have access restrictions. To control access to confidential information, permission metadata should be attached to user queries.
Conclusion
RAG offers an effective way to enhance LLMs, helping to ensure outputs are up to date with external knowledge sources and best practices.
Gladia uses different techniques to improve the quality of output and contextualize an initial prompt, and RAG enhances that ability, making our speech-to-text and audio intelligence API more robust and reliable.
Gladia provides a speech-to-text and audio intelligence API for building virtual meetings, note-taking apps, call center platforms and media products, providing transcription, translation and insights powered by best-in-class ASR, LLMs, and GenAI models.
Building better voice agents: Lessons from Thoughtly × Gladia's webinar
Voice AI has evolved fast — from early experiments that barely handled a “hello” to today’s real-time conversational agents running across industries. Alex Casella (CTO at Thoughtly) sat down with Gladia’s CEO Jean-Louis Quéguiner to unpack the technical and operational realities of building production-grade voice agents.
Safety, hallucinations, and guardrails: How to build voice AI agents you can trust
As voice agents become a core part of customer and employee experience, users need to know these AI systems are accurate, safe, and acting within boundaries. That’s especially true for enterprise-grade tools, where a rogue voice agent can severely damage relationships and create major legal risks.
How Aircall cut transcription time by 95% with Gladia
The contact center is transforming. Traditionally defined by manual workflows, siloed data, and reactive customer service, today's Contact Center as a Service (CCaaS) platforms are embracing a new era—one driven by real-time AI and automation.