Accueil
Blog
Prompt injection in speech recognition explained

Prompt injection in speech recognition explained

Prompt injection in speech recognition explained
Published on
Mar 2024

Following the release of ChatGPT, prompt engineering for LLMs became one of the most widely-discussed fields in AI. Prompt injection in Speech Recognition in particular, used to guide the underlying model to produce more accurate results, is a fascinating NLP technique worth exploring in more detail.

Relying on Gladia’s expertise in Audio Intelligence AI, in this blog I will cover how prompt injection in Speech Recognition is used to enhance Automatic Speech Recognition (ASR) by improving previous — heavy, yet complementary — speech adaption or keyword-boosting methods.

Setting the scene

As for many AI systems in the last few years — and even many years before with more classical techniques — Open AI’s Whisper is a model that uses an encoding / decoding technique, presenting some large advantages that allow for abstract mathematical manipulations.

The basic mechanism here can be illustrated with a triangle connecting audio, latent space, and the final word transcript.

Basic graph explaining the relationship between audio input and output, passing through latent spaces via encoders and decoders

Let’s zoom in a little. Here comes the general view of audio processing in Automatic Speech Recognition (ASR).

Main stages of audio processing in Automatic Speech Recognition (ASR)

Before going into the details of how all of these components come together, the first thing to understand is:

The main objective of the prompt injection approach is to provide some guidance to the decoder in order to capture the right words and perform better — thanks to context-setting.

Now, let’s dive into the key elements of ASR and explore why prompt injection is a miraculous addition to the mix that can drastically improve the quality of audio transcription.

What are Features in Speech Recognition?

Performing Speech-to-Text is like solving a big Lego puzzle: the speech needs to be broken into smaller elementary bricks called “features” or “tokens” in order to be processed.

These elementary features are able to “embed” multiple aspects of the original speech segment, such as the speech tone, pitch, volume, or even speech rate.

In speech recognition, these features are extracted using a “pre-processing” step, which is usually done by performing spectrograms or Mel-Frequency Cepstral Coefficients (MFCCs) speech representation. Here, the audio basically becomes a simple “picture” representation of the audio.


Diagram of how pectrograms or Mel-Frequency Cepstral Coefficients (MFCCs) are formed.

What are Encoders and Latent Spaces?

Continuing with the Lego analogy, it’s now time to put some magic magnets on the back of each brick — so that the bricks that were originally together can form a magic link.

During the training phase, the strength of the magnetic links can vary between the different bricks to finally form a fixed picture of “link strengths” between them.

This final picture, Latent Space, contains Lego bricks / tokens (illustrated as colored dots below) with unique identifiers.

Passage from encoder to Latent Space
Adapted from https://www.mdpi.com/2076-3417/12/14/7195
Functioning of embeddings, explained

Another way to see it is that during the encoding phase, some elementary parts of speech (i.e. tokens) can be more or less attracted toward other tokens. The vector representations of these interactions (or “magic links”) are known as “embeddings”.

Disclaimer: This is a very inaccurate explanation from a scientific perspective — but it helps to emphasize that not all tokens have a role to play in the Latent Space and therefore are more or less likely to be “activated” or “picked” (not at all if they are very shy ;-) ).

Schematic representation of speech tokens in the Latent Space

Explaining this to one of our customers was like a revelation to him: “It’s like Esperanto for computers!”. And it’s exactly that.

Source: Business Insider

What is a Decoder

A “decoder” is a Lego brick builder that combines the pieces based on their interactions (embeddings) and translates the beautiful Lego tapestry into a humanly comprehensive structure — the written language. It takes the unique token identifiers and converts them into their equivalent in words (or pieces of words).

Diagram showing the functioning of a decoder

Prompt Injection in Speech Recognition

So finally, what’s “prompt injection”?

Prompt injection is a technique that will change the interactions between the bricks by dropping a giant magnet in the middle of the Magic Magnetic Legos.

It will help the decoder handpick pieces closer to the giant magnet, and pieces that might not have been originally selected (because they were too shy) can now emerge by virtue of being found closer to the giant magnet.

Diagram to explain the role of prompt injection in ASR: overview

A concrete example: Fiber vs. Cider

Let’s imagine we’re working with a poor-quality audio file where it’s hard to hear if the speaker talks about “Fiber” or “Cider”.

Fiber has strong interactions, well-connected in the Latent Space, as it has been very frequently seen in the training dataset and is linked to many concepts — while Cider is not.

By guiding the transcription using a prompt such asthis conversation is about telecoms and internet equipment”, we’re dropping Red far from Cider and close to Fibre, which will then be classified by AI as the perfect candidate.

In this scenario, Cider and Fiber remain very close regarding feature representation (they look alike in tone, energy, pitch, volume, …). In other words, the color and shape of the Lego brick Cider are very close to that of Fiber.

But, as the prompt is introduced, the magnetic sticker on the Fiber brick becomes way stronger and more connected to the rest of the Legos in the game (Latent Space).

Illustration of prompt injection in action with words Cider and Fiber

Now, let's drop another prompt injection: this conversation is about alcohol and drinks”. Now the Red Magnets are really close to Cider, and even if the Lego Brick magnet stick on the back of Cider is not originally as strong as the Fiber one, it’s more likely to be picked up in the end.

Illustration of prompt injection in action with words Cider and Fiber

Other more conventional and complementary techniques

Many other techniques can help to make transcription more accurate. For example, keyword boosting can be considered as virtually increasing the Lego brick's size to make the magnetic sticker bigger and more attractive, so more likely to be picked.

Speech adaptation is another technique that consists of adapting the encoder to understand new “waveform shapes” and adapt the mapping into the Latent Space.

Here is a more comprehensive map of corrective measures to improve speech recognition quality.

Complete overview of prompt injection in ASR

In conclusion, we have seen that prompt injection in speech recognition is a fascinating new tool — albeit with a malicious potential — complementary to the existing techniques that will help the ASR tech reach superior quality standards.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Speech-To-Text

Should you trust WER?

Word Error Rate (WER) is a metric that evaluates the performance of ASR systems by analyzing the accuracy of speech-to-text results. WER metric allows developers, scientists, and researchers to assess ASR performance. A lower WER indicates better ASR performance, and vice versa. The assessment allows for optimizing the ASR technologies over time and helps to compare speech-to-text models and providers for commercial use. 

Speech-To-Text

OpenAI Whisper vs Google Speech-to-Text vs Amazon Transcribe: The ASR rundown

Speech recognition models and APIs are crucial in building apps for various industries, including healthcare, customer service, online meetings, and entertainment.

Speech-To-Text

Best open-source speech-to-text models

Automatic speech recognition, also known as speech-to-text (STT), has been around for some decades, but the advances of the last two decades in both hardware and software, especially for artificial intelligence, made the technology more robust and accessible than ever before.