What is ASR & how do speech recognition models work?

Published on Mar 21, 2024

Automatic speech recognition (ASR) is a cornerstone of many business applications in domains ranging from call centers to smart device engineering. At their core, ASR models, also referred to as Speech-to-Text (STT), intelligently recognize human speech and convert it into a written format.

Modern ASR engines leverage a combination of groundbreaking technologies, including Natural Language Processing (NLP), AI, ML, and LLMs. While ASR relies on all these technologies, it’s very different from each one of them in fundamental ways.

Take NLP, for example, which is frequently confused with ASR. The technology aims to understand natural language patterns and use those to create new content from the ground up, whereas ASR has a laser focus on transcribing the spoken word. It is therefore crucial to understand how speech recognition works to be able to build better products and experiences for your customers and teams.

In this article, we’ll explore the inner workings of ASR models and compare legacy approaches with today's state-of-the-art sequence-to-sequence models. If you’re a developer, AI engineer, CTO, or CPO, you’ll discover a wealth of insights into ASR models like Whisper seq2seq and how they’re vital for transcriptions, captioning, and content creation in business environments.

What is ASR?

ASR stands for Automatic Speech Recognition, a technology that converts spoken language into written text. It’s the foundation behind voice assistants, transcription tools, and real-time communication solutions. Unlike general AI or NLP, which focus on understanding or generating language, ASR is specifically designed to capture and accurately transcribe audio signals into readable text. Modern ASR systems leverage advanced machine learning and deep learning models to handle variations in accents, background noise, and different languages.

How does speech recognition work?

When incorporating speech recognition technologies into your business and customer workflows, it helps to understand how they work. Armed with these insights, you’ll be better able to select an ASR model that best meets your specific requirements. If you're a new to speech recognition, feel free to check out our introductory guide to speech-to-text AI before diving into this.

The traditional speech recognition approach

Let’s take a look at how speech recognition has historically worked. Legacy ASR systems, first introduced in the 1970s, function by converting analog audio signals into digital bits and processing them with a decoder to create sentences with words based on the data sequence and context. They remained mainstream for the last four decades until the introduction of end-to-end ASR models.

The decoder in a traditional speech recognition system analyzes the input data in conjunction with multiple ML models.

Acoustic Models (AM): The decoder needs acoustic models like Hidden Markov Models (HMM) and Gaussian Mixture Models (GMM) that understand the natural speech patterns to predict the exact spoken sound.

Language Models: Merely estimating the sound (phenome) is not enough. You need a language model that can predict the right sequence of words based on a statistical analysis of the language.

Lexicon Models: A lexicon model determines the phonetic variations in language. This helps distinguish between accents and similar-sounding phrases. These models all work in tandem to create the desired output (written text). An example of a lexicon model would be a Finite-State Transducer (FST) with a pronunciation dictionary mapping the word “SPEECH” to “S P IY CH.” All natural language words are represented in the FST as subword units.

As you might have guessed, this approach necessitates training multiple models independently of each other, which is both time-consuming and expensive. One of the most significant drawbacks, however, is the reliance on the lexicon model.

The success of the traditional speech recognition models depends to a great extent on how well the lexicon models are crafted. Experienced phoneticians collaborate to create a custom set for the language at hand. The process must be repeated for every single language the model aims to support. This makes implementation challenging, especially in dynamic business environments expanding into new markets.

Modern ASR models with end-to-end (E2E) deep learning

Modern ASR models take a disruptive approach to speech processing with end-to-end deep learning. One of the early approaches to building an end-to-end system was introduced in 2014 by researchers Alex Graves from Google DeepMind and Navdeep Jaitly from the University of Toronto.

Essentially, a complex neural network in a modern ASR model replaces multi-stage models in legacy systems which minimizes latency and drastically improves performance and accuracy.

This architecture also does away with independent language, acoustic, and lexicon models, and the resultant modern ASR system functions as a single neural network (as against multiple models in legacy systems), reducing development time and costs. In addition, they also achieve unparalleled accuracy levels and support multiple languages owing to their advanced neural architecture.

Basic architecture of modern ASR models in a — *Basic architecture of modern ASR models*

The modern ASR models are trained on large datasets and often require self-supervised deep learning as it would be extremely challenging for human operators to manually process voluminous data. Engineers use large amounts of unlabeled data to build a foundation model, which is then fine-tuned to achieve the desired word error rate (WER). Models can also be benchmarked based on this metric to compare performance.

A great example of a modern ASR model is OpenAI’s Whisper seq2seq model trained on a huge dataset of over 680,000 hours of audio data from many languages. This makes it more robust than models trained on smaller, more specialized datasets. The seq2seq model is essentially an ML architecture optimized for tasks that involve sequential data (like speech and text). It comprises two core components which take a sequence as input and generate a related sequence output.

Encoder: The encoder uses neural networks to process the input sequence and create a fixed-size vector representation. The encoder will also capture context from the input sequence and pass it to the decoder.

Decoder: A ‘decoder’ module takes the encoder output vector and creates an output sequence. The decoder will predict the next tokens of the output sequence based on the received context and its own previous predictions.

Transformers: The Whisper seq2seq model uses an end-to-end transformer architecture to understand context and meaning from the input audio. The model first splits the input audio into small chunks before passing them to the encoder. The decoder predicts the text caption.

Recurrent Neural Networks (RNN): Sometimes both the encoder and the decoder use Recurrent Neural Networks (RNN), a specific type of neural network well-suited for sequence prediction. RNN cells can remember information about previously seen elements of the sequence through their internal memory and use it to determine current output. Unlike transformers, RNNs are sequential models. LSTM (Long Short-Term Memory) models, for example, use an RNN architecture.

Legacy ASR models vs seq2seq architecture

We’ve summarized the key differences between the traditional and modern ASR approaches below for easy reference.

Characteristics

Legacy ASR models

Seq2Seq models

Internal structure

Modular pipeline with acoustic models, lexicon models, and language models

Use end-to-end neural networks with an encoder-decoder architecture to directly convert speech to text

Accuracy

Limited accuracy, cannot reach human accuracy levels

High accuracy, can reach human-level accuracy and beyond

Versatility

Limited adaptability to diverse inputs (some accents can throw off transcriptions)

Highly adaptable to different accents and languages

Training data type

Needs labeled phonetic data to function properly

Works with unlabeled or less labeled data

Training technique

Each model needs to be independently trained

The entire model is trained in one go

Suitability

Best suited for simple speech-to-text functions

Best suited for complex and real-time speech-to-text applications

Speed

Generally slower due to multi-stage component interactions

Typically faster due to parallel processing

Error detection and correction

Limited error correction capabilities

Complex error correction and control mechanisms

Scalability

Limited scalability

Highly scalable

Language support

Support a limited number of languages

Can support a large number of languages

Context awareness

Limited context awareness. Probabilistic language models analyze context to predict the next word with limited use of machine learning.

Highly context-aware, use deep learning for comprehensive context analysis

Which ASR models do we use today?

While legacy models suffer from drawbacks, they aren’t in any way rendered obsolete. In fact, modern ASR models build upon several traditional fundamental speech recognition systems such as acoustic and language models. Furthermore, legacy systems are also deployed for some specific tasks.

Generally speaking, though, Seq2seq models like Whisper are great at tasks that require natural language understanding. This includes speech recognition, translation, and caption generation. This is, in part, owing to their superior architecture that lends them advantages when dealing with sequences of varying lengths. That said, modern models do need very large datasets and compute resources to train and work. They can also face problems with long-range dependencies and when rare words are encountered.

How to select the right ASR model?

We know how challenging it can be to select an optimal ASR model for your specific needs. We’ve put together a few decisive factors you should consider to streamline the selection process.

Word error rate: The ideal goal of any voice recognition application is to achieve zero error rates. However, practical considerations dictate variations beyond our control, so make sure you factor in the precision and accuracy you need in your system when selecting an ASR model. If your application needs uncompromising performance, choose a modern ASR system like Whisper Seq2Seq.

End goal: Consider the requirements of your end users when choosing a model. How will they use the product or service?

Operating environment: ASR performance depends on the background noise to a great extent. If your end users will operate the product or application in noisy environments, you’ll need advanced noise suppression support in your model.

Audio properties: Consider the audio sample rate, bitrate, file format, channels, and duration when selecting a model.

Input audio type: Factor in how varied your input audio will be and the languages and dialects the model will need to support.

Performance: Every model performs differently, so you’ll need to evaluate it based on your specific benchmarks. If you need real-time speech-to-text conversions (such as in smart devices and wearables), choose a model with the lowest latency possible.

Conclusion

Automatic speech recognition plays a pivotal role in numerous business applications and knowing which kind of ASR system to use for a niche use case can greatly improve end-user experience.

Traditionally, legacy speech-to-text systems have used a combination of acoustic, lexicon, and language models to predict text. However, these suffer from accuracy constraints and require specialized expertise for phonetic training.

Modern ASR models are increasingly complex and can handle diverse input types, including multiple languages. A simplified end-to-end deep learning model ensures minimal latency, superior accuracy, and high scalability.

When choosing an ASR model for your business, consider its underlying architecture (such as RNN/transformer), input data modalities, and the desired word error rates.

Additional resources

Far-Field Automatic Speech Recognition (2020) by Reinhold Haeb-Umbach et al.

Contact us

Your request has been registered

A problem occurred while submitting the form.

How to measure latency in speech-to-text (TTFB, Partials, Finals, RTF): A deep dive

Latency can make or break a voice experience. Whether you’re building an agent that must stop speaking the moment a customer interrupts, or you’re captioning live content, you need a clear, reproducible way to measure how fast your STT really is, from first partial word to final transcript.

Speech-To-Text

How to build multilingual AI voice agents for the global customer experience

Great customer support experiences rely on clear communication and deep understanding. Until recently, meeting that expectation at scale was nearly impossible—human agents can only handle so many languages, and even fewer can switch between them fluently.

Case Studies

How Attention closes more deals and powers smarter AI sales workflows with Gladia

The revenue tech stack is evolving fast. What used to be manual note-taking and inconsistent CRM updates is giving way to AI-powered workflows that turn every conversation into structured, actionable data. At the core of that shift is transcription: if the words aren’t captured quickly and accurately, everything downstream falls apart.