How do speech recognition models work?

How do speech recognition models work?

How do speech recognition models work?
Published on
Apr 2024

Automatic speech recognition (ASR) is a cornerstone of many business applications in domains ranging from call centers to smart device engineering. At their core, ASR models, also referred to as Speech-to-Text (STT), intelligently recognize human speech and convert it into a written format.

Modern ASR engines leverage a combination of groundbreaking technologies, including Natural Language Processing (NLP), AI, ML, and LLMs. While ASR relies on all these technologies, it’s very different from each one of them in fundamental ways. 

Take NLP, for example, which is frequently confused with ASR. The technology aims to understand natural language patterns and use those to create new content from the ground up, whereas ASR has a laser focus on transcribing the spoken word. It is therefore crucial to understand how speech recognition works to be able to build better products and experiences for your customers and teams.

In this article, we’ll explore the inner workings of ASR models and compare legacy approaches with today's state-of-the-art sequence-to-sequence models. If you’re a developer, AI engineer, CTO, or CPO, you’ll discover a wealth of insights into ASR models like Whisper seq2seq and how they’re vital for transcriptions, captioning, and content creation in business environments.

How does speech recognition work? 

When incorporating speech recognition technologies into your business and customer workflows, it helps to understand how they work. Armed with these insights, you’ll be better able to select an ASR model that best meets your specific requirements.  If you're a new to speech recognition, feel free to check our our introductory guide to speech-to-text AI before diving into this.

The traditional speech recognition approach

Let’s take a look at how speech recognition has historically worked. Legacy ASR systems, first introduced in the 1970s, function by converting analog audio signals into digital bits and processing them with a decoder to create sentences with words based on the data sequence and context. They remained mainstream for the last four decades until the introduction of end-to-end ASR models. 

The decoder in a traditional speech recognition system analyzes the input data in conjunction with multiple ML models.

  • Acoustic Models (AM): The decoder needs acoustic models like Hidden Markov Models (HMM) and Gaussian Mixture Models (GMM) that understand the natural speech patterns to predict the exact spoken sound. 
  • Language Models: Merely estimating the sound (phenome) is not enough. You need a language model that can predict the right sequence of words based on a statistical analysis of the language. 
  • Lexicon Models: A lexicon model determines the phonetic variations in language. This helps distinguish between accents and similar-sounding phrases. These models all work in tandem to create the desired output (written text). An example of a lexicon model would be a Finite-State Transducer (FST) with a pronunciation dictionary mapping the word “SPEECH” to “S P IY CH.” All natural language words are represented in the FST as subword units.
Major components of legacy ASR models
Major components of legacy ASR models

As you might have guessed, this approach necessitates training multiple models independently of each other, which is both time-consuming and expensive. One of the most significant drawbacks, however, is the reliance on the lexicon model. 

The success of the traditional speech recognition models depends to a great extent on how well the lexicon models are crafted. Experienced phoneticians collaborate to create a custom set for the language at hand. The process must be repeated for every single language the model aims to support. This makes implementation challenging, especially in dynamic business environments expanding into new markets.

Modern ASR models with end-to-end (E2E) deep learning 

Modern ASR models take a disruptive approach to speech processing with end-to-end deep learning. One of the early approaches to building an end-to-end system was introduced in 2014 by researchers Alex Graves from Google DeepMind and Navdeep Jaitly from the University of Toronto.

Essentially, a complex neural network in a modern ASR model replaces multi-stage models in legacy systems which minimizes latency and drastically improves performance and accuracy.

This architecture also does away with independent language, acoustic, and lexicon models, and the resultant modern ASR system functions as a single neural network (as against multiple models in legacy systems), reducing development time and costs. In addition, they also achieve unparalleled accuracy levels and support multiple languages owing to their advanced neural architecture.

Basic architecture of modern ASR models in a
Basic architecture of modern ASR models

The modern ASR models are trained on large datasets and often require self-supervised deep learning as it would be extremely challenging for human operators to manually process voluminous data. Engineers use large amounts of unlabeled data to build a foundation model, which is then fine-tuned to achieve the desired word error rate (WER). Models can also be benchmarked based on this metric to compare performance.

A great example of a modern ASR model is OpenAI’s Whisper seq2seq model trained on a huge dataset of over 680,000 hours of audio data from many languages. This makes it more robust than models trained on smaller, more specialized datasets. The seq2seq model is essentially an ML architecture optimized for tasks that involve sequential data (like speech and text). It comprises two core components which take a sequence as input and generate a related sequence output. 

  • Encoder: The encoder uses neural networks to process the input sequence and create a fixed-size vector representation. The encoder will also capture context from the input sequence and pass it to the decoder.
  • Decoder: A ‘decoder’ module takes the encoder output vector and creates an output sequence. The decoder will predict the next tokens of the output sequence based on the received context and its own previous predictions.
  • Transformers: The Whisper seq2seq model uses an end-to-end transformer architecture to understand context and meaning from the input audio. The model first splits the input audio into small chunks before passing them to the encoder. The decoder predicts the text caption.
  • Recurrent Neural Networks (RNN): Sometimes both the encoder and the decoder use Recurrent Neural Networks (RNN), a specific type of neural network well-suited for sequence prediction. RNN cells can remember information about previously seen elements of the sequence through their internal memory and use it to determine current output. Unlike transformers, RNNs are sequential models. LSTM (Long Short-Term Memory) models, for example, use an RNN architecture.
Source: OpenAI

Legacy ASR models vs seq2seq architecture

We’ve summarized the key differences between the traditional and modern ASR approaches below for easy reference.

 Characteristics  Legacy ASR models  Seq2Seq models
 Internal structure Modular pipeline with acoustic models, lexicon models, and language models Use end-to-end neural networks with an encoder-decoder architecture to directly convert speech to text
 Accuracy Limited accuracy, cannot reach human accuracy levels High accuracy, can reach human-level accuracy and beyond

Limited adaptability to diverse inputs (some accents can throw off transcriptions)

Highly adaptable to different accents and languages

 Training data type  

Needs labeled phonetic data to function properly


Works with unlabeled or less labeled data

 Training technique  

Each model needs to be independently trained


The entire model is trained in one go


Best suited for simple speech-to-text functions


Best suited for complex and real-time speech-to-text applications


Generally slower due to multi-stage component interactions


Typically faster due to parallel processing

 Error detection and   correction  

Limited error correction capabilities


Complex error correction and control mechanisms

 Scalability Limited scalability

Highly scalable

 Language support

Support a limited number of languages

Can support a large number of languages

 Context awareness

Limited context awareness. Probabilistic language models analyze context to predict the next word with limited use of machine learning.

Highly context-aware, use deep learning for comprehensive context analysis

Which ASR models do we use today?

While legacy models suffer from drawbacks, they aren’t in any way rendered obsolete. In fact, modern ASR models build upon several traditional fundamental speech recognition systems such as acoustic and language models. Furthermore, legacy systems are also deployed for some specific tasks.

Generally speaking, though, Seq2seq models like Whisper are great at tasks that require natural language understanding. This includes speech recognition, translation, and caption generation. This is, in part, owing to their superior architecture that lends them advantages when dealing with sequences of varying lengths. That said, modern models do need very large datasets and compute resources to train and work. They can also face problems with long-range dependencies and when rare words are encountered.

How to select the right ASR model?

We know how challenging it can be to select an optimal ASR model for your specific needs. We’ve put together a few decisive factors you should consider to streamline the selection process.

  • Word error rate: The ideal goal of any voice recognition application is to achieve zero error rates. However, practical considerations dictate variations beyond our control, so make sure you factor in the precision and accuracy you need in your system when selecting an ASR model. If your application needs uncompromising performance, choose a modern ASR system like Whisper Seq2Seq.
  • End goal: Consider the requirements of your end users when choosing a model. How will they use the product or service?
  • Operating environment: ASR performance depends on the background noise to a great extent. If your end users will operate the product or application in noisy environments, you’ll need advanced noise suppression support in your model.
  • Audio properties: Consider the audio sample rate, bitrate, file format, channels, and duration when selecting a model.
  • Input audio type: Factor in how varied your input audio will be and the languages and dialects the model will need to support.
  • Performance: Every model performs differently, so you’ll need to evaluate it based on your specific benchmarks. If you need real-time speech-to-text conversions (such as in smart devices and wearables), choose a model with the lowest latency possible.


Automatic speech recognition plays a pivotal role in numerous business applications and knowing which kind of ASR system to use for a niche use case can greatly improve end-user experience. 

Traditionally, legacy speech-to-text systems have used a combination of acoustic, lexicon, and language models to predict text. However, these suffer from accuracy constraints and require specialized expertise for phonetic training. 

Modern ASR models are increasingly complex and can handle diverse input types, including multiple languages. A simplified end-to-end deep learning model ensures minimal latency, superior accuracy, and high scalability. 

When choosing an ASR model for your business, consider its underlying architecture (such as RNN/transformer), input data modalities, and the desired word error rates.

Additional resources

Far-Field Automatic Speech Recognition (2020) by Reinhold Haeb-Umbach et al.

Contact us

Your request has been registered
A problem occurred while submitting the form.

Read more


OpenAI Whisper vs Google Speech-to-Text vs Amazon Transcribe: The ASR rundown

Speech recognition models and APIs are crucial in building apps for various industries, including healthcare, customer service, online meetings, and entertainment.


Best open-source speech-to-text models

Automatic speech recognition, also known as speech-to-text (STT), has been around for some decades, but the advances of the last two decades in both hardware and software, especially for artificial intelligence, made the technology more robust and accessible than ever before.

Case Studies

How Gladia's multilingual audio-to-text API supercharges Carv's AI for recruiters

In today's professional landscape, the average workday of a recruiter is characterized by a perpetual cycle of administrative tasks, alternated by intake calls with hiring managers and interviews with candidates. And while recruiters enjoy connecting with hiring managers and candidates, there’s an almost universal disdain for the administrative side of the job.