Fine-tuning ASR models: Key definitions, mechanics, and use cases

Fine-tuning ASR models: Key definitions, mechanics, and use cases

Fine-tuning ASR models: Key definitions, mechanics, and use cases
Published on
Apr 2024

Many modern AI models are built for general-purpose applications and require fine-tuning for domain-specific tasks. The fine-tuning process involves taking an existing model and training it further on domain-specific data. The additional training allows the model to understand the new data and improve its performance in a particular field.

Automatic Speed Recognition (ASR) models such as Whisper from OpenAI are trained to transcribe audio of different languages to plain text. Like Whisper, most leading ASR models today are built around an encoder-decoder architecture. The encoder extracts auditory features from the input, and the decoder reconstructs the features into a natural language sequence.

Fine-tuning ASR models allows developers to introduce new languages, dialects, accents, and industry-specific terms – such as for healthcare and finance – to the model's knowledge. It also enables them to improve the model on its existing learning.

The ASR architecture can be daunting, especially for new developers or non-technical enthusiasts. Understanding the model, collecting data, training the model, and implementing use cases is a holistic task requiring some expertise. This article will serve as a guide for people interested in leveraging ASR capabilities in their workflows. We will discover the key terminologies related to ASR, understand the concept of fine-tuning, and finally look into Whisper model fine-tuning.

How fine-tuning works

Speech recognition models are sequence-to-sequence models since they accept an audio sequence as input and generate a text sequence as output.  A modern deep-learning-based ASR model consists of two primary modules: acoustic and language models.

The acoustic model decomposes the audio signal, learns various speech patterns, and maps the speech to the probable phonemes spoken. The language model is responsible for the text sequence output. It uses an understanding of the acoustic model to predict which word comes next in the sequence. The language model uses Natural Language Processing (NLP) to form a complete and accurate sentence. Each of these modules is a deep-learning architecture with various layers.

During the fine-tuning process, a pre-trained ASR model with pre-trained weights is used. The weights are tuned to reflect the learning from a larger generic dataset. The model is then trained further – using the pre-trained weights as the starting point –  with a new domain-specific dataset.

Graph to illustrate how fine-tuning ASR model works
Fine-tuning an ASR model. Credits: ResearchGate

There are various approaches to fine-tuning, but the most common one is to freeze the initial layers and continue training with the last few. 

The starting layers capture generic information from the data, such as general speech patterns and vocabulary. The final layers are more important for task-specific learning, such as specific words and their pronunciations. The weights of these layers are updated during fine-tuning as learn to understand the new data and make relevant predictions.

Why is fine-tuning useful

Fine-tuning works on the notion of ‘Don’t reinvent the wheel.’ It allows developers to improve existing machine learning models without developing or re-training the model from scratch. 

For example, if we have a generic speech recognition model, we can fine-tune it on medical-related audio and texts so the model understands the domain-specific jargon. Fine-tuning is a popular training methodology and has the following benefits:

  • Saves time and resources. Training modern deep-learning models requires extensive datasets and expensive hardware. For reference, the Whisper ASR model was trained on 680,000 hours of audio data, and the cost of GPT-4 training was $100 million. These resource-related problems make it difficult for most organizations to train models from scratch. Since fine-tuning is done on a comparatively small (and domain-specific) dataset, the training is completed quicker than it takes to re-train the entire model. Quicker training saves developers on GPU utilization costs.
  • Improved performance. Fine-tuning further expands the original knowledge base of the model. It allows the model to learn from existing and new data sources and display excellent performance in the new domain.
  • Quick and easy solution for SMEs. It helps small businesses use AI to improve their workflows. Small organizations that can not afford to build proprietary models can use an open-source model, fine-tune it at a fraction of the cost, and use it for niche tasks.

Challenges of fine-tuning

Despite their various benefits, there are certain challenges to fine-tuning ASR models.

  • Technical expertise: Fine-tuning is not a plug-and-play solution. To fine-tune a model, an organization will need an AI/ ML expert for data processing and setting up the environment. Moreover, different models have different interfaces, and developers must modify the data into the format the architecture accepts. Sometimes, the process may require the developer to deep-dive into the model implementation to understand the intricate details.
  • Hardware requirements: Despite a shorter training time, fine-tuning still requires expensive hardware for quick training. Modern ASR models require GPUs with at least 16GB of VRAM, which can be fairly expensive.
  • Data quality: Attaining sufficient, high-quality data for fine-tuning can be challenging. The data collection process can be expensive and time-consuming. However, without it, fine-tuning will not yield good results. 
  • Overfitting and information loss: Fine-tuning allows the model to learn new information but at the risk of diminishing its previous knowledge. The model may perform well on the new training data. However, performance may degrade in other generic scenarios.

Different types of fine-tuning

There are various types of fine-tuning techniques in machine learning (ML). Each of these serves different purposes and is relevant for varying cases. 

1. Tuning the entire model 

In this scenario, we would take the existing weights of the model and continue training on them using the new dataset. This approach updates all the layers and parameters. The model would learn all parameters from the new data. This can impact the overall performance depending on data volume and quality.

2. Tuning Particular Layers

The majority of the layers are frozen (left untouched), and only a few (usually the bottom ones) are trained on new data. The intuition for ASR models is that the existing model has already learned key speech patterns from extensive datasets. The model understands different aspects of speech, noise, and accents. Updating only particular layers will add vocabulary information, such as for medical jargon. The model will use knowledge from both aspects to display enhanced performance of the new domain.

3. Adding/modifying layers 

Sometimes, it is important to keep all previous knowledge intact. In such a case, we can add new layers to the end of the model and just train them during the fine-tuning process. You may also have to modify the head (output) layer if the use case demands it. For example, in the case of classification, the base model might be trained for two classes, and the fine-tuned model may have to add a third one. An ML developer must add a new head layer (with three neurons) and train it during fine-tuning.

4. Low-Rank Adaptation (LoRA) 

LoRA is an efficient fine-tuning technique mostly used for LLMs but also available for other AI models. It consists of an adapter module with two small weight matrices, approximating the larger weight matrix of the base model. The fine-tuning process adjusts the weights of the adapter to the new task, and these are then used for inference. 

5. Few-Shot Learning

Few-shot is a training technique useful in data-limited scenarios. The model is trained on a few new data examples and learns representations to apply to real-world scenarios. However, this approach relies on model architecture and may not perform well for all.

6. Reinforcement Learning (RL)

RL is a self-tuning mechanism that allows the model to improve itself depending on its output. It uses a reward function to adjust the model's weights. The reward function emphasizes or penalizes the model depending on the output. A sub-category of RL, RL with Human Feedback, uses real-time human feedback to tune the model output. It does not rely on additional data but uses the feedback to improve its output according to the expectations.

Fine-tuning applications for ASR

An ASR model can be fine-tuned for various purposes. The fine-tuning task depends on the requirements of the end application. Common ASR fine-tuning adaption types include:

  1. Language
    This aims to add an entirely new language to the model's knowledge base. Since a pre-trained model already understands generic speech patterns, language adaptation helps map this knowledge to the vocabulary of the new language. Language fine-tuning might require a comparatively larger dataset depending on how much of the vocabulary or accents are to be included.
  2. Accents and dialects
    Processing accents is the responsibility of the acoustic module. If an ASR model is trained on an American dataset, it will be troubled by British or Australian accents despite the language being the same (English). Accent adaptation tweaks the model to understand the same words from different pronunciations.
  3. Environmental
    Most open-source AI models are trained on clean, fabricated datasets. These rarely represent real-world scenarios. For ASR, this means that the model is better for transcribing clean audio. Noise adaptation fine-tunes the model to understand speech from different noise conditions such as audio static, echo, general call-center background noise, etc.
  4. Domain
    Fine-tuning an ASR model to adapt to a particular domain or industry. This includes tuning a pre-trained ASR to understand jargon and acronyms related to finance or legal work. Domain adaptation is useful when the model is to be implemented within a particular industry.

Custom vocabulary vs. fine-tuning vs. specialized models

There are various ways to obtain an ASR model for niche tasks. The selection of the right method depends on the project budget, task requirements, and technical expertise available. Let’s discuss the key training approaches and the scenarios in which they fit best.

Custom vocabulary

It is common for generic ASR models to struggle with out-of-vocabulary words. Terms not part of a regular conversation, like ‘Neural Networks’ or ‘Monetarism,’ will not be recognized by the probabilistic language modeling.

Custom vocabulary allows the user to specify domain-specific terms and their phonetic representations to the model. The model includes the terms as part of its language model, and the decoder can generate them as part of the output.

This is a simpler training methodology as it does not require an extensive dataset or a long training routine. It is helpful in cases where the model is missing out on new words but has a general understanding of language patterns. For example, an English ASR model can be tuned to medical-specific vocabulary. This allows it to identify the jargon in medical notes. 

Most speech-to-text APIs today come with a custom vocabulary option, but it’s generally recommended to test and benchmark the overall API accuracy to determine their performance in your specific use case.


Fine-tuning is useful to expand the model to a new domain or task. It requires a decent-sized dataset relevant to the new task and specialized expertise and hardware for training. It can be used when an ASR model is required to be tuned for a new language or a dialect.

A fine-tuned model uses the language patterns learned from its earlier dataset and applies them to the new domain. The additional training helps the model learn new patterns and build expertise in the new language.

Fine-tuning allows the model to add a new task to its domain rather than improving the capabilities of the existing one.

Specialized models

You may not need a generic solution in many niche cases. A specialized model is purpose-built for a specific task in mind. The model is trained on domain-specific data only and is alien to any other information. For example, an application built to transcribe finance meetings in Arabic will work best with a specialized model since it does not need to understand any other jargon or language.

Specialized models display excellent performance on the target domain. They are built with specific parameters such as sampling rate, bit-depth, and overall audio clarity. Hence, they integrate flawlessly with existing applications.

However, one is unlikely to find an open-source model tailored by default to your precise task and infrastructure. The model will have to be built and trained from scratch. This requires in-house experts, hardware, and training data — all challenging to acquire.

Some speech-to-text providers offer à la carte models like this, but that is, of course, the most expensive option and is not always needed given the alternatives mentioned above. 

Custom vocabulary vs fine-tuning vs custom models in ASR

Whisper model fine-tuning

Now that we understand fine-tuning in the context of ASR, let's explore the finetuning of the OpenAI Whisper model. Whisper is open-source and available on HuggingFace for download, use, and fine-tune. The following steps will walk you through the fine-tuning process.

1. Load dataset

The first step is to find a suitable dataset. If you are experimenting, then the CommonVoice11 dataset is a good starting point. The dataset contains over 16,000 hours of audio in over 100 languages. Each sample point provides an audio clip and its text. The dataset can be easily loaded using the `datasets` module in Python from HuggingFace.

You must pass the desired language and the dataset URL as ‘mozilla-foundation/common_voice_11_0’. For experimentation, you may fine-tune the Arabic language. The identifier for it is ‘lg’, which also needs to be passed to the load_dataset function.

2. Load a pre-trained Whisper model

The Whisper model is available in various sizes. It ranges from whisper-tiny to whisper-large, and each model is pre-trained on a different-size dataset and has a different number of parameters.

The model selection depends on your use case and available hardware. If you are experimenting, then the whisper-small (with 244 M parameters) should be sufficient. For practical applications, you should use the whisper-large as it performs the best out of all in terms of accuracy, though with a bigger strain on the hardware.

To start with the fine-tuning process, you will have to initialize the whisper feature extractor and tokenizer. Both these modules are provided by HuggingFace in their `transformers` package.

Initialize feature extractor

The feature extractor will be loaded for the model which is to be fine-tuned. Since, we will be working on the whisper small, we pass the respective model path ("openai/whisper-small")

 to the initializer. The feature extractor will clip/pad the audio to have 30-second segments and create Log-mel spectrograms of each.

Initialize tokenizer

A pre-trained whisper tokenizer can be loaded from the transformers package. Similar to the feature extractor, the tokenizer must also be loaded for the whisper-small model using its respective path and the required language. The tokenizer encodes the text sentences to vector arrays for training.

3. Process data

After initializing the modules, we need to process the data such that it can be passed to the models. The data has to be processed in three ways.

Resample audio

Whisper accepts audio files sampled at 16kHz. The audio files present in our dataset are all 48kHz, so they must be resampled before use. You can use various libraries like librosa to resample the audio files. Loop over the dataset, resample each audio and save it to memory.

Apply feature extraction

When audios are re-sampled, we can proceed to extracting the Log-Mel Spectrograms. We will loop over the audio files and use the initialized feature extractor to convert the 1-D audio arrays to 2-D Spectrograms.

Tokenize outputs

Lastly, we process the output text by passing it to the pre-trained tokenizer. The tokenizer will encode each text sentence to a 1-D vector which is understood by the whisper model.

4. Prepare training environment

Now that we have the data prepared, we can set up some necessities for training Whisper.

Create PyTorch tensors

The input features and output encodings must be converted to batches of Pytorch tensors recognized by the Whisper model. The feature extractor and tokenizer module provide the .pad function which pads the input and outputs to the same length. It can be passed the `return_tensors=”pt”` parameter so the function returns the vectors and tensors.

The padded tokens in the output must be replaced with -100 so that they are ignored during loss calculation.

Define evaluation metrics

The most common ASR evaluation metric is word error rate (WER). We can initialize the metric from the HuggingFace’s ‘evaluate’ package. In addition to the WER metric, we also need to create a function to identify the padded tokens, and decode the predicted vector and true labels to string so the metric can be calculated.

Load a pre-trained checkpoint

Finally, we need to load the pre-trained weights for the whisper-small model. This can be done using the “WhisperForConditionalGeneration” module from transformers.

Define training arguments

The Whisper model requires certain parameters initialized before the training can be executed. These are:

  • output_dir: Location where the model checkpoints will be saved.
  • generation_max_length: maximum number of tokens to autoregressively generate during evaluation.
  • Save_steps: Save model checkpoints after how many steps.
  • Eval_steps: No. of steps after which the model is evaluated.

Define the Seq2SeqTrainer

All the training arguments are passed to the Seq2SeqTrainer module from HuggingFace. The module is built for sequence-to-sequence training and identifies the model and training task via the defined parameters. The trainer requires the following to prepare the training environment:

  • Training Arguments: The arguments discussed in the last section
  • Pre-trained Model: The whisper-small checkpoint loaded in step 4.
  • Train and Evaluation Dataset: The dataset prepared in step 3.
  • Compute Metrics: The WER metric initialized in step 4.

5. Execute training

With all the parameters and data in place, the fine-tuning can be executed. This is a simple step as the Seq2SeqTrainer provides a ‘.train’ function which will execute the training. It will load and process the data as required, update model weights, save the fine-tuned checkpoints and log the WER as per the provided arguments.

Following the steps, the Whisper model can be fine-tuned on any dataset, including a custom one.

Final remarks

Fine-tuning ASR models helps train them for downstream tasks with little use of resources and data. Fine-tuning allows the model to expand its knowledge base using new data and be used in a new domain or task. This learning method can be used to train existing ASR models on new languages, dialects, and accents. It saves businesses from the hassle of training a specialized model and allows them to integrate AI into their workflows.

Powered by our proprietary Whisper-Zero model, Gladia provides a speech-to-text API for all common use cases and verticals. Our product unlocks transcriptions in 99 languages, with custom vocabulary, and is tailored to capture a variety of accents. To learn more, get started for free today.

Article written for Gladia by Haziqa Sajid.

About Gladia

At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities, and state-of-the-art features, including speaker diarization and word-level timestamps.

Contact us

Your request has been registered
A problem occurred while submitting the form.

Read more

Case Studies

How VEED is streamlining video editing and subtitles with AI transcription

User-generated content has become a cornerstone of the internet-driven economy. As part of this shift, various platforms have emerged to provide easy-to-use tools to create high-quality video content in a matter of minutes — with AI transcription playing a foundational role in their product development.


How to build a speaker identification system for recorded online meetings

Virtual meeting recordings are becoming increasingly used as a source of valuable business knowledge. However, given the large amount of audio data produced in meetings by companies, getting the full value out of recorded meetings can be tricky.


Should you trust WER?

Word Error Rate (WER) is a metric that evaluates the performance of ASR systems by analyzing the accuracy of speech-to-text results. WER metric allows developers, scientists, and researchers to assess ASR performance. A lower WER indicates better ASR performance, and vice versa. The assessment allows for optimizing the ASR technologies over time and helps to compare speech-to-text models and providers for commercial use.