Top 5 Whisper GitHub projects: A practical guide for programmers
Top 5 Whisper GitHub projects: A practical guide for programmers
Published on
Apr 2024
In September 2022, OpenAI unveiled Whisper, an innovative open-source automatic speech recognition (ASR) model trained on an impressive dataset of 680,000 hours of diverse speech. Since its release, the model has received widespread recognition for its remarkable robustness and accuracy. It rivaled human capabilities in English speech recognition and set a new standard for multilingual transcription and translation.
This groundbreaking model has not only captured the attention of the academic community but has also led to a proliferation of high-quality open-source projects.
In this guide, we delve into five such projects – whisper.cpp, use-whisper, buzz, whisperX and distil-whisper – for their innovative applications, practical utility, and unique approaches to leveraging Whisper's capabilities. These projects exemplify the versatility of Whisper in various programming environments, from embedded systems to web applications.
1. whisper.cpp by ggerganov
What it does
The project whisper.cpp, developed by ggerganov, plays a pivotal role in integrating OpenAI's Whisper model with the C/C++ programming ecosystem. By adapting the model to a C/C++ compatible format, whisper.cpp significantly speeds up the processing time for speech-to-text conversion. This porting effort significantly enhances the utility of Whisper's advanced speech-to-text capabilities in environments where C/C++ is the language of choice. A key aspect of this initiative is to adapt Whisper's functionalities, recognized for their powerful speech-to-text conversion, into a format compatible with C/C++ projects, while also catering to command line applications.
Project activity and maintenance
Demonstrates high activity with a total of 809 commits.
Features and uses
Use cases: Ideal for embedded systems, desktop applications, or integration with existing C/C++ codebases.
Platform support: Supports various platforms, including Apple Silicon, Android, and Windows, making it suitable for cross-platform applications.
Application areas: Useful in real-time audio processing and systems with limited resources due to its focus on performance and efficiency.
Why we like it
Whisper.cpp is a testament to the adaptability of AI models in varied programming landscapes. Its integration with Python bindings makes it approachable for a wide range of developers, bringing the power of Whisper to those who prefer working in a C/C++ environment. Its use in real-time audio processing and systems with limited resources showcases its performance and efficiency.
Example use
Here's a quick start guide for whisper.cpp. To begin using whisper.cpp, follow these steps:
Step 1: Clone the repository
Start by cloning the whisper.cpp repository to your local machine. This can be done using the following Git command:
Next, download a Whisper model that has been converted to the ggml format. For instance, to download the base.en model, use the following bash script included in the repository:
bash
bash ./models/download-ggml-model.sh base.en
This command will download the `base` English model, which balances performance and accuracy.
Step 3: Optional - convert models yourself
If you prefer to convert Whisper models to ggml format yourself, you can find instructions in the `models/README.md` file within the repository. This step is optional and typically not necessary unless you have specific requirements.
Step 4: Build the main example
Compile the main example application provided in the repository. This is done using the make command:
bash
make
Step 5: Transcribe an audio file
Finally, use the compiled application to transcribe an audio file. For example, to transcribe the sample file `jfk.wav`, execute the following command:
bash
./main -f samples/jfk.wav
Following these steps will get you started with whisper.cpp, allowing you to experiment with transcribing audio files using the power of OpenAI's Whisper model in a C++ environment. Alternatively, you can use a wrapper like in Python:
from whispercpp import Whisper
w = Whisper.from_pretrained("tiny.en")
import ffmpeg
import numpy as np
try:
y, _ = (
ffmpeg.input("sample.wav", threads=0)
.output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sample_rate)
.run(
cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True
)
)
except ffmpeg.Error as e:
raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
arr = np.frombuffer(y, np.int16).flatten().astype(np.float32) / 32768.0
w.transcribe(arr)
2. use-whisper by chengsokdara
What it does
Use-whisper, created by chengsokdara, is a React hook designed to seamlessly integrate OpenAI's Whisper model with web applications. It offers features like speech recording and real-time transcription, making it a powerful tool for developers working with React. The hook simplifies the process of adding sophisticated speech-to-text functionality to web interfaces.
Use cases
Use-whisper, created by chengsokdara, is a React hook designed to seamlessly integrate OpenAI's Whisper model with web applications. It offers features like speech recording and real-time transcription, making it a powerful tool for developers working with React. The hook simplifies the process of adding sophisticated speech-to-text functionality to web interfaces.
Features and uses
Educational tools: Can be employed in language learning platforms, providing immediate transcription for language practice, pronunciation correction, and other interactive educational activities.
Accessibility features: Enhances the accessibility of web applications for users with disabilities. Speech-to-text capabilities can aid users who have difficulties with traditional input methods, such as typing.
Real-time communication platforms: Integrating use-whisper in platforms like chat applications, web conferencing tools, or customer service interfaces allows for real-time captioning and transcription, benefiting both users with hearing impairments and those in noisy environments.
Content creation tools: Useful for journalists, content creators, and podcasters for real-time transcription of interviews, creating subtitles, or generating written content from spoken words.
Why we like it
The simplicity and effectiveness of use-whisper in providing real-time transcription capabilities to web applications are commendable. Its silence detection feature is a notable enhancement, improving user experience in applications like virtual meetings or language learning tools.
Example use
// React Example
import { useWhisper } from '@chengsokdara/use-whisper';
function App() {
const { transcript, startRecording, stopRecording } = useWhisper({
apiKey: "YOUR_OPENAI_API_TOKEN",
removeSilence: true,
});
return (
Transcript: {transcript.text}
);
}
3. buzz by chidiwilliams
What It does
Buzz, offers a variety of functionalities that enhance its versatility in speech-to-text conversion. It supports multiple models including Whisper, Whisper.cpp, Hugging Face, Faster Whisper, and OpenAI API, allowing users to choose the most suitable one for their specific needs. Additionally, Buzz has an application available on the App Store, catering to a broader user base. A notable feature of Buzz is its ability to operate entirely offline, ensuring privacy protection by keeping audio and transcriptions on the user's device. This offline functionality is particularly valuable for users concerned about data security and privacy
Use cases
Valuable for individuals with hearing impairments and those concerned with privacy in online tools. There is also an app to download.
Features and uses
Nature: An application for transcribing audio offline, using OpenAI’s Whisper.
Accessibility: A valuable tool for individuals with hearing impairments, offering more control and independence.
Privacy concerns: Addresses privacy issues by functioning offline and not storing conversation contents.
Enhanced app experience: The Mac-native Buzz app available on the App Store features a more user-friendly interface, audio playback, drag-and-drop import, transcript editing, and search functionality.
Versatility: Tested on various systems, including Ubuntu, indicating wide operating system compatibility.
Generally speaking, users appreciate its offline functionality and independence from third-party cloud solutions. Performance without a CUDA capable GPU is a point of consideration, especially for users with less powerful hardware.
Why we like it
Buzz addresses key concerns like privacy and accessibility. Its offline functionality and independence from cloud solutions are particularly appealing. The Mac-native Buzz app enhances user experience with features like audio playback and transcript editing.
Example use
# installation
pip install buzz-captions
# Command Line Example
python -m buzz transcribe audio-file.mp3
Here the cli can be used to transcribe a file completely offline and it's easy to install.
4. whisperX by m-bain
What it does
WhisperX, developed by m-bain, is a cutting-edge extension of OpenAI's Whisper model, enhancing it with advanced features like word-level timestamps and speaker diarization. This project stands out for its ability to provide fast and accurate automatic speech recognition, which is crucial for applications requiring detailed and precise transcriptions.
Project activity and maintenance
Active with 336 commits, with the most recent update in November 2023.
Why we like it
WhisperX stands out for its detailed audio transcription capabilities. The addition of speaker diarization and word-level timestamps make it invaluable for tasks requiring high transcription precision.
Example use
To use whisperX from its GitHub repository, follow these steps:
Step 1: Setup environment
Ensure you have Python 3.10 and PyTorch 2.0 installed. You'll also need NVIDIA libraries like cuBLAS 11.x and cuDNN 8.x if you plan to run on a GPU. For example, you can create a Python environment using Conda, see whisper-x on Github for more details
Step 3: Once installed, you can use whisperX for transcribing audio files via the command line
Here's an example command:
whisperx sample01.wav --model base --diarize --highlight_words True
Alternatively, you can import whisperx in Python.
Use cases
Ideal for detailed audio transcription tasks where distinguishing between speakers or precise timing is crucial.
5. distil-whisper by huggingface
What it does
Distil-Whisper is a distilled version of the OpenAI Whisper model, developed by Hugging Face. It is designed to provide fast, efficient speech recognition while maintaining high accuracy. This distilled model is notably faster and smaller than the original Whisper model, making it highly suitable for low-latency or resource-constrained environments. Using Distil-Whisper as an assistant to the main Whisper model in speculative decoding accelerates the inference process while aligning the distributions of the assistant and main models.
Project activity and maintenance
Active, started late 2023.
Example use
# Python Example using Hugging Face Transformers
from transformers import pipeline
model_id = "distil-whisper/distil-large-v2"
asr_pipeline = pipeline("automatic-speech-recognition", model=model_id)
transcription = asr_pipeline("path/to/audio/file.mp3")
print(transcription["text"])
Features and uses
Nature: This is a more efficient, distilled version of Whisper; however, as of this writing, it only supports English.
Efficiency: Offers faster inference and smaller size while maintaining similar accuracy to Whisper.
Robustness: Performs well in low signal-to-noise scenarios and shows fewer word duplicates and lower insertion error rates than Whisper.
Application: Can be used as an assistant model to Whisper, providing a faster alternative for time-critical applications like live transcription or real-time translation.
Key benefits
Its strong performance across different environments adds to its versatility in both professional and academic settings.
Why we like it
Distil-whisper strikes a balance between efficiency and accuracy, making it suitable for time-critical applications. Its robust performance in various environments adds to its appeal in both professional and academic settings.
Conclusion
The Whisper-inspired projects described above demonstrate the versatility of speech recognition technology in various programming environments. From enhancing web applications with real-time transcription to creating private, offline transcription tools, these projects offer programmers a wealth of possibilities for application development. This guide aims to provide a starting point for exploring these technologies, empowering programmers to integrate advanced speech recognition into their solutions.
At Gladia, we build an enhanced version of Whisper in the form of a single API, optimized for enterprise-grade projects in mind. If you’re curious to know more about the difference between our API and vanilla Whisper, feel free to check the landing page for our latest model, Whisper-Zero. You may also want to check out this blog post on the key factors to consider when choosing the open-source Whisper route vs. using an all-batteries-included API. To try to the API, sign up for free below.
Our Road to Real-Time Audio AI – with $16M in Series A funding
Real-time audio AI is transforming the way we work and build software. With instant insights from every call and meeting at their fingertips, customer support agents and sales reps will be able to reach new levels of efficiency and deliver a more delightful customer experience across borders.
Gladia selected to participate in the 2024 AWS Generative AI Accelerator
We’re proud to announce that Gladia has been selected for the second cohort of the AWS Generative AI Accelerator, a global program offering top early-stage startups that are using generative AI to solve complex challenges, learn go-to-market strategies, and access to mentorship and AWS credits.
How to implement advanced speaker diarization and emotion analysis for online meetings
In our previous article, we discussed how to unlock some of that data by building a speaker diarization system for online meetings (POC) to identify speakers in audio streams and provide organizations with detailed speaker-based insights into meetings, create meeting summaries, action items, and more.