Enhancing real-time transcription with WebSockets and Golang
Enhancing real-time transcription with WebSockets and Golang
Published on
Mar 2024
Communication has evolved from sending post letters and waiting at phone booths to digital connections, happening simultaneously at a high speed, especially in business environments. Given the unprecedented volumes of voice data generated by companies daily, the ability to document customer calls, conferences, and online meetings in real-time and asynchronously is becoming crucial.
Real-time communication in the past faced technological challenges, including high latency, poor scalability, slow data transfer rates, and heavy infrastructure loads. Advancements of the last decade in Automatic Speech Recognition (ASR), also known as speech-to-text, and WebSockets are changing the narrative, paving the way for efficient, accurate, and affordable real-time transcription solutions at scale.
In this blog, we’ll explore a way to leverage the fusion of WebSockets, Golang, and our speech-to-text API to build applications with real-time transcription capabilities. Sounds like something for you? Let’s begin!
What is real-time transcription?
Real-time transcription is the process of converting a live speech to text instantaneously. This instantaneous conversion is facilitated by cutting-edge ASR and Natural Language Processing (NLP) models, used as part of hybrid AL/ML architectures, ensuring a near-instant constant transformation from spoken words to written text.
The ability to display the transcript as it’s being spoken out with minimal perceptible delay is a key technical requirement for this ASR feature. Latency is the delay between when a speaker utters a word or phrase and when the ASR system produces the corresponding transcription result.
The acceptable range for low latency highly depends on each application's specific needs and end-user expectations. Our average latency at Gladia is around 800 milliseconds, which is optimal for most voice assistants, communication platforms, and industrial and media apps that require real-time control and response.
Most common applications of real-time transcription
Real-time bidirectional communication has diverse applications across multiple industries, impacting day-to-day operations. Real-time transcription is especially useful in scenarios where you need to react to what's being said directly, where very low latency or wait time is required.
Real-time bidirectional communication has diverse applications across multiple industries, impacting day-to-day operations. Real-time transcription is especially useful in scenarios where you need to react to what's being said directly, where very low latency or wait time is required.
Conversational bots are one of the most common use cases for live transcription, equally in high demand for real-time captioning during live events. Here are some other business scenarios where the technology is becoming increasingly popular:
Online and hybrid meetings: Across teams and individuals, live transcription is valuable in remote and onsite meetings. It helps team members catch up on discussions and brainstorming sessions at their convenience.
Customer support and call centers:Call centers and customer service departments can improve their workflow with live transcription, particularly in the context of call bots, to ensure an instant response to customer inquiries and better customer satisfaction in the long run.
Healthcare: For doctor-patient consultations happening both on sight and remotely, live transcription is useful in helping healthcare professionals automate note-taking and focus fully on patient examinations.
Finance: For financial institutions, real-time transcription provides updated information to monitor financial markets and ensure swift response to client queries with voice-based agents.
Media: Live broadcasting of media events, such as international conferences and forums, requires real-time captions and translation.
What’s a WebSocket?
WebSockets are event-driven communication protocols used in real-time communication. Unlike the request-response model of HTTP, where a client first sends a request and then waits for a response – with wait time spanning from seconds to minutes depending on the API provider – WebSockets create bidirectional connections between the client and server, enabling simultaneous data transmission.
Advantages of using WebSockets for real-time transcription
Improved user experience by low latency. WebSockets ensure minimal delays in transmitting data, resulting in nearly instant updates. With this, users receive transcribed content immediately, creating an interactive experience.
Full-duplex communication. In a full-duplex communication mode, the client does not have to wait for the server before responding to a message – just as you can always send messages over chats or video/audio calls without having to wait for a reply explicitly.
Scalable and flexible.Network overheads are minimized when using WebSockets, improving their scalability and flexibility. Services can easily adapt to varying transcription demands, which limits downtime when users increase.
WebSockets and Go: How to set up and get the server running
Golang, aka Go, is a statically typed programming language developed by Google engineers Rob Pike, Robert Griesemer, and Ken Thompson in 2007 and used for backend development, cloud computing, and DevOps.
In this section, we’ll look at how to set up a basic WebSocket in Go.
Prerequisites:
Go installed on your system (Mac, Linux, or Windows)
The HTML file, when displayed on your browser, contains a text box to enter a message and a Send button to transmit the message via the WebSocket to the server.
Golang WebSocket
Step 6: Running the WebSocket server
To run the WebSocket server, enter this command in your terminal:
go run server.go
This command starts the WebSocket server on port 8080.
When a message is sent with the form above, it flows from the client-side web form to the server, where it is displayed in the terminal. This interaction demonstrates the bidirectional communication of the WebSocket server.
Using speech-to-text APIs for real-time transcriptions
Application Programming Interfaces (APIs) are programs that interact with other applications to connect clients with the server through integrations, serving responses based on requests.
Using speech AI APIs for real-time transcriptions offers advantages such as:
Accuracy: They leverage a hybrid combination of advanced speech recognition models and proprietary algorithms to produce highly accurate transcriptions and audio intelligence features, often fine-tuned to specific languages and industry use cases.
Efficiency: APIs are designed to be efficient at scale, making them ideal for handling real-time transcription demands, even in high-traffic scenarios.
Integration: APIs come in ‘all-batteries-included’ packages that can be directly integrated into various applications and systems, providing flexibility for developers irrespective of their AI expertise while reducing hardware and setup costs.
Gladia's real-time transcription API, powered by optimized Whisper ASR
Gladia's Audio Intelligence API was designed with the goal of simplifying Speech AI integration for developers. It offers a wide range of functionalities, including real-time transcription, that can be built directly into voice assistants, call and note-taking bots, and other speech-based enterprise applications whatever the tech stack.
The original OpenAI’s Whisper ASR, on which Gladia’s API is primarily based, does not inherently support live transcription and WebSockets. This led to the reengineering of Whisper by our team to enable live transcription and WebSocket integration. With latency as low as 400 milliseconds, the API seamlessly transcribes audio and video in real-time.
Here are the steps to integrate Gladia's live transcription API with your Golang application:
Step 1: Sign up on Gladia to get an API key
Before you can begin integrating Gladia's Audio Intelligence API, you'll need to sign up and obtain an API key. This key will be used to authenticate your requests to the API.
To proceed, create a free account with 10h/month of transcription included on app.gladia.io.
Step 2: Integrating Gladia's real-time transcription API with the Golang app
Following the code examples on our developer documentation, here’s a Golang app that transcribes an audio file using Gladia’s API.
Note: You need an audio file available that will be transcribed, check the supported media formats. For this example, it's an m4a audio file.
In your code editor, create a file `audio.go`, and enter the code:
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"mime/multipart"
"net/http"
"os"
)
func main() {
// Step 1: Create a new file buffer to store the audio file
file, err := os.Open("path_to_audio_file")
// Replace "path_to_audio_file" with the path to the audio on your system
if err != nil {
fmt.Println("Error opening audio file:", err)
return
}
defer file.Close()
body := &bytes.Buffer{}
writer := multipart.NewWriter(body)
part, err := writer.CreateFormFile("audio", "audioFile")
if err != nil {
fmt.Println("Error writing to buffer:", err)
return
}
_, err = io.Copy(part, file)
if err != nil {
fmt.Println("Error copying file to buffer:", err)
return
}
writer.Close()
// Step 2: We create a request with Gladia’s endpoint and add the API key for authorization
req, err := http.NewRequest("POST", "https://api.gladia.io/audio/text/audio-transcription/", body)
if err != nil {
fmt.Println("Error creating request:", err)
return
}
req.Header.Set("Content-Type", writer.FormDataContentType())
req.Header.Set("x-gladia-key", "YOUR_API_KEY")
// Replace "YOUR_API_KEY" with your Gladia API key
// Step 3: Send the request and handle the response
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
fmt.Println("Error sending request:", err)
return
}
defer resp.Body.Close()
// Step 4: Handle the response
fmt.Println("Response Status:", resp.Status)
// Handling the JSON response to extract the transcription
var jsonResponse map[string]interface{}
err = json.NewDecoder(resp.Body).Decode(&jsonResponse)
if err != nil {
fmt.Println("Error decoding JSON response:", err)
return
}
// Access the 'prediction' field and display the transcribed text
predictions, ok := jsonResponse["prediction"].([]interface{})
if !ok {
fmt.Println("No prediction data found in the response")
return
}
for _, prediction := range predictions {
predictionMap, ok := prediction.(map[string]interface{})
if !ok {
fmt.Println("Invalid prediction format")
return
}
transcription, found := predictionMap["transcription"].(string)
if !found {
fmt.Println("No transcription found in the prediction data")
return
}
fmt.Println("Transcription:", transcription)
}
}
Replace:
"Path_to_audio_file" with the path to the audio file you want to transcribe on your computer.
"YOUR_API_KEY" with your API key from app.gladia.io.
Code breakdown:
In the program above, ‘package main` is the entry point. It indicates that the program will be a standalone executable.
Next, we import some packages:
bytes: for creating an in-memory buffer to handle the audio file data.
encoding/json: to encode and decode JSON data
fmt: handles formatted I/O and prints to the console.
io: It copies the audio file data to a buffer.
mime/multipart: used for MIME (Multipurpose Internet Mail Extensions) encoding and decoding.
net/http: creates and handles HTTP requests to communicate with Gladia's API.
os: It opens the audio file.
The `func main`, which is the main function of the program, handles the integration with Gladia's real-time transcription API. This function:
Imports necessary packages and initializes variables.
Sets up the HTTP request with the audio file and API key in the header. (`x-gladia-key`)
Sends requests, handles the response and displays transcribed text upon receiving a successful response. (Status 200 OK)
Step 3: Run the program in your terminal
Enter `go run audio.go` in the terminal to execute the program.
First, It gives a 200 OK response status, confirming a successful connection and request to Gladia's API.
Then displays the output containing the transcribed audio:
The program prints the transcribed text without the additional JSON structure (timestamps, confidence scores, and other data) as it extracts the` transcription` field from the JSON response. This is done using Golang’s `encoding/json` package to parse the JSON data.
Real-time transcription, using WebSockets and optimized Whisper ASR API by Gladia, significantly enhances real-time communication by offering live text-to-speech functionalities. As this technology evolves, it promises increased productivity and the ability to derive more accurate insights from unstructured audio data.
To learn more about Gladia’s approach to enhancing the Whisper transcription performance for companies, check out our new model, Whisper-Zero, or sign up for the API directly below.
At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities, and state-of-the-art features, including speaker diarization and word-level timestamps.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Product News
Our Road to Real-Time Audio AI – with $16M in Series A funding
Real-time audio AI is transforming the way we work and build software. With instant insights from every call and meeting at their fingertips, customer support agents and sales reps will be able to reach new levels of efficiency and deliver a more delightful customer experience across borders.
Gladia selected to participate in the 2024 AWS Generative AI Accelerator
We’re proud to announce that Gladia has been selected for the second cohort of the AWS Generative AI Accelerator, a global program offering top early-stage startups that are using generative AI to solve complex challenges, learn go-to-market strategies, and access to mentorship and AWS credits.
How to implement advanced speaker diarization and emotion analysis for online meetings
In our previous article, we discussed how to unlock some of that data by building a speaker diarization system for online meetings (POC) to identify speakers in audio streams and provide organizations with detailed speaker-based insights into meetings, create meeting summaries, action items, and more.