Enhancing real-time transcription with WebSockets and Golang

Published on Dec 14, 2023

Communication has evolved from sending post letters and waiting at phone booths to digital connections, happening simultaneously at a high speed, especially in business environments. Given the unprecedented volumes of voice data generated by companies daily, the ability to document customer calls, conferences, and online meetings in real-time and asynchronously is becoming crucial.

Real-time communication in the past faced technological challenges, including high latency, poor scalability, slow data transfer rates, and heavy infrastructure loads. Advancements of the last decade in Automatic Speech Recognition (ASR), also known as speech-to-text, and WebSockets are changing the narrative, paving the way for efficient, accurate, and affordable real-time transcription solutions at scale.

In this blog, we’ll explore a way to leverage the fusion of WebSockets, Golang, and our speech-to-text API to build applications with real-time transcription capabilities. Sounds like something for you? Let’s begin!

What is real-time transcription?

Real-time transcription is the process of converting a live speech to text instantaneously. This instantaneous conversion is facilitated by cutting-edge ASR and Natural Language Processing (NLP) models, used as part of hybrid AL/ML architectures, ensuring a near-instant constant transformation from spoken words to written text.

The ability to display the transcript as it’s being spoken out with minimal perceptible delay is a key technical requirement for this ASR feature. Latency is the delay between when a speaker utters a word or phrase and when the ASR system produces the corresponding transcription result.

The acceptable range for low latency highly depends on each application's specific needs and end-user expectations. Our average latency at Gladia is around 800 milliseconds, which is optimal for most voice assistants, communication platforms, and industrial and media apps that require real-time control and response.

Most common applications of real-time transcription

Real-time bidirectional communication has diverse applications across multiple industries, impacting day-to-day operations. Real-time transcription is especially useful in scenarios where you need to react to what's being said directly, where very low latency or wait time is required.

Conversational bots are one of the most common use cases for live transcription, equally in high demand for real-time captioning during live events. Here are some other business scenarios where the technology is becoming increasingly popular:

Online and hybrid meetings: Across teams and individuals, live transcription is valuable in remote and onsite meetings. It helps team members catch up on discussions and brainstorming sessions at their convenience.
Customer support and call centers: Call centers and customer service departments can improve their workflow with live transcription, particularly in the context of call bots, to ensure an instant response to customer inquiries and better customer satisfaction in the long run.
Healthcare: For doctor-patient consultations happening both on sight and remotely, live transcription is useful in helping healthcare professionals automate note-taking and focus fully on patient examinations.
Finance: For financial institutions, real-time transcription provides updated information to monitor financial markets and ensure swift response to client queries with voice-based agents.
Media: Live broadcasting of media events, such as international conferences and forums, requires real-time captions and translation.

What’s a WebSocket?

WebSockets are event-driven communication protocols used in real-time communication. Unlike the request-response model of HTTP, where a client first sends a request and then waits for a response – with wait time spanning from seconds to minutes depending on the API provider – WebSockets create bidirectional connections between the client and server, enabling simultaneous data transmission.

Advantages of using WebSockets for real-time transcription

Improved user experience by low latency. WebSockets ensure minimal delays in transmitting data, resulting in nearly instant updates. With this, users receive transcribed content immediately, creating an interactive experience.
Full-duplex communication. In a full-duplex communication mode, the client does not have to wait for the server before responding to a message – just as you can always send messages over chats or video/audio calls without having to wait for a reply explicitly.
Scalable and flexible. Network overheads are minimized when using WebSockets, improving their scalability and flexibility. Services can easily adapt to varying transcription demands, which limits downtime when users increase.

WebSockets and Go: How to set up and get the server running

Golang, aka Go, is a statically typed programming language developed by Google engineers Rob Pike, Robert Griesemer, and Ken Thompson in 2007 and used for backend development, cloud computing, and DevOps.

In this section, we’ll look at how to set up a basic WebSocket in Go.

Prerequisites:

Go installed on your system (Mac, Linux, or Windows)
A code editor - VS Code, Sublime text.

Setting up a WebSocket server in Go

Step 1: Import the required Go package

Any program written in Go is made up of packages, which contain functions that will be used in the program.

For handling WebSockets, a popular choice is the gorilla/websocket package.

Here’s how to import it:
`go get github.com/gorilla/websocket`

Open your code editor, and create a file named: server.go. Then enter the code below


import (
    "fmt"
    "log"
    "net/http"


    "github.com/gorilla/websocket"
)

The net/http package is used to handle HTTP requests.‍

Step 2: Create the WebSocket upgrader

To upgrade an HTTP connection to a WebSocket connection, we define an upgrader:


var upgrader = websocket.Upgrader{
	ReadBufferSize:	1024,
	WriteBufferSize:	1024,
}

The upgrader configuration defines the buffer sizes for reading and writing.‍

Step 3: Handling WebSocket connections

Next, we’ll set up a function to handle WebSocket connections:


func handleConnections(w http.ResponseWriter, r *http.Request) {
    conn, err := upgrader.Upgrade(w, r, nil)
    if err != nil {
        log.Println(err)
        return
    }
    defer conn.Close()


    for {
        // Read the message from the client
        _, msg, err := conn.ReadMessage()
        if err != nil {
            log.Println(err)
            return
        }


        log.Printf("Received message: %s", msg)


        // Echo the message back to the client
        err = conn.WriteMessage(websocket.TextMessage, msg)
        if err != nil {
            log.Println(err)
            return
        }
    }
}

Step 4: Starting the server

Then configure the server to listen on a specified port. For this example, we’re starting the server on port 8080:


func serveHome(w http.ResponseWriter, r *http.Request) {
    http.ServeFile(w, r, "index.html")
}


func main() {
    http.HandleFunc("/ws", handleConnections)
    http.HandleFunc("/", serveHome)


    fmt.Println("Server started at http://localhost:8080")
    http.ListenAndServe(":8080", nil)
}

The code above configures an HTTP server using Golang. The main function initializes the server and handles different types of requests.

When we access the root path ('/'), the serveHome function triggers and serves an HTML file named 'index.html' to the browser.

In contrast, when requests are made to the '/ws' path, they are managed by the handleConnections function, which sets up the WebSocket connection.

The server starts at 'http://localhost:8080'.‍

Step 5: Creating the index.html file

Create an index.html file and enter this code:

The HTML file, when displayed on your browser, contains a text box to enter a message and a Send button to transmit the message via the WebSocket to the server.





    Golang WebSocket

Step 6: Running the WebSocket server

To run the WebSocket server, enter this command in your terminal:


go run server.go

This command starts the WebSocket server on port 8080.

When a message is sent with the form above, it flows from the client-side web form to the server, where it is displayed in the terminal. This interaction demonstrates the bidirectional communication of the WebSocket server.

Using speech-to-text APIs for real-time transcriptions

Application Programming Interfaces (APIs) are programs that interact with other applications to connect clients with the server through integrations, serving responses based on requests.

Using speech AI APIs for real-time transcriptions offers advantages such as:

Accuracy: They leverage a hybrid combination of advanced speech recognition models and proprietary algorithms to produce highly accurate transcriptions and audio intelligence features, often fine-tuned to specific languages and industry use cases.
Efficiency: APIs are designed to be efficient at scale, making them ideal for handling real-time transcription demands, even in high-traffic scenarios.
Integration: APIs come in ‘all-batteries-included’ packages that can be directly integrated into various applications and systems, providing flexibility for developers irrespective of their AI expertise while reducing hardware and setup costs.

Gladia's real-time transcription API, powered by optimized Whisper ASR

Gladia's Audio Intelligence API was designed with the goal of simplifying Speech AI integration for developers. It offers a wide range of functionalities, including real-time transcription, that can be built directly into voice assistants, call and note-taking bots, and other speech-based enterprise applications whatever the tech stack.

The original OpenAI’s Whisper ASR, on which Gladia’s API is primarily based, does not inherently support live transcription and WebSockets. This led to the reengineering of Whisper by our team to enable live transcription and WebSocket integration. With latency as low as 400 milliseconds, the API seamlessly transcribes audio and video in real-time.

Here are the steps to integrate Gladia's live transcription API with your Golang application:

Step 1: Sign up on Gladia to get an API key

Before you can begin integrating Gladia's Audio Intelligence API, you'll need to sign up and obtain an API key. This key will be used to authenticate your requests to the API.

To proceed, create a free account with 10h/month of transcription included on app.gladia.io.

Welcome screen of Gladia's palayground allowing to obtain an API key and test real-time transcription — *In our playground, you can sign up for your API key and test real-time transciption directly.*

Step 2: Integrating Gladia's real-time transcription API with the Golang app

Following the code examples on our developer documentation, here’s a Golang app that transcribes an audio file using Gladia’s API.

Note: You need an audio file available that will be transcribed, check the supported media formats. For this example, it's an m4a audio file.

In your code editor, create a file `audio.go`, and enter the code:


package main

import (
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"mime/multipart"
	"net/http"
	"os"
)

func main() {
	// Step 1: Create a new file buffer to store the audio file

	file, err := os.Open("path_to_audio_file")
     // Replace "path_to_audio_file" with the path to the audio on your system

	if err != nil {
		fmt.Println("Error opening audio file:", err)
		return
	}
	defer file.Close()

	body := &bytes.Buffer{}
	writer := multipart.NewWriter(body)
	part, err := writer.CreateFormFile("audio", "audioFile") 

     
	if err != nil {
		fmt.Println("Error writing to buffer:", err)
		return
	}
	_, err = io.Copy(part, file)
	if err != nil {
		fmt.Println("Error copying file to buffer:", err)
		return
	}
	writer.Close()

	// Step 2: We create a request with Gladia’s endpoint and add the API key for authorization
	req, err := http.NewRequest("POST", "https://api.gladia.io/audio/text/audio-transcription/", body)
	if err != nil {
		fmt.Println("Error creating request:", err)
		return
	}
	req.Header.Set("Content-Type", writer.FormDataContentType())
	req.Header.Set("x-gladia-key", "YOUR_API_KEY")
      // Replace "YOUR_API_KEY" with your Gladia API key

	// Step 3: Send the request and handle the response

	client := &http.Client{}
	resp, err := client.Do(req)
	if err != nil {
		fmt.Println("Error sending request:", err)
		return
	}
	defer resp.Body.Close()

	// Step 4: Handle the response
	fmt.Println("Response Status:", resp.Status)

	// Handling the JSON response to extract the transcription
	var jsonResponse map[string]interface{}
	err = json.NewDecoder(resp.Body).Decode(&jsonResponse)
	if err != nil {
		fmt.Println("Error decoding JSON response:", err)
		return
	}

	// Access the 'prediction' field and display the transcribed text
	predictions, ok := jsonResponse["prediction"].([]interface{})
	if !ok {
		fmt.Println("No prediction data found in the response")
		return
	}

	for _, prediction := range predictions {
		predictionMap, ok := prediction.(map[string]interface{})
		if !ok {
			fmt.Println("Invalid prediction format")
			return
		}

		transcription, found := predictionMap["transcription"].(string)
		if !found {
			fmt.Println("No transcription found in the prediction data")
			return
		}

		fmt.Println("Transcription:", transcription)
	}
}

Replace:

"Path_to_audio_file" with the path to the audio file you want to transcribe on your computer.
"YOUR_API_KEY" with your API key from app.gladia.io.

Code breakdown:

In the program above, ‘package main` is the entry point. It indicates that the program will be a standalone executable.

Next, we import some packages:

bytes: for creating an in-memory buffer to handle the audio file data.
encoding/json: to encode and decode JSON data
fmt: handles formatted I/O and prints to the console.
io: It copies the audio file data to a buffer.
mime/multipart: used for MIME (Multipurpose Internet Mail Extensions) encoding and decoding.
net/http: creates and handles HTTP requests to communicate with Gladia's API.
os: It opens the audio file.

The `func main`, which is the main function of the program, handles the integration with Gladia's real-time transcription API. This function:

Imports necessary packages and initializes variables.
Sets up the HTTP request with the audio file and API key in the header. (`x-gladia-key`)
Sends requests, handles the response and displays transcribed text upon receiving a successful response. (Status 200 OK)

Step 3: Run the program in your terminal

Enter `go run audio.go` in the terminal to execute the program.

First, It gives a 200 OK response status, confirming a successful connection and request to Gladia's API.

Then displays the output containing the transcribed audio:

The program prints the transcribed text without the additional JSON structure (timestamps, confidence scores, and other data) as it extracts the` transcription` field from the JSON response. This is done using Golang’s `encoding/json` package to parse the JSON data.

🔗Access the complete code on this GitHub repository.

‍Conclusion‍

Real-time transcription, using WebSockets and optimized Whisper ASR API by Gladia, significantly enhances real-time communication by offering live text-to-speech functionalities. As this technology evolves, it promises increased productivity and the ability to derive more accurate insights from unstructured audio data.

To learn more about Gladia’s approach to enhancing the Whisper transcription performance for companies, check out our new model, Whisper-Zero, or sign up for the API directly below.

Other relevant resources

[1] Gladia’s Live audio documentation:

https://docs.gladia.io/reference/live-audio

[2] Real-time transcription API by Glaida, deep-dive:

https://www.gladia.io/blog/real-time-transcription-powered-by-whisper-asr

[3] Introducing Whisper Zero:

https://www.gladia.io/blog/introducing-whisper-zero

‍About Gladia

At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities, and state-of-the-art features, including speaker diarization and word-level timestamps.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Introducing Solaria, the first truly universal speech-to-text model

Voice is the most natural way we communicate. As AI continues to redefine the way businesses interact with customers, the ability to accurately and instantly transcribe speech across languages is no longer a luxury—it’s a necessity. Enter Solaria, the breakthrough speech-to-text model designed to power the next era of global AI-driven conversations.

Product News

Gladia x pyannoteAI: Speaker diarization and the future of voice AI

Speaker recognition is advancing rapidly. Beyond merely capturing what is said, it reveals who is speaking and how they communicate, paving the way for more advanced communication platforms and assistant apps

Speech-To-Text

Building AI voice agents: Starter guide

2025 marks a significant shift in AI-driven automation with the emergence of Agentic AI—intelligent, autonomous systems capable of reasoning, goal-setting, and adaptive decision-making.