- read

Speech Summarization With GPT-4 and Vosk: Build a Web App in Go

Simone Convertini 60

Speech Summarization With GPT-4 and Vosk: Build a Web App in Go

Simone Convertini
Level Up Coding
Published in
5 min read21 hours ago


Photo by Daniel Sandvik on Unsplash


In a world where information is constantly bombarding us, the ability to streamline and simplify communication is more crucial than ever. We find ourselves drowning in a sea of data, often struggling to extract the valuable insights we truly need. In this article, we’ll explore the possibilities of harnessing LLM technology to build innovative services that can enhance the way we consume information. Specifically, we’ll discuss the concept of a web application designed to extract bullet-point summaries from audio files, making information retrieval faster and more efficient.

Implementation Focus

In this article, our primary focus is on implementing an endpoint for uploading an audio file to trigger the summarization process. To build the web service, we have chosen Golang as our programming language and Gin as our web framework. I love Go routines. And Channels too.

The Architecture Blueprint

Our system design follows the classic principles of microservices, embodying a design philosophy where each service is crafted to execute a specific task. These services, akin to the specialized cogs in a well-oiled machine, will work harmoniously to achieve a common objective, efficient speech summarization. To choreograph this seamless collaboration, we’ve employed RabbitMQ as our message broker, allowing these services to communicate effortlessly, triggering one another as the need arises.

Speech to Text

To kickstart the summarization process, we need to convert spoken words into text. For this crucial task, we turn to Vosk, a speech recognition model that seamlessly balances lightweight efficiency with remarkable power. We will opt for the lighter version of the Vosk model. Our preference is to run multiple instances, prioritizing speed and efficiency over precision in transcription. This approach aligns with our overarching strategy because the transcriptions will undergo re-tokenization for further elaboration by our LLM. Vosk’s multifaceted capabilities render it the ideal choice to fuel our application’s core functionality.

Text Summarization

The next step involves transforming the speech transcription into a concise, easily digestible bullet-point summary. Here, we enlist the assistance of GPT-4, the 32k token context version can handle substantial text transcriptions, making it well-suited for our summarization needs. For smaller text inputs, we can efficiently utilize the more cost-effective GPT-3.5 model.

Process Workflow

Step 1 — Audio File Upload

Upon a user’s initiative to upload an audio file, our system springs into action. We store the incoming audio file in MinIO, a robust and S3-compatible storage system renowned for its scalability and reliability. Simultaneously, we emit a successful upload event to our trusty message broker. As for the management of metadata, we use a MongoDB instance, a versatile and efficient NoSQL database solution. Here’s a snippet of the function performing the audio upload.

Step 2 — Speech Recognition

Our exploration of the inner workings of our speech summarization system now brings us to the core of the process. As the audio file makes its way into our system, we initiate the transcriber service, a pivotal component in our architecture. The transcriber service takes on the task of converting spoken words into text. To achieve this with both efficiency and celerity, we’ve strategically implemented multiple instances of Vosk, which work harmoniously in parallel. Now, let’s take a closer look at the inner workings of this process by examining the routine responsible for listening to messages provided by our message broker.

Step 3 — Text Summarization

Upon completion of the transcription, we activate the summarizer service, which calls GPT-4 to perform the summarization. The resulting summary is then stored in our database. This is the function used to call GPT-4 and make a summarization.

Vosk Parallelization and Performance

One of the most significant challenges we face is the time it takes to complete the transcription process. It’s a real bottleneck that we can’t afford in today’s fast-paced world. To put it into perspective, a single instance of Vosk processing a 3-minute audio file can take anywhere between 4 to 5 minutes. Waiting for what feels like an eternity for our summarization is simply not an option. So, what’s the solution? Parallelization.

Our approach involves parallelizing the transcription process using multiple instances of Vosk serving over multiple WebSocket connections. We pass the audio file buffer to Vosk instances in parallel, and in return, they promptly provide us with the transcribed text.

We supply the audio file buffers to the Vosk routines using a round-robin method, ensuring fair distribution of the workload among the instances. This balanced allocation ensures that each instance efficiently processes its assigned task and load the output to the routine-safe map. To maintain the integrity of the transcription order, we employ a queue mechanism. As soon as the first transcription in the queue is completed, it’s promptly sent to the output provider channel. This method guarantees that the transcriptions are delivered in the correct sequence, preserving the coherence and readability of the summarization. This process continues iteratively until the entire transcription is completed, allowing us to deliver accurate and organized summaries to our users in a timely fashion.

Performance Improvement

We’ve made remarkable progress in reducing the processing time for a 3-minute audio file from 4/5 minutes to less than 2 minutes, achieved by utilizing just three Vosk instances.


In conclusion, this article has outlined my approach to speech summarization using the power of GPT-4 and Vosk within a web application built in Go. In a world inundated with information, the ability to efficiently extract valuable insights from audio content is becoming increasingly useful, and this system addresses this need head-on.

I aim to contribute to the discourse surrounding the fascinating applications of this technologies. I am eager to witness the extent to which these technologies will advance and the multitude of innovative uses that will emerge.

To delve deeper into the implementation or if you’re interested in specific aspects of my work, especially if you need practical examples of the technologies discussed in this article, I recommend checking out the fully functioning demo on the GitHub repository.