a real-time audio transcriber server in golang using audio ml model & leverage protocol buffer with grpc.
TL;DR
It all begin when I watch coworker start live streaming in one of the biggest social media platform,
afer a few minutes, they got warning violation notification.
As far as they know, they believed that they don't violate their community rules,
since they also believed their video presentation are appropriate.
After couple attempts, the violation appear again, but now we realized that violation notification appear after they says something.
idea flow:
-
client/end-user send the audio -> server check & processing the audio -> send response
-
if the audio process contain forbidden keywords -> do something (warn, err, etc.)
this looks simple, but you know that when we configuring, build, and test our software it required something else in the process
-
go core compiler tools:
-
protoc:
-
protoc-gen-go:
go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
-
protoc-gen-go-grpc:
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest
-
run and tested in:
- go version go1.25.5 X:nodwarf5 linux/amd64
- base model en is provided through git-lfs, see
./assets/models/ggml-base.en.bin
-
if
config.audio.json&config.grpc.jsondoesn't exists, copy and paste those file from .json.template to .json -
you need to build and expose the library from whisper and install the model:
- after the installation, you need to export the include path and include lib directory
- you may required to export you
LD_LIBRARY_PATHif you use custom path, i.e.:
# when you define `-DCMAKE_INSTALL_PREFIX=~/` when build whisper library # this will add bin, lib, include, share dir to the home directory after build and install `cmake --install build/path export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$HOME/lib"; export C_INCLUDE_PATH="$C_INCLUDE_PATH:$HOME/include"
-
use drun-audio_client.sh to run the client & use drun-grpc_server.sh for the grpc server
-
use proper model and check the forbidden keywords, a base english model from ggml is still capable to detect specific keyword, you also may adjust this as you need, see whisper model field
- local system environment/your pc:*
- build and expose/install whisper library
- you can run
./dbuild.shto build this project
--
- docker/podman:
- build image container is provided, you can run:
./build-image-docker.shfor docker, or./build-image-podman.shfor podman
- to run build image in container si provided, you can run:
./run-container-docker.shfor docker, or./run-container-podman.shfor podman
- if you want to change the model you can substitue the string of
ggml-base.en.binfromDockerfileand create your own custom build/deployment
- build image container is provided, you can run:
when all goes well, you should see something from the log as below
...
YYYY/MM/DD 14:23:43 server running on 0.0.0.0:20202
...
that log came from docker logs -f server-backend-audio_transcriber-go or podman logs -f server-backend-audio_transcriber-go
- real-time design
- modular structure
- informative logging
- active buffer checking
- keywords awareness check
- seperate goroutine for send/receive
-
concurrency vs thread safety:
- paralellism use worker pool
- since it using whisper library it's not thread-safe using mutex approach
-
latency vs transcription accuracy:
- both audio_client & grpc_server are collecting ~1 second audio before sent & processed
- better throughput rather than latency
-
resposiveness vs data lost:
- audio chanel and request use limited buffer:
- preferably drop rather than block
- responsive on high load but audio burst could make grpc server slowdown
- audio chanel and request use limited buffer:
below is a 2 hour stress test using 12 threads cpu and 32GB of RAM
if you had any better options/approach, I would love to read/see that - @prothegee



