Live speech to IPA transcription in Rust using ML

I finished my rough implementation of a live speech transcription pipeline. It captures my speech from the mic, and processes it with the Multipa model using the Burn framework.
To make it work live I implemented a sliding window that chunks the audio into 2 second fragments, each of which is processed separately, and then assembled into a single string. The chunks overlap, and are then sliced, so that audio that was on the borders of a cut (which will reduce the quality) is thrown away, and the text for it is taken from the next chunk, for which it is at the center.

The inference is running on cpu for simplicity and compatiblity, so by sacraficing a bit of pain I added threading (where each chunk is processed on a separate thread) to make sure the pipeline can keep up. I’m slightly worried about how the perfromance will be when this is going to run along Minecraft, but hopefully I will be able to make it as lightwieght and unintruisive as possible.

Finally I reworked how the text is accumulated, switching from a vector of logits to a string. Since IPA has no concept of words, just phonemes, everything I say is appened to one long string.
So I added a very simple silence detection, where if a captured chunk returns no text a space is added to the string.

For easier teasting I also improved the main function, so that the program can:

take a file and process it all at once
take a file and process it using the pipeline
run live mic capture on the pipeline

What’s next - improving the detection

In theory this is all I need, and I can start working on the Minecraft mod. However I’m not satisfied with the quality of the transcription, and at the level it currently is most spell detection will probably fail.
The quality of the model itself is pretty bad. If the description on github of the Multipa model is correct, the model was trained on merely 9 hours of speech data, which probably explains it. The model is also sensitive to noise, and distractions like other people talking, and sometimes outputs garbage.
So the next devlog I will be experimenting with various techinques of filtering and cleaning up the audio the model is receiving, and potentially even training a new model myself on a bigger dataset.

Attached video

Working demo of live capture and transcription of me talking.

Open comments for this post

@Szczurek on Sephyr magic mod · 6 days ago

3h 44m 18s logged

Sephyr - project overview

Sephyr is an unique magic mod for Minecraft. It aims to differentiate itself by:

Introducing a voice powered spell casting system, that’s both fun to use and intricate. You will be reading spells out loud into your mic in a magic language, and what will you say will affect the result of the spell.
Having stunning, detailed visuals unique to each spell. This is going to be achieved by spending way too much effort on rendering. I want spell casting to be something satisfying, and the look is a key part of it.

The project consits of two parts:

Sephyr - The minecraft mod itself containing the spells, mechanics and visuals, written in kotlin.
Sepple - Speech processing engine, responsible for transforming your babble into something Sephyr can work with and figure out what have you just said.

Sepple overview

Sepple is a foundation of the voice powered spell casting. It is an audio processing library written in rust, that Sephyr will depend on and embed. It captures your mic live and pre-processes the speech. Then it uses the Burn framework to locally and quickly run a speech recognition model called Multipa. The model transcribes speech into the International Phonetic Alphabet (IPA), a textual notation of various sounds that make up the spoken language. Finally it tries to match the result against a custom dictionary of magic words, which will form sentences - the spells you will be casting.

This devlog

Before Stardance launched I have written a rough skeleton for Sepple. With some trouble I managed to import the Multipa model to Burn via ONNX export from Pytorch, and implement the required post-processing to decode model’s output into IPA. I also started working on a pipeline for mic capture and sliding window processing, so that the audio can be transcribed live.
After Stardance started I noticed that the same input audio file gives quite different result in Pytorch then in Rust. I spent few hours debugging why that is, and found a missing piece in my implemenation. For the Multipa model to work correctly the samples need to be normalized using a so-called z-score normalization. That means calculating the mean of the input and its standard deviation (a statistics measure), and applying a formula that rescales the data, so that it has a mean of 0 and standard deviation of 1. After implemenating that the model started to work correctly and give the same output for the same input on both sides.

Picture: “Hello Stardance. Sepple is working now” translated to IPA using pytorch and the fixed Rust implementation. It is not veeery accurate, but it is good enough.