Live speech to IPA transcription in Rust using ML

I finished my rough implementation of a live speech transcription pipeline. It captures my speech from the mic, and processes it with the Multipa model using the Burn framework.
To make it work live I implemented a sliding window that chunks the audio into 2 second fragments, each of which is processed separately, and then assembled into a single string. The chunks overlap, and are then sliced, so that audio that was on the borders of a cut (which will reduce the quality) is thrown away, and the text for it is taken from the next chunk, for which it is at the center.

The inference is running on cpu for simplicity and compatiblity, so by sacraficing a bit of pain I added threading (where each chunk is processed on a separate thread) to make sure the pipeline can keep up. I’m slightly worried about how the perfromance will be when this is going to run along Minecraft, but hopefully I will be able to make it as lightwieght and unintruisive as possible.

Finally I reworked how the text is accumulated, switching from a vector of logits to a string. Since IPA has no concept of words, just phonemes, everything I say is appened to one long string.
So I added a very simple silence detection, where if a captured chunk returns no text a space is added to the string.

For easier teasting I also improved the main function, so that the program can:

take a file and process it all at once
take a file and process it using the pipeline
run live mic capture on the pipeline

What’s next - improving the detection

In theory this is all I need, and I can start working on the Minecraft mod. However I’m not satisfied with the quality of the transcription, and at the level it currently is most spell detection will probably fail.
The quality of the model itself is pretty bad. If the description on github of the Multipa model is correct, the model was trained on merely 9 hours of speech data, which probably explains it. The model is also sensitive to noise, and distractions like other people talking, and sometimes outputs garbage.
So the next devlog I will be experimenting with various techinques of filtering and cleaning up the audio the model is receiving, and potentially even training a new model myself on a bigger dataset.

Attached video

Working demo of live capture and transcription of me talking.