Sephyr - project overview
Sephyr is an unique magic mod for Minecraft. It aims to differentiate itself by:
- Introducing a voice powered spell casting system, that’s both fun to use and intricate. You will be reading spells out loud into your mic in a magic language, and what will you say will affect the result of the spell.
- Having stunning, detailed visuals unique to each spell. This is going to be achieved by spending way too much effort on rendering. I want spell casting to be something satisfying, and the look is a key part of it.
The project consits of two parts:
- Sephyr - The minecraft mod itself containing the spells, mechanics and visuals, written in kotlin.
- Sepple - Speech processing engine, responsible for transforming your babble into something Sephyr can work with and figure out what have you just said.
Sepple overview
Sepple is a foundation of the voice powered spell casting. It is an audio processing library written in rust, that Sephyr will depend on and embed. It captures your mic live and pre-processes the speech. Then it uses the Burn framework to locally and quickly run a speech recognition model called Multipa. The model transcribes speech into the International Phonetic Alphabet (IPA), a textual notation of various sounds that make up the spoken language. Finally it tries to match the result against a custom dictionary of magic words, which will form sentences - the spells you will be casting.
This devlog
Before Stardance launched I have written a rough skeleton for Sepple. With some trouble I managed to import the Multipa model to Burn via ONNX export from Pytorch, and implement the required post-processing to decode model’s output into IPA. I also started working on a pipeline for mic capture and sliding window processing, so that the audio can be transcribed live.
After Stardance started I noticed that the same input audio file gives quite different result in Pytorch then in Rust. I spent few hours debugging why that is, and found a missing piece in my implemenation. For the Multipa model to work correctly the samples need to be normalized using a so-called z-score normalization. That means calculating the mean of the input and its standard deviation (a statistics measure), and applying a formula that rescales the data, so that it has a mean of 0 and standard deviation of 1. After implemenating that the model started to work correctly and give the same output for the same input on both sides.
Picture: “Hello Stardance. Sepple is working now” translated to IPA using pytorch and the fixed Rust implementation. It is not veeery accurate, but it is good enough.
Comments 0
No comments yet. Be the first!
Sign in to join the conversation.