WisprFlow SDK - Stardance

Ship #1 Changes requested

@shash on WisprFlow SDK · about 2 months ago

I'm just going to ramble on and I'm going to describe my whole project.

So basically, I needed a way to transcribe my audio. My eventual goal is to make a home assistant-type of Jarvis, but I really needed to pick up my voice really well. I like to speak in Hindi and English at the same time, multiple, lots of times. I have lots of filler words, I mumble a lot. Obviously, you would expect it to catch your voice really well.

There are models such as OpenAI Whisper. Those are models, but they are not that good for this use case. But then I found that Wispr Flow is a model that is absolutely exceptional at picking up the little nuances of language when I speak in multiple languages at the same time.

But there was no programmatical way to really interact with the API, so I texted one of the owners on LinkedIn, and they gave me permission to reverse engineer it, as they said that maintaining an API is a huge overhead for them, and they are just a start-up, so they really would not mind me reverse engineering it and building it myself if I really wanted that desperately, but they have no plans of making an API.

And because I have the Pro plan, I can use it a lot without running out of words

So I took it upon myself as a challenge, and I built the Python SDK that also works with the command line, where you just give it an audio file and it gives you the final transcribed version.

3 devlogs
5h

Try project → See source code →

Open comments for this post

@shash on WisprFlow SDK · about 2 months ago

3h 56m 16s logged

https://github.com/ThisisShashwat/wisprflow-sdk/blob/main/demo/demo.md

Hey guys, I’m so excited to announce that I’ve completed everything! The SDK is done, the Path Script is done, and now I can 100% make it work. It’s been coding for over 4 hours now, after the last dev log. I’m really frustrated that this took an insane amount of time because my experience with JavaScript isn’t that much, but it was enough for me to know what Electron was and how Node works. And even the code was highly minified (it was Electron code), I was able to figure it out. I learned a lot while making this. I learned how to reverse engineer Electron apps, and I learned how to use partial scripts to make reverse engineering ways. Here, obviously I took aid from ChatGPT and Claude to help me out with the reverse engineering because I didn’t know a lot, and they’re amazing tools when it comes to learning. So I finally managed to make it; everything is done, and now I’ll be finally pushing it out!

Open comments for this post

@shash on WisprFlow SDK · about 2 months ago

19m 56s logged

Well before I tell you the goal of this project, let me describe why I’m making it and what my project is trying to do. For that I’ll need to share some context.

First of all hi, I am Shashwat and it’s really nice that I’ve been given an opportunity to participate in Stardance Hack Club. I did not know how devlogs work so my first devlog was just a dummy devlog. This is the actual devlog.

I did not know that I am supposed to install a hack a time to track my code timing because it’s really frustrating that it barely tracked one hour where I’ve been working on it for over 20 hours now, split across days.

Anyways that does not matter. Coming back Wispr Flow is basically a commercial paid app made by some entity which is voice transcriptions. The way it’s different from Siri detections is that Siri does not know context and Siri does not know your names. It does not have a dictionary of you and the moment you start switching between languages if you’re bilingual and if you like to talk in a bilingual tone then it’s really difficult. It completely butchers everything and all the filler words, unnecessary stuff, etc. Every single thing is excluded, removed by Wispr Flow so you have a very professional text.

In the long term that I would want to make a Jarvis-like home assistant for myself. I already have the whole home assistant running on my Raspberry Pi and I would want it to be automated by my voice. I haven’t really found any model that can do what Wispr Flow can. As the name suggests if you literally whisper to your microphone it can catch that even if you’re far away, even if you’re mumbling half asleep, you can still catch that. I don’t know how they’ve accomplished it but they’ve accomplished it really well.

Again this is not an advertisement so I have purchased the pro tier and have paid for it but they don’t have an API. Basically they have an API but they have sunsetted the API where they don’t give it anymore because I spoke to their owners and the owner was like it was very complicated and it was not really worth the investment so that’s why they stopped the API. If I can reverse engineer it, I have full permission to do it for my personal projects and I can use it.

For the past few days, combined time over 25 hours to 30 hours, I’ve spent on reverse engineering the Wispr Flow app. Based on that the progress I have made till now is that I have figured out how their model works and what’s the structure of their queries and calls and how I structure it now. The part that’s left is to wrap it into a Python SDK because I want to programmatically be able to send that specific 10 seconds of my voice recording and just get the text back programmatically in my own Python application. That is the long-term goal.

I’ll talk to you next time or also this entire thing was dictated by Wispr Flow and I’m literally not going to correct anything because I’m pretty sure it caught everything and I’m just going to hit post. In case you find any mistakes just excuse me because I was using voice dictation.

Open comments for this post

@shash on WisprFlow SDK · about 2 months ago

50m 19s logged

The end goal: Reverse-engineers the Wispr Flow desktop client and exposes its transcription and command APIs through a clean Python interface. Send audio files directly from Python, stream live audio, customize transcription behavior, and receive structured results — no UI interaction required.