I built a Python voice assistant called Jarvis. It listens for a wake word (“Jarvis”), processes spoken input, and then responds using text-to-speech or executes basic system actions like opening apps, running commands, or answering queries through an AI/API layer.
How it works (in simple terms)
The flow is basically:
Microphone input
The program constantly listens through the mic until it detects the wake word.
Wake word detection
Once “Jarvis” is detected, it switches into active listening mode.
Speech-to-text
Your speech is converted into text using a speech recognition engine.
Processing layer
The text is either:
matched to predefined commands (like opening apps, searching, etc.), or
sent to an AI model/API for a response
Response output
The response is converted back to speech and played through speakers.
State control
It manages states like “listening”, “thinking”, and “speaking” so it doesn’t overlap audio input and output.
devlog / what actually went wrong
At the start it looked simple, but most of the time was spent fixing edge cases and weird failures.
Mic input was the first issue. It either didn’t pick up sound or used the wrong device entirely. On my system it worked, but on another setup it silently failed because the default audio index was different. I had to explicitly handle device selection instead of relying on defaults.
Speech recognition was inconsistent too. It would misinterpret short commands or fail when there was background noise. I had to tweak sensitivity and improve how it handled pauses in speech so it wouldn’t cut off early.
One major mistake was not separating speaking and listening states. The assistant would sometimes hear its own response and trigger itself again, creating loops. I fixed this by adding a simple lock so it cannot listen while speaking.
The biggest structural issue came later with packaging. It worked perfectly when running python main.py, but broke when converted into an executable. The dist version failed due to missing modules, broken file paths, and hardcoded system-specific directories.
I initially assumed the environment would behave the same after packaging, but PyInstaller changed everything. I had to fix relative paths, remove hardcoded references to my machine, and manually ensure hidden dependencies were included.
API and environment variables also caused issues. The program would run fine in terminal but fail in packaged form because the .env file or API key wasn’t being loaded correctly in the new runtime context.
what I learned from it
Most of the problems were not in “AI logic” but in system-level behavior: audio devices, file paths, packaging, and runtime differences.
The main takeaways were:
voice input systems are fragile and device-dependent
Python packaging is not equivalent to running a script
state management matters more than expected in real-time systems
debugging usually comes from environment mismatch, not code logic itself
- 1 devlog
- 10h