You are browsing as a guest. Sign up (or log in) to start making projects!

Open comments for this post

1h 14m 10s logged

making it sound less terrible (two commits, one mission)

these two commits are about one thing: the output was robotic and buzzy and I was tired of it. every change here is about making the synthesizer sound more like a voice and less like a modem.


the renderer got smarter

cross-note co-articulation. renderNote() now returns { chunk, finalFormants }. the stream passes the previous note’s final formants into the next note’s renderer, and the first phoneme interpolates from those formants instead of jumping cold. gaps between notes reset the chain. notes that follow each other seamlessly now blend their formants across the boundary.

per-phoneme envelopes. the old global 5ms attack / 10ms release is gone. replaced with getPhonemeEnvelopeSamples() which gives each phoneme type its own envelope: plosives get 2ms attack / 15ms decay (sharp burst), consonants and vowels get 5ms / 3ms. every phoneme segment fades independently.

diphthong formant sweeping. PhonemeDef got an endFormants field. if a diphthong has both formants and endFormants, the renderer sweeps between them over the phoneme duration. AY now actually glides from /aa/ to /ih/. EY glides from /eh/ to /ih/. OW glides from /oh/ to /uh/. they sound like diphthongs now instead of static vowels.

vibratoOverride. was defined on Note but never read. now it is. per-note vibrato control works.


everything got retuned

glottal source. added shimmer (per-cycle amplitude variation driven by jitter, so the volume wobbles slightly like a real voice). aspiration noise is now high-pass filtered (subtract a lowpass from the raw noise) so it’s airy instead of muddy. aspiration gain bumped from 0.1 to 0.15.

plosive bursts. noise envelope for plosives changed from symmetric fade to a fast 12ms exponential decay. “pa” now sounds like a burst instead of a pop.

formant data for everything. Z, ZH, V, DH, Y, W, HH, JH in English all got formant targets. same for z, h, y, w, j in Japanese. consonants that were previously just noise bursts now resonate through the vocal tract. the difference is huge.

vowel bandwidths tightened. defaults went from 80/100/120 to 70/90/130 Hz. narrower bandwidths = sharper resonant peaks = more vowel-like quality.

voice presets retuned. male voice: lower open quotient (0.4), lower speed quotient (0.65), higher tenseness (0.65), less aspiration (0.05). sounds less breathy, more chest voice. female: formant scale 1.18. gender slider in scaleVoice now affects speed quotient and has a gender-dependent tenseness base.

pitch accent. Japanese got resolveAccents() implementing heiban pattern (low first mora, high rest). the stream groups consecutive notes into phrases, calls resolveAccents, and applies the offsets as constant pitch shifts. it’s basic but it makes Japanese phrases have some melodic contour beyond what the score provides.


four TODO items checked off in one go: co-articulation, phoneme envelopes, diphthong sweeping, vibratoOverride. pitch accent too.

it still doesn’t sound human. but it’s starting to sound like it’s trying. and that’s a big step from where it was.


if you can identify the song in the editor image, good job, you’re cool :D

0
0

Comments 0

No comments yet. Be the first!