Distilling DDSP: Exploring Real-Time Audio Generation on Embedded Systems
Harmonic-plus-Noise synthesis decomposes an audio signal into two complementary components: harmonic and noise. The harmonic component models periodic sounds as a sum of sinusoidal oscillators, while the noise component captures the non-periodic, broadband content.
A signal $x[n]$ is expressed as:
$ x[n] = e[n] + \sum_{k=1}^N A_k[n] \sin\left(2\pi f_k[n] n T + \phi_k[n]\right) $
Where $T$ is the sampling period, $N$ is the number of harmonics, while $A_k[n]$, $f_k[n]$, and $\phi_k[n]$ are respectively the amplitude, frequency, and phase of the $k$-th harmonic. The noise component $e[n]$ can be modeled using subtractive synthesis:
$ e[n] = \mathcal{F}\big(\mathcal{N}[n]; \Theta\big), $
Where $\mathcal{N}[n]$ is an input noise (e.g., white noise or gaussian noise), $\mathcal{F}$ is a filter function, and $\Theta$ are the parameters of the filter (e.g. cutoff frequency).
The HpN architecture employs a decoder, formed of recurrent and fully connected layers, conditioned on a sequence of pitch ($f_0$) and loudness ($L$) frames to predict the overall amplitude of the audio signal ($A$), the normalized distribution of spectral variations among the various harmonics ($c_k$), and the coefficients of the filter used to model the noise component ($h$).
| Reference | Anchor (LPC) | |
|---|---|---|
| 🪈 Flute | ||
| 🎺 Trumpet | ||
| 🎻 Violin | ||
| 🎹 Piano |
| Full | Reduced | Reduced+AD | Reduced+CD | |
|---|---|---|---|---|
| 🪈 Flute | ||||
| 🎺 Trumpet | ||||
| 🎻 Violin | ||||
| 🎹 Piano |