This is the demonstration page of the paper “Full-band General Audio Synthesis With Score-based Diffusion” with some selected samples generated with the proposed method.
Recent works have shown the capability of deep generative models to tackle general audio synthesis from a single label, producing a variety of impulsive, tonal, and environmental sounds. Such models operate on band-limited signals and, as a result of an autoregressive approach, they are typically conformed by pre-trained latent encoders and/or several cascaded modules. In this work, we propose a diffusion-based generative model for general audio synthesis, named DAG, which deals with full-band signals end-to-end in the waveform domain. Results show the superiority of DAG over existing label-conditioned generators in terms of both quality and diversity. More specifically, when compared to the state of the art, the band-limited and full-band versions of DAG achieve relative improvements that go up to 40 and 65%, respectively. We believe DAG is flexible enough to accommodate different conditioning schemas while providing good quality synthesis.
Full-band General Audio Synthesis With Score-based Diffusion.
S. Pascual, G. Bhattacharya., C. Yeh, J. Pons, & J. Serrà.
2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023).
Arxiv: 2210.14661
The following examples are generated using the classifier-free guidance weight gamma=2. The model is trained on UrbanSound8K dataset.
The following examples are generated using the classifier-free guidance weight gamma=0
wave splashing
wave lapping
wind whistling
wind gust
rain on grass
rain on umbrella
close water running
stream with birds singing
fire embers
wood fire crackling
horse gallop
horse carriage
large dynamics betwen notes
soft and reverberant
crowd applausing and cheering
close applausing
footsteps on wooden floor
walking on grass
We use samples from Medley-solos-DB as “input” (inject it before the first sampling step) to DAG and generate “outputs” with different models.
input
output generated by piano model
input
output generated by carhorn model
input
output generated by piano model
output generated by dog bark model
input
output generated by piano model
output generated by dog bark model