Full-band General Audio Synthesis With Score-based Diffusion

This is the demonstration page of the paper “Full-band General Audio Synthesis With Score-based Diffusion” with some selected samples generated with the proposed method.

Info

Abstract

Recent works have shown the capability of deep generative models to tackle general audio synthesis from a single label, producing a variety of impulsive, tonal, and environmental sounds. Such models operate on band-limited signals and, as a result of an autoregressive approach, they are typically conformed by pre-trained latent encoders and/or several cascaded modules. In this work, we propose a diffusion-based generative model for general audio synthesis, named DAG, which deals with full-band signals end-to-end in the waveform domain. Results show the superiority of DAG over existing label-conditioned generators in terms of both quality and diversity. More specifically, when compared to the state of the art, the band-limited and full-band versions of DAG achieve relative improvements that go up to 40 and 65%, respectively. We believe DAG is flexible enough to accommodate different conditioning schemas while providing good quality synthesis.

Reference

Full-band General Audio Synthesis With Score-based Diffusion.
S. Pascual, G. Bhattacharya., C. Yeh, J. Pons, & J. Serrà.
Arxiv: 2210.14661. October 2022.

Paper Examples

The following examples are generated using the classifier-free guidance weight gamma=2. The model is trained on UrbanSound8K dataset.

Air conditioner

Car horn

Children playing

Dog bark

Drilling

Engine idling

Gunshot

Jackhammer

Siren

Street music

Showcase Examples

The following examples are generated using the classifier-free guidance weight gamma=0

Wave

wave splashing

wave lapping

Wind

wind whistling

wind gust

Rain

rain on grass

rain on umbrella

River

close water running

stream with birds singing

Fire

fire embers

wood fire crackling

Horse

horse gallop

horse carriage

Piano

large dynamics betwen notes

soft and reverberant

Applause

crowd applausing and cheering

close applausing

Footstep

footsteps on wooden floor

walking on grass

Style Transfer Examples

We use samples from Medley-solos-DB as “input” (inject it before the first sampling step) to DAG and generate “outputs” with different models.

Flute

input

output generated by piano model

Trumpet

input

output generated by carhorn model

Singing

input

output generated by piano model

output generated by dog bark model

Violin

input

output generated by piano model

output generated by dog bark model