Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion


1. Samples of Editing

1.1. Samples of Text-Based Editing

  1.1.1. Various Samples

# Source Prompt Target Prompt Original Audio Edited Audio Edit Skip
1 A recording of a sneaky jazz song. A recording of a tense classical music score. 90
2 A recording of a hard rock song. A recording of a jazz song. 100
3 A recording of a happy upbeat classical music piece. A recording of a happy upbeat arcade game soundtrack. 100
4 A recording of a rock song. A recording of Arabic music. 90
5 —— A recording of a funky hip hop song. 90
6 Trumpets playing alongside a piano, bass and drums in an upbeat old-timey cool jazz song. A banjo playing alongside a piano, bass and drums in an upbeat old-timey cool country song. 110
7 A recording of an upbeat gospel song. A recording of an upbeat techno song. 100
8 A recording of a happy upbeat song in a Latin jazz style. A recording of a happy upbeat song in a retro arcade game soundtrack style. 110
9 —— A recording of a dark techno song. 110
10 A recording of a dramatic epic Chinese piece. A recording of a dramatic heavy metal piece. 160
11 —— A recording of an upbeat cool jazz song. 110
12 A recording of an old rock song. A recording of an techno song. 110
13 Chinese strings, flutes, and harps playing an upbeat piece. Chinese strings, flutes, and harps playing an somber piece. 120
14 A high quality recording of wind instruments and strings playing. A high quality recording of a piano playing. 130
15 —— A recording of an upbeat arcade game soundtrack. 120
16 A high quality recording of a cat meowing. A high quality recording of a dog barking. 50
17 A high quality recording of a dog barking a lot. A high quality recording of a gun shooting a lot. 100
18 A kid talking loudly. A rooster crowing. 90

  1.1.2. The Effect of Skip Used

# Source Prompt Target Prompt Original Audio Skip=90 Skip=100 Skip=110 Skip=120 Skip=130
1 A recording of a happy upbeat song in a Latin jazz style. A recording of a happy upbeat song in a retro arcade game soundtrack style.
2 A recording of a funky jazz song.
3 Trumpets playing alongside a piano, bass and drums in an upbeat old-timey cool jazz song. A banjo playing alongside a piano, bass and drums in an upbeat old-timey cool country song.


1.2. Samples of Unsupervised Uncertainty-Based Editing

  1.2.1. Various Samples (Strength changes)

# Inversion Prompt Original Audio Edited Audio +PC Edited Audio +2PC PC Interpretation Edit Parameters
1 A high quality recording of flutes and a trumpet playing. Melody change t'∈[200, -1]
Specific t=80 used
PCs 1+2+3
2 A recording of a calm country song. Remove singer t'∈[150, -1]
Specific t=115 used
PCs 1+2+3
3 Just drums t'∈[150, -1]
Specific t=80 used
PCs 1+2+3
4 A recording of a scary classical music piece. Melody change t'∈[150, 50]
Specific t=95 used
PCs 1+2+3
5 A trumpet and a saxophone playing a cool jazz melody, with an accompaniment of a piano, bass and drums. Melody change t'∈[135, 95]
PCs 1+2+3
6 A high quality recording of wind instruments and strings playing. Melody change t'∈[135, 95]
PCs 1+2+3
7 A strings section playing classical music. Minor melody changes t'∈[95, 80]
PCs 1+2+3
8 A high quality recording of a woman singing while a guitar and drums play in the background. Instrument change t'∈[200, -1]
Specific t=65 used
PCs 1+2+3


  1.2.2. Various Samples (PC direction changes)

# Inversion Prompt Edited Audio -γPC Original Audio Edited Audio +γPC PC Interpretation Edit Parameters
1 A high quality recording of a man singing and drums, guitar and bass playing a song, and later a woman is singing. Lead Guitar/Singers emphasis t'∈[115, 80]
PC #1
2 A high quality recording of a man singing and drums, guitar and bass playing a song, and later a woman is singing. Singers/Drums emphasis t'∈[115, 80]
PC #2
3 A recording of ryhtmic clapping, a women singing, and drums and guitar playing. Vibrato strength t'∈[150, -1]
Specific t=120 used
PC #3
4 A high quality recording of a man singing with a rock band accompaniment. Drum-beats style t'∈[200, -1]
Specific t=80 used
PC #1
5 A recording of an old timey rock song from the sixties. Guitar/Singer emphasis t'∈[200, -1]
Specific t=65 used
PCs 1+2+3
6 Isolate Woman/Man t'∈[115, 95]
PC #1



2. Comparisons to Other Methods

2.1. Comparisons of Text-Based Editing

  2.1.2. Music Samples

# Source Prompt Target Prompt Original Audio Ours SDEdit
skip=100
skip=130
skip=160
MusicGen DDIM Inversion
1 A recording of a sneaky jazz song. A recording of a tense classical music score.
skip=90


2 A recording of a rock song. A recording of Arabic music.
skip=90


3 A recording of an upbeat rock song. A recording of an arcade game soundtrack.
skip=100


4 A recording of a dark techno song.
skip=110


5 A recording of a funky hip hop song.
skip=90


6 A recording of an upbeat arcade game soundtrack.
skip=120


7 A recording of an upbeat cool jazz song.
skip=110


8 A recording of an upbeat gospel song. A recording of an upbeat techno song.
skip=100


9 Trumpets playing alongside a piano, bass and drums in an upbeat old-timey cool jazz song. A banjo playing alongside a piano, bass and drums in an upbeat old-timey cool country song.
skip=110


10 A recording of a dramatic epic Chinese piece. A recording of a dramatic heavy metal piece.
skip=160


11 Chinese strings, flutes, and harps playing an upbeat piece. Chinese strings, flutes, and harps playing an somber piece.
skip=120


12 A high quality recording of wind instruments and strings playing. A high quality recording of a piano playing.
skip=130


13 A recording of a happy arcade game soundtrack.
skip=90


14 A recording of a hard rock song. A recording of a jazz song.
skip=100


15 A recording of an old rock song. A recording of an techno song.
skip=110


16 A recording of a happy upbeat song in a Latin jazz style. A recording of a happy upbeat song in a retro arcade game soundtrack style.
skip=110




  2.1.2. Audio Samples

# Source Prompt Target Prompt Original Audio Ours SDEdit skip=50 SDEdit skip=80 SDEdit skip=100 SDEdit skip=130 DDIM Inversion
1 A high quality recording of a cat meowing. A high quality recording of a dog barking.
skip=50
2 A high quality recording of a dog barking a lot. A high quality recording of a gun shooting a lot.
skip=100
3 A kid talking loudly. A rooster crowing.
skip=90


  2.2. Comparisons of Unsupervised Uncertainty-Based Editing

# Inversion Prompt Original Audio Our Semantic Edit SDEdit Skip=85 SDEdit Skip=100 SDEdit Skip=115 SDEdit Skip=130 Our Edit Parameters
1 A high quality recording of a man singing and drums, guitar and bass playing a song, and later a woman is singing. t'∈[115, 80]
PC #1
2 A high quality recording of a man singing with a rock band accompaniment. t'∈[200, -1]
Specific t=80 used
PC #1
3 t'∈[150, -1]
Specific t=80 used
PCs 1+2+3
4 A high quality recording of flutes and a trumpet playing. t'∈[200, -1]
Specific t=80 used
PCs 1+2+3
5 A recording of a calm country song. t'∈[150, -1]
Specific t=115 used
PCs 1+2+3
6 A recording of a scary classical music piece. t'∈[150, 50]
Specific t=95 used
PCs 1+2+3
7 A trumpet and a saxophone playing a cool jazz melody, with an accompaniment of a piano, bass and drums. t'∈[135, 95]
PCs 1+2+3
8 A high quality recording of wind instruments and strings playing. t'∈[135, 95]
PCs 1+2+3
9 A strings section playing classical music. t'∈[95, 80]
PCs 1+2+3
10 A recording of an old timey rock song from the sixties. t'∈[200, -1]
Specific t=65 used
PCs 1+2+3
11 A high quality recording of a woman singing while a guitar and drums play in the background. t'∈[200, -1]
Specific t=65 used
PCs 1+2+3



  3. Comparison of Unsupervised Editing Directions With Random Directions

# Type Inversion Prompt Edited Audios -γPC Original Audio Edited Audios +γPC PC Interpretation Edit Parameters
1 Random A high quality recording of a man singing with a rock band accompaniment.
γ = -12

γ = -8

γ = -2

γ = 2

γ = 8

γ = 12
t'∈[200, -1]
Specific t=80 used
PC #1
Ours A high quality recording of a man singing with a rock band accompaniment.
γ = -3

γ = -2

γ = -1

γ = 1

γ = 2

γ = 3
Drum-beat style t'∈[200, -1]
Specific t=80 used
PC #1
3 Random
γ = -240

γ = -120

γ = -40

γ = 40

γ = 120

γ = 240
t'∈[115, 95]

PC #1
Ours
γ = -60

γ = -40

γ = -20

γ = 20

γ = 40

γ = 60
Isolate Woman/Man t'∈[115, 95]

PC #1
5 Random A recording of an old timey rock song from the sixties.
γ = -12

γ = -8

γ = -2

γ = 2

γ = 8

γ = 12
t'∈[200, -1]
Specific t=65 used
PCs 1+2+3
Ours A recording of an old timey rock song from the sixties.
γ = -2

γ = -1

γ = -0.5

γ = 0.5

γ = 1

γ = 2
Guitar/Singer emphasis t'∈[200, -1]
Specific t=65 used
PCs 1+2+3