ScenA

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

Michael Finkelson1,2, Daniel Segal1, Eitan Richardson1, Shahar Armon1, Nani Goldring1, Poriya Panet1, Nir Zabari1, Benjamin Brazowski1, Or Patashnik2, Yoav HaCohen1

1Lightricks    2Tel Aviv University

ScenA teaser: a high-level natural-language prompt and a set of reference voices are transformed into a multi-speaker conversational scene with overlapping speech, paralinguistic events, and ambient audio.

Our ScenA framework transforms free-form natural language prompts and a set of reference voices into rich, multi-speaker conversational scenes. The prompt alone determines which reference speaks where, with no per-turn tags, transcripts, or identity encoders. This natural language interface enables complex human interactions, including overlapping speech, spontaneous paralinguistic events, and scene-level ambient sound.

Abstract. Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the Reference Shortcut. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

Overlapping & Synchronized Speech

# Speaker 1 Ref Speaker 2 Ref Prompt ScenA
(Ours)
1 Mellow piano plays gently in the background. The speaker from reference 1 says: "I'm Michael." The speaker from reference 2 says: "I'm Grant." Both speakers then exclaim together in absolute synchrony, voices fused as one: "And we're hijacking each other's channels today!"
2 Two female speakers count in unison, voices fully overlapping: "Three... two... one..." Both yell together: "Go!" The speaker from reference 1 says: "Did it actually work?" The speaker from reference 2 says: "I have no idea."

Non-Speech Vocalizations

# Speaker 1 Ref Speaker 2 Ref Prompt ScenA
(Ours)
1 The speaker from reference 1 says: "And so the way that quantum tunneling actually works—" He breaks into a violent coughing fit. The speaker from reference 2 asks with concern: "Are you alright?" The speaker from reference 1 clears his throat and rasps: "Yeah, sorry, allergies."
2 Both speakers desperately try to suppress laughter during something serious. The speaker from reference 1 whispers, holding back giggles: "We cannot laugh right now, I'm serious." The speaker from reference 2 snickers under his breath: "I'm trying, I'm trying!" Both burst out laughing simultaneously.
3 A male speaker (reference 1) sobs and breaks his words: "I just— *sob* —I don't know what to do— *sniff* —anymore..." The speaker from reference 2 says softly: "Hey. Hey. Breathe. I'm right here." The male speaker (reference 1) sniffles deeply: "Okay... okay..."
4 The speaker from reference 1 says: "And then the dog goes—" He imitates a goofy bark: "Woof, woof, woof!" The speaker from reference 2 says flatly: "And?" The speaker from reference 1 keeps going: "And the cat is just like, 'meow.'"

Acoustic Scenes & Sound Effects

# Speaker 1 Ref Speaker 2 Ref Prompt ScenA
(Ours)
1 A strong, continuous howling wind blows loudly throughout the entire scene without ever stopping or quieting, partially burying the voices. The speaker from reference 1 shouts loudly over the constant wind: "I told you we should've turned back an hour ago!" The speaker from reference 2 yells back over the unrelenting wind: "We're almost there, just hold on!" The wind keeps howling at the same level the whole time.
2 A farm at sunrise: a rooster crows. Chickens cluck softly throughout. The speaker from reference 1 says with a yawn: "Way too early for this." The speaker from reference 2 chuckles: "Welcome to country life." The rooster crows again.
3 A train conductor's announcement plays through speaker compression: "Now arriving at Central Station. Please mind the gap as you exit the train." The speaker from reference 1 says: "That's us." The speaker from reference 2 replies: "Finally, my legs are dead."
4 A loud, steady old grandfather clock ticks distinctly throughout, each tick clearly audible against the quiet room. The speaker from reference 1 says gently: "And how did that make you feel?" The speaker from reference 2 sighs deeply and replies: "Honestly? Pretty awful. I haven't been able to stop thinking about it." The clock keeps ticking distinctly.
5 A roaring stadium crowd cheers continuously. A whistle blows. The speaker from reference 1 shouts excitedly: "Did you see that?! What a goal!" The speaker from reference 2 yells back: "I told you he'd score, I told you!" The crowd's cheering swells louder. A buzzer sounds. The speaker from reference 1: "We're winning! We're actually winning!"
6 Inside a packed nightclub: thumping bass-heavy electronic music dominates the entire scene. The speaker from reference 1 yells right next to the other's ear: "I'm gonna grab a drink!" The speaker from reference 2 yells back: "Get me one too! Vodka soda!" The bass drops harder.
7 An airport terminal: rolling suitcase wheels, distant chatter. A PA announcement plays: "Now boarding flight 1142 to Chicago at gate B7." The speaker from reference 1 says: "That's our gate, let's move." The speaker from reference 2 replies: "Wait, I need to use the bathroom first."
8 A voicemail leaving over a brief beep. A short tone beeps. The speaker from reference 1 says clearly: "Hey, it's me, just calling to check in. Give me a buzz when you get this." A short pause. The speaker from reference 2's voice plays back through a phone speaker, slightly tinny: "Got it, calling you back now."
9 The speaker from reference 1 announces dramatically: "Ladies and gentlemen, prepare to be amazed... behold!" A loud, sharp WHOOSH followed by a hiss as a thick puff of stage smoke bursts out. The speaker from reference 2 gasps and says: "Wait, where did the rabbit come from?!" The speaker from reference 1 chuckles smugly: "A magician never tells."
10 A sudden interrupting moment with sound effects. The speaker from reference 1 says: "Today we're gonna talk about why bees actually—" A loud bee buzzes right past the mic. The speaker from reference 1 yelps: "Ow! Did that just happen on camera?!" The speaker from reference 2 laughs and says: "I think the bees heard you."
11 The speaker from reference 1 says: "Anyway, that's why prime numbers are so—" A loud doorbell rings sharply, cutting him off mid-sentence. The speaker from reference 1 sighs: "One second." The speaker from reference 2 chuckles and says: "We'll just leave that in."
12 The speaker from reference 1 fires: "Capital of France?" The speaker from reference 2 instantly: "Paris!" The speaker from reference 1: "Square root of eighty-one?" The speaker from reference 2: "Nine!" The speaker from reference 1: "Year of moon landing?" The speaker from reference 2: "Sixty-nine!" A loud, harsh game-show buzzer rings out — a sustained electronic BZZZZT.

Described Voices (Single Reference)

# Speaker 1 Ref Speaker 2 Ref Prompt ScenA
(Ours)
1 — described in prompt — A speaker (from reference 1) interacts with a deep, booming, intimidating man's voice. The speaker says nervously: "Are you... sure I'm allowed in here?" The deep voice replies in a low, thunderous baritone: "You are now. Don't make me regret it." The speaker swallows audibly and says: "Got it."
2 — described in prompt — A man's voice (from reference 1) and a small cheerful child (around five years old, high-pitched, slightly lispy) are talking. The man says: "What did you do at school today, buddy?" The child replies in a high, cheerful voice: "We painted dinosaurs! Mine was purple!" The man laughs warmly: "Purple dinosaurs are the best kind."
3 — described in prompt — A speaker (from reference 1) and a raspy, gruff old sailor with a hoarse, weathered voice talk. The speaker asks: "Have you ever seen anything like it?" The sailor replies in a deep raspy growl: "Once. Long time ago. Lost three good men that night." A pause. The speaker says quietly: "Tell me about them."
4 — described in prompt — A speaker (from reference 1) talks with a posh, refined British woman with a crisp Received Pronunciation accent. The speaker says: "How was the play?" The British woman replies in a precise, plummy English accent: "Absolutely marvelous, darling. The lead was simply divine, you really must go." The speaker says: "I'll get tickets tomorrow."

Voice Acting & Impressions

# Speaker 1 Ref Speaker 2 Ref Prompt ScenA
(Ours)
1 A fake drill sergeant impression. The speaker from reference 1 yells gruffly like a drill sergeant: "Drop and give me twenty, soldier!" The speaker from reference 2 laughs and replies: "I'm literally just trying to eat my cereal." The speaker from reference 1 keeps yelling: "I said TWENTY!"
2 One speaker briefly switches into a fake Italian accent for emphasis. The speaker from reference 1 says: "I made the spaghetti from scratch tonight." The speaker from reference 2 replies in an exaggerated theatrical Italian accent: "Mamma mia, that-a smells-a delizioso!" The speaker from reference 1 laughs.

Conversational & Emotional Range

# Speaker 1 Ref Speaker 2 Ref Prompt ScenA
(Ours)
1 A polite interview rhythm. The speaker from reference 1 asks: "So tell me, what got you started in this field?" The speaker from reference 2 replies thoughtfully: "Honestly, it was a complete accident. A friend dragged me to a lecture and I never looked back." The speaker from reference 1 says: "I love that."
2 An emotional shift from calm to angry over the course of the scene. The speaker from reference 1 says calmly: "I'm really trying to stay patient with you here." The speaker from reference 2 replies dismissively: "Yeah, well, try harder." The speaker from reference 1's voice rises sharply: "What did you just say to me?!"
3 A relaxed chat between roommates. The speaker from reference 1 says: "Did you remember to take out the trash?" The speaker from reference 2 replies: "Yep, did it this morning." The speaker from reference 1 says: "You're a lifesaver."
4 Two coworkers commenting on a meeting. The speaker from reference 1 says: "That meeting could've been an email." The speaker from reference 2 replies with a sigh: "It always could've been an email." The speaker from reference 1 chuckles: "Why do we keep going?"

Leaderboard

Comparison with multi-speaker dialogue baselines on CoVoMix2-Dialogue-20s. Bold = best, underline = second-best within each column. cpWER ↓: concatenated min-permutation WER for two-speaker dialogue. cpSIM ↑: concatenated min-permutation SIM-O for two-speaker turns. ACC ↑: speaker-turn assignment accuracy. WER ↓: Word Error Rate. SIM-O ↑: speaker similarity to the prompt (embedding cosine, 0–1). UTMOS ↑: predicted naturalness MOS (1–5). SQUIM ↑: predicted speech quality (TorchAudio-SQUIM MOS).

Method cpWER ↓ cpSIM ↑ ACC ↑ WER ↓ SIM-O ↑ UTMOS ↑ SQUIM ↑
MOSS-TTSD 0.232 0.547 0.855 0.109 0.443 3.76 4.28
VibeVoice-7B 0.206 0.527 0.821 0.044 0.451 3.58 4.28
VibeVoice-1.5B 0.212 0.503 0.830 0.050 0.423 3.56 4.27
ZipVoice-Dialog 0.176 0.538 0.847 0.032 0.446 3.57 4.34
Dia (Nari Labs) 0.303 0.339 0.757 0.133 0.312 2.69 4.09
ScenA (Ours) 0.145 0.567 0.866 0.020 0.451 3.44 4.32

BibTeX

@article{finkelson2026scena,
  title   = {Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors},
  author  = {Finkelson, Michael and Segal, Daniel and Richardson, Eitan and Armon, Shahar and Goldring, Nani and Panet, Poriya and Zabari, Nir and Brazowski, Benjamin and Patashnik, Or and HaCohen, Yoav},
  journal = {arXiv preprint arXiv:2606.19325},
  year    = {2026}
}