DramaBox: direct speech like a scene, with pauses, laughs, and whispers

01What it does

DramaBox is an open text-to-speech model built around directable speech. Instead of only feeding it words to read, you describe the speaker and the scene, put the spoken dialogue inside quotes, and leave performance notes outside the quotes.

That makes the workflow feel closer to writing a short script than setting a basic emotion control. For example, you can ask for a calm narrator who starts warmly, pauses, laughs under their breath, and then drops into a tense whisper. The model tries to turn those instructions into delivery, not spoken text.

The release is also useful because it is not only a polished hosted demo. The model weights are public, the Space is public, and the GitHub repo includes Python, CLI, and Gradio paths for local testing. That matters for readers who want to move from a browser experiment to their own speech workflow.

02How to try it

Start with the Hugging Face Space and write one short scene instead of a plain sentence. Put only the words you want spoken inside double quotes, then add one or two stage directions outside the quotes: a pause, a laugh, a sigh, a change in tone, or a whisper.

On the first test, listen for three things: whether the words stay intelligible, whether the stage directions affect the delivery, and whether the performance still sounds natural instead of overacted. If you have a clean 10-second voice reference, try the same script with and without it to see whether the cloned timbre helps or distracts.

If the demo is promising, the GitHub repo gives you the local path. Treat that as a real GPU workflow rather than a casual laptop install. The project notes point to high-memory GPU use for the polished warm-server path, so the Space is the fastest first test for most readers.

03Caveat

The control is the point, but it is also the failure mode to watch. Longer scenes, too many directions, or aggressive emotional shifts can make speech sound theatrical, unstable, or less text-faithful. Local reproduction is public, but practical use likely needs a capable GPU.

04What you can do with it

Draft character voice lines with specific delivery notes instead of flat narration.
Prototype podcast intros, trailer lines, or game dialogue with controlled pacing.
Test whether stage directions improve voice UX before using a closed TTS API.
Generate alternate reads of the same script by changing only the performance cues.
Evaluate voice-reference cloning on a scripted emotional arc, not just one sentence.

Try the demo

View model page

Neural Expedition · Useful open-source AI, curated without hype.