Notes on Google's New Dialog Model

· Zach Ocean

Google released an API for their 2-way dialog model, the same technology that powers the podcast generation feature in NotebookLM. API docs here.

It supports only 2 speakers. The dialog is represente as a series of speaking turns, each of which is assigned a speaker. There are 2 male and 2 female speakers supported, no custom voice or voice clones as of today.

The model outputs are extremely natural, which is what makes the NotebookLM podcast generation so uncanny. I have not heard any other commercial product that comes close. Play has a dialog generation model with custom voice options but the voices sound less natural, more metallic, and has more audio artifacts.

Here’s a sample from Google’s API docs:

from google.cloud import texttospeech_v1beta1


# Instantiates a client
client = texttospeech_v1beta1.TextToSpeechClient()

multi_speaker_markup = texttospeech_v1beta1.MultiSpeakerMarkup()

turn1 = texttospeech_v1beta1.MultiSpeakerMarkup.Turn()
turn1.text = "I've heard that the Google Cloud multi-speaker audio generation sounds amazing!"
turn1.speaker = "R"
multi_speaker_markup.turns.append(turn1)

turn2 = texttospeech_v1beta1.MultiSpeakerMarkup.Turn()
turn2.text = "Oh? What's so good about it?"
turn2.speaker = "T"
multi_speaker_markup.turns.append(turn2)

turn3 = texttospeech_v1beta1.MultiSpeakerMarkup.Turn()
turn3.text = "Well.."
turn3.speaker = "R"
multi_speaker_markup.turns.append(turn3)

turn4 = texttospeech_v1beta1.MultiSpeakerMarkup.Turn()
turn4.text = "well what?"
turn4.speaker = "T"
multi_speaker_markup.turns.append(turn4)

turn5 = texttospeech_v1beta1.MultiSpeakerMarkup.Turn()
turn5.text = "Well, you should find it out by yourself!"
turn5.speaker = "R"
multi_speaker_markup.turns.append(turn5)

turn6 = texttospeech_v1beta1.MultiSpeakerMarkup.Turn()
turn6.text = "Alright alright, let's try it out!"
turn6.speaker = "T"
multi_speaker_markup.turns.append(turn6)

# Set the text input to be synthesized
synthesis_input = texttospeech_v1beta1.SynthesisInput(multi_speaker_markup=multi_speaker_markup)

Hallucinations

Note that in the above example, “Well..” and “well what?” are meant to be uttered by two different speakers. But in my generations, they consistently were uttered by the same speaker. Tweaking the transcript by fixing the ellipses usage (note that the above example from Google’s docs has 2 periods instead of 3 in the ellipses) or adding filler words fixed the issue. Proper punctuation (or more robust training data) is critical!

Ellipses fix

“well what” -> “Um, well what?”

Natural feedback

When one speaker has a long utterance, the model sometimes naturally inserts filler feedback words (“um”, “ah”, “hmm”). (Excuse the text copy/pasted from a YouTube description.)

Note the natural “mmhmm” - not contained in the transcript - at ~54 seconds.

audio_content = make_dialog([
    "I've heard that the Google Cloud multi-speaker audio generation sounds amazing!",
    "Oh? What's so good about it?", 
    "Well..",
    "Um, well what?",
    "Well, you should find it out by yourself!",
    "Alright alright, let's try it out!",
    """
On Tuesday night, president-elect Donald Trump announced that the richest man in the world, Elon Musk, along with entrepreneur Vivek Ramaswamy will head a new initiative in the Trump administration: the Department of Government Efficiency, or "DOGE."

Aside from the very strange fact that internet meme culture has now landed in the White House—Dogecoin is a memecoin—more importantly, what the announcement solidifies is the triumph of the counter-elite. A bunch of oddball outsiders ran against an insular band of out-of-touch elites supported by every celebrity in Hollywood—and they won. And they are about to reshape not just the government but also the culture in ways we can't imagine.

And there was one person I wanted to discuss it with. He is the vanguard of those antiestablishment counter-elites: Peter Thiel."""
], speakers=["R", "T"])

Punctuation and contextual changes

Mostly the voices stay in character. But with the right punctuation and contextual clues (e.g., a heckler at a comedy show) they can exclaim and yell.

audio_content = make_dialog([
    "You know what's weird about living in 2024? Our phones are basically running our lives now. I was at a restaurant the other day, and my phone died. You would've thought I lost a family member. I'm sitting there, staring at this black screen like it's a tiny funeral.",
    "Booooooo!!!!!! You Suck!!! Boooo!!!!",
    "And the worst part? I couldn't even pay for my meal because everything's on my phone now! I had to do this walk of shame to the ATM like some kind of caveman. The teenagers at the next table were looking at me like I was using a sundial to tell time.",
    "Boooo!!!! Get a real job!!!!!",
], speakers=["R", "T"])

I tried a bit to get interruptions/overlapping speech without success.

audio_content = make_dialog([
    "You know what's weird about living in 2024? Our phones are basically running our lives now. I was at a restaurant the other day,",
    "Boooo!!!!!! You Suck!!! Boooo!!!!",
    "and my phone died. You would've thought I lost a family member. I'm sitting there, staring at this black screen like it's a tiny funeral.",
    "Booooo!!!!!! You Suck!!! Boooo!!!!",
    "And the worst part? I couldn't even pay for my meal because everything's on my phone now! I had to do this",
    "Boooo!!!! Get a real job!!!!!",
    "walk of shame to the ATM like some kind of caveman. The teenagers at the next table were looking at me like I was using a sundial to tell time.",
    "Boooo!!!! Get a real job!!!!!",
    "Stupid-ass heckler! Go fuck yourself!"
], speakers=["R", "T", "R", "T", "R", "T", "R", "T", "R"])