Start with the listening context
Name the format first: product demo, support message, podcast intro, character line, learning narration, or short ad.
Audio examples
Use these examples to shape pacing, emotion, pronunciation, and voice design before spending credits on a final take.
A strong request separates the words to speak from the performance direction. Tell the model who is speaking, who is listening, the emotional temperature, and any words that need careful pronunciation.
Name the format first: product demo, support message, podcast intro, character line, learning narration, or short ad.
Choose a primary tone such as warm, calm, urgent, playful, documentary, or reassuring instead of stacking many moods.
Use short sentences, commas, line breaks, and bracketed cues like [pause] or [softly] where a human narrator would naturally breathe.
Too vague
Read this product update in a nice voice.
Better request
Style direction: calm product narrator, confident but not salesy, medium pace. Script: Welcome to your weekly workspace summary. [pause] Three projects moved forward, two invoices are ready for review, and one deadline needs attention today.
The improved version defines role, tone, pace, and where the listener should feel a pause.
Too emotional
Say sorry with a sad voice.
Better request
Style direction: sincere support specialist, steady pace, warm and accountable. Script: We are sorry for the delay. Your request is already with our review team, and we will send the next update before Friday afternoon.
The voice is empathetic without sounding theatrical, and the script includes concrete next-step information.
Too many adjectives
Make it super exciting, happy, premium, funny, dramatic, and viral.
Better request
Style direction: bright creator voice with a subtle smile, fast but understandable. Script: Your launch video does not need another rewrite. Drop in the script, choose a voice, and export a clean take in minutes.
One performance idea and a compact script usually produce a cleaner take than conflicting mood instructions.
Too generic
A good English voice.
Better request
Voice design prompt: English female narrator in her 30s, warm studio tone, slightly lower pitch, precise consonants, suitable for product tutorials and onboarding videos.
Voice design works better when you describe age range, pitch, texture, articulation, and repeated use case.
Too flat
Read this story dramatically.
Better request
Style direction: cinematic audiobook narrator, low volume at first, slow pace, then warmer after the reveal. Script: [quietly] The hallway light flickered once. Then again. [pause] Mira held her breath. [tense] The door opened by itself. [pause] [relieved] It was only her brother, holding the birthday cake with both hands.
This gives the model a timeline of performance changes. The cues describe when the emotion changes instead of asking for one vague dramatic mood.
Unclear speakers
Make this conversation sound real: Are we late? No, we still have time.
Better request
Style direction: light radio drama, natural reactions, keep both characters distinct with pacing rather than extreme voices. Script: Ava [worried, quick]: Are we late? Noah [calm, slight smile]: No. We still have time. Ava [exhale]: Good. I thought the doors closed at eight. Noah [reassuring]: They do. It is only seven forty.
Speaker names, emotional tags, and line breaks make the exchange easier to follow while avoiding overacted character voices.
No timing structure
Explain how to upload a voice sample.
Better request
Style direction: patient tutorial narrator, medium-slow pace, leave room between steps for screen actions. Script: First, open the Voice Studio. [pause 1s] Choose Create voice. [pause 1s] Upload a clean MP3 or WAV sample. [pause 1s] Read the authorization statement carefully, then confirm only if you have permission to use the voice.
Explicit step boundaries and pauses help the audio fit a product demo or onboarding video without rushed narration.
Ambiguous text
Say: CVX ships API v2.5 on 05/06 with 1200 new voices.
Better request
Style direction: clear launch announcer, precise pronunciation, no hype. Pronunciation notes: CVX is read as C V X. API is read as A P I. v2.5 is read as version two point five. 05/06 is read as May sixth. Script: C V X ships A P I version two point five on May sixth, with twelve hundred new voice options.
For names, versions, and dates, writing the spoken form directly is often more reliable than leaving the model to infer pronunciation.
Missing breath and space
Read this meditation calmly.
Better request
Style direction: gentle meditation guide, soft volume, unhurried, warm but not sleepy. Script: Settle your shoulders. [long pause] Notice the weight of your hands. [softly] There is nothing to solve right now. [long pause] Breathe in slowly. [pause] Breathe out, and let the room become quiet around you.
Meditation audio depends on silence as much as speech. Longer pause cues and fewer words create a better rhythm.
Language switch is abrupt
Read this in Chinese and English: 欢迎使用 Custom Voices. Create your first voice now.
Better request
Style direction: bilingual product host, smooth code-switching, Mandarin first, English brand words pronounced clearly. Script: 欢迎使用 Custom Voices。[pause] 你可以先上传授权样本,创建自己的 voice profile,然后用文本生成自然的英文或中文音频。
The request tells the model how to handle mixed-language terms and keeps the bilingual phrasing natural instead of sounding pasted together.
Poor sample guidance
Upload any clip of the speaker.
Better request
Sample guidance: choose 30 to 90 seconds of clean solo speech, stable microphone distance, no music, no overlapping speakers, and at least a few complete sentences with the speaker's normal tone. Voice use note: after cloning, use style direction for performance changes instead of trying to fix a noisy sample with stronger prompts.
The cloned voice quality starts with the sample. A clean, representative clip gives later style prompts much more room to work.
Too harsh
Read this warning in a serious voice.
Better request
Style direction: professional compliance narrator, calm authority, neutral pace, no alarm. Script: This voice can only be used with permission from the voice owner. [pause] Do not use generated audio to impersonate, mislead, or imply endorsement without consent.
For policy or safety copy, a controlled neutral tone usually builds more trust than exaggerated severity.