name: audio-tts description: Generate speech audio from text using Qwen3 TTS, or clone a voice from reference audio. Triggered when the user wants to convert text to speech, generate audio, read text aloud, or clone/mimic a voice. Supports multiple speakers, English and Chinese, and emotion/style control.
Qwen3 TTS — Text-to-Speech and Voice Cloning
Generate speech audio from text, or clone a voice from a reference audio file.
Binaries
{baseDir}/scripts/tts— Text-to-speech generation with named speakers.{baseDir}/scripts/voice_clone— Voice cloning from a reference audio file.
Models
{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-CustomVoice— Named speaker TTS (0.6B parameters).{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-Base— Voice cloning from reference audio (0.6B parameters).
Reference Audio
Pre-packaged reference audio files for voice cloning are available at {baseDir}/scripts/reference_audio/. Each speaker has two files:
{baseDir}/scripts/reference_audio/<speaker_name>.wav— Reference audio (mono 24kHz 16-bit WAV){baseDir}/scripts/reference_audio/<speaker_name>.txt— Transcript of the reference audio
Available reference speakers: trump, elon_musk.
When to Use Which Tool
tts— When the user wants to generate speech from text using a named speaker (Vivian, Ryan, etc.). Supports English and Chinese.voice_clone— When the user wants to clone a specific voice from a reference audio file and generate new speech in that voice. If the user asks to clone a voice by speaker name (e.g., "speak like Trump", "use Elon Musk's voice"), check{baseDir}/scripts/reference_audio/for a matching<speaker_name>.wavand<speaker_name>.txtpair, and use ICL mode with both files.
Linux Environment Setup
On Linux, the binaries require libtorch shared libraries. Set the library path before running any command:
export LD_LIBRARY_PATH={baseDir}/scripts/libtorch/lib:$LD_LIBRARY_PATH
On macOS, no environment setup is needed (the binaries use the MLX backend). All commands below show the macOS form. On Linux, prefix each command with LD_LIBRARY_PATH={baseDir}/scripts/libtorch/lib:$LD_LIBRARY_PATH.
Text-to-Speech
Generate speech audio from text with a named speaker.
{baseDir}/scripts/tts \
{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-CustomVoice \
"<text>" \
<speaker> \
<language>
Parameters
| Parameter | Required | Description |
|---|---|---|
| model_path | Yes | Path to the model directory |
| text | Yes | The text to synthesize as speech |
| speaker | Yes | Speaker name (see Available Speakers below) |
| language | Yes | english or chinese |
Available Speakers
Vivian, Serena, Ryan, Aiden, Uncle_fu, Ono_anna, Sohee, Eric, Dylan.
Output
Generates output.wav (24kHz mono WAV) in the current working directory.
Example
{baseDir}/scripts/tts \
{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-CustomVoice \
"Hello! Welcome to the Qwen3 text-to-speech system." \
Vivian \
english
Voice Cloning (ICL Mode)
Clone a voice from a reference audio file using ICL (In-Context Learning). This encodes the reference audio into codec tokens and conditions generation on both the speaker embedding and the reference audio/text transcript, producing high-fidelity voice cloning.
Both a reference audio file and its transcript text are required.
{baseDir}/scripts/voice_clone \
{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-Base \
<reference_audio.wav> \
"<text>" \
<language> \
"<reference_text>"
Parameters
| Parameter | Required | Description |
|---|---|---|
| model_path | Yes | Path to the Base model directory |
| reference_audio | Yes | Path to reference WAV file (mono 24kHz 16-bit) |
| text | Yes | The text to synthesize in the cloned voice |
| language | Yes | english or chinese |
| reference_text | Yes | Transcript of the reference audio |
Reference Audio Requirements
The reference audio must be a mono 24kHz 16-bit WAV file. Convert from other formats with ffmpeg:
ffmpeg -i input.m4a -ac 1 -ar 24000 -sample_fmt s16 reference.wav
Output
Generates output_voice_clone.wav (24kHz mono WAV) in the current working directory.
Example
{baseDir}/scripts/voice_clone \
{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-Base \
reference.wav \
"This is a voice cloning test with in-context learning." \
english \
"The transcript of what was said in the reference audio."
Workflow
1. Determine the Task
- If the user wants to generate speech from text with a named speaker (Vivian, Ryan, etc.), use
tts. - If the user wants to clone a voice from an audio file, use
voice_clone. - If the user asks to clone a voice by speaker name (e.g., "speak like Trump", "in Elon Musk's voice"), use
voice_clonewith the pre-packaged reference audio.
2. Prepare Input
- For
tts: Identify the text, speaker name, and language from the user's request. Default toVivianandenglishif not specified. - For
voice_clonewith a named reference speaker:- Look up
{baseDir}/scripts/reference_audio/<speaker_name>.wavand{baseDir}/scripts/reference_audio/<speaker_name>.txt. - Read the transcript from the
.txtfile. - Pass both the
.wavfile and the transcript text.
- Look up
- For
voice_clonewith a user-provided audio file: Ensure the reference audio is a mono 24kHz 16-bit WAV. Convert if needed using ffmpeg. Ask the user for the transcript of the reference audio.
3. Run the Command
Run the appropriate binary using the full paths to the binaries and model directories. On Linux, prefix with LD_LIBRARY_PATH={baseDir}/scripts/libtorch/lib:$LD_LIBRARY_PATH.
Example: Clone by Speaker Name
If the user says "Say hello world in Trump's voice":
# Read the transcript
REF_TEXT=$(cat {baseDir}/scripts/reference_audio/trump.txt)
# Run voice clone with ICL mode
{baseDir}/scripts/voice_clone \
{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-Base \
{baseDir}/scripts/reference_audio/trump.wav \
"Hello world" \
english \
"$REF_TEXT"
4. Return the Output
The output WAV file will be in the current working directory:
ttsproducesoutput.wavvoice_cloneproducesoutput_voice_clone.wav
Inform the user of the output file path.