LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models
Paper
โข
2602.15675
โข
Published
Paper: https://arxiv.org/abs/2602.15675
Nile-XTTS is a fine-tuned version of XTTS v2 optimized for Egyptian Arabic (ุงูููุฌุฉ ุงูู ุตุฑูุฉ) text-to-speech synthesis with zero-shot voice cloning capabilities.
This model was fine-tuned on the NileTTS dataset, comprising 38 hours of Egyptian Arabic speech across medical, sales, and general conversation domains.
| Metric | XTTS v2 (Baseline) | Nile-XTTS-v2 (Ours) | Improvement |
|---|---|---|---|
| WER | 26.8% | 18.8% | 29.9% |
| CER | 8.1% | 4.1% | 49.4% |
| Speaker Similarity | 0.713 | 0.755 | +5.9% |
pip install TTS
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
# load config and model
config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(
config,
checkpoint_path="model.pth",
vocab_path="vocab.json",
use_deepspeed=False
)
model.cuda()
model.eval()
# get speaker latents from reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path="reference.wav",
gpt_cond_len=6,
max_ref_length=30,
sound_norm_refs=False
)
# synth speech
out = model.inference(
text="ู
ุฑุญุจุงุ ุฅุฒูู ุงูููุงุฑุฏูุ",
language="ar",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.7,
)
# save output
torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
If you use this model, please cite: [TO BE ADDED]
This model is released under the Apache 2.0 license, following the original XTTS v2 license.
Base model
coqui/XTTS-v2