Configuration Parsing Warning: Invalid JSON for config file config.json

Nile-XTTS Model 🇪🇬

Paper: https://arxiv.org/abs/2602.15675

Nile-XTTS is a fine-tuned version of XTTS v2 optimized for Egyptian Arabic (اللهجة المصرية) text-to-speech synthesis with zero-shot voice cloning capabilities.

Model Description

This model was fine-tuned on the NileTTS dataset, comprising 38 hours of Egyptian Arabic speech across medical, sales, and general conversation domains.

Key Features

Egyptian Arabic optimized: Trained specifically on Egyptian dialect, not MSA or Gulf Arabic
Zero-shot voice cloning: Clone any voice with just a 6-second reference audio
Improved intelligibility: 29.9% reduction in WER compared to base XTTS v2
Better pronunciation: 49.4% reduction in CER for Egyptian Arabic

Performance

Metric	XTTS v2 (Baseline)	Nile-XTTS-v2 (Ours)	Improvement
WER	26.8%	18.8%	29.9%
CER	8.1%	4.1%	49.4%
Speaker Similarity	0.713	0.755	+5.9%

Usage

Interactive Demo

Installation

pip install TTS

Usage (Direct Model Loading)

import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# load config and model
config = XttsConfig()
config.load_json("config.json")

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path="model.pth",
    vocab_path="vocab.json",
    use_deepspeed=False
)
model.cuda()
model.eval()

# get speaker latents from reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path="reference.wav",
    gpt_cond_len=6,
    max_ref_length=30,
    sound_norm_refs=False
)

# synth speech
out = model.inference(
    text="مرحبا، إزيك النهارده؟",
    language="ar",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.7,
)

# save output
torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)

Training Details

Base model: XTTS v2
Training data: NileTTS dataset (38 hours, 2 speakers)
Epochs: 8 (early stopping)
Learning rate: 5e-6

Limitations

Limited to 2 speaker voices in training data
Optimized for Egyptian Arabic; may not perform as well on other Arabic dialects
Zero-shot cloning quality depends on reference audio quality

Citation

If you use this model, please cite: [TO BE ADDED]

License

This model is released under the Apache 2.0 license, following the original XTTS v2 license.

Acknowledgements

Coqui TTS for the XTTS v2 base model
The NileTTS team for the dataset creation

Downloads last month: 31

Model tree for KickItLikeShika/NileTTS-XTTS

Base model

coqui/XTTS-v2

Finetuned

(57)

this model

Paper for KickItLikeShika/NileTTS-XTTS

LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models

Paper • 2602.15675 • Published 4 days ago