Vaakya‑Open

Vaakya‑Open is a high‑quality, single‑speaker Text‑to‑Speech (TTS) model developed by Voxaura Labs, designed for English and Hindi voice synthesis. It features a natural female voice, optimized for voice‑overs, audiobooks, podcasts, narration, assistants, and production‑grade applications.

Built with a strong focus on clarity, consistency, and expressiveness, Vaakya‑Open is ideal for creators and developers looking for a dependable, studio‑like voice that works seamlessly across English, Hindi, and code‑mixed inputs.

🔊 Live Demo

👉 Try it instantly using the accompanying Gradio Space:

Vaakya‑Open TTS Demo — Convert English or Hindi text into natural‑sounding speech directly in your browser.

Note! If the space has become dormant due to inactivity, you may restart the space. Restart takes about 4 minutes (T4 small VM)

🎧 Audio Samples

Listen to sample outputs demonstrating the model's capabilities:

Sample	Language	Audio Player
English	Pure English
Hindi	Pure Hindi (Devanagari)
Code‑Mixed	Hindi + English (Hinglish)

All samples are generated at 24kHz with 16-bit PCM encoding.

✨ Key Highlights

🎙️ Single Professional Female Voice — consistent, warm, and narration‑ready
🌐 Bilingual Support — English & Hindi (with natural code‑mixing)
🎧 Studio‑Quality Audio — trained on pristine 192kHz recordings, output at 24kHz
⚡ Low‑Latency Inference — suitable for real‑time and batch workflows
🧠 Production‑Oriented — stable voice characteristics across long passages

🧠 Model Overview

Attribute	Details
Model Name	Vaakya‑Open
Model Type	Autoregressive Transformer (Speech‑LLM)
Base Architecture	Llama 3B
Speaker	Single speaker (Female)
Languages	English, Hindi, Code‑mixed
Audio Codec	SNAC @ 24kHz
Sampling Rate	24 kHz (output)
Developed By	Voxaura Labs
License	Apache 2.0

🏗️ Architecture

Vaakya‑Open is built on the Orpheus TTS architecture pioneered by Canopy Labs, which treats speech synthesis as a language modeling task. The model generates discrete audio tokens that are decoded into high‑quality waveforms.

┌──────────────────────────────────────────────────────────────────────────┐
│                       VAAKYA-OPEN TTS ARCHITECTURE                       │
└──────────────────────────────────────────────────────────────────────────┘

    ╔══════════════════╗
    ║   Text Input     ║  English / Hindi / Code-Mixed
    ║                  ║
    ╚════════╤═════════╝
             │
             ▼
    ╔═══════════════════════════╗
    ║   Llama 3B LLM            ║  Autoregressive generation
    ║   (Speech Transformer)    ║  7 tokens per audio frame
    ╚═════════════╤═════════════╝
                  │
                  ▼
    ╔═══════════════════════════╗
    ║   Audio Tokens            ║  Discrete codes
    ║   (SNAC Format)           ║  Hierarchical 3-level
    ╚═════════════╤═════════════╝
                  │
                  ▼
    ╔═══════════════════════════╗
    ║   SNAC Decoder            ║  Neural audio codec
    ║   (24kHz)                 ║  Token → Waveform
    ╚═════════════╤═════════════╝
                  │
                  ▼
    ╔═══════════════════════════╗
    ║   24kHz Audio Waveform    ║  High-quality speech output
    ║   (Output)                ║  Studio-grade quality
    ╚═══════════════════════════╝

How It Works

Text Input — Your text (English, Hindi, or code‑mixed) is tokenized using a text tokenizer
Audio Token Generation — The Llama‑based LLM autoregressively generates discrete audio tokens (7 tokens per audio frame)
SNAC Decoding — The SNAC neural codec converts audio tokens back into a 24kHz waveform
Output — High‑quality speech audio ready for playback or further processing

Key Architectural Features

Component	Specification
LLM Backbone	Llama‑style autoregressive transformer (3B parameters)
Audio Tokenizer	SNAC with 7 tokens per frame (flattened sequence)
Tokens per Second	~83 audio tokens/second
Context Length	2048 tokens
Streaming Support	Yes (via sliding window decoding)

Attribution: This architecture builds on Orpheus TTS by Canopy Labs, which demonstrated that LLMs can achieve human‑level speech synthesis.

🎼 Voice & Audio Quality

Attribute	Value
Original Recording Rate	192 kHz (studio‑grade)
Training / Output Rate	24 kHz
Output Format	WAVE (PCM 16-bit, mono, 24kHz)
Recording Environment	Controlled studio conditions
Voice Style	Neutral, professional, voice‑over friendly

This high‑resolution capture pipeline preserves subtle vocal textures, resulting in clean pronunciation, smooth prosody, and reduced artifacts.

🚀 Getting Started

Installation

pip install torch transformers soundfile accelerate
pip install snac  # For audio decoding

For optional 4-bit quantization support:

pip install bitsandbytes

Basic Usage

import torch
import soundfile as sf
from transformers import AutoTokenizer, AutoModel

model_id = "voxaura-labs/vaakya-open"

model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

text = "नमस्ते! This is Vaakya‑Open from Voxaura Labs."

with torch.no_grad():
    audio = model.generate_speech(text)

sf.write("output.wav", audio, 24000)

Advanced Usage (Full Pipeline)

For users who need more control over generation parameters:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC
import soundfile as sf

# Load model and tokenizer (FP16 for balanced speed and quality)
model = AutoModelForCausalLM.from_pretrained(
    "voxaura-labs/vaakya-open",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("voxaura-labs/vaakya-open")

# Initialize SNAC decoder
# Note: SNAC will be moved to CPU during generation for stability
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval()
if torch.cuda.is_available():
    snac_model = snac_model.cuda()

# Optional: For faster inference with lower memory usage, use 4-bit quantization:
# from transformers import BitsAndBytesConfig
# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16,
#     bnb_4bit_use_double_quant=True,
# )
# Then pass quantization_config to from_pretrained() instead of torch_dtype

# Token IDs
START_OF_SPEECH_TOKEN = 128257
END_OF_SPEECH_TOKEN = 128258
START_OF_HUMAN_TOKEN = 128259
END_OF_HUMAN_TOKEN = 128260
START_OF_AI_TOKEN = 128261
END_OF_AI_TOKEN = 128262
AUDIO_CODE_BASE_OFFSET = 128266

# Audio token range: 128266 to 156937 (28,672 tokens total)
# The 7 tokens per frame use offsets: +0, +4096, +8192, +12288, +16384, +20480, +24576
MAX_AUDIO_TOKEN = 156937


def generate_speech(text, temperature=0.5, top_p=0.9):
    """Generate speech from text.

    Args:
        text: Input text (English, Hindi, or code-mixed)
        temperature: Sampling temperature (0.4-0.7 recommended)
        top_p: Nucleus sampling parameter

    Returns:
        numpy array: Audio waveform at 24kHz
    """
    # Move SNAC to CPU for decoding (important for stability)
    snac_model.to("cpu")

    # Tokenize text (automatically adds BOS token)
    input_ids = tokenizer(text, return_tensors="pt").input_ids

    # Create input sequence: START_HUMAN + [BOS + Text + EOS] + END_HUMAN
    start_token = torch.tensor([[START_OF_HUMAN_TOKEN]], dtype=torch.int64)
    end_tokens = torch.tensor([[128009, END_OF_HUMAN_TOKEN]], dtype=torch.int64)  # EOS + END_HUMAN

    input_sequence = torch.cat([start_token, input_ids, end_tokens], dim=1)
    input_sequence = input_sequence.to(model.device)

    # Generate audio tokens
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=input_sequence,
            max_new_tokens=2048,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=1.1,
            num_return_sequences=1,
            eos_token_id=END_OF_SPEECH_TOKEN,
            use_cache=True,
            pad_token_id=tokenizer.pad_token_id,
        )

    # Extract generated tokens
    gen_tokens = generated_ids[0]

    # Find the last occurrence of START_OF_SPEECH token
    sos_indices = (gen_tokens == START_OF_SPEECH_TOKEN).nonzero(as_tuple=True)[0]

    if len(sos_indices) > 0:
        # Start from after the last START_OF_SPEECH token
        start_idx = sos_indices[-1].item() + 1
        cropped_tokens = gen_tokens[start_idx:]
    else:
        cropped_tokens = gen_tokens

    # Remove END_OF_SPEECH tokens
    audio_tokens = cropped_tokens[cropped_tokens != END_OF_SPEECH_TOKEN]

    # Truncate to make divisible by 7
    num_tokens = len(audio_tokens)
    num_frames = num_tokens // 7
    audio_tokens = audio_tokens[:num_frames * 7]

    if len(audio_tokens) == 0:
        raise ValueError("No audio tokens generated")

    # Convert to list and subtract offset
    code_list = [t.item() - AUDIO_CODE_BASE_OFFSET for t in audio_tokens]

    # Decode to audio
    return decode_snac_tokens(code_list)


def decode_snac_tokens(code_list):
    """Decode SNAC tokens to audio waveform.

    This model uses a 7-token interleaved encoding per audio frame with offsets:
    Token 1: offset +0      (level 0, coarse)
    Token 2: offset +4096   (level 1, medium)
    Token 3: offset +8192   (level 2, fine)
    Token 4: offset +12288  (level 2, fine)
    Token 5: offset +16384  (level 1, medium)
    Token 6: offset +20480  (level 2, fine)
    Token 7: offset +24576  (level 2, fine)

    Args:
        code_list: List of audio codes (already offset-adjusted)

    Returns:
        numpy array: Decoded audio waveform
    """
    if not code_list or len(code_list) % 7 != 0:
        raise ValueError("Code list must be non-empty and divisible by 7")

    # Redistribute codes into SNAC's 3-level hierarchy
    layer_1 = []  # Coarse: 1 token per frame
    layer_2 = []  # Medium: 2 tokens per frame
    layer_3 = []  # Fine: 4 tokens per frame

    for i in range(len(code_list) // 7):
        # Extract and adjust each token based on its offset
        layer_1.append(code_list[7*i])
        layer_2.append(code_list[7*i + 1] - 4096)
        layer_3.append(code_list[7*i + 2] - (2*4096))
        layer_3.append(code_list[7*i + 3] - (3*4096))
        layer_2.append(code_list[7*i + 4] - (4*4096))
        layer_3.append(code_list[7*i + 5] - (5*4096))
        layer_3.append(code_list[7*i + 6] - (6*4096))

    # Create hierarchical code tensors for SNAC (on same device as SNAC model)
    snac_device = next(snac_model.parameters()).device
    codes = [
        torch.tensor(layer_1, device=snac_device).unsqueeze(0),
        torch.tensor(layer_2, device=snac_device).unsqueeze(0),
        torch.tensor(layer_3, device=snac_device).unsqueeze(0)
    ]

    # Decode to audio waveform
    with torch.no_grad():
        audio_hat = snac_model.decode(codes)

    return audio_hat.squeeze().clamp(-1, 1).cpu().numpy()


# Example usage
text = "आज का मौसम बहुत अच्छा है। Let's go for a walk!"
audio = generate_speech(text, temperature=0.5, top_p=0.9)
sf.write("output.wav", audio, 24000)
print(f"Generated {len(audio) / 24000:.2f} seconds of audio")

🌍 Language Support

Language	Status	Notes
English	✅ Supported	Clear, neutral international pronunciation
Hindi	✅ Supported	Natural, fluent, and expressive
Code‑Mixed (Hinglish)	✅ Supported	Seamless language switching

🧩 Use Cases

Vaakya‑Open is well‑suited for:

Application	Description
📚 Audiobooks & Narration	Long‑form content with consistent voice
🎥 Video Voice‑Overs	Professional dubbing and narration
🏥 Healthcare	Patient reminders, medication instructions
📞 Voice AI Agents	IVR systems, conversational assistants
🧑‍🏫 E‑learning	Educational content and tutorials
♿ Accessibility	Screen readers for visually impaired users

🏗️ Training Summary

Attribute	Value
Speaker Count	1 (professional voice‑over artist)
Recording Quality	Studio‑grade, 192kHz original capture
Output Quality	24kHz (downsampled for efficiency)
Data Characteristics	Conversational, narrative, informational
Training Method	LoRA fine‑tuning on Orpheus base

Training Data Sources

This model was trained using high‑quality speech data from publicly available academic datasets developed by premier Indian research institutions:

Dataset	Institution	Description	License
IndicTTS	IIT Madras	Hindi and Indian English speech corpus for TTS	Custom (see below)
SYSPIN TTS Corpus	IISc Bangalore / SPIRE Lab	900+ hours of studio-recorded TTS data in 9 Indian languages	CC‑BY‑4.0
SPICOR TTS Corpus	IISc Bangalore / SPIRE Lab	97+ hour domain-rich Indian English TTS corpus	CC‑BY‑4.0

We gratefully acknowledge these institutions for making their datasets available for research and development in Indian language speech synthesis.

📊 Performance

Metric	Value
Latency (A100‑80GB)	~120ms
Latency (RTX 4090)	~200ms
Real‑time Factor	< 0.1x*
Output Sample Rate	24kHz

Note: Benchmarks measured with 4-bit quantization (load_in_4bit=True), batch size 1, average text length ~20-30 tokens, max_new_tokens=512. Performance varies significantly based on hardware, precision (FP16/FP32/4-bit), text length, and generation parameters. FP16 inference may be 2-3x slower than 4-bit but provides better quality.

⚠️ Limitations

Limitation	Description
Single Speaker	No speaker switching or voice selection
Language Support	English & Hindi only
Emotion Control	Emotion tokens not yet exposed
Hardware	GPU recommended for real‑time inference

🛣️ Roadmap

We are actively working on:

🎭 Emotion & prosody control tokens
🌏 Additional Indian languages (Tamil, Telugu, Bengali, Marathi)
🎙️ Multiple voice variants (male, regional accents)
⚙️ CPU‑optimized inference paths
📡 Streaming inference support

📜 License

This project is released under the Apache 2.0 License.

Training Data Licenses

The training data used in this model is subject to the following licenses:

IIT Madras IndicTTS Dataset:

COPYRIGHT 2016 TTS Consortium, TDIL, Meity represented by Hema A Murthy & S Umesh, DEPARTMENT OF Computer Science and Engineering and Electrical Engineering, IIT Madras. ALL RIGHTS RESERVED.

The IndicTTS dataset is provided under a permissive license that allows derivative works and free distribution. See the full license for details.

IISc Bangalore SYSPIN & SPICOR Datasets:

The SYSPIN and SPICOR TTS corpora are released under the Creative Commons Attribution 4.0 International License (CC‑BY‑4.0), which permits sharing and adaptation for any purpose, including commercial use, with appropriate attribution.

🙏 Acknowledgments

Training Data

We gratefully acknowledge the following institutions and projects for providing high‑quality speech datasets that made this model possible:

Indian Institute of Technology Madras (IIT Madras)

IndicTTS Project — A comprehensive speech corpus for Indian languages developed by Prof. Hema A Murthy, Prof. S Umesh, and the TTS Consortium
Funded by: Technology Development for Indian Languages (TDIL), Ministry of Electronics and Information Technology (MeitY), Government of India

Indian Institute of Science, Bangalore (IISc) — SPIRE Lab

SYSPIN Project (SYnthesizing SPeech in INdian languages) — 900+ hours of studio-recorded TTS data in 9 Indian languages, led by Prof. Prasanta Kumar Ghosh
- Funded by: German Development Cooperation "FAIR Forward — AI for All" (GIZ Germany) and Bhashini AI Solutions Private Limited
SPICOR Project — 97+ hours of domain-rich Indian English TTS corpus
Dataset URL: https://spiredatasets.iisc.ac.in/

Architecture & Tools

Canopy Labs — For pioneering the Orpheus TTS architecture and open‑sourcing their work
Unsloth — For training optimizations and fine‑tuning tools
SNAC — Hubert Siuzdak for the neural audio codec

Voice

Voice Artist — For providing high‑quality studio recordings

🛡️ Safety, Ethics & Responsible Use

Intended Use

Vaakya‑Open is designed for legitimate applications including:

📚 Audiobook narration and content creation
🎥 Video voice‑overs and dubbing
🏥 Healthcare communication (patient reminders, medication instructions)
📞 Voice assistants and IVR systems
🧑‍🏫 Educational content and e‑learning
♿ Accessibility tools for visually impaired users
🔬 Academic research in speech synthesis

Prohibited Uses

Do not use this model for:

❌ Impersonation — Creating audio that falsely represents a real person without their explicit consent
❌ Fraud & Scams — Generating deceptive audio for phishing, vishing, or financial fraud
❌ Misinformation — Producing fake news, propaganda, or misleading content
❌ Deepfakes — Creating non‑consensual synthetic media intended to deceive
❌ Harassment — Generating content to bully, threaten, or demean individuals
❌ Illegal Activities — Any use that violates applicable laws or regulations

Transparency Recommendation

If you use Vaakya‑Open to generate speech for public‑facing applications, we strongly recommend disclosing to end users that they are listening to AI‑generated content. Transparency builds trust and helps prevent potential misuse.

Legal Compliance

Users are responsible for ensuring their use of this model complies with:

All applicable local, national, and international laws
Industry‑specific regulations (healthcare, finance, telecommunications, etc.)
Platform terms of service where the generated audio is distributed
Intellectual property and privacy rights of third parties

Disclaimer

THE DEVELOPERS OF VAAKYA‑OPEN ASSUME NO LIABILITY FOR ANY MISUSE OF THIS MODEL.

This model is provided "as is" without warranty of any kind. Voxaura Labs and its contributors are not responsible for any direct, indirect, incidental, or consequential damages arising from the use or misuse of this model or its outputs.

By using this model, you agree to follow all applicable laws, respect the rights and privacy of others, and uphold ethical standards in AI development and deployment.

Ethical AI Commitment

We at Voxaura Labs are committed to the responsible development and use of AI technologies. We believe that:

AI should augment human capabilities, not deceive or harm
Transparency is essential in AI‑generated content
Privacy and consent must be respected in voice applications
Accessibility should be a core consideration in voice technology

We encourage the community to use Vaakya‑Open as a force for good — to improve accessibility, enhance communication, and create meaningful experiences while respecting the dignity and rights of all individuals.

Reporting Misuse

If you become aware of any misuse of this model, please report it to us at hello@voxygen.ai. We take reports of misuse seriously and will take appropriate action where possible.

💬 About Voxaura Labs & Voxygen.ai

Voxaura Labs builds advanced Voice AI systems focused on realism, scalability, and multilingual accessibility — bridging the gap between human expression and artificial speech.

Voxygen is the product brand of Voxaura Labs, dedicated to making advanced Voice AI technologies accessible through practical applications and tools for creators, developers, and enterprises worldwide.

🌐 Website · 📧 Contact

📝 Citation

@misc{vaakya2026,
    title={Vaakya-Open: Text-to-Speech for Hindi and English},
    author={Voxaura Labs},
    year={2026},
    publisher={HuggingFace},
    url={https://huggingface.co/voxaura-labs/vaakya-open}
}

If you use this model, please also consider citing the training data sources:

@misc{indictts2016,
    title={IndicTTS: Text-to-Speech for Indian Languages},
    author={Murthy, Hema A and Umesh, S and {TTS Consortium}},
    year={2016},
    institution={Indian Institute of Technology Madras},
    note={TDIL, MeitY},
    url={https://www.iitm.ac.in/donlab/indictts/database}
}

@misc{syspin2025,
    title={SYSPIN_S1.0 Corpus - A TTS Corpus of 900+ hours in nine Indian Languages},
    author={Abhayjeet et al.},
    year={2025},
    institution={Indian Institute of Science, Bengaluru},
    url={https://spiredatasets.iisc.ac.in/syspinCorpus}
}

@misc{spicor2025,
    title={SPICOR TTS_1.0 Corpus - A 97+ hour domain-rich Indian English TTS Corpus},
    author={Abhayjeet et al.},
    year={2025},
    institution={Indian Institute of Science, Bengaluru},
    url={https://spiredatasets.iisc.ac.in/spicortts10}
}

Made in India 🇮🇳

If you use Vaakya‑Open in your work, we'd love to hear from you!

Downloads last month: 235

Model tree for voxaura-labs/vaakya-open

Base model

meta-llama/Llama-3.2-3B-Instruct

Finetuned

canopylabs/orpheus-3b-0.1-pretrained

Finetuned

canopylabs/orpheus-3b-0.1-ft

Finetuned

(18)

this model

Quantizations

1 model

voxaura-labs
/

vaakya-open

Vaakya‑Open

🔊 Live Demo

🎧 Audio Samples

✨ Key Highlights

🧠 Model Overview

🏗️ Architecture

How It Works

Key Architectural Features

🎼 Voice & Audio Quality

🚀 Getting Started

Installation

Basic Usage

Advanced Usage (Full Pipeline)

🌍 Language Support

🧩 Use Cases

🏗️ Training Summary

Training Data Sources

📊 Performance

⚠️ Limitations

🛣️ Roadmap

📜 License

Training Data Licenses

🙏 Acknowledgments

Training Data

Architecture & Tools

Voice

🛡️ Safety, Ethics & Responsible Use

Intended Use

Prohibited Uses

Transparency Recommendation

Legal Compliance

Disclaimer

Ethical AI Commitment

Reporting Misuse

💬 About Voxaura Labs & Voxygen.ai

📝 Citation

Model tree for voxaura-labs/vaakya-open

Space using voxaura-labs/vaakya-open 1