Vaakyaโ€‘Open

Vaakyaโ€‘Open is a highโ€‘quality, singleโ€‘speaker Textโ€‘toโ€‘Speech (TTS) model developed by Voxaura Labs, designed for English and Hindi voice synthesis. It features a natural female voice, optimized for voiceโ€‘overs, audiobooks, podcasts, narration, assistants, and productionโ€‘grade applications.

Built with a strong focus on clarity, consistency, and expressiveness, Vaakyaโ€‘Open is ideal for creators and developers looking for a dependable, studioโ€‘like voice that works seamlessly across English, Hindi, and codeโ€‘mixed inputs.


๐Ÿ”Š Live Demo

๐Ÿ‘‰ Try it instantly using the accompanying Gradio Space:

Vaakyaโ€‘Open TTS Demo โ€” Convert English or Hindi text into naturalโ€‘sounding speech directly in your browser.

Note! If the space has become dormant due to inactivity, you may restart the space. Restart takes about 4 minutes (T4 small VM)


๐ŸŽง Audio Samples

Listen to sample outputs demonstrating the model's capabilities:

Sample Language Audio Player
English Pure English
Hindi Pure Hindi (Devanagari)
Codeโ€‘Mixed Hindi + English (Hinglish)

All samples are generated at 24kHz with 16-bit PCM encoding.


โœจ Key Highlights

  • ๐ŸŽ™๏ธ Single Professional Female Voice โ€” consistent, warm, and narrationโ€‘ready
  • ๐ŸŒ Bilingual Support โ€” English & Hindi (with natural codeโ€‘mixing)
  • ๐ŸŽง Studioโ€‘Quality Audio โ€” trained on pristine 192kHz recordings, output at 24kHz
  • โšก Lowโ€‘Latency Inference โ€” suitable for realโ€‘time and batch workflows
  • ๐Ÿง  Productionโ€‘Oriented โ€” stable voice characteristics across long passages

๐Ÿง  Model Overview

Attribute Details
Model Name Vaakyaโ€‘Open
Model Type Autoregressive Transformer (Speechโ€‘LLM)
Base Architecture Llama 3B
Speaker Single speaker (Female)
Languages English, Hindi, Codeโ€‘mixed
Audio Codec SNAC @ 24kHz
Sampling Rate 24 kHz (output)
Developed By Voxaura Labs
License Apache 2.0

๐Ÿ—๏ธ Architecture

Vaakyaโ€‘Open is built on the Orpheus TTS architecture pioneered by Canopy Labs, which treats speech synthesis as a language modeling task. The model generates discrete audio tokens that are decoded into highโ€‘quality waveforms.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                       VAAKYA-OPEN TTS ARCHITECTURE                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

    โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
    โ•‘   Text Input     โ•‘  English / Hindi / Code-Mixed
    โ•‘                  โ•‘
    โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
             โ”‚
             โ–ผ
    โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
    โ•‘   Llama 3B LLM            โ•‘  Autoregressive generation
    โ•‘   (Speech Transformer)    โ•‘  7 tokens per audio frame
    โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
                  โ”‚
                  โ–ผ
    โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
    โ•‘   Audio Tokens            โ•‘  Discrete codes
    โ•‘   (SNAC Format)           โ•‘  Hierarchical 3-level
    โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
                  โ”‚
                  โ–ผ
    โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
    โ•‘   SNAC Decoder            โ•‘  Neural audio codec
    โ•‘   (24kHz)                 โ•‘  Token โ†’ Waveform
    โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
                  โ”‚
                  โ–ผ
    โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
    โ•‘   24kHz Audio Waveform    โ•‘  High-quality speech output
    โ•‘   (Output)                โ•‘  Studio-grade quality
    โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

How It Works

  1. Text Input โ€” Your text (English, Hindi, or codeโ€‘mixed) is tokenized using a text tokenizer
  2. Audio Token Generation โ€” The Llamaโ€‘based LLM autoregressively generates discrete audio tokens (7 tokens per audio frame)
  3. SNAC Decoding โ€” The SNAC neural codec converts audio tokens back into a 24kHz waveform
  4. Output โ€” Highโ€‘quality speech audio ready for playback or further processing

Key Architectural Features

Component Specification
LLM Backbone Llamaโ€‘style autoregressive transformer (3B parameters)
Audio Tokenizer SNAC with 7 tokens per frame (flattened sequence)
Tokens per Second ~83 audio tokens/second
Context Length 2048 tokens
Streaming Support Yes (via sliding window decoding)

Attribution: This architecture builds on Orpheus TTS by Canopy Labs, which demonstrated that LLMs can achieve humanโ€‘level speech synthesis.


๐ŸŽผ Voice & Audio Quality

Attribute Value
Original Recording Rate 192 kHz (studioโ€‘grade)
Training / Output Rate 24 kHz
Output Format WAVE (PCM 16-bit, mono, 24kHz)
Recording Environment Controlled studio conditions
Voice Style Neutral, professional, voiceโ€‘over friendly

This highโ€‘resolution capture pipeline preserves subtle vocal textures, resulting in clean pronunciation, smooth prosody, and reduced artifacts.


๐Ÿš€ Getting Started

Installation

pip install torch transformers soundfile accelerate
pip install snac  # For audio decoding

For optional 4-bit quantization support:

pip install bitsandbytes

Basic Usage

import torch
import soundfile as sf
from transformers import AutoTokenizer, AutoModel

model_id = "voxaura-labs/vaakya-open"

model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

text = "เคจเคฎเคธเฅเคคเฅ‡! This is Vaakyaโ€‘Open from Voxaura Labs."

with torch.no_grad():
    audio = model.generate_speech(text)

sf.write("output.wav", audio, 24000)

Advanced Usage (Full Pipeline)

For users who need more control over generation parameters:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC
import soundfile as sf

# Load model and tokenizer (FP16 for balanced speed and quality)
model = AutoModelForCausalLM.from_pretrained(
    "voxaura-labs/vaakya-open",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("voxaura-labs/vaakya-open")

# Initialize SNAC decoder
# Note: SNAC will be moved to CPU during generation for stability
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval()
if torch.cuda.is_available():
    snac_model = snac_model.cuda()

# Optional: For faster inference with lower memory usage, use 4-bit quantization:
# from transformers import BitsAndBytesConfig
# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16,
#     bnb_4bit_use_double_quant=True,
# )
# Then pass quantization_config to from_pretrained() instead of torch_dtype

# Token IDs
START_OF_SPEECH_TOKEN = 128257
END_OF_SPEECH_TOKEN = 128258
START_OF_HUMAN_TOKEN = 128259
END_OF_HUMAN_TOKEN = 128260
START_OF_AI_TOKEN = 128261
END_OF_AI_TOKEN = 128262
AUDIO_CODE_BASE_OFFSET = 128266

# Audio token range: 128266 to 156937 (28,672 tokens total)
# The 7 tokens per frame use offsets: +0, +4096, +8192, +12288, +16384, +20480, +24576
MAX_AUDIO_TOKEN = 156937


def generate_speech(text, temperature=0.5, top_p=0.9):
    """Generate speech from text.

    Args:
        text: Input text (English, Hindi, or code-mixed)
        temperature: Sampling temperature (0.4-0.7 recommended)
        top_p: Nucleus sampling parameter

    Returns:
        numpy array: Audio waveform at 24kHz
    """
    # Move SNAC to CPU for decoding (important for stability)
    snac_model.to("cpu")

    # Tokenize text (automatically adds BOS token)
    input_ids = tokenizer(text, return_tensors="pt").input_ids

    # Create input sequence: START_HUMAN + [BOS + Text + EOS] + END_HUMAN
    start_token = torch.tensor([[START_OF_HUMAN_TOKEN]], dtype=torch.int64)
    end_tokens = torch.tensor([[128009, END_OF_HUMAN_TOKEN]], dtype=torch.int64)  # EOS + END_HUMAN

    input_sequence = torch.cat([start_token, input_ids, end_tokens], dim=1)
    input_sequence = input_sequence.to(model.device)

    # Generate audio tokens
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=input_sequence,
            max_new_tokens=2048,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=1.1,
            num_return_sequences=1,
            eos_token_id=END_OF_SPEECH_TOKEN,
            use_cache=True,
            pad_token_id=tokenizer.pad_token_id,
        )

    # Extract generated tokens
    gen_tokens = generated_ids[0]

    # Find the last occurrence of START_OF_SPEECH token
    sos_indices = (gen_tokens == START_OF_SPEECH_TOKEN).nonzero(as_tuple=True)[0]

    if len(sos_indices) > 0:
        # Start from after the last START_OF_SPEECH token
        start_idx = sos_indices[-1].item() + 1
        cropped_tokens = gen_tokens[start_idx:]
    else:
        cropped_tokens = gen_tokens

    # Remove END_OF_SPEECH tokens
    audio_tokens = cropped_tokens[cropped_tokens != END_OF_SPEECH_TOKEN]

    # Truncate to make divisible by 7
    num_tokens = len(audio_tokens)
    num_frames = num_tokens // 7
    audio_tokens = audio_tokens[:num_frames * 7]

    if len(audio_tokens) == 0:
        raise ValueError("No audio tokens generated")

    # Convert to list and subtract offset
    code_list = [t.item() - AUDIO_CODE_BASE_OFFSET for t in audio_tokens]

    # Decode to audio
    return decode_snac_tokens(code_list)


def decode_snac_tokens(code_list):
    """Decode SNAC tokens to audio waveform.

    This model uses a 7-token interleaved encoding per audio frame with offsets:
    Token 1: offset +0      (level 0, coarse)
    Token 2: offset +4096   (level 1, medium)
    Token 3: offset +8192   (level 2, fine)
    Token 4: offset +12288  (level 2, fine)
    Token 5: offset +16384  (level 1, medium)
    Token 6: offset +20480  (level 2, fine)
    Token 7: offset +24576  (level 2, fine)

    Args:
        code_list: List of audio codes (already offset-adjusted)

    Returns:
        numpy array: Decoded audio waveform
    """
    if not code_list or len(code_list) % 7 != 0:
        raise ValueError("Code list must be non-empty and divisible by 7")

    # Redistribute codes into SNAC's 3-level hierarchy
    layer_1 = []  # Coarse: 1 token per frame
    layer_2 = []  # Medium: 2 tokens per frame
    layer_3 = []  # Fine: 4 tokens per frame

    for i in range(len(code_list) // 7):
        # Extract and adjust each token based on its offset
        layer_1.append(code_list[7*i])
        layer_2.append(code_list[7*i + 1] - 4096)
        layer_3.append(code_list[7*i + 2] - (2*4096))
        layer_3.append(code_list[7*i + 3] - (3*4096))
        layer_2.append(code_list[7*i + 4] - (4*4096))
        layer_3.append(code_list[7*i + 5] - (5*4096))
        layer_3.append(code_list[7*i + 6] - (6*4096))

    # Create hierarchical code tensors for SNAC (on same device as SNAC model)
    snac_device = next(snac_model.parameters()).device
    codes = [
        torch.tensor(layer_1, device=snac_device).unsqueeze(0),
        torch.tensor(layer_2, device=snac_device).unsqueeze(0),
        torch.tensor(layer_3, device=snac_device).unsqueeze(0)
    ]

    # Decode to audio waveform
    with torch.no_grad():
        audio_hat = snac_model.decode(codes)

    return audio_hat.squeeze().clamp(-1, 1).cpu().numpy()


# Example usage
text = "เค†เคœ เค•เคพ เคฎเฅŒเคธเคฎ เคฌเคนเฅเคค เค…เคšเฅเค›เคพ เคนเฅˆเฅค Let's go for a walk!"
audio = generate_speech(text, temperature=0.5, top_p=0.9)
sf.write("output.wav", audio, 24000)
print(f"Generated {len(audio) / 24000:.2f} seconds of audio")

๐ŸŒ Language Support

Language Status Notes
English โœ… Supported Clear, neutral international pronunciation
Hindi โœ… Supported Natural, fluent, and expressive
Codeโ€‘Mixed (Hinglish) โœ… Supported Seamless language switching

๐Ÿงฉ Use Cases

Vaakyaโ€‘Open is wellโ€‘suited for:

Application Description
๐Ÿ“š Audiobooks & Narration Longโ€‘form content with consistent voice
๐ŸŽฅ Video Voiceโ€‘Overs Professional dubbing and narration
๐Ÿฅ Healthcare Patient reminders, medication instructions
๐Ÿ“ž Voice AI Agents IVR systems, conversational assistants
๐Ÿง‘โ€๐Ÿซ Eโ€‘learning Educational content and tutorials
โ™ฟ Accessibility Screen readers for visually impaired users

๐Ÿ—๏ธ Training Summary

Attribute Value
Speaker Count 1 (professional voiceโ€‘over artist)
Recording Quality Studioโ€‘grade, 192kHz original capture
Output Quality 24kHz (downsampled for efficiency)
Data Characteristics Conversational, narrative, informational
Training Method LoRA fineโ€‘tuning on Orpheus base

Training Data Sources

This model was trained using highโ€‘quality speech data from publicly available academic datasets developed by premier Indian research institutions:

Dataset Institution Description License
IndicTTS IIT Madras Hindi and Indian English speech corpus for TTS Custom (see below)
SYSPIN TTS Corpus IISc Bangalore / SPIRE Lab 900+ hours of studio-recorded TTS data in 9 Indian languages CCโ€‘BYโ€‘4.0
SPICOR TTS Corpus IISc Bangalore / SPIRE Lab 97+ hour domain-rich Indian English TTS corpus CCโ€‘BYโ€‘4.0

We gratefully acknowledge these institutions for making their datasets available for research and development in Indian language speech synthesis.


๐Ÿ“Š Performance

Metric Value
Latency (A100โ€‘80GB) ~120ms
Latency (RTX 4090) ~200ms
Realโ€‘time Factor < 0.1x*
Output Sample Rate 24kHz

Note: Benchmarks measured with 4-bit quantization (load_in_4bit=True), batch size 1, average text length ~20-30 tokens, max_new_tokens=512. Performance varies significantly based on hardware, precision (FP16/FP32/4-bit), text length, and generation parameters. FP16 inference may be 2-3x slower than 4-bit but provides better quality.


โš ๏ธ Limitations

Limitation Description
Single Speaker No speaker switching or voice selection
Language Support English & Hindi only
Emotion Control Emotion tokens not yet exposed
Hardware GPU recommended for realโ€‘time inference

๐Ÿ›ฃ๏ธ Roadmap

We are actively working on:

  • ๐ŸŽญ Emotion & prosody control tokens
  • ๐ŸŒ Additional Indian languages (Tamil, Telugu, Bengali, Marathi)
  • ๐ŸŽ™๏ธ Multiple voice variants (male, regional accents)
  • โš™๏ธ CPUโ€‘optimized inference paths
  • ๐Ÿ“ก Streaming inference support

๐Ÿ“œ License

This project is released under the Apache 2.0 License.

Training Data Licenses

The training data used in this model is subject to the following licenses:

IIT Madras IndicTTS Dataset:

COPYRIGHT 2016 TTS Consortium, TDIL, Meity represented by Hema A Murthy & S Umesh, DEPARTMENT OF Computer Science and Engineering and Electrical Engineering, IIT Madras. ALL RIGHTS RESERVED.

The IndicTTS dataset is provided under a permissive license that allows derivative works and free distribution. See the full license for details.

IISc Bangalore SYSPIN & SPICOR Datasets:

The SYSPIN and SPICOR TTS corpora are released under the Creative Commons Attribution 4.0 International License (CCโ€‘BYโ€‘4.0), which permits sharing and adaptation for any purpose, including commercial use, with appropriate attribution.


๐Ÿ™ Acknowledgments

Training Data

We gratefully acknowledge the following institutions and projects for providing highโ€‘quality speech datasets that made this model possible:

Indian Institute of Technology Madras (IIT Madras)

  • IndicTTS Project โ€” A comprehensive speech corpus for Indian languages developed by Prof. Hema A Murthy, Prof. S Umesh, and the TTS Consortium
  • Funded by: Technology Development for Indian Languages (TDIL), Ministry of Electronics and Information Technology (MeitY), Government of India

Indian Institute of Science, Bangalore (IISc) โ€” SPIRE Lab

  • SYSPIN Project (SYnthesizing SPeech in INdian languages) โ€” 900+ hours of studio-recorded TTS data in 9 Indian languages, led by Prof. Prasanta Kumar Ghosh
    • Funded by: German Development Cooperation "FAIR Forward โ€” AI for All" (GIZ Germany) and Bhashini AI Solutions Private Limited
  • SPICOR Project โ€” 97+ hours of domain-rich Indian English TTS corpus
  • Dataset URL: https://spiredatasets.iisc.ac.in/

Architecture & Tools

  • Canopy Labs โ€” For pioneering the Orpheus TTS architecture and openโ€‘sourcing their work
  • Unsloth โ€” For training optimizations and fineโ€‘tuning tools
  • SNAC โ€” Hubert Siuzdak for the neural audio codec

Voice

  • Voice Artist โ€” For providing highโ€‘quality studio recordings

๐Ÿ›ก๏ธ Safety, Ethics & Responsible Use

Intended Use

Vaakyaโ€‘Open is designed for legitimate applications including:

  • ๐Ÿ“š Audiobook narration and content creation
  • ๐ŸŽฅ Video voiceโ€‘overs and dubbing
  • ๐Ÿฅ Healthcare communication (patient reminders, medication instructions)
  • ๐Ÿ“ž Voice assistants and IVR systems
  • ๐Ÿง‘โ€๐Ÿซ Educational content and eโ€‘learning
  • โ™ฟ Accessibility tools for visually impaired users
  • ๐Ÿ”ฌ Academic research in speech synthesis

Prohibited Uses

Do not use this model for:

  • โŒ Impersonation โ€” Creating audio that falsely represents a real person without their explicit consent
  • โŒ Fraud & Scams โ€” Generating deceptive audio for phishing, vishing, or financial fraud
  • โŒ Misinformation โ€” Producing fake news, propaganda, or misleading content
  • โŒ Deepfakes โ€” Creating nonโ€‘consensual synthetic media intended to deceive
  • โŒ Harassment โ€” Generating content to bully, threaten, or demean individuals
  • โŒ Illegal Activities โ€” Any use that violates applicable laws or regulations

Transparency Recommendation

If you use Vaakyaโ€‘Open to generate speech for publicโ€‘facing applications, we strongly recommend disclosing to end users that they are listening to AIโ€‘generated content. Transparency builds trust and helps prevent potential misuse.

Legal Compliance

Users are responsible for ensuring their use of this model complies with:

  • All applicable local, national, and international laws
  • Industryโ€‘specific regulations (healthcare, finance, telecommunications, etc.)
  • Platform terms of service where the generated audio is distributed
  • Intellectual property and privacy rights of third parties

Disclaimer

THE DEVELOPERS OF VAAKYAโ€‘OPEN ASSUME NO LIABILITY FOR ANY MISUSE OF THIS MODEL.

This model is provided "as is" without warranty of any kind. Voxaura Labs and its contributors are not responsible for any direct, indirect, incidental, or consequential damages arising from the use or misuse of this model or its outputs.

By using this model, you agree to follow all applicable laws, respect the rights and privacy of others, and uphold ethical standards in AI development and deployment.

Ethical AI Commitment

We at Voxaura Labs are committed to the responsible development and use of AI technologies. We believe that:

  1. AI should augment human capabilities, not deceive or harm
  2. Transparency is essential in AIโ€‘generated content
  3. Privacy and consent must be respected in voice applications
  4. Accessibility should be a core consideration in voice technology

We encourage the community to use Vaakyaโ€‘Open as a force for good โ€” to improve accessibility, enhance communication, and create meaningful experiences while respecting the dignity and rights of all individuals.

Reporting Misuse

If you become aware of any misuse of this model, please report it to us at hello@voxygen.ai. We take reports of misuse seriously and will take appropriate action where possible.


๐Ÿ’ฌ About Voxaura Labs & Voxygen.ai

Voxaura Labs builds advanced Voice AI systems focused on realism, scalability, and multilingual accessibility โ€” bridging the gap between human expression and artificial speech.

Voxygen is the product brand of Voxaura Labs, dedicated to making advanced Voice AI technologies accessible through practical applications and tools for creators, developers, and enterprises worldwide.

๐ŸŒ Website ยท ๐Ÿ“ง Contact


๐Ÿ“ Citation

@misc{vaakya2026,
    title={Vaakya-Open: Text-to-Speech for Hindi and English},
    author={Voxaura Labs},
    year={2026},
    publisher={HuggingFace},
    url={https://huggingface.co/voxaura-labs/vaakya-open}
}

If you use this model, please also consider citing the training data sources:

@misc{indictts2016,
    title={IndicTTS: Text-to-Speech for Indian Languages},
    author={Murthy, Hema A and Umesh, S and {TTS Consortium}},
    year={2016},
    institution={Indian Institute of Technology Madras},
    note={TDIL, MeitY},
    url={https://www.iitm.ac.in/donlab/indictts/database}
}

@misc{syspin2025,
    title={SYSPIN_S1.0 Corpus - A TTS Corpus of 900+ hours in nine Indian Languages},
    author={Abhayjeet et al.},
    year={2025},
    institution={Indian Institute of Science, Bengaluru},
    url={https://spiredatasets.iisc.ac.in/syspinCorpus}
}

@misc{spicor2025,
    title={SPICOR TTS_1.0 Corpus - A 97+ hour domain-rich Indian English TTS Corpus},
    author={Abhayjeet et al.},
    year={2025},
    institution={Indian Institute of Science, Bengaluru},
    url={https://spiredatasets.iisc.ac.in/spicortts10}
}

Made in India ๐Ÿ‡ฎ๐Ÿ‡ณ

If you use Vaakyaโ€‘Open in your work, we'd love to hear from you!

Downloads last month
235
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for voxaura-labs/vaakya-open

Finetuned
(18)
this model
Quantizations
1 model

Space using voxaura-labs/vaakya-open 1