---
license: mit
language:
- en
- gu
base_model:
- Qwen/Qwen2.5-0.5B
pipeline_tag: text-to-speech
tags:
- tts
- indian-accent
---
# Ind-QwenTTS
A lightweight multilingual Text-to-Speech system with accent control for English and Gujarati.
## Features
- Multilingual: English + Gujarati
- Accent Control: Indian & Gujarati accents
- 4 voices (2 male, 2 female)
- Accent transfer capability
- Fast inference with 0.5B parameters
## Supported Voices
| Speaker ID | Language | Accent | Gender |
|-----------|----------|---------|---------|
| `SPK_EN_M_001` | English | Indian | Male |
| `SPK_EN_F_001` | English | Indian | Female |
| `SPK_GU_M_001` | Gujarati | Gujarati | Male |
| `SPK_GU_F_001` | Gujarati | Gujarati | Female |
## Installation
```bash
pip install transformers torch torchaudio snac torchcodec
```
## Usage
```python
import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("AryanNsc/IND-QWENTTS-V1", fix_mistral_regex=True)
model = AutoModelForCausalLM.from_pretrained("AryanNsc/IND-QWENTTS-V1", torch_dtype=torch.bfloat16).to(device).eval()
snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to(device).eval()
def generate_speech(text, language="english", accent="indian", gender="M", speaker=None, output_file="output.wav"):
if speaker is None:
speaker_map = {
("english", "M"): "SPK_EN_M_001",
("english", "F"): "SPK_EN_F_001",
("gujarati", "M"): "SPK_GU_M_001",
("gujarati", "F"): "SPK_GU_F_001"
}
speaker = speaker_map.get((language, gender), "SPK_EN_M_001")
prompt = f"{language}{accent}{gender}{speaker} {text}"
input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(device)
start_tokens = torch.tensor([
tokenizer.convert_tokens_to_ids("<|endoftext|>"),
tokenizer.convert_tokens_to_ids(""),
tokenizer.convert_tokens_to_ids(""),
tokenizer.convert_tokens_to_ids("")
], device=device).unsqueeze(0)
full_input = torch.cat([input_ids, start_tokens], dim=1)
with torch.no_grad():
output = model.generate(
full_input,
max_new_tokens=1500,
temperature=0.7,
top_p=0.85,
repetition_penalty=1.15,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.convert_tokens_to_ids("")
)
generated_ids = output[0, full_input.shape[1]:]
eos_id = tokenizer.convert_tokens_to_ids("")
if len(generated_ids) > 0 and generated_ids[-1] == eos_id:
generated_ids = generated_ids[:-1]
if len(generated_ids) % 7 != 0:
trunc_len = (len(generated_ids) // 7) * 7
generated_ids = generated_ids[:trunc_len]
if len(generated_ids) == 0:
print("Error: No audio generated.")
return
codes = generated_ids.reshape(-1, 7).T
snac_offset = model.config.vocab_size - 4096
codes = codes - snac_offset
codes = torch.clamp(codes, min=0)
l1 = codes[0, :]
l2 = torch.stack([codes[1, :], codes[4, :]], dim=1).flatten()
l3 = torch.stack([codes[2, :], codes[3, :], codes[5, :], codes[6, :]], dim=1).flatten()
with torch.inference_mode():
audio = snac.decode([l1.unsqueeze(0), l2.unsqueeze(0), l3.unsqueeze(0)])
audio_tensor = audio.squeeze(0).cpu()
torchaudio.save(output_file, audio_tensor, 24000)
print(f"Saved to {output_file}")
generate_speech(
text="The competition results will be announced tomorrow morning.",
language="english",
accent="indian",
gender="M",
output_file="test_english.wav"
)
```
## Examples
**Basic English synthesis:**
```python
generate_speech("Hello world, this is a test.", language="english", accent="indian", gender="M")
```
**Gujarati synthesis:**
```python
generate_speech("નમસ્તે, તમે કેમ છો?", language="gujarati", accent="gujarati", gender="F")
```
## Audio Samples
Here are some samples generated by the model.
| Description | Speaker | Audio |
|:--- |:--- |:--- |
| **Indian English**
Standard Generation | Male (`SPK_EN_M_001`) | |
| **Indian English**
Long Narrative | Female (`SPK_EN_F_001`) | |
| **Gujarati**
Native Speech | Female (`SPK_GU_F_001`) | |
## Parameters
- `text`: Text to synthesize
- `language`: `"english"` or `"gujarati"`
- `accent`: `"indian"` or `"gujarati"`
- `gender`: `"M"` (male) or `"F"` (female)
- `speaker`: Optional specific speaker ID (auto-selected if not provided)
## Training Code
Training pipeline and scripts will be open-sourced soon.
## Citation
```bibtex
@misc{ind-qwentts-2024,
title={Ind-QwenTTS: Multilingual Accent-Aware TTS},
author={Aryan Purohit},
year={2025}
}
```