--- license: mit language: - en - gu base_model: - Qwen/Qwen2.5-0.5B pipeline_tag: text-to-speech tags: - tts - indian-accent --- # Ind-QwenTTS A lightweight multilingual Text-to-Speech system with accent control for English and Gujarati. ## Features - Multilingual: English + Gujarati - Accent Control: Indian & Gujarati accents - 4 voices (2 male, 2 female) - Accent transfer capability - Fast inference with 0.5B parameters ## Supported Voices | Speaker ID | Language | Accent | Gender | |-----------|----------|---------|---------| | `SPK_EN_M_001` | English | Indian | Male | | `SPK_EN_F_001` | English | Indian | Female | | `SPK_GU_M_001` | Gujarati | Gujarati | Male | | `SPK_GU_F_001` | Gujarati | Gujarati | Female | ## Installation ```bash pip install transformers torch torchaudio snac torchcodec ``` ## Usage ```python import torch import torchaudio from transformers import AutoModelForCausalLM, AutoTokenizer from snac import SNAC device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained("AryanNsc/IND-QWENTTS-V1", fix_mistral_regex=True) model = AutoModelForCausalLM.from_pretrained("AryanNsc/IND-QWENTTS-V1", torch_dtype=torch.bfloat16).to(device).eval() snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to(device).eval() def generate_speech(text, language="english", accent="indian", gender="M", speaker=None, output_file="output.wav"): if speaker is None: speaker_map = { ("english", "M"): "SPK_EN_M_001", ("english", "F"): "SPK_EN_F_001", ("gujarati", "M"): "SPK_GU_M_001", ("gujarati", "F"): "SPK_GU_F_001" } speaker = speaker_map.get((language, gender), "SPK_EN_M_001") prompt = f"{language}{accent}{gender}{speaker} {text}" input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(device) start_tokens = torch.tensor([ tokenizer.convert_tokens_to_ids("<|endoftext|>"), tokenizer.convert_tokens_to_ids(""), tokenizer.convert_tokens_to_ids(""), tokenizer.convert_tokens_to_ids("") ], device=device).unsqueeze(0) full_input = torch.cat([input_ids, start_tokens], dim=1) with torch.no_grad(): output = model.generate( full_input, max_new_tokens=1500, temperature=0.7, top_p=0.85, repetition_penalty=1.15, do_sample=True, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.convert_tokens_to_ids("") ) generated_ids = output[0, full_input.shape[1]:] eos_id = tokenizer.convert_tokens_to_ids("") if len(generated_ids) > 0 and generated_ids[-1] == eos_id: generated_ids = generated_ids[:-1] if len(generated_ids) % 7 != 0: trunc_len = (len(generated_ids) // 7) * 7 generated_ids = generated_ids[:trunc_len] if len(generated_ids) == 0: print("Error: No audio generated.") return codes = generated_ids.reshape(-1, 7).T snac_offset = model.config.vocab_size - 4096 codes = codes - snac_offset codes = torch.clamp(codes, min=0) l1 = codes[0, :] l2 = torch.stack([codes[1, :], codes[4, :]], dim=1).flatten() l3 = torch.stack([codes[2, :], codes[3, :], codes[5, :], codes[6, :]], dim=1).flatten() with torch.inference_mode(): audio = snac.decode([l1.unsqueeze(0), l2.unsqueeze(0), l3.unsqueeze(0)]) audio_tensor = audio.squeeze(0).cpu() torchaudio.save(output_file, audio_tensor, 24000) print(f"Saved to {output_file}") generate_speech( text="The competition results will be announced tomorrow morning.", language="english", accent="indian", gender="M", output_file="test_english.wav" ) ``` ## Examples **Basic English synthesis:** ```python generate_speech("Hello world, this is a test.", language="english", accent="indian", gender="M") ``` **Gujarati synthesis:** ```python generate_speech("નમસ્તે, તમે કેમ છો?", language="gujarati", accent="gujarati", gender="F") ``` ## Audio Samples Here are some samples generated by the model. | Description | Speaker | Audio | |:--- |:--- |:--- | | **Indian English**
Standard Generation | Male (`SPK_EN_M_001`) | | | **Indian English**
Long Narrative | Female (`SPK_EN_F_001`) | | | **Gujarati**
Native Speech | Female (`SPK_GU_F_001`) | | ## Parameters - `text`: Text to synthesize - `language`: `"english"` or `"gujarati"` - `accent`: `"indian"` or `"gujarati"` - `gender`: `"M"` (male) or `"F"` (female) - `speaker`: Optional specific speaker ID (auto-selected if not provided) ## Training Code Training pipeline and scripts will be open-sourced soon. ## Citation ```bibtex @misc{ind-qwentts-2024, title={Ind-QwenTTS: Multilingual Accent-Aware TTS}, author={Aryan Purohit}, year={2025} } ```