YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
VibeVoice TTS - ONNX for Browser
Microsoft's VibeVoice-Realtime-0.5B converted to ONNX for browser inference.
Architecture
VibeVoice uses a two-LLM architecture with diffusion-based vocoding:
- Text LM (187 MB INT8) - Processes input text to hidden states
- TTS LM (417 MB INT8) - Generates speech latents autoregressively
- Diffusion Head (25 MB INT4) - Denoises latents
- Vocoder (339 MB INT4) - Converts latents to 24kHz audio
- Acoustic Connector (0.5 MB INT4) - Feeds speech back to TTS LM
Total: ~969 MB quantized (down from 3.8 GB)
Files
text_lm_int8.onnx- Text encodertts_lm_int8.onnx- Speech token generatordiffusion_head_int4.onnx- Latent denoiservocoder_int4.onnx- Audio synthesizeracoustic_connector_int4.onnx- Speech feedbacktokenizer.json- Qwen2 tokenizerconfig.json- Model configuration
Usage
See demo.html for browser usage example.
License
Apache 2.0 (same as original VibeVoice model)
- Downloads last month
- 18
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support