YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

VibeVoice TTS - ONNX for Browser

Microsoft's VibeVoice-Realtime-0.5B converted to ONNX for browser inference.

Architecture

VibeVoice uses a two-LLM architecture with diffusion-based vocoding:

Text LM (187 MB INT8) - Processes input text to hidden states
TTS LM (417 MB INT8) - Generates speech latents autoregressively
Diffusion Head (25 MB INT4) - Denoises latents
Vocoder (339 MB INT4) - Converts latents to 24kHz audio
Acoustic Connector (0.5 MB INT4) - Feeds speech back to TTS LM

Total: ~969 MB quantized (down from 3.8 GB)

See demo.html for browser usage example.

Apache 2.0 (same as original VibeVoice model)

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support