YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

VibeVoice TTS - ONNX for Browser

Microsoft's VibeVoice-Realtime-0.5B converted to ONNX for browser inference.

Architecture

VibeVoice uses a two-LLM architecture with diffusion-based vocoding:

  1. Text LM (187 MB INT8) - Processes input text to hidden states
  2. TTS LM (417 MB INT8) - Generates speech latents autoregressively
  3. Diffusion Head (25 MB INT4) - Denoises latents
  4. Vocoder (339 MB INT4) - Converts latents to 24kHz audio
  5. Acoustic Connector (0.5 MB INT4) - Feeds speech back to TTS LM

Total: ~969 MB quantized (down from 3.8 GB)

Files

  • text_lm_int8.onnx - Text encoder
  • tts_lm_int8.onnx - Speech token generator
  • diffusion_head_int4.onnx - Latent denoiser
  • vocoder_int4.onnx - Audio synthesizer
  • acoustic_connector_int4.onnx - Speech feedback
  • tokenizer.json - Qwen2 tokenizer
  • config.json - Model configuration

Usage

See demo.html for browser usage example.

License

Apache 2.0 (same as original VibeVoice model)

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support