MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning


๐Ÿ“– Paper ยท โญ GitHub ยท ๐Ÿ“Š Dataset ยท ๐Ÿค— Checkpoints

Key Features:

  • MMDuet2 is a Video MLLM for proactive interaction, which means that it can not only reply right after the user's turn, but also at any approprite and timely moment during the video playback.

  • With only a 3B model, MMDuet2 is lightweight and fast for real-time interaction.

  • Responses are neither too sparse nor too dense and repetitive, which was a common issue in previous works.

(Demo Video Here)

Usage

MMDuet2 is post-trained from Qwen2.5-VL-3B-Intruct with enhanced proactive interaction ability.

Though this model can be used in the exact same way as Qwen2.5-VL, we recommend using the demo and evaluation code provided in MMDuet2 Github Repo to experience its ability of actively generating replies while a video is playing online.

Downloads last month
20
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support