Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding Paper • 2601.10611 • Published 4 days ago • 23
Transition Matching Distillation for Fast Video Generation Paper • 2601.09881 • Published 5 days ago • 28
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking Paper • 2601.04720 • Published 11 days ago • 45
Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization Paper • 2601.05432 • Published 11 days ago • 159
SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices Paper • 2601.08303 • Published 6 days ago • 14
Yume-1.5: A Text-Controlled Interactive World Generation Model Paper • 2512.22096 • Published 24 days ago • 59
VINO: A Unified Visual Generator with Interleaved OmniModal Context Paper • 2601.02358 • Published 14 days ago • 28
DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer Paper • 2601.01425 • Published 15 days ago • 50
DreamStyle: A Unified Framework for Video Stylization Paper • 2601.02785 • Published 13 days ago • 23