Diffusion Transformers with Representation Autoencoders Paper • 2510.11690 • Published Oct 13, 2025 • 165
PIG-Nav: Key Insights for Pretrained Image Goal Navigation Models Paper • 2507.17220 • Published Jul 23, 2025 • 1
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models Paper • 2507.23682 • Published Jul 31, 2025 • 23
Learning Getting-Up Policies for Real-World Humanoid Robots Paper • 2502.12152 • Published Feb 17, 2025 • 42
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published Feb 20, 2025 • 156
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity Paper • 2502.13063 • Published Feb 18, 2025 • 73
IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI Paper • 2411.00785 • Published Oct 17, 2024 • 8
Distributional Reinforcement Learning for Multi-Dimensional Reward Functions Paper • 2110.13578 • Published Oct 26, 2021 • 1
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation Paper • 2410.05363 • Published Oct 7, 2024 • 45
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting Paper • 2410.17856 • Published Oct 23, 2024 • 52
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model Paper • 2403.13064 • Published Mar 19, 2024 • 31
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data Paper • 2401.10891 • Published Jan 19, 2024 • 62
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities Paper • 2401.12168 • Published Jan 22, 2024 • 29
LLM in a flash: Efficient Large Language Model Inference with Limited Memory Paper • 2312.11514 • Published Dec 12, 2023 • 260
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts Paper • 2312.10763 • Published Dec 17, 2023 • 19
Holodeck: Language Guided Generation of 3D Embodied AI Environments Paper • 2312.09067 • Published Dec 14, 2023 • 15