SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions Paper • 2506.23046 • Published Jun 29, 2025 • 1
AutoLibra: Agent Metric Induction from Open-Ended Feedback Paper • 2505.02820 • Published May 5, 2025 • 3
Mind the Gap! Static and Interactive Evaluations of Large Audio Models Paper • 2502.15919 • Published Feb 21, 2025 • 4
EgoNormia: Benchmarking Physical Social Norm Understanding Paper • 2502.20490 • Published Feb 27, 2025 • 6
Grounded Persuasive Language Generation for Automated Marketing Paper • 2502.16810 • Published Feb 24, 2025 • 13
HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions Paper • 2409.16427 • Published Sep 24, 2024 • 1
What Are Tools Anyway? A Survey from the Language Model Perspective Paper • 2403.15452 • Published Mar 18, 2024
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks Paper • 2412.14161 • Published Dec 18, 2024 • 51
Improve Vision Language Model Chain-of-thought Reasoning Paper • 2410.16198 • Published Oct 21, 2024 • 26
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents Paper • 2310.11667 • Published Oct 18, 2023 • 4
A Self-enhancement Approach for Domain-specific Chatbot Training via Knowledge Mining and Digest Paper • 2311.10614 • Published Nov 17, 2023
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward Paper • 2404.01258 • Published Apr 1, 2024 • 12
SOTOPIA-$π$: Interactive Learning of Socially Intelligent Language Agents Paper • 2403.08715 • Published Mar 13, 2024 • 21
WebArena: A Realistic Web Environment for Building Autonomous Agents Paper • 2307.13854 • Published Jul 25, 2023 • 25
Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue Paper • 2210.04443 • Published Oct 10, 2022
COBRA Frames: Contextual Reasoning about Effects and Harms of Offensive Statements Paper • 2306.01985 • Published Jun 3, 2023 • 1
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents Paper • 2310.11667 • Published Oct 18, 2023 • 4
WebArena: A Realistic Web Environment for Building Autonomous Agents Paper • 2307.13854 • Published Jul 25, 2023 • 25
FewRel: A Large-Scale Supervised Few-Shot Relation Classification Dataset with State-of-the-Art Evaluation Paper • 1810.10147 • Published Oct 24, 2018