PatronusAI/Qwen3-4B-Instruct-2507-Tau2-32-GPT41Teach-notROnly-Merge-6e-5-Q4-32768-1445Jan22 4B • Updated 3 days ago • 5
PatronusAI/Qwen3-4B-Instruct-2507-Tau2-32-GPT41Teach-notROnly-Merge-6e-5-Q4-32768-1445Jan22 4B • Updated 3 days ago • 5
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments Paper • 2510.01353 • Published Oct 1, 2025 • 3
TRAIL: Trace Reasoning and Agentic Issue Localization Paper • 2505.08638 • Published May 13, 2025 • 6
SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models Paper • 2311.08370 • Published Nov 14, 2023
FinanceBench: A New Benchmark for Financial Question Answering Paper • 2311.11944 • Published Nov 20, 2023
GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking Paper • 2412.14140 • Published Dec 18, 2024 • 1
Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning Paper • 2503.19193 • Published Mar 24, 2025 • 1
Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning Paper • 2503.19193 • Published Mar 24, 2025 • 1
GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking Paper • 2412.14140 • Published Dec 18, 2024 • 1