title: DOAB Metadata Extraction Evaluation
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
VLM vs Text: Extracting Metadata from Book Covers
Can Vision-Language Models extract metadata from book covers better than text extraction?
TL;DR
Yes, significantly. VLMs achieve ~97% accuracy vs ~75% for text extraction on title extraction, and ~80% vs ~71% on full metadata extraction.
The Task
Extracting metadata from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:
- VLM (Vision): Send the cover image directly to a Vision-Language Model
- Text Extraction: Extract text from the image first, then send to an LLM
Results
Title Extraction (simpler task)
| Approach | Average Accuracy |
|---|---|
| VLM | 97% |
| Text | 75% |
Full Metadata (title, subtitle, publisher, year, ISBN)
| Approach | Average Accuracy |
|---|---|
| VLM | 80% |
| Text | 71% |
VLMs consistently outperform text extraction across both tasks.
Why VLMs Win
Book covers are visually structured:
- Titles appear in specific locations (usually top/center)
- Typography indicates importance (larger = more likely title)
- Layout provides context that pure text loses
Text extraction flattens this structure, losing valuable spatial information.
Models Evaluated
VLM Models:
- Qwen3-VL-8B-Instruct (8B params)
- Qwen3-VL-30B-A3B-Thinking (30B params)
- GLM-4.6V-Flash (9B params)
Text Models:
- gpt-oss-20b (20B params)
- Qwen3-4B-Instruct-2507 (4B params)
- Olmo-3-7B-Instruct (7B params)
Interesting finding: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.
Interactive Features
- Task selector: Switch between Title Extraction and Full Metadata results
- Model size vs accuracy plot: Interactive scatter plot showing efficiency
- Leaderboard: Filter by VLM or Text approach
Technical Details
- Dataset: DOAB Metadata Extraction (50 samples)
- Evaluation Framework: Inspect AI
- Scoring:
- Title: Flexible matching (handles case, subtitles, punctuation)
- Full Metadata: LLM-as-judge with partial credit
- Logs: davanstrien/doab-title-extraction-evals
Replicate This
The evaluation logs are stored on HuggingFace and can be loaded directly:
from inspect_ai.analysis import evals_df
df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals")
Why This Matters for GLAM
Libraries and archives have millions of digitized documents where metadata is incomplete or missing. VLMs offer a promising approach for:
- Catalog enhancement: Fill gaps in existing records
- Discovery: Make collections more searchable
- Quality assessment: Validate existing metadata
This evaluation demonstrates that domain-specific benchmarks can help identify the best approaches for cultural heritage tasks.
Built with Marimo | Evaluation framework: Inspect AI