davanstrien's picture
davanstrien HF Staff
Upload folder using huggingface_hub
a08c225 verified
---
title: DOAB Metadata Extraction Evaluation
emoji: πŸ“š
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
---
# VLM vs Text: Extracting Metadata from Book Covers
**Can Vision-Language Models extract metadata from book covers better than text extraction?**
## TL;DR
**Yes, significantly.** VLMs achieve ~97% accuracy vs ~75% for text extraction on title extraction, and ~80% vs ~71% on full metadata extraction.
## The Task
Extracting metadata from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:
1. **VLM (Vision)**: Send the cover image directly to a Vision-Language Model
2. **Text Extraction**: Extract text from the image first, then send to an LLM
## Results
### Title Extraction (simpler task)
| Approach | Average Accuracy |
|----------|-----------------|
| **VLM** | **97%** |
| Text | 75% |
### Full Metadata (title, subtitle, publisher, year, ISBN)
| Approach | Average Accuracy |
|----------|-----------------|
| **VLM** | **80%** |
| Text | 71% |
VLMs consistently outperform text extraction across both tasks.
### Why VLMs Win
Book covers are **visually structured**:
- Titles appear in specific locations (usually top/center)
- Typography indicates importance (larger = more likely title)
- Layout provides context that pure text loses
Text extraction flattens this structure, losing valuable spatial information.
## Models Evaluated
**VLM Models**:
- Qwen3-VL-8B-Instruct (8B params)
- Qwen3-VL-30B-A3B-Thinking (30B params)
- GLM-4.6V-Flash (9B params)
**Text Models**:
- gpt-oss-20b (20B params)
- Qwen3-4B-Instruct-2507 (4B params)
- Olmo-3-7B-Instruct (7B params)
**Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.
## Interactive Features
- **Task selector**: Switch between Title Extraction and Full Metadata results
- **Model size vs accuracy plot**: Interactive scatter plot showing efficiency
- **Leaderboard**: Filter by VLM or Text approach
## Technical Details
- **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples)
- **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
- **Scoring**:
- Title: Flexible matching (handles case, subtitles, punctuation)
- Full Metadata: LLM-as-judge with partial credit
- **Logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
## Replicate This
The evaluation logs are stored on HuggingFace and can be loaded directly:
```python
from inspect_ai.analysis import evals_df
df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals")
```
## Why This Matters for GLAM
Libraries and archives have millions of digitized documents where metadata is incomplete or missing. VLMs offer a promising approach for:
- **Catalog enhancement**: Fill gaps in existing records
- **Discovery**: Make collections more searchable
- **Quality assessment**: Validate existing metadata
This evaluation demonstrates that domain-specific benchmarks can help identify the best approaches for cultural heritage tasks.
---
*Built with [Marimo](https://marimo.io) | Evaluation framework: [Inspect AI](https://inspect.aisi.org.uk/)*