|
|
--- |
|
|
title: DOAB Metadata Extraction Evaluation |
|
|
emoji: π |
|
|
colorFrom: blue |
|
|
colorTo: purple |
|
|
sdk: docker |
|
|
pinned: false |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# VLM vs Text: Extracting Metadata from Book Covers |
|
|
|
|
|
**Can Vision-Language Models extract metadata from book covers better than text extraction?** |
|
|
|
|
|
## TL;DR |
|
|
|
|
|
**Yes, significantly.** VLMs achieve ~97% accuracy vs ~75% for text extraction on title extraction, and ~80% vs ~71% on full metadata extraction. |
|
|
|
|
|
## The Task |
|
|
|
|
|
Extracting metadata from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches: |
|
|
|
|
|
1. **VLM (Vision)**: Send the cover image directly to a Vision-Language Model |
|
|
2. **Text Extraction**: Extract text from the image first, then send to an LLM |
|
|
|
|
|
## Results |
|
|
|
|
|
### Title Extraction (simpler task) |
|
|
|
|
|
| Approach | Average Accuracy | |
|
|
|----------|-----------------| |
|
|
| **VLM** | **97%** | |
|
|
| Text | 75% | |
|
|
|
|
|
### Full Metadata (title, subtitle, publisher, year, ISBN) |
|
|
|
|
|
| Approach | Average Accuracy | |
|
|
|----------|-----------------| |
|
|
| **VLM** | **80%** | |
|
|
| Text | 71% | |
|
|
|
|
|
VLMs consistently outperform text extraction across both tasks. |
|
|
|
|
|
### Why VLMs Win |
|
|
|
|
|
Book covers are **visually structured**: |
|
|
- Titles appear in specific locations (usually top/center) |
|
|
- Typography indicates importance (larger = more likely title) |
|
|
- Layout provides context that pure text loses |
|
|
|
|
|
Text extraction flattens this structure, losing valuable spatial information. |
|
|
|
|
|
## Models Evaluated |
|
|
|
|
|
**VLM Models**: |
|
|
- Qwen3-VL-8B-Instruct (8B params) |
|
|
- Qwen3-VL-30B-A3B-Thinking (30B params) |
|
|
- GLM-4.6V-Flash (9B params) |
|
|
|
|
|
**Text Models**: |
|
|
- gpt-oss-20b (20B params) |
|
|
- Qwen3-4B-Instruct-2507 (4B params) |
|
|
- Olmo-3-7B-Instruct (7B params) |
|
|
|
|
|
**Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality. |
|
|
|
|
|
## Interactive Features |
|
|
|
|
|
- **Task selector**: Switch between Title Extraction and Full Metadata results |
|
|
- **Model size vs accuracy plot**: Interactive scatter plot showing efficiency |
|
|
- **Leaderboard**: Filter by VLM or Text approach |
|
|
|
|
|
## Technical Details |
|
|
|
|
|
- **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples) |
|
|
- **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/) |
|
|
- **Scoring**: |
|
|
- Title: Flexible matching (handles case, subtitles, punctuation) |
|
|
- Full Metadata: LLM-as-judge with partial credit |
|
|
- **Logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals) |
|
|
|
|
|
## Replicate This |
|
|
|
|
|
The evaluation logs are stored on HuggingFace and can be loaded directly: |
|
|
|
|
|
```python |
|
|
from inspect_ai.analysis import evals_df |
|
|
|
|
|
df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals") |
|
|
``` |
|
|
|
|
|
## Why This Matters for GLAM |
|
|
|
|
|
Libraries and archives have millions of digitized documents where metadata is incomplete or missing. VLMs offer a promising approach for: |
|
|
|
|
|
- **Catalog enhancement**: Fill gaps in existing records |
|
|
- **Discovery**: Make collections more searchable |
|
|
- **Quality assessment**: Validate existing metadata |
|
|
|
|
|
This evaluation demonstrates that domain-specific benchmarks can help identify the best approaches for cultural heritage tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
*Built with [Marimo](https://marimo.io) | Evaluation framework: [Inspect AI](https://inspect.aisi.org.uk/)* |
|
|
|