--- title: DOAB Metadata Extraction Evaluation emoji: 📚 colorFrom: blue colorTo: purple sdk: docker pinned: false license: mit --- # VLM vs Text: Extracting Metadata from Book Covers **Can Vision-Language Models extract metadata from book covers better than text extraction?** ## TL;DR **Yes, significantly.** VLMs achieve ~97% accuracy vs ~75% for text extraction on title extraction, and ~80% vs ~71% on full metadata extraction. ## The Task Extracting metadata from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches: 1. **VLM (Vision)**: Send the cover image directly to a Vision-Language Model 2. **Text Extraction**: Extract text from the image first, then send to an LLM ## Results ### Title Extraction (simpler task) | Approach | Average Accuracy | |----------|-----------------| | **VLM** | **97%** | | Text | 75% | ### Full Metadata (title, subtitle, publisher, year, ISBN) | Approach | Average Accuracy | |----------|-----------------| | **VLM** | **80%** | | Text | 71% | VLMs consistently outperform text extraction across both tasks. ### Why VLMs Win Book covers are **visually structured**: - Titles appear in specific locations (usually top/center) - Typography indicates importance (larger = more likely title) - Layout provides context that pure text loses Text extraction flattens this structure, losing valuable spatial information. ## Models Evaluated **VLM Models**: - Qwen3-VL-8B-Instruct (8B params) - Qwen3-VL-30B-A3B-Thinking (30B params) - GLM-4.6V-Flash (9B params) **Text Models**: - gpt-oss-20b (20B params) - Qwen3-4B-Instruct-2507 (4B params) - Olmo-3-7B-Instruct (7B params) **Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality. ## Interactive Features - **Task selector**: Switch between Title Extraction and Full Metadata results - **Model size vs accuracy plot**: Interactive scatter plot showing efficiency - **Leaderboard**: Filter by VLM or Text approach ## Technical Details - **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples) - **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/) - **Scoring**: - Title: Flexible matching (handles case, subtitles, punctuation) - Full Metadata: LLM-as-judge with partial credit - **Logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals) ## Replicate This The evaluation logs are stored on HuggingFace and can be loaded directly: ```python from inspect_ai.analysis import evals_df df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals") ``` ## Why This Matters for GLAM Libraries and archives have millions of digitized documents where metadata is incomplete or missing. VLMs offer a promising approach for: - **Catalog enhancement**: Fill gaps in existing records - **Discovery**: Make collections more searchable - **Quality assessment**: Validate existing metadata This evaluation demonstrates that domain-specific benchmarks can help identify the best approaches for cultural heritage tasks. --- *Built with [Marimo](https://marimo.io) | Evaluation framework: [Inspect AI](https://inspect.aisi.org.uk/)*