davanstrien's picture
davanstrien HF Staff
Upload folder using huggingface_hub
a08c225 verified
metadata
title: DOAB Metadata Extraction Evaluation
emoji: πŸ“š
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit

VLM vs Text: Extracting Metadata from Book Covers

Can Vision-Language Models extract metadata from book covers better than text extraction?

TL;DR

Yes, significantly. VLMs achieve ~97% accuracy vs ~75% for text extraction on title extraction, and ~80% vs ~71% on full metadata extraction.

The Task

Extracting metadata from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:

  1. VLM (Vision): Send the cover image directly to a Vision-Language Model
  2. Text Extraction: Extract text from the image first, then send to an LLM

Results

Title Extraction (simpler task)

Approach Average Accuracy
VLM 97%
Text 75%

Full Metadata (title, subtitle, publisher, year, ISBN)

Approach Average Accuracy
VLM 80%
Text 71%

VLMs consistently outperform text extraction across both tasks.

Why VLMs Win

Book covers are visually structured:

  • Titles appear in specific locations (usually top/center)
  • Typography indicates importance (larger = more likely title)
  • Layout provides context that pure text loses

Text extraction flattens this structure, losing valuable spatial information.

Models Evaluated

VLM Models:

  • Qwen3-VL-8B-Instruct (8B params)
  • Qwen3-VL-30B-A3B-Thinking (30B params)
  • GLM-4.6V-Flash (9B params)

Text Models:

  • gpt-oss-20b (20B params)
  • Qwen3-4B-Instruct-2507 (4B params)
  • Olmo-3-7B-Instruct (7B params)

Interesting finding: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.

Interactive Features

  • Task selector: Switch between Title Extraction and Full Metadata results
  • Model size vs accuracy plot: Interactive scatter plot showing efficiency
  • Leaderboard: Filter by VLM or Text approach

Technical Details

Replicate This

The evaluation logs are stored on HuggingFace and can be loaded directly:

from inspect_ai.analysis import evals_df

df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals")

Why This Matters for GLAM

Libraries and archives have millions of digitized documents where metadata is incomplete or missing. VLMs offer a promising approach for:

  • Catalog enhancement: Fill gaps in existing records
  • Discovery: Make collections more searchable
  • Quality assessment: Validate existing metadata

This evaluation demonstrates that domain-specific benchmarks can help identify the best approaches for cultural heritage tasks.


Built with Marimo | Evaluation framework: Inspect AI