Spaces:

davanstrien
/

doab-title-extraction-eval

Running

App Files Files Community

doab-title-extraction-eval / README.md

davanstrien HF Staff

Upload folder using huggingface_hub

a08c225 verified 1 day ago

preview code

raw

history blame contribute delete

3.38 kB

metadata

title: DOAB Metadata Extraction Evaluation
emoji: 📚
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit

VLM vs Text: Extracting Metadata from Book Covers

Can Vision-Language Models extract metadata from book covers better than text extraction?

TL;DR

Yes, significantly. VLMs achieve ~97% accuracy vs ~75% for text extraction on title extraction, and ~80% vs ~71% on full metadata extraction.

The Task

Extracting metadata from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:

VLM (Vision): Send the cover image directly to a Vision-Language Model
Text Extraction: Extract text from the image first, then send to an LLM

Results

Title Extraction (simpler task)

Approach	Average Accuracy
VLM	97%
Text	75%

Full Metadata (title, subtitle, publisher, year, ISBN)

Approach	Average Accuracy
VLM	80%
Text	71%

VLMs consistently outperform text extraction across both tasks.

Why VLMs Win

Book covers are visually structured:

Titles appear in specific locations (usually top/center)
Typography indicates importance (larger = more likely title)
Layout provides context that pure text loses

Text extraction flattens this structure, losing valuable spatial information.

Models Evaluated

VLM Models:

Qwen3-VL-8B-Instruct (8B params)
Qwen3-VL-30B-A3B-Thinking (30B params)
GLM-4.6V-Flash (9B params)

Text Models:

gpt-oss-20b (20B params)
Qwen3-4B-Instruct-2507 (4B params)
Olmo-3-7B-Instruct (7B params)

Interesting finding: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.

Interactive Features

Task selector: Switch between Title Extraction and Full Metadata results
Model size vs accuracy plot: Interactive scatter plot showing efficiency
Leaderboard: Filter by VLM or Text approach

Technical Details

Dataset: DOAB Metadata Extraction (50 samples)
Evaluation Framework: Inspect AI
Scoring:
- Title: Flexible matching (handles case, subtitles, punctuation)
- Full Metadata: LLM-as-judge with partial credit
Logs: davanstrien/doab-title-extraction-evals

Replicate This

The evaluation logs are stored on HuggingFace and can be loaded directly:

from inspect_ai.analysis import evals_df

df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals")

Why This Matters for GLAM

Libraries and archives have millions of digitized documents where metadata is incomplete or missing. VLMs offer a promising approach for:

Catalog enhancement: Fill gaps in existing records
Discovery: Make collections more searchable
Quality assessment: Validate existing metadata

This evaluation demonstrates that domain-specific benchmarks can help identify the best approaches for cultural heritage tasks.

Built with Marimo | Evaluation framework: Inspect AI