Spaces:

davanstrien
/

doab-title-extraction-eval

Running

App Files Files Community

doab-title-extraction-eval / README.md

davanstrien HF Staff

Upload folder using huggingface_hub

a08c225 verified 1 day ago

preview code

raw

history blame contribute delete

3.38 kB

	---
	title: DOAB Metadata Extraction Evaluation
	emoji: 📚
	colorFrom: blue
	colorTo: purple
	sdk: docker
	pinned: false
	license: mit
	---

	# VLM vs Text: Extracting Metadata from Book Covers

	Can Vision-Language Models extract metadata from book covers better than text extraction?

	## TL;DR

	Yes, significantly. VLMs achieve ~97% accuracy vs ~75% for text extraction on title extraction, and ~80% vs ~71% on full metadata extraction.

	## The Task

	Extracting metadata from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:

	1. VLM (Vision): Send the cover image directly to a Vision-Language Model
	2. Text Extraction: Extract text from the image first, then send to an LLM

	## Results

	### Title Extraction (simpler task)

	\| Approach \| Average Accuracy \|
	\|----------\|-----------------\|
	\| VLM \| 97% \|
	\| Text \| 75% \|

	### Full Metadata (title, subtitle, publisher, year, ISBN)

	\| Approach \| Average Accuracy \|
	\|----------\|-----------------\|
	\| VLM \| 80% \|
	\| Text \| 71% \|

	VLMs consistently outperform text extraction across both tasks.

	### Why VLMs Win

	Book covers are visually structured:
	- Titles appear in specific locations (usually top/center)
	- Typography indicates importance (larger = more likely title)
	- Layout provides context that pure text loses

	Text extraction flattens this structure, losing valuable spatial information.

	## Models Evaluated

	VLM Models:
	- Qwen3-VL-8B-Instruct (8B params)
	- Qwen3-VL-30B-A3B-Thinking (30B params)
	- GLM-4.6V-Flash (9B params)

	Text Models:
	- gpt-oss-20b (20B params)
	- Qwen3-4B-Instruct-2507 (4B params)
	- Olmo-3-7B-Instruct (7B params)

	Interesting finding: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.

	## Interactive Features

	- Task selector: Switch between Title Extraction and Full Metadata results
	- Model size vs accuracy plot: Interactive scatter plot showing efficiency
	- Leaderboard: Filter by VLM or Text approach

	## Technical Details

	- Dataset: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples)
	- Evaluation Framework: [Inspect AI](https://inspect.aisi.org.uk/)
	- Scoring:
	- Title: Flexible matching (handles case, subtitles, punctuation)
	- Full Metadata: LLM-as-judge with partial credit
	- Logs: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)

	## Replicate This

	The evaluation logs are stored on HuggingFace and can be loaded directly:

	```python
	from inspect_ai.analysis import evals_df

	df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals")
	```

	## Why This Matters for GLAM

	Libraries and archives have millions of digitized documents where metadata is incomplete or missing. VLMs offer a promising approach for:

	- Catalog enhancement: Fill gaps in existing records
	- Discovery: Make collections more searchable
	- Quality assessment: Validate existing metadata

	This evaluation demonstrates that domain-specific benchmarks can help identify the best approaches for cultural heritage tasks.

	---

	Built with [Marimo](https://marimo.io) \| Evaluation framework: [Inspect AI](https://inspect.aisi.org.uk/)