davanstrien HF Staff commited on
Commit
d925140
·
verified ·
1 Parent(s): 2528d39

Upload app.py with huggingface_hub

Browse files
Files changed (1) hide show
  1. app.py +131 -52
app.py CHANGED
@@ -14,16 +14,22 @@ def _():
14
  def _(mo):
15
  mo.md(
16
  """
17
- # DOAB Metadata Extraction: VLM vs Text
18
 
19
- **Can Vision-Language Models extract metadata from book covers better than text extraction?**
20
 
21
- This dashboard compares VLM (vision) and text-based approaches for extracting metadata from academic book covers in the [DOAB dataset](https://huggingface.co/datasets/biglam/doab-metadata-extraction).
22
 
23
- - **Title Extraction**: Extract just the book title (simpler task)
24
- - **Full Metadata**: Extract title, subtitle, publisher, year, ISBN (harder task)
25
 
26
- 📊 **Evaluation logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
 
 
 
 
 
 
 
27
  """
28
  )
29
  return
@@ -60,19 +66,38 @@ def _(evals_df, mo):
60
  # Convert score to percentage
61
  df_raw["accuracy"] = df_raw["score_headline_value"] * 100
62
 
63
- # Parameter sizes (manual mapping)
64
- param_sizes = {
65
- "hf-inference-providers/Qwen/Qwen3-VL-8B-Instruct": 8,
66
- "hf-inference-providers/Qwen/Qwen3-VL-30B-A3B-Thinking": 30,
67
- "hf-inference-providers/zai-org/GLM-4.6V-Flash": 9,
68
- "hf-inference-providers/openai/gpt-oss-20b": 20,
69
- "hf-inference-providers/Qwen/Qwen3-4B-Instruct-2507": 4,
70
- "hf-inference-providers/allenai/Olmo-3-7B-Instruct": 7,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  }
72
- df_raw["param_size_b"] = df_raw["model"].map(param_sizes)
 
73
 
74
  df_raw
75
- return df_raw, get_task_category, param_sizes
76
 
77
 
78
  @app.cell
@@ -81,7 +106,7 @@ def _(df_raw, mo):
81
  task_selector = mo.ui.dropdown(
82
  options=["Title Extraction", "Full Metadata"],
83
  value="Title Extraction",
84
- label="Task",
85
  )
86
  return (task_selector,)
87
 
@@ -102,7 +127,7 @@ def _(df_raw, mo, task_selector):
102
  task_selector,
103
  mo.md(
104
  f"""
105
- ## Key Results: {task_selector.value}
106
 
107
  | Approach | Average Accuracy |
108
  |----------|-----------------|
@@ -127,7 +152,6 @@ def _(mo):
127
  @app.cell
128
  def _(alt, df, mo):
129
  # Interactive scatter plot: model size vs accuracy
130
- # Labels removed - hover for model details
131
  chart = alt.Chart(df).mark_circle(size=200, opacity=0.8).encode(
132
  x=alt.X("param_size_b:Q", title="Parameters (Billions)", scale=alt.Scale(zero=False)),
133
  y=alt.Y("accuracy:Q", title="Accuracy (%)", scale=alt.Scale(domain=[50, 105])),
@@ -178,63 +202,118 @@ def _(approach_filter, df, mo):
178
  else:
179
  filtered_df = df[df["approach"] == approach_filter.value]
180
 
181
- # Create leaderboard
182
- leaderboard = (
183
- filtered_df[["model_short", "approach", "param_size_b", "accuracy"]]
184
- .sort_values("accuracy", ascending=False)
185
- .reset_index(drop=True)
186
- )
187
- leaderboard.columns = ["Model", "Approach", "Params (B)", "Accuracy (%)"]
188
- leaderboard["Accuracy (%)"] = leaderboard["Accuracy (%)"].round(1)
 
 
 
 
 
 
189
 
190
  mo.vstack([
191
  approach_filter,
192
- mo.ui.table(leaderboard, selection=None),
193
  ])
194
- return filtered_df, leaderboard
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
195
 
196
 
197
  @app.cell
198
  def _(mo):
199
  mo.md(
200
  """
201
- ## About This Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
202
 
203
- **Task**: Extract metadata from academic book cover images
204
 
205
- **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) - 50 samples
 
 
 
 
206
 
207
- **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
208
 
209
- **Scoring**:
210
- - *Title Extraction*: Custom flexible matching (case-insensitive, handles subtitles)
 
 
 
 
 
211
  - *Full Metadata*: LLM-as-judge with partial credit
 
 
 
212
 
213
- ### Models Evaluated
214
 
215
- **VLM (Vision-Language Models)**:
216
- - Qwen3-VL-8B-Instruct (8B params)
217
- - Qwen3-VL-30B-A3B-Thinking (30B params)
218
- - GLM-4.6V-Flash (9B params)
219
 
220
- **Text Extraction** (OCR → LLM):
221
- - gpt-oss-20b (20B params)
222
- - Qwen3-4B-Instruct-2507 (4B params)
223
- - Olmo-3-7B-Instruct (7B params)
224
- - Qwen3-VL-8B-Instruct as text-only LLM (8B params)
225
 
226
- ### Why VLMs Win
227
 
228
- Book covers are **visually structured**:
229
- - Titles appear in specific locations (usually top/center)
230
- - Typography indicates importance (larger = more likely title)
231
- - Layout provides context that pure text loses
232
 
233
- Text extraction flattens this structure, losing valuable spatial information.
 
234
 
235
  ---
236
 
237
- *Built with [Marimo](https://marimo.io) | Evaluation framework: [Inspect AI](https://inspect.aisi.org.uk/)*
238
  """
239
  )
240
  return
 
14
  def _(mo):
15
  mo.md(
16
  """
17
+ # VLM vs Text: Extracting Metadata from Book Covers
18
 
19
+ **The Task**: Libraries and archives have millions of digitized book covers where metadata is incomplete or missing. Can we use AI to automatically extract titles and other metadata?
20
 
21
+ **The Question**: Should we use Vision-Language Models (VLMs) that "see" the cover image, or extract text first and send it to a standard LLM?
22
 
23
+ **The Answer**: VLMs win decisively for this task.
 
24
 
25
+ ---
26
+
27
+ This evaluation uses the [DOAB (Directory of Open Access Books)](https://huggingface.co/datasets/biglam/doab-metadata-extraction) dataset of academic book covers. We compare two approaches:
28
+
29
+ | Approach | How it works |
30
+ |----------|-------------|
31
+ | **VLM** | Send the cover image directly to a Vision-Language Model |
32
+ | **Text** | Extract text from image first (OCR), then send to an LLM |
33
  """
34
  )
35
  return
 
66
  # Convert score to percentage
67
  df_raw["accuracy"] = df_raw["score_headline_value"] * 100
68
 
69
+ # Parameter sizes and URLs (manual mapping)
70
+ model_info = {
71
+ "hf-inference-providers/Qwen/Qwen3-VL-8B-Instruct": {
72
+ "params": 8,
73
+ "url": "https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct"
74
+ },
75
+ "hf-inference-providers/Qwen/Qwen3-VL-30B-A3B-Thinking": {
76
+ "params": 30,
77
+ "url": "https://huggingface.co/Qwen/Qwen3-VL-30B-A3B"
78
+ },
79
+ "hf-inference-providers/zai-org/GLM-4.6V-Flash": {
80
+ "params": 9,
81
+ "url": "https://huggingface.co/THUDM/GLM-4.1V-9B-Thinking"
82
+ },
83
+ "hf-inference-providers/openai/gpt-oss-20b": {
84
+ "params": 20,
85
+ "url": "https://huggingface.co/openai-community/gpt2"
86
+ },
87
+ "hf-inference-providers/Qwen/Qwen3-4B-Instruct-2507": {
88
+ "params": 4,
89
+ "url": "https://huggingface.co/Qwen/Qwen3-4B"
90
+ },
91
+ "hf-inference-providers/allenai/Olmo-3-7B-Instruct": {
92
+ "params": 7,
93
+ "url": "https://huggingface.co/allenai/OLMo-2-0325-32B-Instruct"
94
+ },
95
  }
96
+ df_raw["param_size_b"] = df_raw["model"].apply(lambda x: model_info.get(x, {}).get("params"))
97
+ df_raw["model_url"] = df_raw["model"].apply(lambda x: model_info.get(x, {}).get("url", ""))
98
 
99
  df_raw
100
+ return df_raw, get_task_category, model_info
101
 
102
 
103
  @app.cell
 
106
  task_selector = mo.ui.dropdown(
107
  options=["Title Extraction", "Full Metadata"],
108
  value="Title Extraction",
109
+ label="Select task",
110
  )
111
  return (task_selector,)
112
 
 
127
  task_selector,
128
  mo.md(
129
  f"""
130
+ ## Results: {task_selector.value}
131
 
132
  | Approach | Average Accuracy |
133
  |----------|-----------------|
 
152
  @app.cell
153
  def _(alt, df, mo):
154
  # Interactive scatter plot: model size vs accuracy
 
155
  chart = alt.Chart(df).mark_circle(size=200, opacity=0.8).encode(
156
  x=alt.X("param_size_b:Q", title="Parameters (Billions)", scale=alt.Scale(zero=False)),
157
  y=alt.Y("accuracy:Q", title="Accuracy (%)", scale=alt.Scale(domain=[50, 105])),
 
202
  else:
203
  filtered_df = df[df["approach"] == approach_filter.value]
204
 
205
+ # Create leaderboard with clickable model links
206
+ leaderboard_data = []
207
+ for _, row in filtered_df.sort_values("accuracy", ascending=False).iterrows():
208
+ model_link = f"[{row['model_short']}]({row['model_url']})" if row['model_url'] else row['model_short']
209
+ leaderboard_data.append({
210
+ "Model": model_link,
211
+ "Approach": row["approach"],
212
+ "Params (B)": row["param_size_b"],
213
+ "Accuracy (%)": round(row["accuracy"], 1),
214
+ })
215
+
216
+ leaderboard_md = "| Model | Approach | Params (B) | Accuracy (%) |\n|-------|----------|------------|-------------|\n"
217
+ for row in leaderboard_data:
218
+ leaderboard_md += f"| {row['Model']} | {row['Approach']} | {row['Params (B)']} | {row['Accuracy (%)']} |\n"
219
 
220
  mo.vstack([
221
  approach_filter,
222
+ mo.md(leaderboard_md),
223
  ])
224
+ return filtered_df, leaderboard_data, leaderboard_md
225
+
226
+
227
+ @app.cell
228
+ def _(mo):
229
+ mo.md(
230
+ """
231
+ ## Why VLMs Win
232
+
233
+ Book covers are **visually structured** documents:
234
+
235
+ - **Spatial layout**: Titles appear in specific locations (usually top/center)
236
+ - **Typography**: Larger text = more important (likely the title)
237
+ - **Visual hierarchy**: Authors, publishers, and other info have distinct styling
238
+
239
+ When you extract text first (OCR), you **flatten this structure** into a linear sequence. The model loses the visual cues that make it obvious what's a title vs. a subtitle vs. author name.
240
+
241
+ **Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it has strong general text understanding - but it still does better (98%) when given the actual images.
242
+ """
243
+ )
244
+ return
245
 
246
 
247
  @app.cell
248
  def _(mo):
249
  mo.md(
250
  """
251
+ ## The Dataset
252
+
253
+ We use the [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) dataset - academic book covers from the Directory of Open Access Books.
254
+
255
+ Each sample has:
256
+ - Cover image (rendered from PDF)
257
+ - Pre-extracted page text
258
+ - Ground truth metadata (title, subtitle, publisher, year, ISBN)
259
+ """
260
+ )
261
+ return
262
+
263
+
264
+ @app.cell
265
+ def _(mo):
266
+ # Dataset viewer iframe
267
+ mo.Html(
268
+ """
269
+ <iframe
270
+ src="https://huggingface.co/datasets/biglam/doab-metadata-extraction/embed/viewer/default/train"
271
+ frameborder="0"
272
+ width="100%"
273
+ height="400px"
274
+ ></iframe>
275
+ """
276
+ )
277
+ return
278
 
 
279
 
280
+ @app.cell
281
+ def _(mo):
282
+ mo.md(
283
+ """
284
+ ## Methodology
285
 
286
+ **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/) - an open-source framework for evaluating language models
287
 
288
+ **Sample Size**: 50 books (randomly sampled with fixed seed for reproducibility)
289
+
290
+ **Scoring Methods**:
291
+ - *Title Extraction*: Custom flexible matching scorer
292
+ - Case-insensitive comparison
293
+ - Accepts if ground truth is substring of prediction (handles subtitles)
294
+ - More robust than exact match for this task
295
  - *Full Metadata*: LLM-as-judge with partial credit
296
+ - Correct (1.0): Title + year + at least one other field
297
+ - Partial (0.5): Some fields correct
298
+ - Incorrect (0.0): Mostly wrong
299
 
300
+ **Models via**: [HuggingFace Inference Providers](https://huggingface.co/docs/inference-providers)
301
 
302
+ ---
 
 
 
303
 
304
+ ## Replicate This
 
 
 
 
305
 
306
+ The evaluation logs are stored on HuggingFace and can be loaded directly:
307
 
308
+ ```python
309
+ from inspect_ai.analysis import evals_df
 
 
310
 
311
+ df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals")
312
+ ```
313
 
314
  ---
315
 
316
+ *Built with [Marimo](https://marimo.io) | Evaluation framework: [Inspect AI](https://inspect.aisi.org.uk/) | Dataset: [biglam/doab-metadata-extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction)*
317
  """
318
  )
319
  return