davanstrien HF Staff commited on
Commit
a08c225
·
verified ·
1 Parent(s): 7455617

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +32 -15
  2. app.py +139 -41
  3. requirements.txt +1 -0
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: DOAB Title Extraction Evaluation
3
  emoji: 📚
4
  colorFrom: blue
5
  colorTo: purple
@@ -8,29 +8,38 @@ pinned: false
8
  license: mit
9
  ---
10
 
11
- # VLM vs Text: Extracting Titles from Book Covers
12
 
13
  **Can Vision-Language Models extract metadata from book covers better than text extraction?**
14
 
15
  ## TL;DR
16
 
17
- **Yes, significantly.** VLMs achieve ~97% accuracy vs ~70% for text extraction on the DOAB academic book cover dataset.
18
 
19
  ## The Task
20
 
21
- Extracting titles from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:
22
 
23
  1. **VLM (Vision)**: Send the cover image directly to a Vision-Language Model
24
  2. **Text Extraction**: Extract text from the image first, then send to an LLM
25
 
26
  ## Results
27
 
 
 
28
  | Approach | Average Accuracy |
29
  |----------|-----------------|
30
  | **VLM** | **97%** |
31
- | Text | 70% |
 
 
32
 
33
- VLMs outperform text extraction by ~27 percentage points.
 
 
 
 
 
34
 
35
  ### Why VLMs Win
36
 
@@ -43,23 +52,31 @@ Text extraction flattens this structure, losing valuable spatial information.
43
 
44
  ## Models Evaluated
45
 
46
- **VLM Models** (96-98% accuracy):
47
- - Qwen3-VL-8B-Instruct
48
- - Qwen3-VL-30B-A3B-Thinking
49
- - GLM-4.6V-Flash
50
 
51
- **Text Models** (68-70% accuracy):
52
- - gpt-oss-20b
53
- - Qwen3-4B-Instruct-2507
54
- - Olmo-3-7B-Instruct
55
 
56
  **Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.
57
 
 
 
 
 
 
 
58
  ## Technical Details
59
 
60
  - **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples)
61
  - **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
62
- - **Scoring**: Flexible title matching (handles case, subtitles, punctuation)
 
 
63
  - **Logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
64
 
65
  ## Replicate This
 
1
  ---
2
+ title: DOAB Metadata Extraction Evaluation
3
  emoji: 📚
4
  colorFrom: blue
5
  colorTo: purple
 
8
  license: mit
9
  ---
10
 
11
+ # VLM vs Text: Extracting Metadata from Book Covers
12
 
13
  **Can Vision-Language Models extract metadata from book covers better than text extraction?**
14
 
15
  ## TL;DR
16
 
17
+ **Yes, significantly.** VLMs achieve ~97% accuracy vs ~75% for text extraction on title extraction, and ~80% vs ~71% on full metadata extraction.
18
 
19
  ## The Task
20
 
21
+ Extracting metadata from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:
22
 
23
  1. **VLM (Vision)**: Send the cover image directly to a Vision-Language Model
24
  2. **Text Extraction**: Extract text from the image first, then send to an LLM
25
 
26
  ## Results
27
 
28
+ ### Title Extraction (simpler task)
29
+
30
  | Approach | Average Accuracy |
31
  |----------|-----------------|
32
  | **VLM** | **97%** |
33
+ | Text | 75% |
34
+
35
+ ### Full Metadata (title, subtitle, publisher, year, ISBN)
36
 
37
+ | Approach | Average Accuracy |
38
+ |----------|-----------------|
39
+ | **VLM** | **80%** |
40
+ | Text | 71% |
41
+
42
+ VLMs consistently outperform text extraction across both tasks.
43
 
44
  ### Why VLMs Win
45
 
 
52
 
53
  ## Models Evaluated
54
 
55
+ **VLM Models**:
56
+ - Qwen3-VL-8B-Instruct (8B params)
57
+ - Qwen3-VL-30B-A3B-Thinking (30B params)
58
+ - GLM-4.6V-Flash (9B params)
59
 
60
+ **Text Models**:
61
+ - gpt-oss-20b (20B params)
62
+ - Qwen3-4B-Instruct-2507 (4B params)
63
+ - Olmo-3-7B-Instruct (7B params)
64
 
65
  **Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.
66
 
67
+ ## Interactive Features
68
+
69
+ - **Task selector**: Switch between Title Extraction and Full Metadata results
70
+ - **Model size vs accuracy plot**: Interactive scatter plot showing efficiency
71
+ - **Leaderboard**: Filter by VLM or Text approach
72
+
73
  ## Technical Details
74
 
75
  - **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples)
76
  - **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
77
+ - **Scoring**:
78
+ - Title: Flexible matching (handles case, subtitles, punctuation)
79
+ - Full Metadata: LLM-as-judge with partial credit
80
  - **Logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
81
 
82
  ## Replicate This
app.py CHANGED
@@ -14,11 +14,14 @@ def _():
14
  def _(mo):
15
  mo.md(
16
  """
17
- # DOAB Title Extraction: VLM vs Text
18
 
19
- **Can Vision-Language Models extract book titles from covers better than text extraction?**
20
 
21
- This dashboard compares VLM (vision) and text-based approaches for extracting titles from academic book covers in the [DOAB dataset](https://huggingface.co/datasets/biglam/doab-metadata-extraction).
 
 
 
22
 
23
  📊 **Evaluation logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
24
  """
@@ -29,50 +32,134 @@ def _(mo):
29
  @app.cell
30
  def _():
31
  import pandas as pd
 
32
  from inspect_ai.analysis import evals_df
33
- return evals_df, pd
34
 
35
 
36
  @app.cell
37
  def _(evals_df):
38
  # Load evaluation results from HuggingFace
39
- df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals", quiet=True)
 
 
 
 
40
 
41
- # Add approach column
42
- df["approach"] = df["task_name"].apply(lambda x: "VLM" if "vlm" in x else "Text")
 
 
 
 
43
 
44
- # Shorten model names
45
- df["model_short"] = df["model"].apply(lambda x: x.split("/")[-1])
46
 
47
  # Convert score to percentage
48
- df["accuracy"] = df["score_headline_value"] * 100
49
- return (df,)
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
 
52
  @app.cell
53
- def _(df, mo):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  # Calculate summary stats
55
  vlm_avg = df[df["approach"] == "VLM"]["accuracy"].mean()
56
  text_avg = df[df["approach"] == "Text"]["accuracy"].mean()
57
  diff = vlm_avg - text_avg
58
 
59
- mo.md(
60
- f"""
61
- ## Key Results
 
 
 
 
62
 
63
- | Approach | Average Accuracy |
64
- |----------|-----------------|
65
- | **VLM (Vision)** | **{vlm_avg:.0f}%** |
66
- | Text Extraction | {text_avg:.0f}% |
67
 
68
- **VLM advantage: +{diff:.0f} percentage points**
69
 
70
- VLMs significantly outperform text extraction for book cover metadata.
71
- This is because book covers are **visually structured** - titles appear in specific
72
- locations with distinctive formatting that VLMs can recognize.
73
- """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  )
75
- return diff, text_avg, vlm_avg
 
 
 
 
 
 
 
76
 
77
 
78
  @app.cell
@@ -83,7 +170,7 @@ def _(mo):
83
 
84
  @app.cell
85
  def _(df, mo):
86
- # Filter selector
87
  approach_filter = mo.ui.dropdown(
88
  options=["All", "VLM", "Text"],
89
  value="All",
@@ -93,7 +180,7 @@ def _(df, mo):
93
 
94
 
95
  @app.cell
96
- def _(approach_filter, df, mo, pd):
97
  # Filter data based on selection
98
  if approach_filter.value == "All":
99
  filtered_df = df
@@ -102,11 +189,11 @@ def _(approach_filter, df, mo, pd):
102
 
103
  # Create leaderboard
104
  leaderboard = (
105
- filtered_df[["model_short", "approach", "accuracy"]]
106
  .sort_values("accuracy", ascending=False)
107
  .reset_index(drop=True)
108
  )
109
- leaderboard.columns = ["Model", "Approach", "Accuracy (%)"]
110
  leaderboard["Accuracy (%)"] = leaderboard["Accuracy (%)"].round(1)
111
 
112
  mo.vstack([
@@ -117,35 +204,46 @@ def _(approach_filter, df, mo, pd):
117
 
118
 
119
  @app.cell
120
- def _(df, mo):
121
  mo.md(
122
  """
123
  ## About This Evaluation
124
 
125
- **Task**: Extract the title from academic book cover images
126
 
127
  **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) - 50 samples
128
 
129
  **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
130
 
131
- **Scoring**: Flexible title matching (case-insensitive, handles subtitles)
 
 
132
 
133
  ### Models Evaluated
134
 
135
  **VLM (Vision-Language Models)**:
136
- - Qwen3-VL-8B-Instruct
137
- - Qwen3-VL-30B-A3B-Thinking
138
- - GLM-4.6V-Flash
 
 
 
 
 
 
 
 
 
 
 
 
 
139
 
140
- **Text Extraction**:
141
- - gpt-oss-20b
142
- - Qwen3-4B-Instruct-2507
143
- - Olmo-3-7B-Instruct
144
- - Qwen3-VL-8B-Instruct (used as text-only LLM)
145
 
146
  ---
147
 
148
- *Built with [Marimo](https://marimo.io) | Evaluation logs on [HuggingFace](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)*
149
  """
150
  )
151
  return
 
14
  def _(mo):
15
  mo.md(
16
  """
17
+ # DOAB Metadata Extraction: VLM vs Text
18
 
19
+ **Can Vision-Language Models extract metadata from book covers better than text extraction?**
20
 
21
+ This dashboard compares VLM (vision) and text-based approaches for extracting metadata from academic book covers in the [DOAB dataset](https://huggingface.co/datasets/biglam/doab-metadata-extraction).
22
+
23
+ - **Title Extraction**: Extract just the book title (simpler task)
24
+ - **Full Metadata**: Extract title, subtitle, publisher, year, ISBN (harder task)
25
 
26
  📊 **Evaluation logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
27
  """
 
32
  @app.cell
33
  def _():
34
  import pandas as pd
35
+ import altair as alt
36
  from inspect_ai.analysis import evals_df
37
+ return alt, evals_df, pd
38
 
39
 
40
  @app.cell
41
  def _(evals_df):
42
  # Load evaluation results from HuggingFace
43
+ df_raw = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals", quiet=True)
44
+
45
+ # Add metadata columns
46
+ df_raw["approach"] = df_raw["task_name"].apply(lambda x: "VLM" if "vlm" in x else "Text")
47
+ df_raw["model_short"] = df_raw["model"].apply(lambda x: x.split("/")[-1])
48
 
49
+ # Determine task category
50
+ def get_task_category(task_name):
51
+ if "llm_judge" in task_name:
52
+ return "Full Metadata"
53
+ else:
54
+ return "Title Extraction"
55
 
56
+ df_raw["task_category"] = df_raw["task_name"].apply(get_task_category)
 
57
 
58
  # Convert score to percentage
59
+ df_raw["accuracy"] = df_raw["score_headline_value"] * 100
60
+
61
+ # Parameter sizes (manual mapping)
62
+ param_sizes = {
63
+ "hf-inference-providers/Qwen/Qwen3-VL-8B-Instruct": 8,
64
+ "hf-inference-providers/Qwen/Qwen3-VL-30B-A3B-Thinking": 30,
65
+ "hf-inference-providers/zai-org/GLM-4.6V-Flash": 9,
66
+ "hf-inference-providers/openai/gpt-oss-20b": 20,
67
+ "hf-inference-providers/Qwen/Qwen3-4B-Instruct-2507": 4,
68
+ "hf-inference-providers/allenai/Olmo-3-7B-Instruct": 7,
69
+ }
70
+ df_raw["param_size_b"] = df_raw["model"].map(param_sizes)
71
+
72
+ df_raw
73
+ return df_raw, get_task_category, param_sizes
74
 
75
 
76
  @app.cell
77
+ def _(df_raw, mo):
78
+ # Task selector
79
+ task_selector = mo.ui.dropdown(
80
+ options=["Title Extraction", "Full Metadata"],
81
+ value="Title Extraction",
82
+ label="Task",
83
+ )
84
+ return (task_selector,)
85
+
86
+
87
+ @app.cell
88
+ def _(df_raw, mo, task_selector):
89
+ # Filter by selected task
90
+ df = df_raw[df_raw["task_category"] == task_selector.value].copy()
91
+
92
  # Calculate summary stats
93
  vlm_avg = df[df["approach"] == "VLM"]["accuracy"].mean()
94
  text_avg = df[df["approach"] == "Text"]["accuracy"].mean()
95
  diff = vlm_avg - text_avg
96
 
97
+ task_desc = "book titles" if task_selector.value == "Title Extraction" else "full metadata (title, subtitle, publisher, year, ISBN)"
98
+
99
+ mo.vstack([
100
+ task_selector,
101
+ mo.md(
102
+ f"""
103
+ ## Key Results: {task_selector.value}
104
 
105
+ | Approach | Average Accuracy |
106
+ |----------|-----------------|
107
+ | **VLM (Vision)** | **{vlm_avg:.0f}%** |
108
+ | Text Extraction | {text_avg:.0f}% |
109
 
110
+ **VLM advantage: +{diff:.0f} percentage points**
111
 
112
+ VLMs {'significantly ' if diff > 15 else ''}outperform text extraction for extracting {task_desc} from book covers.
113
+ """
114
+ )
115
+ ])
116
+ return df, diff, task_desc, text_avg, vlm_avg
117
+
118
+
119
+ @app.cell
120
+ def _(mo):
121
+ mo.md("## Model Size vs Accuracy")
122
+ return
123
+
124
+
125
+ @app.cell
126
+ def _(alt, df, mo):
127
+ # Interactive scatter plot: model size vs accuracy
128
+ scatter = alt.Chart(df).mark_circle(size=150).encode(
129
+ x=alt.X("param_size_b:Q", title="Parameters (Billions)", scale=alt.Scale(zero=False)),
130
+ y=alt.Y("accuracy:Q", title="Accuracy (%)", scale=alt.Scale(domain=[50, 105])),
131
+ color=alt.Color("approach:N", title="Approach", scale=alt.Scale(domain=["VLM", "Text"], range=["#1f77b4", "#ff7f0e"])),
132
+ tooltip=[
133
+ alt.Tooltip("model_short:N", title="Model"),
134
+ alt.Tooltip("approach:N", title="Approach"),
135
+ alt.Tooltip("param_size_b:Q", title="Params (B)"),
136
+ alt.Tooltip("accuracy:Q", title="Accuracy", format=".1f"),
137
+ ],
138
+ ).properties(
139
+ width=500,
140
+ height=300,
141
+ ).interactive()
142
+
143
+ # Add text labels
144
+ text = alt.Chart(df).mark_text(
145
+ align="left",
146
+ baseline="middle",
147
+ dx=10,
148
+ fontSize=11,
149
+ ).encode(
150
+ x="param_size_b:Q",
151
+ y="accuracy:Q",
152
+ text="model_short:N",
153
+ color=alt.Color("approach:N", scale=alt.Scale(domain=["VLM", "Text"], range=["#1f77b4", "#ff7f0e"])),
154
  )
155
+
156
+ chart = (scatter + text).configure_axis(
157
+ labelFontSize=12,
158
+ titleFontSize=14,
159
+ )
160
+
161
+ mo.ui.altair_chart(chart)
162
+ return chart, scatter, text
163
 
164
 
165
  @app.cell
 
170
 
171
  @app.cell
172
  def _(df, mo):
173
+ # Filter selector for approach
174
  approach_filter = mo.ui.dropdown(
175
  options=["All", "VLM", "Text"],
176
  value="All",
 
180
 
181
 
182
  @app.cell
183
+ def _(approach_filter, df, mo):
184
  # Filter data based on selection
185
  if approach_filter.value == "All":
186
  filtered_df = df
 
189
 
190
  # Create leaderboard
191
  leaderboard = (
192
+ filtered_df[["model_short", "approach", "param_size_b", "accuracy"]]
193
  .sort_values("accuracy", ascending=False)
194
  .reset_index(drop=True)
195
  )
196
+ leaderboard.columns = ["Model", "Approach", "Params (B)", "Accuracy (%)"]
197
  leaderboard["Accuracy (%)"] = leaderboard["Accuracy (%)"].round(1)
198
 
199
  mo.vstack([
 
204
 
205
 
206
  @app.cell
207
+ def _(mo):
208
  mo.md(
209
  """
210
  ## About This Evaluation
211
 
212
+ **Task**: Extract metadata from academic book cover images
213
 
214
  **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) - 50 samples
215
 
216
  **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
217
 
218
+ **Scoring**:
219
+ - *Title Extraction*: Custom flexible matching (case-insensitive, handles subtitles)
220
+ - *Full Metadata*: LLM-as-judge with partial credit
221
 
222
  ### Models Evaluated
223
 
224
  **VLM (Vision-Language Models)**:
225
+ - Qwen3-VL-8B-Instruct (8B params)
226
+ - Qwen3-VL-30B-A3B-Thinking (30B params)
227
+ - GLM-4.6V-Flash (9B params)
228
+
229
+ **Text Extraction** (OCR → LLM):
230
+ - gpt-oss-20b (20B params)
231
+ - Qwen3-4B-Instruct-2507 (4B params)
232
+ - Olmo-3-7B-Instruct (7B params)
233
+ - Qwen3-VL-8B-Instruct as text-only LLM (8B params)
234
+
235
+ ### Why VLMs Win
236
+
237
+ Book covers are **visually structured**:
238
+ - Titles appear in specific locations (usually top/center)
239
+ - Typography indicates importance (larger = more likely title)
240
+ - Layout provides context that pure text loses
241
 
242
+ Text extraction flattens this structure, losing valuable spatial information.
 
 
 
 
243
 
244
  ---
245
 
246
+ *Built with [Marimo](https://marimo.io) | Evaluation framework: [Inspect AI](https://inspect.aisi.org.uk/)*
247
  """
248
  )
249
  return
requirements.txt CHANGED
@@ -3,3 +3,4 @@ pandas>=2.0.0
3
  inspect-ai>=0.3.0
4
  huggingface-hub>=0.20.0
5
  pyarrow>=14.0.0
 
 
3
  inspect-ai>=0.3.0
4
  huggingface-hub>=0.20.0
5
  pyarrow>=14.0.0
6
+ altair>=5.0.0