Upload folder using huggingface_hub
Browse files- README.md +32 -15
- app.py +139 -41
- requirements.txt +1 -0
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title: DOAB
|
| 3 |
emoji: 📚
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
|
@@ -8,29 +8,38 @@ pinned: false
|
|
| 8 |
license: mit
|
| 9 |
---
|
| 10 |
|
| 11 |
-
# VLM vs Text: Extracting
|
| 12 |
|
| 13 |
**Can Vision-Language Models extract metadata from book covers better than text extraction?**
|
| 14 |
|
| 15 |
## TL;DR
|
| 16 |
|
| 17 |
-
**Yes, significantly.** VLMs achieve ~97% accuracy vs ~
|
| 18 |
|
| 19 |
## The Task
|
| 20 |
|
| 21 |
-
Extracting
|
| 22 |
|
| 23 |
1. **VLM (Vision)**: Send the cover image directly to a Vision-Language Model
|
| 24 |
2. **Text Extraction**: Extract text from the image first, then send to an LLM
|
| 25 |
|
| 26 |
## Results
|
| 27 |
|
|
|
|
|
|
|
| 28 |
| Approach | Average Accuracy |
|
| 29 |
|----------|-----------------|
|
| 30 |
| **VLM** | **97%** |
|
| 31 |
-
| Text |
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
### Why VLMs Win
|
| 36 |
|
|
@@ -43,23 +52,31 @@ Text extraction flattens this structure, losing valuable spatial information.
|
|
| 43 |
|
| 44 |
## Models Evaluated
|
| 45 |
|
| 46 |
-
**VLM Models
|
| 47 |
-
- Qwen3-VL-8B-Instruct
|
| 48 |
-
- Qwen3-VL-30B-A3B-Thinking
|
| 49 |
-
- GLM-4.6V-Flash
|
| 50 |
|
| 51 |
-
**Text Models
|
| 52 |
-
- gpt-oss-20b
|
| 53 |
-
- Qwen3-4B-Instruct-2507
|
| 54 |
-
- Olmo-3-7B-Instruct
|
| 55 |
|
| 56 |
**Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.
|
| 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
## Technical Details
|
| 59 |
|
| 60 |
- **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples)
|
| 61 |
- **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
|
| 62 |
-
- **Scoring**:
|
|
|
|
|
|
|
| 63 |
- **Logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
|
| 64 |
|
| 65 |
## Replicate This
|
|
|
|
| 1 |
---
|
| 2 |
+
title: DOAB Metadata Extraction Evaluation
|
| 3 |
emoji: 📚
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
|
|
|
| 8 |
license: mit
|
| 9 |
---
|
| 10 |
|
| 11 |
+
# VLM vs Text: Extracting Metadata from Book Covers
|
| 12 |
|
| 13 |
**Can Vision-Language Models extract metadata from book covers better than text extraction?**
|
| 14 |
|
| 15 |
## TL;DR
|
| 16 |
|
| 17 |
+
**Yes, significantly.** VLMs achieve ~97% accuracy vs ~75% for text extraction on title extraction, and ~80% vs ~71% on full metadata extraction.
|
| 18 |
|
| 19 |
## The Task
|
| 20 |
|
| 21 |
+
Extracting metadata from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:
|
| 22 |
|
| 23 |
1. **VLM (Vision)**: Send the cover image directly to a Vision-Language Model
|
| 24 |
2. **Text Extraction**: Extract text from the image first, then send to an LLM
|
| 25 |
|
| 26 |
## Results
|
| 27 |
|
| 28 |
+
### Title Extraction (simpler task)
|
| 29 |
+
|
| 30 |
| Approach | Average Accuracy |
|
| 31 |
|----------|-----------------|
|
| 32 |
| **VLM** | **97%** |
|
| 33 |
+
| Text | 75% |
|
| 34 |
+
|
| 35 |
+
### Full Metadata (title, subtitle, publisher, year, ISBN)
|
| 36 |
|
| 37 |
+
| Approach | Average Accuracy |
|
| 38 |
+
|----------|-----------------|
|
| 39 |
+
| **VLM** | **80%** |
|
| 40 |
+
| Text | 71% |
|
| 41 |
+
|
| 42 |
+
VLMs consistently outperform text extraction across both tasks.
|
| 43 |
|
| 44 |
### Why VLMs Win
|
| 45 |
|
|
|
|
| 52 |
|
| 53 |
## Models Evaluated
|
| 54 |
|
| 55 |
+
**VLM Models**:
|
| 56 |
+
- Qwen3-VL-8B-Instruct (8B params)
|
| 57 |
+
- Qwen3-VL-30B-A3B-Thinking (30B params)
|
| 58 |
+
- GLM-4.6V-Flash (9B params)
|
| 59 |
|
| 60 |
+
**Text Models**:
|
| 61 |
+
- gpt-oss-20b (20B params)
|
| 62 |
+
- Qwen3-4B-Instruct-2507 (4B params)
|
| 63 |
+
- Olmo-3-7B-Instruct (7B params)
|
| 64 |
|
| 65 |
**Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.
|
| 66 |
|
| 67 |
+
## Interactive Features
|
| 68 |
+
|
| 69 |
+
- **Task selector**: Switch between Title Extraction and Full Metadata results
|
| 70 |
+
- **Model size vs accuracy plot**: Interactive scatter plot showing efficiency
|
| 71 |
+
- **Leaderboard**: Filter by VLM or Text approach
|
| 72 |
+
|
| 73 |
## Technical Details
|
| 74 |
|
| 75 |
- **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples)
|
| 76 |
- **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
|
| 77 |
+
- **Scoring**:
|
| 78 |
+
- Title: Flexible matching (handles case, subtitles, punctuation)
|
| 79 |
+
- Full Metadata: LLM-as-judge with partial credit
|
| 80 |
- **Logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
|
| 81 |
|
| 82 |
## Replicate This
|
app.py
CHANGED
|
@@ -14,11 +14,14 @@ def _():
|
|
| 14 |
def _(mo):
|
| 15 |
mo.md(
|
| 16 |
"""
|
| 17 |
-
# DOAB
|
| 18 |
|
| 19 |
-
**Can Vision-Language Models extract
|
| 20 |
|
| 21 |
-
This dashboard compares VLM (vision) and text-based approaches for extracting
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
📊 **Evaluation logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
|
| 24 |
"""
|
|
@@ -29,50 +32,134 @@ def _(mo):
|
|
| 29 |
@app.cell
|
| 30 |
def _():
|
| 31 |
import pandas as pd
|
|
|
|
| 32 |
from inspect_ai.analysis import evals_df
|
| 33 |
-
return evals_df, pd
|
| 34 |
|
| 35 |
|
| 36 |
@app.cell
|
| 37 |
def _(evals_df):
|
| 38 |
# Load evaluation results from HuggingFace
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
-
#
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
-
|
| 45 |
-
df["model_short"] = df["model"].apply(lambda x: x.split("/")[-1])
|
| 46 |
|
| 47 |
# Convert score to percentage
|
| 48 |
-
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
|
| 52 |
@app.cell
|
| 53 |
-
def _(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
# Calculate summary stats
|
| 55 |
vlm_avg = df[df["approach"] == "VLM"]["accuracy"].mean()
|
| 56 |
text_avg = df[df["approach"] == "Text"]["accuracy"].mean()
|
| 57 |
diff = vlm_avg - text_avg
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
|
| 68 |
-
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
)
|
| 75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
|
| 78 |
@app.cell
|
|
@@ -83,7 +170,7 @@ def _(mo):
|
|
| 83 |
|
| 84 |
@app.cell
|
| 85 |
def _(df, mo):
|
| 86 |
-
# Filter selector
|
| 87 |
approach_filter = mo.ui.dropdown(
|
| 88 |
options=["All", "VLM", "Text"],
|
| 89 |
value="All",
|
|
@@ -93,7 +180,7 @@ def _(df, mo):
|
|
| 93 |
|
| 94 |
|
| 95 |
@app.cell
|
| 96 |
-
def _(approach_filter, df, mo
|
| 97 |
# Filter data based on selection
|
| 98 |
if approach_filter.value == "All":
|
| 99 |
filtered_df = df
|
|
@@ -102,11 +189,11 @@ def _(approach_filter, df, mo, pd):
|
|
| 102 |
|
| 103 |
# Create leaderboard
|
| 104 |
leaderboard = (
|
| 105 |
-
filtered_df[["model_short", "approach", "accuracy"]]
|
| 106 |
.sort_values("accuracy", ascending=False)
|
| 107 |
.reset_index(drop=True)
|
| 108 |
)
|
| 109 |
-
leaderboard.columns = ["Model", "Approach", "Accuracy (%)"]
|
| 110 |
leaderboard["Accuracy (%)"] = leaderboard["Accuracy (%)"].round(1)
|
| 111 |
|
| 112 |
mo.vstack([
|
|
@@ -117,35 +204,46 @@ def _(approach_filter, df, mo, pd):
|
|
| 117 |
|
| 118 |
|
| 119 |
@app.cell
|
| 120 |
-
def _(
|
| 121 |
mo.md(
|
| 122 |
"""
|
| 123 |
## About This Evaluation
|
| 124 |
|
| 125 |
-
**Task**: Extract
|
| 126 |
|
| 127 |
**Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) - 50 samples
|
| 128 |
|
| 129 |
**Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
|
| 130 |
|
| 131 |
-
**Scoring**:
|
|
|
|
|
|
|
| 132 |
|
| 133 |
### Models Evaluated
|
| 134 |
|
| 135 |
**VLM (Vision-Language Models)**:
|
| 136 |
-
- Qwen3-VL-8B-Instruct
|
| 137 |
-
- Qwen3-VL-30B-A3B-Thinking
|
| 138 |
-
- GLM-4.6V-Flash
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
-
|
| 141 |
-
- gpt-oss-20b
|
| 142 |
-
- Qwen3-4B-Instruct-2507
|
| 143 |
-
- Olmo-3-7B-Instruct
|
| 144 |
-
- Qwen3-VL-8B-Instruct (used as text-only LLM)
|
| 145 |
|
| 146 |
---
|
| 147 |
|
| 148 |
-
*Built with [Marimo](https://marimo.io) | Evaluation
|
| 149 |
"""
|
| 150 |
)
|
| 151 |
return
|
|
|
|
| 14 |
def _(mo):
|
| 15 |
mo.md(
|
| 16 |
"""
|
| 17 |
+
# DOAB Metadata Extraction: VLM vs Text
|
| 18 |
|
| 19 |
+
**Can Vision-Language Models extract metadata from book covers better than text extraction?**
|
| 20 |
|
| 21 |
+
This dashboard compares VLM (vision) and text-based approaches for extracting metadata from academic book covers in the [DOAB dataset](https://huggingface.co/datasets/biglam/doab-metadata-extraction).
|
| 22 |
+
|
| 23 |
+
- **Title Extraction**: Extract just the book title (simpler task)
|
| 24 |
+
- **Full Metadata**: Extract title, subtitle, publisher, year, ISBN (harder task)
|
| 25 |
|
| 26 |
📊 **Evaluation logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
|
| 27 |
"""
|
|
|
|
| 32 |
@app.cell
|
| 33 |
def _():
|
| 34 |
import pandas as pd
|
| 35 |
+
import altair as alt
|
| 36 |
from inspect_ai.analysis import evals_df
|
| 37 |
+
return alt, evals_df, pd
|
| 38 |
|
| 39 |
|
| 40 |
@app.cell
|
| 41 |
def _(evals_df):
|
| 42 |
# Load evaluation results from HuggingFace
|
| 43 |
+
df_raw = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals", quiet=True)
|
| 44 |
+
|
| 45 |
+
# Add metadata columns
|
| 46 |
+
df_raw["approach"] = df_raw["task_name"].apply(lambda x: "VLM" if "vlm" in x else "Text")
|
| 47 |
+
df_raw["model_short"] = df_raw["model"].apply(lambda x: x.split("/")[-1])
|
| 48 |
|
| 49 |
+
# Determine task category
|
| 50 |
+
def get_task_category(task_name):
|
| 51 |
+
if "llm_judge" in task_name:
|
| 52 |
+
return "Full Metadata"
|
| 53 |
+
else:
|
| 54 |
+
return "Title Extraction"
|
| 55 |
|
| 56 |
+
df_raw["task_category"] = df_raw["task_name"].apply(get_task_category)
|
|
|
|
| 57 |
|
| 58 |
# Convert score to percentage
|
| 59 |
+
df_raw["accuracy"] = df_raw["score_headline_value"] * 100
|
| 60 |
+
|
| 61 |
+
# Parameter sizes (manual mapping)
|
| 62 |
+
param_sizes = {
|
| 63 |
+
"hf-inference-providers/Qwen/Qwen3-VL-8B-Instruct": 8,
|
| 64 |
+
"hf-inference-providers/Qwen/Qwen3-VL-30B-A3B-Thinking": 30,
|
| 65 |
+
"hf-inference-providers/zai-org/GLM-4.6V-Flash": 9,
|
| 66 |
+
"hf-inference-providers/openai/gpt-oss-20b": 20,
|
| 67 |
+
"hf-inference-providers/Qwen/Qwen3-4B-Instruct-2507": 4,
|
| 68 |
+
"hf-inference-providers/allenai/Olmo-3-7B-Instruct": 7,
|
| 69 |
+
}
|
| 70 |
+
df_raw["param_size_b"] = df_raw["model"].map(param_sizes)
|
| 71 |
+
|
| 72 |
+
df_raw
|
| 73 |
+
return df_raw, get_task_category, param_sizes
|
| 74 |
|
| 75 |
|
| 76 |
@app.cell
|
| 77 |
+
def _(df_raw, mo):
|
| 78 |
+
# Task selector
|
| 79 |
+
task_selector = mo.ui.dropdown(
|
| 80 |
+
options=["Title Extraction", "Full Metadata"],
|
| 81 |
+
value="Title Extraction",
|
| 82 |
+
label="Task",
|
| 83 |
+
)
|
| 84 |
+
return (task_selector,)
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
@app.cell
|
| 88 |
+
def _(df_raw, mo, task_selector):
|
| 89 |
+
# Filter by selected task
|
| 90 |
+
df = df_raw[df_raw["task_category"] == task_selector.value].copy()
|
| 91 |
+
|
| 92 |
# Calculate summary stats
|
| 93 |
vlm_avg = df[df["approach"] == "VLM"]["accuracy"].mean()
|
| 94 |
text_avg = df[df["approach"] == "Text"]["accuracy"].mean()
|
| 95 |
diff = vlm_avg - text_avg
|
| 96 |
|
| 97 |
+
task_desc = "book titles" if task_selector.value == "Title Extraction" else "full metadata (title, subtitle, publisher, year, ISBN)"
|
| 98 |
+
|
| 99 |
+
mo.vstack([
|
| 100 |
+
task_selector,
|
| 101 |
+
mo.md(
|
| 102 |
+
f"""
|
| 103 |
+
## Key Results: {task_selector.value}
|
| 104 |
|
| 105 |
+
| Approach | Average Accuracy |
|
| 106 |
+
|----------|-----------------|
|
| 107 |
+
| **VLM (Vision)** | **{vlm_avg:.0f}%** |
|
| 108 |
+
| Text Extraction | {text_avg:.0f}% |
|
| 109 |
|
| 110 |
+
**VLM advantage: +{diff:.0f} percentage points**
|
| 111 |
|
| 112 |
+
VLMs {'significantly ' if diff > 15 else ''}outperform text extraction for extracting {task_desc} from book covers.
|
| 113 |
+
"""
|
| 114 |
+
)
|
| 115 |
+
])
|
| 116 |
+
return df, diff, task_desc, text_avg, vlm_avg
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
@app.cell
|
| 120 |
+
def _(mo):
|
| 121 |
+
mo.md("## Model Size vs Accuracy")
|
| 122 |
+
return
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
@app.cell
|
| 126 |
+
def _(alt, df, mo):
|
| 127 |
+
# Interactive scatter plot: model size vs accuracy
|
| 128 |
+
scatter = alt.Chart(df).mark_circle(size=150).encode(
|
| 129 |
+
x=alt.X("param_size_b:Q", title="Parameters (Billions)", scale=alt.Scale(zero=False)),
|
| 130 |
+
y=alt.Y("accuracy:Q", title="Accuracy (%)", scale=alt.Scale(domain=[50, 105])),
|
| 131 |
+
color=alt.Color("approach:N", title="Approach", scale=alt.Scale(domain=["VLM", "Text"], range=["#1f77b4", "#ff7f0e"])),
|
| 132 |
+
tooltip=[
|
| 133 |
+
alt.Tooltip("model_short:N", title="Model"),
|
| 134 |
+
alt.Tooltip("approach:N", title="Approach"),
|
| 135 |
+
alt.Tooltip("param_size_b:Q", title="Params (B)"),
|
| 136 |
+
alt.Tooltip("accuracy:Q", title="Accuracy", format=".1f"),
|
| 137 |
+
],
|
| 138 |
+
).properties(
|
| 139 |
+
width=500,
|
| 140 |
+
height=300,
|
| 141 |
+
).interactive()
|
| 142 |
+
|
| 143 |
+
# Add text labels
|
| 144 |
+
text = alt.Chart(df).mark_text(
|
| 145 |
+
align="left",
|
| 146 |
+
baseline="middle",
|
| 147 |
+
dx=10,
|
| 148 |
+
fontSize=11,
|
| 149 |
+
).encode(
|
| 150 |
+
x="param_size_b:Q",
|
| 151 |
+
y="accuracy:Q",
|
| 152 |
+
text="model_short:N",
|
| 153 |
+
color=alt.Color("approach:N", scale=alt.Scale(domain=["VLM", "Text"], range=["#1f77b4", "#ff7f0e"])),
|
| 154 |
)
|
| 155 |
+
|
| 156 |
+
chart = (scatter + text).configure_axis(
|
| 157 |
+
labelFontSize=12,
|
| 158 |
+
titleFontSize=14,
|
| 159 |
+
)
|
| 160 |
+
|
| 161 |
+
mo.ui.altair_chart(chart)
|
| 162 |
+
return chart, scatter, text
|
| 163 |
|
| 164 |
|
| 165 |
@app.cell
|
|
|
|
| 170 |
|
| 171 |
@app.cell
|
| 172 |
def _(df, mo):
|
| 173 |
+
# Filter selector for approach
|
| 174 |
approach_filter = mo.ui.dropdown(
|
| 175 |
options=["All", "VLM", "Text"],
|
| 176 |
value="All",
|
|
|
|
| 180 |
|
| 181 |
|
| 182 |
@app.cell
|
| 183 |
+
def _(approach_filter, df, mo):
|
| 184 |
# Filter data based on selection
|
| 185 |
if approach_filter.value == "All":
|
| 186 |
filtered_df = df
|
|
|
|
| 189 |
|
| 190 |
# Create leaderboard
|
| 191 |
leaderboard = (
|
| 192 |
+
filtered_df[["model_short", "approach", "param_size_b", "accuracy"]]
|
| 193 |
.sort_values("accuracy", ascending=False)
|
| 194 |
.reset_index(drop=True)
|
| 195 |
)
|
| 196 |
+
leaderboard.columns = ["Model", "Approach", "Params (B)", "Accuracy (%)"]
|
| 197 |
leaderboard["Accuracy (%)"] = leaderboard["Accuracy (%)"].round(1)
|
| 198 |
|
| 199 |
mo.vstack([
|
|
|
|
| 204 |
|
| 205 |
|
| 206 |
@app.cell
|
| 207 |
+
def _(mo):
|
| 208 |
mo.md(
|
| 209 |
"""
|
| 210 |
## About This Evaluation
|
| 211 |
|
| 212 |
+
**Task**: Extract metadata from academic book cover images
|
| 213 |
|
| 214 |
**Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) - 50 samples
|
| 215 |
|
| 216 |
**Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
|
| 217 |
|
| 218 |
+
**Scoring**:
|
| 219 |
+
- *Title Extraction*: Custom flexible matching (case-insensitive, handles subtitles)
|
| 220 |
+
- *Full Metadata*: LLM-as-judge with partial credit
|
| 221 |
|
| 222 |
### Models Evaluated
|
| 223 |
|
| 224 |
**VLM (Vision-Language Models)**:
|
| 225 |
+
- Qwen3-VL-8B-Instruct (8B params)
|
| 226 |
+
- Qwen3-VL-30B-A3B-Thinking (30B params)
|
| 227 |
+
- GLM-4.6V-Flash (9B params)
|
| 228 |
+
|
| 229 |
+
**Text Extraction** (OCR → LLM):
|
| 230 |
+
- gpt-oss-20b (20B params)
|
| 231 |
+
- Qwen3-4B-Instruct-2507 (4B params)
|
| 232 |
+
- Olmo-3-7B-Instruct (7B params)
|
| 233 |
+
- Qwen3-VL-8B-Instruct as text-only LLM (8B params)
|
| 234 |
+
|
| 235 |
+
### Why VLMs Win
|
| 236 |
+
|
| 237 |
+
Book covers are **visually structured**:
|
| 238 |
+
- Titles appear in specific locations (usually top/center)
|
| 239 |
+
- Typography indicates importance (larger = more likely title)
|
| 240 |
+
- Layout provides context that pure text loses
|
| 241 |
|
| 242 |
+
Text extraction flattens this structure, losing valuable spatial information.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 243 |
|
| 244 |
---
|
| 245 |
|
| 246 |
+
*Built with [Marimo](https://marimo.io) | Evaluation framework: [Inspect AI](https://inspect.aisi.org.uk/)*
|
| 247 |
"""
|
| 248 |
)
|
| 249 |
return
|
requirements.txt
CHANGED
|
@@ -3,3 +3,4 @@ pandas>=2.0.0
|
|
| 3 |
inspect-ai>=0.3.0
|
| 4 |
huggingface-hub>=0.20.0
|
| 5 |
pyarrow>=14.0.0
|
|
|
|
|
|
| 3 |
inspect-ai>=0.3.0
|
| 4 |
huggingface-hub>=0.20.0
|
| 5 |
pyarrow>=14.0.0
|
| 6 |
+
altair>=5.0.0
|