Spaces:

davanstrien
/

doab-title-extraction-eval

Running

App Files Files Community

davanstrien HF Staff commited on about 23 hours ago

Commit

a08c225

verified ·

1 Parent(s): 7455617

Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +32 -15
app.py +139 -41
requirements.txt +1 -0

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: DOAB Title Extraction Evaluation
 emoji: 📚
 colorFrom: blue
 colorTo: purple
@@ -8,29 +8,38 @@ pinned: false
 license: mit
 ---
-# VLM vs Text: Extracting Titles from Book Covers
 **Can Vision-Language Models extract metadata from book covers better than text extraction?**
 ## TL;DR
-**Yes, significantly.** VLMs achieve ~97% accuracy vs ~70% for text extraction on the DOAB academic book cover dataset.
 ## The Task
-Extracting titles from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:
 1. **VLM (Vision)**: Send the cover image directly to a Vision-Language Model
 2. **Text Extraction**: Extract text from the image first, then send to an LLM
 ## Results
 | Approach | Average Accuracy |
 |----------|-----------------|
 | **VLM** | **97%** |
-| Text | 70% |
-VLMs outperform text extraction by ~27 percentage points.
 ### Why VLMs Win
@@ -43,23 +52,31 @@ Text extraction flattens this structure, losing valuable spatial information.
 ## Models Evaluated
-**VLM Models** (96-98% accuracy):
-- Qwen3-VL-8B-Instruct
-- Qwen3-VL-30B-A3B-Thinking
-- GLM-4.6V-Flash
-**Text Models** (68-70% accuracy):
-- gpt-oss-20b
-- Qwen3-4B-Instruct-2507
-- Olmo-3-7B-Instruct
 **Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.
 ## Technical Details
 - **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples)
 - **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
-- **Scoring**: Flexible title matching (handles case, subtitles, punctuation)
 - **Logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
 ## Replicate This

 ---
+title: DOAB Metadata Extraction Evaluation
 emoji: 📚
 colorFrom: blue
 colorTo: purple
 license: mit
 ---
+# VLM vs Text: Extracting Metadata from Book Covers
 **Can Vision-Language Models extract metadata from book covers better than text extraction?**
 ## TL;DR
+**Yes, significantly.** VLMs achieve ~97% accuracy vs ~75% for text extraction on title extraction, and ~80% vs ~71% on full metadata extraction.
 ## The Task
+Extracting metadata from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:
 1. **VLM (Vision)**: Send the cover image directly to a Vision-Language Model
 2. **Text Extraction**: Extract text from the image first, then send to an LLM
 ## Results
+### Title Extraction (simpler task)
 | Approach | Average Accuracy |
 |----------|-----------------|
 | **VLM** | **97%** |
+| Text | 75% |
+### Full Metadata (title, subtitle, publisher, year, ISBN)
+| Approach | Average Accuracy |
+|----------|-----------------|
+| **VLM** | **80%** |
+| Text | 71% |
+VLMs consistently outperform text extraction across both tasks.
 ### Why VLMs Win
 ## Models Evaluated
+**VLM Models**:
+- Qwen3-VL-8B-Instruct (8B params)
+- Qwen3-VL-30B-A3B-Thinking (30B params)
+- GLM-4.6V-Flash (9B params)
+**Text Models**:
+- gpt-oss-20b (20B params)
+- Qwen3-4B-Instruct-2507 (4B params)
+- Olmo-3-7B-Instruct (7B params)
 **Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.
+## Interactive Features
+- **Task selector**: Switch between Title Extraction and Full Metadata results
+- **Model size vs accuracy plot**: Interactive scatter plot showing efficiency
+- **Leaderboard**: Filter by VLM or Text approach
 ## Technical Details
 - **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples)
 - **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
+- **Scoring**:
+  - Title: Flexible matching (handles case, subtitles, punctuation)
+  - Full Metadata: LLM-as-judge with partial credit
 - **Logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
 ## Replicate This

app.py CHANGED Viewed

@@ -14,11 +14,14 @@ def _():
 def _(mo):
     mo.md(
         """
-        # DOAB Title Extraction: VLM vs Text
-        **Can Vision-Language Models extract book titles from covers better than text extraction?**
-        This dashboard compares VLM (vision) and text-based approaches for extracting titles from academic book covers in the [DOAB dataset](https://huggingface.co/datasets/biglam/doab-metadata-extraction).
         📊 **Evaluation logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
         """
@@ -29,50 +32,134 @@ def _(mo):
 @app.cell
 def _():
     import pandas as pd
     from inspect_ai.analysis import evals_df
-    return evals_df, pd
 @app.cell
 def _(evals_df):
     # Load evaluation results from HuggingFace
-    df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals", quiet=True)
-    # Add approach column
-    df["approach"] = df["task_name"].apply(lambda x: "VLM" if "vlm" in x else "Text")
-    # Shorten model names
-    df["model_short"] = df["model"].apply(lambda x: x.split("/")[-1])
     # Convert score to percentage
-    df["accuracy"] = df["score_headline_value"] * 100
-    return (df,)
 @app.cell
-def _(df, mo):
     # Calculate summary stats
     vlm_avg = df[df["approach"] == "VLM"]["accuracy"].mean()
     text_avg = df[df["approach"] == "Text"]["accuracy"].mean()
     diff = vlm_avg - text_avg
-    mo.md(
-        f"""
-        ## Key Results
-        | Approach | Average Accuracy |
-        |----------|-----------------|
-        | **VLM (Vision)** | **{vlm_avg:.0f}%** |
-        | Text Extraction | {text_avg:.0f}% |
-        **VLM advantage: +{diff:.0f} percentage points**
-        VLMs significantly outperform text extraction for book cover metadata.
-        This is because book covers are **visually structured** - titles appear in specific
-        locations with distinctive formatting that VLMs can recognize.
-        """
     )
-    return diff, text_avg, vlm_avg
 @app.cell
@@ -83,7 +170,7 @@ def _(mo):
 @app.cell
 def _(df, mo):
-    # Filter selector
     approach_filter = mo.ui.dropdown(
         options=["All", "VLM", "Text"],
         value="All",
@@ -93,7 +180,7 @@ def _(df, mo):
 @app.cell
-def _(approach_filter, df, mo, pd):
     # Filter data based on selection
     if approach_filter.value == "All":
         filtered_df = df
@@ -102,11 +189,11 @@ def _(approach_filter, df, mo, pd):
     # Create leaderboard
     leaderboard = (
-        filtered_df[["model_short", "approach", "accuracy"]]
         .sort_values("accuracy", ascending=False)
         .reset_index(drop=True)
     )
-    leaderboard.columns = ["Model", "Approach", "Accuracy (%)"]
     leaderboard["Accuracy (%)"] = leaderboard["Accuracy (%)"].round(1)
     mo.vstack([
@@ -117,35 +204,46 @@ def _(approach_filter, df, mo, pd):
 @app.cell
-def _(df, mo):
     mo.md(
         """
         ## About This Evaluation
-        **Task**: Extract the title from academic book cover images
         **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) - 50 samples
         **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
-        **Scoring**: Flexible title matching (case-insensitive, handles subtitles)
         ### Models Evaluated
         **VLM (Vision-Language Models)**:
-        - Qwen3-VL-8B-Instruct
-        - Qwen3-VL-30B-A3B-Thinking
-        - GLM-4.6V-Flash
-        **Text Extraction**:
-        - gpt-oss-20b
-        - Qwen3-4B-Instruct-2507
-        - Olmo-3-7B-Instruct
-        - Qwen3-VL-8B-Instruct (used as text-only LLM)
         ---
-        *Built with [Marimo](https://marimo.io) | Evaluation logs on [HuggingFace](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)*
         """
     )
     return

 def _(mo):
     mo.md(
         """
+        # DOAB Metadata Extraction: VLM vs Text
+        **Can Vision-Language Models extract metadata from book covers better than text extraction?**
+        This dashboard compares VLM (vision) and text-based approaches for extracting metadata from academic book covers in the [DOAB dataset](https://huggingface.co/datasets/biglam/doab-metadata-extraction).
+        - **Title Extraction**: Extract just the book title (simpler task)
+        - **Full Metadata**: Extract title, subtitle, publisher, year, ISBN (harder task)
         📊 **Evaluation logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
         """
 @app.cell
 def _():
     import pandas as pd
+    import altair as alt
     from inspect_ai.analysis import evals_df
+    return alt, evals_df, pd
 @app.cell
 def _(evals_df):
     # Load evaluation results from HuggingFace
+    df_raw = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals", quiet=True)
+    # Add metadata columns
+    df_raw["approach"] = df_raw["task_name"].apply(lambda x: "VLM" if "vlm" in x else "Text")
+    df_raw["model_short"] = df_raw["model"].apply(lambda x: x.split("/")[-1])
+    # Determine task category
+    def get_task_category(task_name):
+        if "llm_judge" in task_name:
+            return "Full Metadata"
+        else:
+            return "Title Extraction"
+    df_raw["task_category"] = df_raw["task_name"].apply(get_task_category)
     # Convert score to percentage
+    df_raw["accuracy"] = df_raw["score_headline_value"] * 100
+    # Parameter sizes (manual mapping)
+    param_sizes = {
+        "hf-inference-providers/Qwen/Qwen3-VL-8B-Instruct": 8,
+        "hf-inference-providers/Qwen/Qwen3-VL-30B-A3B-Thinking": 30,
+        "hf-inference-providers/zai-org/GLM-4.6V-Flash": 9,
+        "hf-inference-providers/openai/gpt-oss-20b": 20,
+        "hf-inference-providers/Qwen/Qwen3-4B-Instruct-2507": 4,
+        "hf-inference-providers/allenai/Olmo-3-7B-Instruct": 7,
+    }
+    df_raw["param_size_b"] = df_raw["model"].map(param_sizes)
+    df_raw
+    return df_raw, get_task_category, param_sizes
 @app.cell
+def _(df_raw, mo):
+    # Task selector
+    task_selector = mo.ui.dropdown(
+        options=["Title Extraction", "Full Metadata"],
+        value="Title Extraction",
+        label="Task",
+    )
+    return (task_selector,)
+@app.cell
+def _(df_raw, mo, task_selector):
+    # Filter by selected task
+    df = df_raw[df_raw["task_category"] == task_selector.value].copy()
     # Calculate summary stats
     vlm_avg = df[df["approach"] == "VLM"]["accuracy"].mean()
     text_avg = df[df["approach"] == "Text"]["accuracy"].mean()
     diff = vlm_avg - text_avg
+    task_desc = "book titles" if task_selector.value == "Title Extraction" else "full metadata (title, subtitle, publisher, year, ISBN)"
+    mo.vstack([
+        task_selector,
+        mo.md(
+            f"""
+            ## Key Results: {task_selector.value}
+            | Approach | Average Accuracy |
+            |----------|-----------------|
+            | **VLM (Vision)** | **{vlm_avg:.0f}%** |
+            | Text Extraction | {text_avg:.0f}% |
+            **VLM advantage: +{diff:.0f} percentage points**
+            VLMs {'significantly ' if diff > 15 else ''}outperform text extraction for extracting {task_desc} from book covers.
+            """
+        )
+    ])
+    return df, diff, task_desc, text_avg, vlm_avg
+@app.cell
+def _(mo):
+    mo.md("## Model Size vs Accuracy")
+    return
+@app.cell
+def _(alt, df, mo):
+    # Interactive scatter plot: model size vs accuracy
+    scatter = alt.Chart(df).mark_circle(size=150).encode(
+        x=alt.X("param_size_b:Q", title="Parameters (Billions)", scale=alt.Scale(zero=False)),
+        y=alt.Y("accuracy:Q", title="Accuracy (%)", scale=alt.Scale(domain=[50, 105])),
+        color=alt.Color("approach:N", title="Approach", scale=alt.Scale(domain=["VLM", "Text"], range=["#1f77b4", "#ff7f0e"])),
+        tooltip=[
+            alt.Tooltip("model_short:N", title="Model"),
+            alt.Tooltip("approach:N", title="Approach"),
+            alt.Tooltip("param_size_b:Q", title="Params (B)"),
+            alt.Tooltip("accuracy:Q", title="Accuracy", format=".1f"),
+        ],
+    ).properties(
+        width=500,
+        height=300,
+    ).interactive()
+    # Add text labels
+    text = alt.Chart(df).mark_text(
+        align="left",
+        baseline="middle",
+        dx=10,
+        fontSize=11,
+    ).encode(
+        x="param_size_b:Q",
+        y="accuracy:Q",
+        text="model_short:N",
+        color=alt.Color("approach:N", scale=alt.Scale(domain=["VLM", "Text"], range=["#1f77b4", "#ff7f0e"])),
     )
+    chart = (scatter + text).configure_axis(
+        labelFontSize=12,
+        titleFontSize=14,
+    )
+    mo.ui.altair_chart(chart)
+    return chart, scatter, text
 @app.cell
 @app.cell
 def _(df, mo):
+    # Filter selector for approach
     approach_filter = mo.ui.dropdown(
         options=["All", "VLM", "Text"],
         value="All",
 @app.cell
+def _(approach_filter, df, mo):
     # Filter data based on selection
     if approach_filter.value == "All":
         filtered_df = df
     # Create leaderboard
     leaderboard = (
+        filtered_df[["model_short", "approach", "param_size_b", "accuracy"]]
         .sort_values("accuracy", ascending=False)
         .reset_index(drop=True)
     )
+    leaderboard.columns = ["Model", "Approach", "Params (B)", "Accuracy (%)"]
     leaderboard["Accuracy (%)"] = leaderboard["Accuracy (%)"].round(1)
     mo.vstack([
 @app.cell
+def _(mo):
     mo.md(
         """
         ## About This Evaluation
+        **Task**: Extract metadata from academic book cover images
         **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) - 50 samples
         **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
+        **Scoring**:
+        - *Title Extraction*: Custom flexible matching (case-insensitive, handles subtitles)
+        - *Full Metadata*: LLM-as-judge with partial credit
         ### Models Evaluated
         **VLM (Vision-Language Models)**:
+        - Qwen3-VL-8B-Instruct (8B params)
+        - Qwen3-VL-30B-A3B-Thinking (30B params)
+        - GLM-4.6V-Flash (9B params)
+        **Text Extraction** (OCR → LLM):
+        - gpt-oss-20b (20B params)
+        - Qwen3-4B-Instruct-2507 (4B params)
+        - Olmo-3-7B-Instruct (7B params)
+        - Qwen3-VL-8B-Instruct as text-only LLM (8B params)
+        ### Why VLMs Win
+        Book covers are **visually structured**:
+        - Titles appear in specific locations (usually top/center)
+        - Typography indicates importance (larger = more likely title)
+        - Layout provides context that pure text loses
+        Text extraction flattens this structure, losing valuable spatial information.
         ---
+        *Built with [Marimo](https://marimo.io) | Evaluation framework: [Inspect AI](https://inspect.aisi.org.uk/)*
         """
     )
     return

requirements.txt CHANGED Viewed

@@ -3,3 +3,4 @@ pandas>=2.0.0
 inspect-ai>=0.3.0
 huggingface-hub>=0.20.0
 pyarrow>=14.0.0

 inspect-ai>=0.3.0
 huggingface-hub>=0.20.0
 pyarrow>=14.0.0
+altair>=5.0.0