JZPeterPan
/

MedVLM-R1

@@ -1,10 +1,12 @@
 ---
-license: apache-2.0
 base_model:
 - Qwen/Qwen2-VL-2B-Instruct
 language:
 - en
 ---
 <div align="center">
 <h1>
   MedVLM-R1
@@ -15,10 +17,17 @@ language:
  <a href="https://arxiv.org/abs/2502.19634" target="_blank">Paper</a>
 </div>
-# <span id="Start">Introduction</span>
 MedVLM-R1 is a medical Vision-Language Model built upon [Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) and fine-tuned using the [GRPO](https://arxiv.org/abs/2402.03300) reinforcement learning framework. Trained on 600 MRI VQA samples from the [HuatuoGPT-Vision dataset](https://huggingface.co/datasets/FreedomIntelligence/Medical_Multimodal_Evaluation_Data), MedVLM-R1 excels in out-of-distribution performance on CT and X-ray VQA tasks. It also demonstrates explicit medical reasoning capabilities beyond merely providing final answers, ensuring greater interpretability and trustworthiness in clinical applications.
-# <span id="Start">Quick Start</span>
 ### 1. Load the model
 ```python
@@ -45,28 +54,8 @@ temp_generation_config = GenerationConfig(
     pad_token_id=151643,
 )
 ```
-### 2. Load the VQA Data
-Pick one of the following examples. These are samples from [OmniMedVQA](https://huggingface.co/datasets/foreverbeliever/OmniMedVQA) data and are bundled by [HuatuoGPT-Vision](https://huggingface.co/datasets/FreedomIntelligence/Medical_Multimodal_Evaluation_Data).
-```python
-question = {"image": ['images/successful_cases/mdb146.png'], "problem": "What content appears in this image?\nA) Cardiac tissue\nB) Breast tissue\nC) Liver tissue\nD) Skin tissue", "solution": "B", "answer": "Breast tissue"}
-question = {"image": ["images/successful_cases/person19_virus_50.jpeg"], "problem": "What content appears in this image?\nA) Lungs\nB) Bladder\nC) Brain\nD) Heart", "solution": "A", "answer": "Lungs"}
-question = {"image":["images/successful_cases/abd-normal023599.png"],"problem":"Is any abnormality evident in this image?\nA) No\nB) Yes.","solution":"A","answer":"No"}
-question = {"image":["images/successful_cases/foot089224.png"],"problem":"Which imaging technique was utilized for acquiring this image?\nA) MRI\nB) Electroencephalogram (EEG)\nC) Ultrasound\nD) Angiography","solution":"A","answer":"MRI"}
-question = {"image":["images/successful_cases/knee031316.png"],"problem":"What can be observed in this image?\nA) Chondral abnormality\nB) Bone density loss\nC) Synovial cyst formation\nD) Ligament tear","solution":"A","answer":"Chondral abnormality"}
-question = {"image":["images/successful_cases/shoulder045906.png"],"problem":"What can be visually detected in this picture?\nA) Bone fracture\nB) Soft tissue fluid\nC) Blood clot\nD) Tendon tear","solution":"B","answer":"Soft tissue fluid"}
-question = {"image":["images/successful_cases/brain003631.png"],"problem":"What attribute can be observed in this image?\nA) Focal flair hyperintensity\nB) Bone fracture\nC) Vascular malformation\nD) Ligament tear","solution":"A","answer":"Focal flair hyperintensity"}
-question = {"image":["images/successful_cases/mrabd005680.png"],"problem":"What can be observed in this image?\nA) Pulmonary embolism\nB) Pancreatic abscess\nC) Intraperitoneal mass\nD) Cardiac tamponade","solution":"C","answer":"Intraperitoneal mass"}
-```
-### 3. Run the inference
 ```python
 QUESTION_TEMPLATE = """
     {Question}
@@ -75,7 +64,30 @@ QUESTION_TEMPLATE = """
     2. Then provide the correct single-letter choice (A, B, C, D,...) inside <answer>...</answer> tags.
     3. No extra information or text outside of these tags.
     """
 message = [{
     "role": "user",
     "content": [{"type": "image", "image": f"file://{question['image'][0]}"}, {"type": "text","text": QUESTION_TEMPLATE.format(Question=question['problem'])}]
@@ -101,20 +113,14 @@ output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=
 print(f'model output: {output_text[0]}')
 ```
 ### Failure cases
 MedVLM-R1's reasoning fails when testing on more difficult VQA examples. Although it can output correct choices in the following examples, the reasoning of them is either superficial or contradictory.
 ```python
-question = {"image":["images/failure_cases/mrabd021764.png"],"problem":"What is the observable finding in this image?\nA) Brain lesion\nB) Intestinal lesion\nC) Gallbladder lesion\nD) Pancreatic lesion","solution":"D","answer":"Pancreatic lesion"}
-question = {"image":["images/failure_cases/spine010017.png"],"problem":"What can be observed in this image?\nA) Cystic lesions\nB) Fractured bones\nC) Inflamed tissue\nD) Nerve damage","solution":"A","answer":"Cystic lesions"}
-question = {"image":["images/failure_cases/ankle056120.png"],"problem":"What attribute can be observed in this image?\nA) Bursitis\nB) Flexor pathology\nC) Tendonitis\nD) Joint inflammation","solution":"B","answer":"Flexor pathology"}
-question = {"image":["images/failure_cases/lung067009.png"],"problem":"What is the term for the anomaly depicted in the image?\nA) Pulmonary embolism\nB) Airspace opacity\nC) Lung consolidation\nD) Atelectasis","solution":"B","answer":"Airspace opacity"}
 ```
-# <span id="Start">Acknowledgement</span>
 We thank all machine learning / medical workers for making public codebase / datasets available to the community 🫶🫶🫶
 If you find our work helpful, feel free to give us a cite.
@@ -126,4 +132,4 @@ If you find our work helpful, feel free to give us a cite.
   journal={arXiv preprint arXiv:2502.19634},
   year={2025}
 }
-```

 ---
 base_model:
 - Qwen/Qwen2-VL-2B-Instruct
 language:
 - en
+license: apache-2.0
+pipeline_tag: image-text-to-text
 ---
 <div align="center">
 <h1>
   MedVLM-R1
  <a href="https://arxiv.org/abs/2502.19634" target="_blank">Paper</a>
 </div>
+# Introduction
 MedVLM-R1 is a medical Vision-Language Model built upon [Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) and fine-tuned using the [GRPO](https://arxiv.org/abs/2402.03300) reinforcement learning framework. Trained on 600 MRI VQA samples from the [HuatuoGPT-Vision dataset](https://huggingface.co/datasets/FreedomIntelligence/Medical_Multimodal_Evaluation_Data), MedVLM-R1 excels in out-of-distribution performance on CT and X-ray VQA tasks. It also demonstrates explicit medical reasoning capabilities beyond merely providing final answers, ensuring greater interpretability and trustworthiness in clinical applications.
+**Paper Abstract:**
+Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory approval. Although Medical Visual Language Models (VLMs) show promise for radiological tasks, most existing VLMs merely produce final answers without revealing the underlying reasoning. To address this gap, we introduce MedVLM-R1, a medical VLM that explicitly generates natural language reasoning to enhance transparency and trustworthiness. Instead of relying on supervised fine-tuning (SFT), which often suffers from overfitting to training distributions and fails to foster genuine reasoning, MedVLM-R1 employs a reinforcement learning framework that incentivizes the model to discover human-interpretable reasoning paths without using any reasoning references. Despite limited training data (600 visual question answering samples) and model parameters (2B), MedVLM-R1 boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models trained on over a million samples. It also demonstrates robust domain generalization under out-of-distribution tasks. By unifying medical image analysis with explicit reasoning, MedVLM-R1 marks a pivotal step toward trustworthy and interpretable AI in clinical practice.
+Github repo: https://github.com/jzpan/MedVLM-R1
+# Quick Start
 ### 1. Load the model
 ```python
     pad_token_id=151643,
 )
 ```
+### 2. Question Template
 ```python
 QUESTION_TEMPLATE = """
     {Question}
     2. Then provide the correct single-letter choice (A, B, C, D,...) inside <answer>...</answer> tags.
     3. No extra information or text outside of these tags.
     """
+```
+### 3. Load the VQA Data
+Pick one of the following examples. These are samples from [OmniMedVQA](https://huggingface.co/datasets/foreverbeliever/OmniMedVQA) data and are bundled by [HuatuoGPT-Vision](https://huggingface.co/datasets/FreedomIntelligence/Medical_Multimodal_Evaluation_Data).
+```python
+question = {"image": ['images/successful_cases/mdb146.png'], "problem": "What content appears in this image?
+A) Cardiac tissue
+B) Breast tissue
+C) Liver tissue
+D) Skin tissue", "solution": "B", "answer": "Breast tissue"}
+question = {"image": ["images/successful_cases/person19_virus_50.jpeg"], "problem": "What content appears in this image?
+A) Lungs
+B) Bladder
+C) Brain
+D) Heart", "solution": "A", "answer": "Lungs"}
+# ... other example questions
+```
+### 4. Run the inference
+```python
 message = [{
     "role": "user",
     "content": [{"type": "image", "image": f"file://{question['image'][0]}"}, {"type": "text","text": QUESTION_TEMPLATE.format(Question=question['problem'])}]
 print(f'model output: {output_text[0]}')
 ```
 ### Failure cases
 MedVLM-R1's reasoning fails when testing on more difficult VQA examples. Although it can output correct choices in the following examples, the reasoning of them is either superficial or contradictory.
 ```python
+# ... failure case examples
 ```
+# Acknowledgement
 We thank all machine learning / medical workers for making public codebase / datasets available to the community 🫶🫶🫶
 If you find our work helpful, feel free to give us a cite.
   journal={arXiv preprint arXiv:2502.19634},
   year={2025}
 }
+```