Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model:
|
| 4 |
+
- google/owlvit-base-patch16
|
| 5 |
+
pipeline_tag: object-detection
|
| 6 |
+
---
|
| 7 |
+
# **NoctOWL: Fine-Grained Open-Vocabulary Object Detector**
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
## **Model Description**
|
| 11 |
+
|
| 12 |
+
**NoctOWL** (***N***ot **o**nly **c**oarse-**t**ext **OWL**) is an adaptation of **OWL-ViT** (*NoctOWL*) and **OWLv2** (*NoctOWLv2*), designed for **Fine-Grained Open-Vocabulary Detection (FG-OVD)**. Unlike standard open-vocabulary object detectors, which focus primarily on class-level recognition, NoctOWL enhances the ability to detect and distinguish fine-grained object attributes such as color, material, transparency, and pattern.
|
| 13 |
+
|
| 14 |
+
It maintains a balanced **trade-off between fine- and coarse-grained detection**, making it particularly effective in scenarios requiring detailed object descriptions.
|
| 15 |
+
|
| 16 |
+
You can find the original code to train and evaluate the model [here](https://github.com/lorebianchi98/FG-OVD/tree/main/benchmarks).
|
| 17 |
+
|
| 18 |
+
### **Model Variants**
|
| 19 |
+
- **NoctOWL Base** (`lorebianchi98/NoctOWL-base-patch16`)
|
| 20 |
+
- **NoctOWLv2 Base** (`lorebianchi98/NoctOWLv2-base-patch16`)
|
| 21 |
+
- **NoctOWL Large** (`lorebianchi98/NoctOWL-large-patch14`)
|
| 22 |
+
- **NoctOWLv2 Large** (`lorebianchi98/NoctOWLv2-large-patch14`)
|
| 23 |
+
|
| 24 |
+
## **Usage**
|
| 25 |
+
|
| 26 |
+
### **Loading the Model**
|
| 27 |
+
```python
|
| 28 |
+
from transformers import OwlViTForObjectDetection, Owlv2ForObjectDetection, OwlViTProcessor, Owlv2Processor
|
| 29 |
+
|
| 30 |
+
# Load NoctOWL model
|
| 31 |
+
model = OwlViTForObjectDetection.from_pretrained("lorebianchi98/NoctOWL-base-patch16")
|
| 32 |
+
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch16")
|
| 33 |
+
|
| 34 |
+
# Load NoctOWLv2 model
|
| 35 |
+
model_v2 = Owlv2ForObjectDetection.from_pretrained("lorebianchi98/NoctOWLv2-base-patch16")
|
| 36 |
+
processor_v2 = Owlv2Processor.from_pretrained("google/owlv2-base-patch16")
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
### **Inference Example**
|
| 40 |
+
```python
|
| 41 |
+
from PIL import Image
|
| 42 |
+
import torch
|
| 43 |
+
|
| 44 |
+
# Load image
|
| 45 |
+
image = Image.open("example.jpg")
|
| 46 |
+
|
| 47 |
+
# Define text prompts (fine-grained descriptions)
|
| 48 |
+
text_queries = ["a red patterned dress", "a dark brown wooden chair"]
|
| 49 |
+
|
| 50 |
+
# Process inputs
|
| 51 |
+
inputs = processor(images=image, text=text_queries, return_tensors="pt")
|
| 52 |
+
|
| 53 |
+
# Run inference
|
| 54 |
+
outputs = model(**inputs)
|
| 55 |
+
|
| 56 |
+
# Extract detected objects
|
| 57 |
+
logits = outputs.logits
|
| 58 |
+
boxes = outputs.pred_boxes
|
| 59 |
+
|
| 60 |
+
# Post-processing can be applied to visualize results
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
## Results
|
| 64 |
+
We report the mean Average Precision (**mAP**) on the Fine-Grained Open-Vocabulary Detection ([FG-OVD](https://lorebianchi98.github.io/FG-OVD/)) benchmarks across different difficulty levels, as well as performance on rare classes from the LVIS dataset.
|
| 65 |
+
| Model | LVIS (Rare) | Trivial | Easy | Medium | Hard | Color | Material | Pattern | Transparency |
|
| 66 |
+
|-------|------------|----------------|---------------|---------------|---------------|-------|----------|---------|--------------|
|
| 67 |
+
| OWL (B/16) | 20.6 | 53.9 | 38.4 | 39.8 | 26.2 | 45.3 | 37.3 | 26.6 | 34.1 |
|
| 68 |
+
| OWL (L/14) | 31.2 | 65.1 | 44.0 | 39.3 | 26.5 | 43.8 | 44.9 | 36.0 | 29.2 |
|
| 69 |
+
| OWLv2 (B/16) | 29.6 | 52.9 | 40.0 | 38.5 | 25.3 | 45.1 | 33.5 | 19.2 | 28.5 |
|
| 70 |
+
| OWLv2 (L/14) | **34.9** | 63.2 | 42.8 | 41.2 | 25.4 | 53.3 | 36.9 | 23.3 | 12.2 |
|
| 71 |
+
| **NoctOWL (B/16)** | 11.6 | 46.6 | 44.4 | 45.6 | 40.0 | 44.7 | 46.0 | 46.1 | 53.6 |
|
| 72 |
+
| **NoctOWL (L/14)** | 26.0 | 57.4 | 54.2 | 54.8 | 48.6 | 53.1 | 56.9 | **49.8** | **57.2** |
|
| 73 |
+
| **NoctOWLv2 (B/16)** | 17.5 | 48.3 | 49.1 | 47.1 | 42.1 | 46.8 | 48.2 | 42.2 | 50.2 |
|
| 74 |
+
| **NoctOWLv2 (L/14)** | 27.2 | **57.5** | **55.5** | **57.2** | **50.2** | **55.6** | **57.0** | 49.2 | 55.9 |
|