📄 CSM-DocExtract-VL (INT4 Quantized)

CSM-DocExtract-VL is a highly optimized, multilingual Vision-Language Model (VLM) engineered specifically for Identity Intelligence automation.

It transforms unstructured images of identity documents into clean, structured JSON data instantly.

💡 Overview (Layman Terms)

Imagine having a digital assistant that can look at any identity document (Passport, ID card, Visa) from almost any country, read the text (even in Arabic, Hindi, Cyrillic, or Chinese), and instantly type out a perfectly structured JSON file.

The Problem: Manual data entry for KYC is slow, prone to human error, and expensive.
The Solution: This model acts as an ultra-fast, highly accurate data-entry expert that never sleeps. It natively understands both the visual layout of the card and the textual languages, bridging the gap seamlessly.

⚙️ Technical Specifications (For Engineers)

This is the 4-bit NF4 quantized version of our fine-tuned 8-Billion parameter Vision-Language Model, designed to run easily on consumer-grade hardware.

Base Architecture: Qwen3-VL-8B
Training Framework: Fine-tuned using Unsloth (2x faster training, lower VRAM) and PyTorch.
Quantization: bitsandbytes INT4 (NF4) with double quantization. This ensures zero accuracy loss while drastically reducing compute requirements.
Adapters: LoRA (Low-Rank Adaptation) applied to Vision, Language, Attention, and MLP modules (Rank=32).
Context Window: 1024 / 2048 Tokens.

🚀 Example Input & Output

Input Prompt: Extract information from this passport image and format it as JSON.

Output Result:

{
  "document_type": "Passport",
  "issuing_country": "IND",
  "full_name": "John Doe",
  "document_number": "Z1234567",
  "date_of_birth": "1990-01-01",
  "date_of_expiry": "2030-12-31",
  "mrz_data": {
    "line1": "P<INDDOE<<JOHN<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<",
    "line2": "Z1234567<8IND9001015M3012316<<<<<<<<<<<<<<02"
  }
}

🏗️ Architecture & LLD (Low-Level Design)

Below is the workflow of how the model processes a document image, attends to specific fields, and resolves conflicts (e.g., MRZ vs. Printed Text):

(High-resolution architecture flow for KYC document processing)

📊 Performance Comparison: FP16 vs INT4

Metric	Original Model (FP16)	Quantized Model (INT4)	Impact / Benefit
Model Size (Disk)	~17.5 GB	~5.5 GB	📉 68% Reduction
VRAM Required	16-24 GB	~6-7 GB	📉 Fits on consumer GPUs (e.g., RTX 3060, T4)
Inference Speed	Slower	Faster	🚀 Optimized memory bandwidth
JSON Accuracy	93-97%	92-96%	⚖️ Negligible drop (≈1%)

💻 How to Use (Deployment Code)

You can directly deploy this model on Hugging Face Spaces, Google Colab, or a local server. Ensure you have transformers, accelerate, and bitsandbytes installed.

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig

# 1. Initialize 4-bit Quantization Config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 2. Load the Model & Processor
model_id = "Chhagan005/CSM-DocExtract-VL-Q4KM"

print("Loading model... (This might take a moment depending on your bandwidth)")
model = AutoModelForImageTextToText.from_pretrained(
    model_id, 
    quantization_config=bnb_config, 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

print("✅ Model loaded successfully and is ready for KYC extraction!")

⚠️ Limitations & Best Practices

Image Quality: The model performs best on well-lit, glare-free document scans. Severe glare on holograms might obscure text.
Handwritten Text: This model is optimized for printed text and standard document fonts. Extraction accuracy may degrade with cursive handwriting.
Hallucination: As with all LLMs, always validate the output in production workflows (e.g., checksum verification on the MRZ strings).

Downloads last month: 40

Safetensors

Model size

9B params

Tensor type

F32

F16

Model tree for Chhagan005/CSM-DocExtract-VL-Q4KM

Base model

Chhagan005/CSM-DocExtract-VL-HF

Quantized

(1)

this model