TimeLens-8B

πŸ“‘ Paper | πŸ’» Code | 🏠 Project Page | πŸ€— Model & Data

✨ Model Description

TimeLens-8B is an MLLM with state-of-the-art video temporal grounding performance among open-source models, finetuned from Qwen3-VL-8B-Instruct. It is trained with carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe proposed in our paper, utilizing our high-quality VTG training dataset TimeLens-100K.

πŸ“Š Performance

TimeLens-8B achieves state-of-the-art video temporal grounding performance among open-source models:

Model Charades-TimeLens ActivityNet-TimeLens QVHighlights-TimeLens
R1
@0.3
R1
@0.5
R1
@0.7
mIoU R1
@0.3
R1
@0.5
R1
@0.7
mIoU R1
@0.3
R1
@0.5
R1
@0.7
mIoU
Qwen2.5-VL-7B-Instruct 59.7 37.8 16.6 39.3 44.1 31.0 16.1 31.4 41.5 27.8 15.2 31.6
TimeLens-7BπŸš€ 70.5 55.6 28.4 48.8 62.8 51.0 32.6 46.2 74.1 62.7 43.1 56.0
Qwen3-VL-8B-Instruct 69.2 53.4 27.5 48.3 62.1 51.2 34.4 46.8 74.2 64.6 49.3 59.4
TimeLens-8BπŸš€ 76.6 63.0 35.2 55.2 68.9 58.4 40.6 53.2 80.2 71.6 55.5 65.5

For detailed comparison with other models, please refer to the Leaderboard.

πŸš€ Usage

Install the following packages:

pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
pip install qwen-vl-utils[decord]==0.0.14
# use Flash-Attention 2 to speed up generation
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir

Using πŸ€—Transformers for Inference:

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load model and processor
model = AutoModelForImageTextToText.from_pretrained(
    "TencentARC/TimeLens-8B",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(
    "TencentARC/TimeLens-8B",
    padding_side="left",
    do_resize=False,
)

# Prepare input
query = "A man is sitting on a chair"
video_path = "https://huggingface.co/datasets/JungleGym/TimeLens-Assets/blob/main/2Y8XQ.mp4"

GROUNDER_PROMPT = "Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in <start time> - <end time> seconds'."

messages = [{
    'role': 'user',
    'content': [
        {
            'type': 'video',
            'video': video_path,
            'min_pixels': 64 * 28 * 28,
            'total_pixels': 14336 * 28 * 28,
            'fps': 2,
        },
        {
            'type': 'text',
            'text': GROUNDER_PROMPT.format(query)
        }
    ]
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos, video_kwargs = process_vision_info(
  messages,
  image_patch_size=16,
  return_video_kwargs=True,
  return_video_metadata=True,
)

videos, video_metadatas = zip(*videos)
videos, video_metadatas = list(videos), list(video_metadatas)

inputs = processor(
  text=[text],
  images=images,
  videos=videos,
  video_metadata=video_metadatas,
  padding=True,
  return_tensors='pt',
  **video_kwargs,
).to("cuda")

output_ids = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=512,
)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"Answer: {answer}")

Citation

If you find our work helpful for your research and applications, please cite our paper:

TODO
Downloads last month
-
Safetensors
Model size
9B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TencentARC/TimeLens-8B

Finetuned
(104)
this model
Quantizations
2 models

Datasets used to train TencentARC/TimeLens-8B

Collection including TencentARC/TimeLens-8B