TimeLens
Collection
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
β’
4 items
β’
Updated
β’
4
π Paper | π» Code | π Project Page | π€ Model & Data
TimeLens-8B is an MLLM with state-of-the-art video temporal grounding performance among open-source models, finetuned from Qwen3-VL-8B-Instruct. It is trained with carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe proposed in our paper, utilizing our high-quality VTG training dataset TimeLens-100K.
TimeLens-8B achieves state-of-the-art video temporal grounding performance among open-source models:
| Model | Charades-TimeLens | ActivityNet-TimeLens | QVHighlights-TimeLens | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU | R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU | R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU | |
| Qwen2.5-VL-7B-Instruct | 59.7 | 37.8 | 16.6 | 39.3 | 44.1 | 31.0 | 16.1 | 31.4 | 41.5 | 27.8 | 15.2 | 31.6 |
| TimeLens-7Bπ | 70.5 | 55.6 | 28.4 | 48.8 | 62.8 | 51.0 | 32.6 | 46.2 | 74.1 | 62.7 | 43.1 | 56.0 |
| Qwen3-VL-8B-Instruct | 69.2 | 53.4 | 27.5 | 48.3 | 62.1 | 51.2 | 34.4 | 46.8 | 74.2 | 64.6 | 49.3 | 59.4 |
| TimeLens-8Bπ | 76.6 | 63.0 | 35.2 | 55.2 | 68.9 | 58.4 | 40.6 | 53.2 | 80.2 | 71.6 | 55.5 | 65.5 |
For detailed comparison with other models, please refer to the Leaderboard.
Install the following packages:
pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
pip install qwen-vl-utils[decord]==0.0.14
# use Flash-Attention 2 to speed up generation
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
Using π€Transformers for Inference:
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load model and processor
model = AutoModelForImageTextToText.from_pretrained(
"TencentARC/TimeLens-8B",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(
"TencentARC/TimeLens-8B",
padding_side="left",
do_resize=False,
)
# Prepare input
query = "A man is sitting on a chair"
video_path = "https://huggingface.co/datasets/JungleGym/TimeLens-Assets/blob/main/2Y8XQ.mp4"
GROUNDER_PROMPT = "Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in <start time> - <end time> seconds'."
messages = [{
'role': 'user',
'content': [
{
'type': 'video',
'video': video_path,
'min_pixels': 64 * 28 * 28,
'total_pixels': 14336 * 28 * 28,
'fps': 2,
},
{
'type': 'text',
'text': GROUNDER_PROMPT.format(query)
}
]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos, video_kwargs = process_vision_info(
messages,
image_patch_size=16,
return_video_kwargs=True,
return_video_metadata=True,
)
videos, video_metadatas = zip(*videos)
videos, video_metadatas = list(videos), list(video_metadatas)
inputs = processor(
text=[text],
images=images,
videos=videos,
video_metadata=video_metadatas,
padding=True,
return_tensors='pt',
**video_kwargs,
).to("cuda")
output_ids = model.generate(
**inputs,
do_sample=False,
max_new_tokens=512,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"Answer: {answer}")
If you find our work helpful for your research and applications, please cite our paper:
TODO