OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

Jiali Yao*, Xin Gu*, Xinran Deng, Mengrui Dai, Bing Fan, Zhipeng Zhang, Yan Huang, Heng Fan, Libo Zhang
University of Chinese Academy of Sciences
ICLR 2026

*Indicates Equal Contribution
Indicates Corresponding Authors
Task Comparison

Comparison of existing STVG that localizes a single object in the query (a) and our OmniSTVG localizing all objects in the query (b). The object in the query and its corresponding spatio-temporal tube in the video is highlighted using the same color.

Abstract

We introduce spatio-temporal omni-object video grounding, dubbed OmniSTVG, a new STVG task aiming to localize spatially and temporally all targets mentioned in the textual query within videos. Compared to classic STVG locating only a single target, OmniSTVG enables localization of not only an arbitrary number of text-referred targets but also their interacting counterparts in the query from the video, making it more flexible and practical in real scenarios for comprehensive understanding. In order to facilitate exploration of OmniSTVG, we propose BOSTVG, a large-scale benchmark dedicated to OmniSTVG. Specifically, BOSTVG contains 10,018 videos with 10.2M frames and covers a wide selection of 287 classes from diverse scenarios. Each sequence, paired with a free-form textual query, encompasses a varying number of targets ranging from 1 to 10. To ensure high quality, each video is manually annotated with meticulous inspection and refinement. To our best knowledge, BOSTVG, to date, is the first and the largest benchmark for OmniSTVG. To encourage future research, we present a simple yet effective approach, named OmniTube, which, drawing inspiration from Transformer-based STVG methods, is specially designed for OmniSTVG and demonstrates promising results. By releasing BOSTVG, we hope to go beyond classic STVG by locating every object appearing in the query for more comprehensive understanding, opening up a new direction for STVG.

Overview of OmniTube

Overview of OmniTube, which contains a multimodal encoder, a spatio-temporal decoder, and a spatial-temporal tube generation module to locate all mentioned targets in the textual query.

Poster

ICLR 2026 Poster

BibTeX

@misc{yao2025omnistvg,
      title={OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding}, 
      author={Jiali Yao and Xinran Deng and Xin Gu and Mengrui Dai and Bing Fan and Zhipeng Zhang and Yan Huang and Heng Fan and Libo Zhang},
      year={2025},
      eprint={2503.10500},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.10500}, 
}