arXiv:2311.17893 Abstract | arXiv Analytics

arXiv:2311.17893 [cs.CV]Abstract References Reviews Resources

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Shuangrui Ding, Rui Qian, Haohang Xu, Dahua Lin, Hongkai Xiong

Published 2023-11-29Version 1

In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability and imposes higher computational requirements. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Specifically, we first introduce a single spatio-temporal Transformer block to process the frame-wise DINO features and establish spatio-temporal dependencies in the form of self-attention. Subsequently, utilizing these attention maps, we implement hierarchical clustering to generate object segmentation masks. To train the spatio-temporal block in a fully self-supervised manner, we employ semantic and dynamic motion consistency coupled with entropy normalization. Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and particularly excels in complex real-world multi-object video segmentation tasks such as DAVIS-17-Unsupervised and YouTube-VIS-19. The code and model checkpoints will be released at https://github.com/shvdiwnkozbw/SSL-UVOS.

Categories: cs.CV

Keywords: self-supervised video object segmentation, effective approach, vos techniques majorly resort, real-world multi-object video segmentation tasks, complex real-world multi-object video segmentation

Related articles: Most relevant | Search more

arXiv:2401.13937 [cs.CV] (Published 2024-01-25)

Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention

Quang-Trung Truong, Duc Thanh Nguyen, Binh-Son Hua, Sai-Kit Yeung

arXiv:2204.10846 [cs.CV] (Published 2022-04-22)

Self-Supervised Video Object Segmentation via Cutout Prediction and Tagging

Jyoti Kini, Fahad Shahbaz Khan, Salman Khan, Mubarak Shah

arXiv:2107.12569 [cs.CV] (Published 2021-07-27)

Self-Supervised Video Object Segmentation by Motion-Aware Mask Propagation

Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Ajmal Mian