arXiv:2204.02547 Abstract | arXiv Analytics

arXiv:2204.02547 [cs.CV]Abstract References Reviews Resources

Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

Wangbo Zhao, Kai Wang, Xiangxiang Chu, Fuzhao Xue, Xinchao Wang, Yang You

Published 2022-04-06Version 1

Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation. Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames. Furthermore, we design a language-guided feature fusion module to progressively fuse appearance and motion features in each feature level with guidance from linguistic features. Finally, a multi-modal alignment loss is proposed to alleviate the semantic gap between features from different modalities. Extensive experiments on A2D Sentences and J-HMDB Sentences verify the performance and the generalization ability of our method compared to the state-of-the-art methods.

Comments: Accepted to CVPR2022

Categories: cs.CV, cs.CL

Keywords: multi-modal features, modeling motion, linguistic features, multi-modal alignment loss, language-guided feature fusion module

Related articles: Most relevant | Search more

arXiv:1910.01197 [cs.CV] (Published 2019-10-02)

Automatic Group Cohesiveness Detection With Multi-modal Features

Bin Zhu, Xin Guo, Kenneth Barner, Charles Boncelet

arXiv:2112.10992 [cs.CV] (Published 2021-12-21, updated 2022-04-24)

Expansion-Squeeze-Excitation Fusion Network for Elderly Activity Recognition

Xiangbo Shu, Jiawen Yang, Rui Yan, Yan Song

arXiv:1808.02632 [cs.CV] (Published 2018-08-08)

Question-Guided Hybrid Convolution for Visual Question Answering