arXiv:2411.12951 Abstract | arXiv Analytics

arXiv:2411.12951 [cs.CV]Abstract References Reviews Resources

On the Consistency of Video Large Language Models in Temporal Comprehension

Minjoon Jung, Junbin Xiao, Byoung-Tak Zhang, Angela Yao

Published 2024-11-20Version 1

Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model's responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal verification tuning that explicitly accounts for consistency, and demonstrate significant improvements for both grounding and consistency. Our data and code will be available at https://github.com/minjoong507/Consistency-of-Video-LLM.

Categories: cs.CV

Keywords: video large language models, consistency, video content, temporal comprehension capabilities, retrieve video moments

Related articles: Most relevant | Search more

arXiv:2101.11342 [cs.CV] (Published 2021-01-27)

Towards Improving the Consistency, Efficiency, and Flexibility of Differentiable Neural Architecture Search

Yibo Yang, Shan You, Hongyang Li, Fei Wang, Chen Qian, Zhouchen Lin

arXiv:2311.11865 [cs.CV] (Published 2023-11-20)

VLM-Eval: A General Evaluation on Video Large Language Models

Shuailin Li, Yuang Zhang, Yucheng Zhao, Qiuyue Wang, Fan Jia, Yingfei Liu, Tiancai Wang

arXiv:2207.13744 [cs.CV] (Published 2022-07-27)

Lighting (In)consistency of Paint by Text

Hany Farid

arXiv Analytics

arXiv:2411.12951 [cs.CV]Abstract References Reviews Resources

On the Consistency of Video Large Language Models in Temporal Comprehension

Links

Toolbox

arXiv:2411.12951 [cs.CV]AbstractReferencesReviewsResources

On the Consistency of Video Large Language Models in Temporal Comprehension

Links

Toolbox

arXiv:2411.12951 [cs.CV]Abstract References Reviews Resources