Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)

Do Video-Language Foundation Models have a Sense of Time?

Piyush Nitin Bagad · Makarand Tapaswi · Cees G Snoek

Keywords: [ contrastive learning ] [ temporal understanding ] [ Video-language models ]


Abstract:

Modelling and understanding time remains a challenge in contemporary video understanding models. Time also appears in language through temporal relations. Video-language models can benefit from having a sense of time, especially since language provides an interface for generalization. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We construct a simple synthetic dataset to measure such temporal understanding in video-language models and find that six existing models struggle to understand even such simple relations. We then posit whether it is feasible to equip these foundation models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without needing data- and compute-intense training from scratch.

Chat is not available.