Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions

School of Computer Science and Engineering, Kyungpook National University

*Equal Contribution

†Corresponding Authors, equal leading contribution. Email: {s.park, hypark}@knu.ac.kr

CVPR 2025

Abstract

In recent text-video retrieval, the use of additional captions from vision-language models has shown promising effects on the performance. However, existing models using additional captions often have struggled to capture the rich semantics, including temporal changes, inherent in the video. In addition, incorrect information caused by generative models can lead to inaccurate retrieval.

To address these issues, we propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the two similarities from different views. Experimental results demonstrate that NarVid achieves state-of-the-art performance on various benchmark datasets.

Method

The method first generates frame-level captions (narration) for each video. Using the frame-level features of the video and narration, enhanced features are obtained through cross-modal interaction with co-attention and temporal block. (a) These enhanced features are further refined using query-aware adaptive filtering. (b) Then, the query-video and query-narration similarity matrices obtained through the multi-granularity matching are utilized for training and inference. (c) To enhance the discriminative ability of the model, we additionally use a cross-view hard negative loss during training.

Qualitative Results

To analyze the effectiveness of using narration in text-video retrieval, we provide examples of retrieval results and generated frame-level captions for three datasets: MSVD¹, VATEX², and DiDeMo³. Additionally, we include examples of incorrect results due to short and general queries.

BibTeX

@article{hur2025narratingthevideo, author = {Chan Hur and Jeong-hun Hong and Dong-hun Lee and Dabin Kang and Semin Myeong and Sang-hyo Park and Hyeyoung Park}, title = {Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions}, journal = {arXiv preprint arXiv:2503.05186}, year = {2025}, }

References

David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4581–4591, 2019.

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017.