Find and Focus: Retrieve and Localize Video Events with Natural Language Queries

Dian Shao* 1, Yu Xiong* 1 , Yue Zhao 1 , Qingqiu Huang 1 , Yu Qiao 2 , Dahua Lin 1

1CUHK-Sensetime Joint Lab, The Chinese University of Hong Kong
2Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
15th European Conference on Computer Vision (ECCV) 2018


The thriving of video sharing services brings new challenges to video retrieval, e.g. the rapid growth in video duration and content diversity. Meeting such challenges calls for new techniques that can ef- fectively retrieve videos with natural language queries. Existing methods along this line, which mostly rely on embedding videos as a whole, re- main far from satisfactory for real-world applications due to the limited expressive power. In this work, we aim to move beyond this limitation by delving into the internal structures of both sides, the queries and the videos. Specifically, we propose a new framework called Find and Focus (FIFO), which not only performs top-level matching (paragraph vs. video), but also makes part-level associations, localizing a video clip for each sentence in the query with the help of a focusing guide. These levels are complementary – the top-level matching narrows the search while the part-level localization refines the results. On both ActivityNet Captions and modified LSMDC datasets, the proposed framework achieves remarkable performance gains.



Retrieval performance (find + focus) on ActivityNet Caption

Method Recall@1 Recall@5 Recall@10 Recall@50 Median-Rank
Random 0.02 0.10 0.20 1.02 2458
LSTM-YT 0 4 - 24 102
S2VT 5 14 - 32 78
Krishna et al 14 32 - 65 34
VSE(find stage) 11.69 34.66 50.03 85.66 10
Ours(find + refine in Top 20) 14.11 37.12 52.13 - 10
Ours(find + refine in Top 100) 14.05 37.40 52.94 86.72 9

Examples of Find&Focus Result



    author = {Dian Shao, Yu Xiong, Yue Zhao, Qingqiu Huang, Yu Qiao, Dahua Lin},
    title = {Find and Focus: Retrieve and Localize Video Events with Natural Language Queries},
    booktitle = {15th European Conference on Computer Vision (ECCV)},
    year = {2018}


Comming Soon.