Video Google A Text Retrieval Approach to Object Matching in Videos

###### tags: Paper Reading # Video Google A Text Retrieval Approach to Object Matching in Videos ## Outline     This paper is talking about how to matching a object in videos. ## Introduction/Motivation     Video retrieval is a important issue for a search engine. Like searching for the image on the Google image search engine. We have to build an efficient way to find out the target in a large number of videos in the database. This paper take a use of text retrieval like inverted index and tf-idf. So we have to talk about traditional way of text search engine first. ## Text retrieval * First, we have to do document parse that means cut the document into lots of words. * Second, we have to represent the words by their stems, e.g. **'walk'**,**'walking'**, **'walks'** would have same represent **'walk'**. * Third, use stop word list to delete so useless word like **'an'**, **'a'**, **'the'**. * Fourth, use unique ID to represent each word. * Fifth, we have to give words a weight between them. So we use term frequency-inverse document frequency (tf-idf) to achieve this. The mean idea of tf-idf is that if a word is have high frequency appear in a document, the word is important. But if a word is often appear between documents, the word is useless. * In the end, we can sum weights of the words for documents and use inverted index to represent it.     After doing above operations, while a query coming, we can get a ranking for documents and know the location of the word in the document. ## Img retrieval     Now, I am going to how to use text retrieval method into video object. * First, we use Shape Adapted (SA) and Maximally Stable (MS) to find out feature region for every frame in the videos. (If you want to find out the detail of SA and MS, please go to read paper P.2 Viewpoint invariant description directly) * Second, use K-means clustering to cluster regions into a ‘word’ like text retrieval we call it visual words. * Third, use the idea of stop list, we delete like top 5% and bottom 10% words and use spatial consistency to filter out some too far outliner. * Fourth, represent regions in the same cluster by same ID. In the end use inverted file system to store ID and store the region is appear in which frame. ![](https://i.imgur.com/TNb95zz.png) ## Conclusion    This paper give me some knowledge about building a search engine about video search. The paper look visual feature into a word is astonish me. It is not intuitive but get top high performance for several years. And this paper also influence the text retrieval. That is how can we get more representative document? After all, this paper take the usage of two different research areas, and it is a wonderful paper.