INVITED
P A P E R
Efficient Visual Search for Objects in Videos
Visual search using text-retrieval methods can rapidly and accurately
locate objects in videos despite changes in camera viewpoint,
lighting, and partial occlusions.
By Josef Sivic and Andrew Zisserman
ABSTRACT | We describe an approach to generalize theconcept of text-
代写留学生论文based search to nontextual information. Inparticular, we elaborate on the possibilities of retrievingobjects or scenes in a movie with the ease, speed, and accuracywith which Google [9] retrieves web pages containing particularwords, by specifying the query as an image of the object orscene. In our approach, each frame of the video is representedby a set of viewpoint invariant region descriptors. These
descriptors enable recognition to proceed successfully despite
changes in viewpoint, illumination, and partial occlusion.
Vector quantizing these region descriptors provides a visual
analogy of a word, which we term a Bvisual word.[ Efficient
retrieval is then achieved by employing methods from statistical
text retrieval, including inverted file systems, and text and
document frequency weightings. The final ranking also depends
on the spatial layout of the regions. Object retrievalresults are reported on the full length feature films BGroundhog
Day,[ BCharade,[ and BPretty Woman,[ including searches from
within the movie and also searches specified by external
images downloaded from the Internet. We discuss three
research directions for the presented video retrieval approach
and review some recent work addressing them: 1) building
visual vocabularies for very large-scale retrieval; 2) retrieval of
3-D objects; and 3) more thorough verification and ranking
using the spatial structure of objects.
KEYWORDS | Object recognition; text retrieval; viewpoint and
scale invariance
I. INTRODUCTION
The aim of this research is to retrieve those key frames and
shots of a video containing a particular object with the
ease, speed, and accuracy with which web search engines
such as Google [9] retrieve text documents (web pages)
containing particular words. An example visual object
query and retrieved results are shown in Fig. 1. This paper
investigates whether a text retrieval approach can be
successfully employed for this task.
Identifying an (identical) object in a database of images
is a challenging problem because the object can have a
different size and pose in the target and query images, and
also the target image may contain other objects (Bclutter[)
that can partially occlude the object of interest. However,
successful methods now exist which can match an object’s
visual appearance despite differences in viewpoint, lighting,
and partial occlusion [22]–[24], [27], [32], [38], [39],
[41], [49], [50]. Typically, an object is represented by a set
of overlapping regions each represented by a vector
computed from the region’s appearance. The region
extraction and descriptors are built with a controlled
degree of invariance to viewpoint and illumination
conditions. Similar descriptors are computed for all images
in the database. Recognition of a particular object proceeds
by nearest neighbor matching of the descriptor vectors,
followed by disambiguating or voting using the spatial
consistency of the matched regions, for example by
computing an affine transformation
本论文由英语论文网提供整理,提供论文代写,英语论文代写,代写论文,代写英语论文,代写留学生论文,代写英文论文,留学生论文代写相关核心关键词搜索。