英语论文网

between the query
and target image [19], [22]. The result is that objects can
be recognized despite significant changes in viewpoint,
some amount of illumination variation and, due to multiple
local regions, despite partial occlusion since some of
the regions will be visible in such cases. Examples of
extracted regions and matches are shown in Figs. 2 and 5.
In this paper, we cast this approach as one of text
retrieval. In essence, this requires a visual analogy of a
word, and here we provide this by vector quantizing the
Manuscript received June 10, 2007; revised November 25, 2007. This work wassupported in part by the Mathematical and Physical Sciences Division, University ofOxford and in part by EC Project Vibes.
The authors are with the Department of Engineering Science, University Object Identifier: 10.1109/JPROC.2008.916343
548 Proceedings of the IEEE | Vol. 96, No. 4, April 2008 0018-9219/$25.00 2008 IEEEAuthorized licensed use limited to: University College London. Downloaded on April 16, 2009 at 11:04 from IEEE Xplore. Restrictions apply.descriptor vectors. The benefit of the text retrieval
approach is that matches are effectively precomputed sothat at run time frames and shots containing anyparticular object can be retrieved with no delay. Thismeans that any object occurring in the video (and
conjunctions of objects) can be retrieved even thoughthere was no explicit interest in these objects whendescriptors were built for the video.Note that the goal of this research is to retrieveinstances of a specific object, e.g., a specific bag or abuilding with a particular logo (Figs. 1 and 2). This is incontrast to retrieval and recognition of Bobject/scenecategories[ [8], [11], [13], [14], [35], [44], sometimes alsocalled Bhigh-level features[ or Bconcepts[ [4], [47] such as
Bbags,[ Bbuildings,[ or Bcars,[ where the goal is to findany bag, building, or car, irrespective of its shape, color,
appearance, or any particular markings/logos.
We describe the steps by which we are able to use textretrieval methods for object retrieval in Section II. Then inSection III, we evaluate the proposed approach on a groundtruth set of six object queries. Object retrieval results,including searches from within the movie and specified byexternal images, are shown on feature films: BGroundhog
Day[ [Ramis, 1993], BCharade[ [Donen, 1963] and BPrettyWoman[ [Marshall,1990]. Finally, in Section IV we
discuss three challenges for the presented video retrievalapproach and review some recent work addressing them.
II. TEXT RETRIEVAL APPROACH
TO OBJECT MATCHING
This section outlines the steps in building an object
retrieval system by combining methods from computer
vision and text retrieval.
Each frame of the video is represented by a set of
overlapping (local) regions with each region represented by
a visual word computed from its appearance. Section II-A
describes the visual regions and descriptors used.
Section II-B then describes their vector quantization into
visual Bwords.[ Sections II-C and II-D then show how text
retrieval techniques are applied to this visual word
representation. We will use the film BGroundhog Day[
as our running example, though the same method is
applied to all the feature films used in this paper.
A. Viewpoint Invariant Description
The goal is to extract a description of an object from an
image which will be largely unaffected by