INSTITUTO DE COMPUTAÇÃO

 

Tese de Doutorado: Otávio Augusto Bizetto Penatti

Data: 
29/11/2012 (All day)
Local: 
Auditório IC 2 - Sala 85

Titulo:

Image and Video Representations based on Visual Dictionaries (Representações de Imagens e Videos baseadas em Dicionários Visuais)

Resumo:

Effectively encoding visual properties from multimedia content is challenging. One popular approach to deal with this challenge is the visual dictionary model. In this model, images are treated as an unordered set of local features being represented by the so-called bag-of-(visual-)words vector. In this thesis, we work on three research problems related to the visual dictionary model.
The first research problem is concerned with the generalization power of dictionaries, which is related to the ability of representing well images from one dataset even using a dictionary created over other dataset, or using a dictionary created on dataset samples. We perform experiments in closed datasets as well as in a Web environment. Obtained results suggest that diverse samples in terms of appearances are enough to generate a good dictionary.
The second research problem is related to the importance of the spatial information of visual words in the image space, which could be crucial to distinguish types of objects and scenes. The traditional pooling methods usually discard the spatial configuration of visual words in the image. We have proposed a pooling method, named Words Spatial Arrangement (WSA), which encodes the relative positions of visual words in the image, having the advantage of generating more compact feature vectors than most of the existing spatial pooling strategies. Experiments for image retrieval show that WSA outperforms the most popular spatial pooling method, the Spatial Pyramids.
The third research problem under investigation in this thesis is related to the lack of semantic information in the visual dictionary model. We show that the problem of having no semantics in the space of low-level descriptions is reduced when we move to the bag-of-words representation. However, even in the bag-of-words space, we show that the semantics is very small. Therefore, we question about moving one step further and propose a representation based on visual words which are less local. Our proposed dictionary is based on scenes. We used the dictionary of scenes for video representation in experiments for video geocoding. Video geocoding is the task of assigning a geographic location to a given video. The evaluation was performed in the context of the Placing Task of the MediaEval challenge and the proposed bag-of-scenes model has shown promising performance.
 

Instituto de Computação :: Universidade Estadual de Campinas :: Av. Albert Einstein, 1251 - Cidade Universitária, Campinas/SP - Brasil, CEP 13083-852 • Fone: [19] 3521-5838