Abstract—With an increasing number of images that are
available in social media, image annotation has emerged as
an important research topic due to its application in image
matching and retrieval. Most studies cast image annotation into
a multi-label classification problem. The main shortcoming of
this approach is that it requires a large number of training
images with clean and complete annotations in order to learn
a reliable model for tag prediction. We address this limitation by
developing a novel approach that combines the strength of tag
ranking with the power of matrix recovery. Instead of having to
make a binary decision for each tag, our approach ranks tags
in the descending order of their relevance to the given image,
significantly simplifying the problem. In addition, the proposed
method aggregates the prediction models for different tags into
a matrix, and casts tag ranking into a matrix recovery problem.
It introduces the matrix trace norm to explicitly control the
model complexity so that a reliable prediction model can be
learned for tag ranking even when the tag space is large and the
number of training images is limited. Experiments on multiple
well-known image datasets demonstrate the effectiveness of the
proposed framework for tag ranking compared to the state-ofthe-
art approaches for image annotation and tag ranking.
INTRODUCTION
THe popularity of digital cameras and mobile phone cameras
leads to an explosive growth of digital images that
are available over the internet. How to accurately retrieve images
from enormous collections of digital photos has become
an important research topic. Content-based image retrieval
(CBIR) addresses this challenge by identifying the matched
images based on their visual similarity to a query image [1].
However due to the semantic gap between the low-level visual
features used to represent images and the high-level semantic
tags used to describe image content, limited performance is
achieved by CBIR techniques [1], [2]. To address the limitation
of CBIR, many algorithms have been developed for tag based
image retrieval (TBIR) that represents images by manually
assigned keywords/tags. It allows a user to present his/her
information needs by textual information and find the relevant
images based on the match between the textual query and the
assigned image tags. Recent studies have shown that TBIR is
usually more effective than CBIR in identifying the relevant
images [3].
Since it is time-consuming to manually label images, various
algorithms have been developed for automatic image
annotation [4]–[11]. Many studies view image annotation as
a multi-label classification problem [12]–[17], where in the
simplest case, a binary classification model is built for each
tag. The main shortcoming of this approach is that in order
to train a reliable model for tag prediction, it requires a
large number of training images with clean and complete
annotations. In this work, we focus on the tag ranking approach
for automatic image annotation [18]–[25]. Instead of
having to decide, for each tag, if it should be assigned to
a given image, the tag ranking approach ranks tags in the
descending order of their relevance to the given image. By
avoiding making binary decision for each tag, the tag ranking
approach significantly simplifies the problem, leading to a
better performance than the traditional classification based
approaches for image annotation [25]. In addition, studies have
shown that tag ranking approaches are more robust to noisy
and missing tags than the classification approaches [24].
Although multiple algorithms have been developed for tag
ranking, they tend to perform poorly when the number of
training images is limited compared to the number of tags, a
scenario often encountered in real world applications [26]. In
this work, we address this limitation by casting tag ranking
into a matrix recovery problem [27]. The key idea is to
aggregate the prediction models for different tags into a matrix.
Instead of learning each prediction model independently, we
propose to learn all the prediction models simultaneously by
exploring the theory of matrix recovery, where a trace norm
regularization is introduced to capture the dependence among
different tags and to control the model complexity. We shown,
both theoretically and empirically, that with the introduction
of trace norm regularizer, a reliable prediction model can be
learned for tag ranking even when the tag space is large and
the number of training images is small. We note that although
the trace norm regularization has been studied extensively for
classification [28], [29], this is the first study that exploits trace
norm regularization for tag ranking.
The rest of the paper is organized as follows. Section 2
reviews the related work on automatic image annotation and
Automatic Image Annotation
Automatic image annotation aims to find a subset of keywords/
tags that describes the visual content of an image. It
plays an important role in bridging the semantic gap between
low-level features and high-level semantic content of images.
Most automatic image annotation algorithms can be classified
into three categories (i) generative models that model the joint
distribution between tags and visual features, (ii) discriminative
models that view image annotation as a classification
problem, and (iii) search based approaches. Below, we will
briefly review approaches in each category.
Both mixture models and topic models, two well known approaches
in generative model, have been successfully applied
to automatic image annotation. In [12], a Gaussian mixture
model is used to model the dependence between keywords
and visual features. In [32]–[34], kernel density estimation
is applied to model the distribution of visual features and to
estimate the conditional probability of keyword assignments
given the visual features. Topic models annotate images as
samples from a specific mixture of topics, which each topic
is a joint distribution between image features and annotation
keywords. Various topic models have been developed
for image annotation, including probabilistic latent semantic
analysis (pLSA) [35], latent Dirichlet allocation [36], [37] and
hierarchical Dirichlet processes [38]. Since a large number
of training examples are needed for estimating the joint
probability distribution over both features and keywords, the
generative models are unable to handle the challenge of large
tag space with limited number of training images.
Discriminative models [39], [40] views image annotation
as a multi-class classification problem, and learns one binary
classification model for either one or multiple tags. A 2D
multiresolution hidded Markov model (MHMM) is proposed
to model the relationship between tags and visual content
[41]. A structured max-margin algorithm is developed in [42]
to exploit the dependence among tags. One problem with
discriminative approaches for image annotation is imbalanced
data distribution because each binary classifier is designed to
distinguish image of one class from images of the other classes.
It becomes more severe when the number of classes/tags is
large [43]. Another limitation of these approaches is that they
are unable to capture the correlation among classes, which is
known to be important in multi-label learning. To overcome
these issues, several algorithms [16], [17], [44] are proposed to
harness the keyword correlation as the additional information.
The search based approaches are based on the assumption
that visually similar images are more likely to share common
keywords [10]. Given a test image I, it first finds out a set of
training images that are visually similar to I, and then assigns
the tags that are most popular among the similar images.
A divide-and-conquer framework is proposed in [45] which
identifies the salient terms from textual descriptions of visual
neighbours searched from web images. In the Joint Equal
Contribution (JEC) model proposed in [4], multiple distance
functions are computed with each based on a different set of
visual features, and the nearest neighbors are determined by
the average distance functions. TagProp [7] predicts keywords
by taking a weighted combination of tags assigned to nearest
neighbor images. More recently, the sparse coding scheme
and its variations are employed in [5], [9], [14] to facilitate
image label propagation. Similar to the classification method,
the search based approaches often fail when the number of
training examples is limited.
B. Tag Ranking
Tag ranking aims to learn a ranking function that puts
relevant tags in front of the irrelevant ones. In the simplest
form, it learns a scoring function that assigns larger values
to the relevant tags than to those irrelevant ones. In [18], the
authors develop a classification framework for tag ranking that
computes tag scores for a test image based on the neighbor
voting. It was extended in [46] to the case where each image
is represented by multiple sets of visual features. Liu et al.
[19] utilizes the Kernel Density Estimation (KDE) to calculate
relevance scores for different tags, and performs a randomwalk
to further improve the performance of tag ranking by
exploring the correlation between tags. Similarly, Tang et al.
[47] proposed a two-stage graph-based relevance propagation
approach. In [21], a two-view tag weighting method is proposed
to effectively exploit both the correlation among tags
and the dependence between visual features and tags. In [26],
a max-margin riffled independence model is developed for tag
ranking. As mentioned in the introduction section, most of
the existing algorithms for tag ranking tend to perform poorly
when the tag space is large and the number of training images
available in social media, image annotation has emerged as
an important research topic due to its application in image
matching and retrieval. Most studies cast image annotation into
a multi-label classification problem. The main shortcoming of
this approach is that it requires a large number of training
images with clean and complete annotations in order to learn
a reliable model for tag prediction. We address this limitation by
developing a novel approach that combines the strength of tag
ranking with the power of matrix recovery. Instead of having to
make a binary decision for each tag, our approach ranks tags
in the descending order of their relevance to the given image,
significantly simplifying the problem. In addition, the proposed
method aggregates the prediction models for different tags into
a matrix, and casts tag ranking into a matrix recovery problem.
It introduces the matrix trace norm to explicitly control the
model complexity so that a reliable prediction model can be
learned for tag ranking even when the tag space is large and the
number of training images is limited. Experiments on multiple
well-known image datasets demonstrate the effectiveness of the
proposed framework for tag ranking compared to the state-ofthe-
art approaches for image annotation and tag ranking.
INTRODUCTION
THe popularity of digital cameras and mobile phone cameras
leads to an explosive growth of digital images that
are available over the internet. How to accurately retrieve images
from enormous collections of digital photos has become
an important research topic. Content-based image retrieval
(CBIR) addresses this challenge by identifying the matched
images based on their visual similarity to a query image [1].
However due to the semantic gap between the low-level visual
features used to represent images and the high-level semantic
tags used to describe image content, limited performance is
achieved by CBIR techniques [1], [2]. To address the limitation
of CBIR, many algorithms have been developed for tag based
image retrieval (TBIR) that represents images by manually
assigned keywords/tags. It allows a user to present his/her
information needs by textual information and find the relevant
images based on the match between the textual query and the
assigned image tags. Recent studies have shown that TBIR is
usually more effective than CBIR in identifying the relevant
images [3].
Since it is time-consuming to manually label images, various
algorithms have been developed for automatic image
annotation [4]–[11]. Many studies view image annotation as
a multi-label classification problem [12]–[17], where in the
simplest case, a binary classification model is built for each
tag. The main shortcoming of this approach is that in order
to train a reliable model for tag prediction, it requires a
large number of training images with clean and complete
annotations. In this work, we focus on the tag ranking approach
for automatic image annotation [18]–[25]. Instead of
having to decide, for each tag, if it should be assigned to
a given image, the tag ranking approach ranks tags in the
descending order of their relevance to the given image. By
avoiding making binary decision for each tag, the tag ranking
approach significantly simplifies the problem, leading to a
better performance than the traditional classification based
approaches for image annotation [25]. In addition, studies have
shown that tag ranking approaches are more robust to noisy
and missing tags than the classification approaches [24].
Although multiple algorithms have been developed for tag
ranking, they tend to perform poorly when the number of
training images is limited compared to the number of tags, a
scenario often encountered in real world applications [26]. In
this work, we address this limitation by casting tag ranking
into a matrix recovery problem [27]. The key idea is to
aggregate the prediction models for different tags into a matrix.
Instead of learning each prediction model independently, we
propose to learn all the prediction models simultaneously by
exploring the theory of matrix recovery, where a trace norm
regularization is introduced to capture the dependence among
different tags and to control the model complexity. We shown,
both theoretically and empirically, that with the introduction
of trace norm regularizer, a reliable prediction model can be
learned for tag ranking even when the tag space is large and
the number of training images is small. We note that although
the trace norm regularization has been studied extensively for
classification [28], [29], this is the first study that exploits trace
norm regularization for tag ranking.
The rest of the paper is organized as follows. Section 2
reviews the related work on automatic image annotation and
Automatic Image Annotation
Automatic image annotation aims to find a subset of keywords/
tags that describes the visual content of an image. It
plays an important role in bridging the semantic gap between
low-level features and high-level semantic content of images.
Most automatic image annotation algorithms can be classified
into three categories (i) generative models that model the joint
distribution between tags and visual features, (ii) discriminative
models that view image annotation as a classification
problem, and (iii) search based approaches. Below, we will
briefly review approaches in each category.
Both mixture models and topic models, two well known approaches
in generative model, have been successfully applied
to automatic image annotation. In [12], a Gaussian mixture
model is used to model the dependence between keywords
and visual features. In [32]–[34], kernel density estimation
is applied to model the distribution of visual features and to
estimate the conditional probability of keyword assignments
given the visual features. Topic models annotate images as
samples from a specific mixture of topics, which each topic
is a joint distribution between image features and annotation
keywords. Various topic models have been developed
for image annotation, including probabilistic latent semantic
analysis (pLSA) [35], latent Dirichlet allocation [36], [37] and
hierarchical Dirichlet processes [38]. Since a large number
of training examples are needed for estimating the joint
probability distribution over both features and keywords, the
generative models are unable to handle the challenge of large
tag space with limited number of training images.
Discriminative models [39], [40] views image annotation
as a multi-class classification problem, and learns one binary
classification model for either one or multiple tags. A 2D
multiresolution hidded Markov model (MHMM) is proposed
to model the relationship between tags and visual content
[41]. A structured max-margin algorithm is developed in [42]
to exploit the dependence among tags. One problem with
discriminative approaches for image annotation is imbalanced
data distribution because each binary classifier is designed to
distinguish image of one class from images of the other classes.
It becomes more severe when the number of classes/tags is
large [43]. Another limitation of these approaches is that they
are unable to capture the correlation among classes, which is
known to be important in multi-label learning. To overcome
these issues, several algorithms [16], [17], [44] are proposed to
harness the keyword correlation as the additional information.
The search based approaches are based on the assumption
that visually similar images are more likely to share common
keywords [10]. Given a test image I, it first finds out a set of
training images that are visually similar to I, and then assigns
the tags that are most popular among the similar images.
A divide-and-conquer framework is proposed in [45] which
identifies the salient terms from textual descriptions of visual
neighbours searched from web images. In the Joint Equal
Contribution (JEC) model proposed in [4], multiple distance
functions are computed with each based on a different set of
visual features, and the nearest neighbors are determined by
the average distance functions. TagProp [7] predicts keywords
by taking a weighted combination of tags assigned to nearest
neighbor images. More recently, the sparse coding scheme
and its variations are employed in [5], [9], [14] to facilitate
image label propagation. Similar to the classification method,
the search based approaches often fail when the number of
training examples is limited.
B. Tag Ranking
Tag ranking aims to learn a ranking function that puts
relevant tags in front of the irrelevant ones. In the simplest
form, it learns a scoring function that assigns larger values
to the relevant tags than to those irrelevant ones. In [18], the
authors develop a classification framework for tag ranking that
computes tag scores for a test image based on the neighbor
voting. It was extended in [46] to the case where each image
is represented by multiple sets of visual features. Liu et al.
[19] utilizes the Kernel Density Estimation (KDE) to calculate
relevance scores for different tags, and performs a randomwalk
to further improve the performance of tag ranking by
exploring the correlation between tags. Similarly, Tang et al.
[47] proposed a two-stage graph-based relevance propagation
approach. In [21], a two-view tag weighting method is proposed
to effectively exploit both the correlation among tags
and the dependence between visual features and tags. In [26],
a max-margin riffled independence model is developed for tag
ranking. As mentioned in the introduction section, most of
the existing algorithms for tag ranking tend to perform poorly
when the tag space is large and the number of training images
Comments
Post a Comment