Abstract—Multimedia data with associated semantics is
omnipresent in today’s social online platforms in the form of
keywords, user comments, and so forth. This article presents
a statistical framework designed to infer knowledge in the
imaging domain from the semantic domain. Note that this is
the reverse direction of common computer vision applications.
The framework relates keywords to image characteristics using
a statistical significance test. It scales to millions of images
and hundreds of thousands of keywords. We demonstrate the
usefulness of the statistical framework with three color imaging
applications: 1) semantic image enhancement: re-render an image
in order to adapt it to its semantic context; 2) color naming: find
the color triplet for a given color name; and 3) color palettes: find
a palette of colors that best represents a given arbitrary semantic
context and that satisfies established harmony constraints.
INTRODUCTION
T HE semantic gap is a major challenge to solve in the multimedia
community. The gap is often understood as the
difficulty to automatically infer semantic information from the
multimedia domain such as in face recognition or video classification.
In this article we investigate the gap in the reverse direction,
i.e. we are interested in applications that have the semantic
domain as a source and the multimedia domain as a target.
Bridging the gap in reverse direction (see Fig. 1) might seem
counterintuitive at first sight because the classic forward direction
is an ubiquitous problem in our daily digital life: we want
computers to gain a semantic understanding of digital content
in order to ease search, classification and so forth. However,
today’s social community websites often have plenty of user
generated semantic information already attached to multimedia
content such as tagged images on Flickr or user comments for
posted content on Facebook. It is thus reasonable to explore applications
that take existing semantic information as input and
infer meaningful information and actions in the image domain.
Many aspects of the research in this article are similar to
classic computer vision. We use large databases of annotated
photos and characterize the photos with image descriptors. We
then learn relations between the keywords and the image descriptors.
However, our work differs in two key aspects: the type
of descriptors used to describe the images and the mapping function
to relate the two domains.
Our image descriptors, which represent our output domain,
are chosen so that they are meaningful to a human observer.
Consequently, descriptors that are popular in classic computer
vision such as SIFT or HoG cannot be used. Instead we use
simple color histograms or Fourier domain histograms because
they characterize visible aspects in the image domain, i.e. colors
and sharpness, respectively. As mapping function we can use a
simple and very scalable statistical framework capable of associating
numeric characteristics to semantic content, because we
do not have to discriminate similar semantic concepts.
The statistical framework is based on the Mann-Whitney-
Wilcoxon (MWW) ranksum test that assess whether the values
in one set are significantly larger or smaller than the ones in another
set. The test is non-parametric and thus does not require
assumptions about the input values’ distribution, which is important
to guarantee usability for versatile inputs. The MWW
test is also considerably robust, because it uses ranks instead of
the absolute values to compute the test statistic. Finally, it can
be implemented very efficiently, so that it can handle millions of
images and hundreds of thousands of keywords on a single-processor
machine.
The first application we present is semantic image enhancement:
given an input image and a semantic expression the image
is re-rendered in order to optimize it for the given semantic concept
(Section IV). We implement two types of enhancements,
which are a color tone mapping and a depth-of-field adaptation.
Adapting an image to a specific semantic context usually requires
a manual interaction and a photo editing tool, but our
approach realizes it fully automatically leveraging a large database
of annotated images.
estimate its color triplet (Section V). This is usually done in
psychophysical experiments where users have to manually
name colors. Our automatic approach allows us to estimate
color values for over 9000 color names in 10 different European
and Asian languages. As there is no user interaction required
we can even estimate color names for languages that we do not
speak ourselves. The results are published in an online color
thesaurus: www.colorthesaurus.com.
The third application is color palette extraction: given a semantic
expression, extract a corresponding palette of five harmonic
colors (Section VI). Color palette creation usually requires
user interaction such as on Adobe Kuler, one of the web’s
most popular color palette tools.1 The scalability of our framework
allows us to pre-compute associations for a large vocabulary
of 100,000 distinct English words and then propose color
palettes fully automatically. The palettes can be explored online:
www.koloro.org.
We further demonstrate in Section VII that among several
alternatives the Mann-Whitney-Wilcoxon test is the optimal
choice for the statistical framework in terms of both computational
complexity and accuracy. The low computational
complexity is essential to scale the methods to a large unrestricted
vocabulary as realized for the color palette application.
RELATED WORK
Our work is inspired by and related to research in the areas of
computer vision, image mining, image enhancement and color
naming. In this section we give a brief overview of other relevant
work and put it into perspective with ours.
A lot of research in the the overlapping fields of imaging and
semantics focusses on inferring semantic information from an
image. Examples are automatic image classification 2 3 or automatic
labeling of objects in images.4 5 Our work also lies in the
fields of semantics and imaging, but differs in one fundamental
way. Instead of building systems that take images at the input
and provide inferred semantic information at the output we work
in the opposite direction: we consider semantic information as
the input and automatically infer knowledge and actions in the
image domain (see Fig. 1).
Deselaers and Ferrari [1] show that images with semantically
similar annotations also have similar visual attributes and vice
versa. This suggests that it is possible to enhance an image
according to its semantic content by altering relevant image
characteristics based on its annotated keywords. This is demonstrated
in Section IV leveraging the MIR Flickr database [2]
containing 1 million annotated Flickr images.
Automatic image enhancement is a widely explored topic
and several approaches are being explored by the community.
several
example images such as Reinhard et al.’s color transfer [3]
or Kang et al.’s personalized image enhancement [4]. Our enhancement
also uses example images, but we select them according
to keywords and estimate significant characteristics for
those keywords. Another approach is to classify an image and
then apply an algorithm explicitly designed for that class. Examples
are Moser and Schröder [5] or Ciocca et al. [6], but
the authors limit to 7 and 3 classes, respectively, because the
different algorithms have to be manually implemented one by
one. In our case we can regard each keyword as a class and the
optimal processing is automatically derived from the example
database; we thus are able to handle thousands of classes in one
generic workflow.
A recent work in semantic image editing is PixelTone [7].
The authors propose a system where users can edit images with
touch input and spoken commands such as “sharpen the midtones
at the top.” Recognized commands such as sharpen midtones
are mapped to the appropriate operation that is applied
globally or locally in the relevant region that is determined by a
keyword such as top or by the user’s touch input. The main difference
between this project and our proposed semantic image
enhancement framework (Section IV) is that PixelTone requires
the user to explicitly verbalize the desired image processing operation.
In contrast, our aim is to exploit any semantic information
that comes along with images such as keywords, image titles
or community comments. This semantic information is originally
not meant to be used for image enhancement, but harnessing
it is a promising approach due to its omnipresence in
social networks and photo sharing platforms.
Annotated image data can also be used to associate words
with colors. An early and famous work is from Jones and
Rehg where the authors propose a statistical model to find
skin tones from annotated images [8]. Other related work proposes
systems to find appropriate colors for given song lyrics
[9] or using a PLSA model to find 24 English color names
from annotated images [10]. Our color naming experiment
presented in Section V is similar, but of considerably larger
scale and incorporates more than 9000 color names from 10
different European and Asian languages. Also we discuss in
Section VII what statistical measures give the best results in
terms of speed and accuracy.
Instead of a single color, it is also possible to extract a color palette for a given topic. There are numerous web services that help artists and designers to find color palettes for a given topic,
such as Color Hunter,6 CSS Drive,7 and Pictaculous.8 Adobe Kuler is the best known platform. The users can create palettes either manually or extract them automatically from an uploaded
image. The palettes can also be annotated with keywords so that
other users can query for them in the database. Color Hunter
provides a service where users can type in a keyword that is
used to query related images from Flickr. The software then extracts
one palette per image and returns them to the user. Our approach
to find color palettes based on semantic expressions (see
6[Online]. Available: http://www.colorhunter.com
7[Online]. Available: http://www.cssdrive.com/imagepalette/
8[Online]. Available: http://www.pictaculous.com
omnipresent in today’s social online platforms in the form of
keywords, user comments, and so forth. This article presents
a statistical framework designed to infer knowledge in the
imaging domain from the semantic domain. Note that this is
the reverse direction of common computer vision applications.
The framework relates keywords to image characteristics using
a statistical significance test. It scales to millions of images
and hundreds of thousands of keywords. We demonstrate the
usefulness of the statistical framework with three color imaging
applications: 1) semantic image enhancement: re-render an image
in order to adapt it to its semantic context; 2) color naming: find
the color triplet for a given color name; and 3) color palettes: find
a palette of colors that best represents a given arbitrary semantic
context and that satisfies established harmony constraints.
INTRODUCTION
T HE semantic gap is a major challenge to solve in the multimedia
community. The gap is often understood as the
difficulty to automatically infer semantic information from the
multimedia domain such as in face recognition or video classification.
In this article we investigate the gap in the reverse direction,
i.e. we are interested in applications that have the semantic
domain as a source and the multimedia domain as a target.
Bridging the gap in reverse direction (see Fig. 1) might seem
counterintuitive at first sight because the classic forward direction
is an ubiquitous problem in our daily digital life: we want
computers to gain a semantic understanding of digital content
in order to ease search, classification and so forth. However,
today’s social community websites often have plenty of user
generated semantic information already attached to multimedia
content such as tagged images on Flickr or user comments for
posted content on Facebook. It is thus reasonable to explore applications
that take existing semantic information as input and
infer meaningful information and actions in the image domain.
Many aspects of the research in this article are similar to
classic computer vision. We use large databases of annotated
photos and characterize the photos with image descriptors. We
then learn relations between the keywords and the image descriptors.
However, our work differs in two key aspects: the type
of descriptors used to describe the images and the mapping function
to relate the two domains.
Our image descriptors, which represent our output domain,
are chosen so that they are meaningful to a human observer.
Consequently, descriptors that are popular in classic computer
vision such as SIFT or HoG cannot be used. Instead we use
simple color histograms or Fourier domain histograms because
they characterize visible aspects in the image domain, i.e. colors
and sharpness, respectively. As mapping function we can use a
simple and very scalable statistical framework capable of associating
numeric characteristics to semantic content, because we
do not have to discriminate similar semantic concepts.
The statistical framework is based on the Mann-Whitney-
Wilcoxon (MWW) ranksum test that assess whether the values
in one set are significantly larger or smaller than the ones in another
set. The test is non-parametric and thus does not require
assumptions about the input values’ distribution, which is important
to guarantee usability for versatile inputs. The MWW
test is also considerably robust, because it uses ranks instead of
the absolute values to compute the test statistic. Finally, it can
be implemented very efficiently, so that it can handle millions of
images and hundreds of thousands of keywords on a single-processor
machine.
The first application we present is semantic image enhancement:
given an input image and a semantic expression the image
is re-rendered in order to optimize it for the given semantic concept
(Section IV). We implement two types of enhancements,
which are a color tone mapping and a depth-of-field adaptation.
Adapting an image to a specific semantic context usually requires
a manual interaction and a photo editing tool, but our
approach realizes it fully automatically leveraging a large database
of annotated images.
estimate its color triplet (Section V). This is usually done in
psychophysical experiments where users have to manually
name colors. Our automatic approach allows us to estimate
color values for over 9000 color names in 10 different European
and Asian languages. As there is no user interaction required
we can even estimate color names for languages that we do not
speak ourselves. The results are published in an online color
thesaurus: www.colorthesaurus.com.
The third application is color palette extraction: given a semantic
expression, extract a corresponding palette of five harmonic
colors (Section VI). Color palette creation usually requires
user interaction such as on Adobe Kuler, one of the web’s
most popular color palette tools.1 The scalability of our framework
allows us to pre-compute associations for a large vocabulary
of 100,000 distinct English words and then propose color
palettes fully automatically. The palettes can be explored online:
www.koloro.org.
We further demonstrate in Section VII that among several
alternatives the Mann-Whitney-Wilcoxon test is the optimal
choice for the statistical framework in terms of both computational
complexity and accuracy. The low computational
complexity is essential to scale the methods to a large unrestricted
vocabulary as realized for the color palette application.
RELATED WORK
Our work is inspired by and related to research in the areas of
computer vision, image mining, image enhancement and color
naming. In this section we give a brief overview of other relevant
work and put it into perspective with ours.
A lot of research in the the overlapping fields of imaging and
semantics focusses on inferring semantic information from an
image. Examples are automatic image classification 2 3 or automatic
labeling of objects in images.4 5 Our work also lies in the
fields of semantics and imaging, but differs in one fundamental
way. Instead of building systems that take images at the input
and provide inferred semantic information at the output we work
in the opposite direction: we consider semantic information as
the input and automatically infer knowledge and actions in the
image domain (see Fig. 1).
Deselaers and Ferrari [1] show that images with semantically
similar annotations also have similar visual attributes and vice
versa. This suggests that it is possible to enhance an image
according to its semantic content by altering relevant image
characteristics based on its annotated keywords. This is demonstrated
in Section IV leveraging the MIR Flickr database [2]
containing 1 million annotated Flickr images.
Automatic image enhancement is a widely explored topic
and several approaches are being explored by the community.
several
example images such as Reinhard et al.’s color transfer [3]
or Kang et al.’s personalized image enhancement [4]. Our enhancement
also uses example images, but we select them according
to keywords and estimate significant characteristics for
those keywords. Another approach is to classify an image and
then apply an algorithm explicitly designed for that class. Examples
are Moser and Schröder [5] or Ciocca et al. [6], but
the authors limit to 7 and 3 classes, respectively, because the
different algorithms have to be manually implemented one by
one. In our case we can regard each keyword as a class and the
optimal processing is automatically derived from the example
database; we thus are able to handle thousands of classes in one
generic workflow.
A recent work in semantic image editing is PixelTone [7].
The authors propose a system where users can edit images with
touch input and spoken commands such as “sharpen the midtones
at the top.” Recognized commands such as sharpen midtones
are mapped to the appropriate operation that is applied
globally or locally in the relevant region that is determined by a
keyword such as top or by the user’s touch input. The main difference
between this project and our proposed semantic image
enhancement framework (Section IV) is that PixelTone requires
the user to explicitly verbalize the desired image processing operation.
In contrast, our aim is to exploit any semantic information
that comes along with images such as keywords, image titles
or community comments. This semantic information is originally
not meant to be used for image enhancement, but harnessing
it is a promising approach due to its omnipresence in
social networks and photo sharing platforms.
Annotated image data can also be used to associate words
with colors. An early and famous work is from Jones and
Rehg where the authors propose a statistical model to find
skin tones from annotated images [8]. Other related work proposes
systems to find appropriate colors for given song lyrics
[9] or using a PLSA model to find 24 English color names
from annotated images [10]. Our color naming experiment
presented in Section V is similar, but of considerably larger
scale and incorporates more than 9000 color names from 10
different European and Asian languages. Also we discuss in
Section VII what statistical measures give the best results in
terms of speed and accuracy.
Instead of a single color, it is also possible to extract a color palette for a given topic. There are numerous web services that help artists and designers to find color palettes for a given topic,
such as Color Hunter,6 CSS Drive,7 and Pictaculous.8 Adobe Kuler is the best known platform. The users can create palettes either manually or extract them automatically from an uploaded
image. The palettes can also be annotated with keywords so that
other users can query for them in the database. Color Hunter
provides a service where users can type in a keyword that is
used to query related images from Flickr. The software then extracts
one palette per image and returns them to the user. Our approach
to find color palettes based on semantic expressions (see
6[Online]. Available: http://www.colorhunter.com
7[Online]. Available: http://www.cssdrive.com/imagepalette/
8[Online]. Available: http://www.pictaculous.com
Comments
Post a Comment