Query Aware Determinization of Uncertain Objects

Abstract—This paper considers the problem of determinizing probabilistic data to enable such data to be stored in legacy systems
that accept only deterministic input. Probabilistic data may be generated by automated data analysis/enrichment techniques such as
entity resolution, information extraction, and speech processing. The legacy system may correspond to pre-existing web applications
such as Flickr, Picasa, etc. The goal is to generate a deterministic representation of probabilistic data that optimizes the quality of the
end-application built on deterministic data. We explore such a determinization problem in the context of two different data processing
tasks—triggers and selection queries. We show that approaches such as thresholding or top-1 selection traditionally used for
determinization lead to suboptimal performance for such applications. Instead, we develop a query-aware strategy and show its
advantages over existing solutions through a comprehensive empirical evaluation over real and synthetic datasets.

INTRODUCTION
WITH the advent of cloud computing and the
proliferation of web-based applications, users often
store their data in various existing web applications. Often,
user data is generated automatically through a variety
of signal processing, data analysis/enrichment techniques
before being stored in the web applications. For example,
modern cameras support vision analysis to generate
tags such as indoors/outdoors, scenery, landscape/portrait,
etc. Modern photo cameras often have microphones for
users to speak out a descriptive sentence which is then
processed by a speech recognizer to generate a set of
tags to be associated with the photo [1]. The photo
(along with the set of tags) can be streamed in real-time
using wireless connectivity to Web applications such as
Flickr.
Pushing such data into web applications introduces a
challenge since such automatically generated content is
often ambiguous and may result in objects with probabilistic
attributes. For instance, vision analysis may result in tags
with probabilities [2], [3], and, likewise, automatic speech
recognizer (ASR) may produce an N-best list or a confusion
network of utterances [1], [4]. Such probabilistic data must
be “determinized" before being stored in legacy web applications.
We refer to the problem of mapping probabilistic
data into the corresponding deterministic representation as
the determinization problem.

Many approaches to the determinization problem can
be designed. Two basic strategies are the Top-1 and
All techniques, wherein we choose the most probable
value / all the possible values of the attribute with
non-zero probability, respectively. For instance, a speech
recognition system that generates a single answer/tag for
each utterance can be viewed as using a top-1 strategy.
Another strategy might be to choose a threshold τ
and include all the attribute values with a probability
higher than τ . However, such approaches being agnostic
to the end-application often lead to suboptimal results
as we will see later. A better approach is to design
customized determinization strategies that select a determinized
representation which optimizes the quality of the
end-application.
Consider, for instance, an end application that supports
triggers/alerts on automatically generated content.
Examples of such an end-application includes publish/
subscibe systems such as Google Alert, wherein users
specify their subscriptions in the form of keywords (e.g.,
“California earthquake") and predicates over metadata
(e.g., data type is video). Google Alert forwards all matching
data items to user based on the subscriptions. Now
consider a video about California Earthquake that is to be
published on Youtube. The video contains a set of tags that
were extracted using either automated vision processing
and/or information extraction techniques applied over
transcribed speech. Such automated tools may produce
tags with probabilities (e.g, “California": 0.9, “earthquake":
0.6, “election": 0.7), while the true tags of the video could
be “California" and “earthquake". The determinization
process should associate the video with appropriate tags
such that subscribers who are really interested in the video
(i.e., whose subscription includes the words “California
Earthquake") are notified while other are not overwhelmed
by irrelevant data. Thus, in the example above, the
determinization process should minimize metrics such

as false positives and false negatives that result from a
determinized representation of data.
Now consider a different application such as Flickr, to
which photos are uploaded automatically from cameras
along with tags that may be generated based on speech
annotation or image analysis. Flickr supports effective
retrieval based on photo tags. In such an application, users
may be interested in choosing determinized representation
that optimizes set-based quality metrics such as F-measure
instead of minimizing false positives/negatives.
In this paper, we study the problem of deteminizing
datasets with probabilistic attributes (possibly generated
by automated data analyses/enrichment). Our approach
exploits a workload of triggers/queries to choose the “best"
deterministic representation for two types of applications
– one, that supports triggers on generated content and
another that supports effective retrieval.
Interestingly, the problem of determinization has not
been explored extensively in the past. The most related
research efforts are [5], [6], which explore how to give
deterministic answers to a query (e.g. conjunctive selection
query [5]) over probabilisitc database. Unlike the problem
of determinizing an answer to a query, our goal is
to determinize the data to enable it to be stored in legacy
deterministic databases such that the determinized representation
optimizes the expected performance of queries in
the future. Solutions in [5], [6] can not be straightforwardly
applied to such a determinization problem.1
Overall, the main contributions of this paper are:
• We introduce the problem of determinizing probabilistic
data. Given a workload of triggers/queries, the
main challenge is to find the deterministic representation
of the data which would optimize certain quality
metrics of the answer to these triggers/queries
(Section 2).
• We propose a framework that solves the problem
of determinization by minimizing the expected cost
of the answer to queries. We develop a branchand-
bound algorithm that finds an approximate
1. We will show that in terms of quality of query answer, our proposed
determinization techniques are very close to those applied only
over probabilistic databases, which are not supported by many legacy
applicati

near-optimal solution to the resulting NP-hard problem
(Section 3).
• We address the problem of determinizing a collection
of objects to optimize set-based quality metrics, such
as F-measure. We develop an efficient algorithm that
reaches near-optimal quality (Section 4).
• We extend the solutions to handle a data model
where mutual exclusion exists among tags. We show
that correlations among tags can be leveraged in our
solutions to get better results. We also demonstrate
that our solutions are designed to handle various
types of queries. (Section 5).
• We empirically demonstrate that the proposed algorithms
are very efficient and reach high-quality results
that are very close to those of the optimal solution.
We also demonstrate that they are robust to small
changes in the original query workload (Section 6)

the object, if the data processing technique outputted 100%
accurate results. For instance, in the case where objects
are speech annotated images, the ground truth tags correspond
to the actual spoken words. The ground truth tags
are unknown to the algorithm and are hence not part of
the uncertain object. These tags are only used for describing
the quality metrics we will define later. For the object
in the running example, the ground truth tags could be
G = {dog, beach}. Notice that due to the uncertainty in the
results returned by the data processing techniques being
used, the set of uncertain tags W could be different from
the set of ground truth tags G.
Deterministic representation. A determinization process
for an object O is to select a deterministic representation
to represent O, such that it can be stored in a legacy
system that does not support probabilistic input. We define
a deterministic representation (or answer set) of an object O
as a set of deterministic tags A = {w1,w2, · · · ,w|A|}. For
example, one possible deterministic representation for the
object in the running example is A = {pier, beach}.
When it is not clear from the context, we will use subscript
‘O’ in variables W, G and A to indicate which object
they refer to, e.g. WO refers to “W of object O".
Query workload. A query workload is a set of queries
Q = {Q1,Q2, · · · ,Q|Q|} that will be executed over the deterministic
representation. For clarity of presentation, we will
initially assume that each query Q ∈ Q is a conjunctive
query. We will later discuss how to extend it to other types
of queries. A conjunctive query Q ∈ Q has the following
information associated with it:
1) T = {q1, q2, . . . , q|T|} stores the set of terms associated
with Q = q1∧q2∧· · ·∧q|T|. Given a deterministic
representation AO of an object O, the object satisfies
a conjunctive query Q iff T ⊆ AO.
2) f ∈ [0, 1] encodes the query weight that corresponds
to the relative frequency of Q in the
workload.
We assume that, in general, such a workload is generated
from a past query log, modeled based on the past querying
patterns [7], or derived from frequent terms in past corpus
of data [8]. We will use subscript ‘Q’ for variables T and f to
indicate which query they refer to, e.g. fQ refers to “weight
of query Q". The running example in such that it can be
stored in a legacy system that does not Fig. 1 demonstrates
a workload of 6 conjunctive queries. The query terms and
their weights are also shown.

SPRING SOURCE TECHNOLOGIES

Search This Blog

Query Aware Determinization of Uncertain Objects

Comments

Post a Comment

Popular posts from this blog

Jio

Enabling Cloud Storage Auditing with Verifiable Outsourcing of Key Updates

PUNCHING MACHINE