Abstract—This paper considers the problem of determinizing probabilistic data to enable such data to be stored in legacy systems
that accept only deterministic input. Probabilistic data may be generated by automated data analysis/enrichment techniques such as
entity resolution, information extraction, and speech processing. The legacy system may correspond to pre-existing web applications
such as Flickr, Picasa, etc. The goal is to generate a deterministic representation of probabilistic data that optimizes the quality of the
end-application built on deterministic data. We explore such a determinization problem in the context of two different data processing
tasks—triggers and selection queries. We show that approaches such as thresholding or top-1 selection traditionally used for
determinization lead to suboptimal performance for such applications. Instead, we develop a query-aware strategy and show its
advantages over existing solutions through a comprehensive empirical evaluation over real and synthetic datasets.
that accept only deterministic input. Probabilistic data may be generated by automated data analysis/enrichment techniques such as
entity resolution, information extraction, and speech processing. The legacy system may correspond to pre-existing web applications
such as Flickr, Picasa, etc. The goal is to generate a deterministic representation of probabilistic data that optimizes the quality of the
end-application built on deterministic data. We explore such a determinization problem in the context of two different data processing
tasks—triggers and selection queries. We show that approaches such as thresholding or top-1 selection traditionally used for
determinization lead to suboptimal performance for such applications. Instead, we develop a query-aware strategy and show its
advantages over existing solutions through a comprehensive empirical evaluation over real and synthetic datasets.
Comments
Post a Comment