SmartCrawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Abstract—As deep web grows at a very fast pace, there has been increased interest in techniques that help efficiently locate

deep-web interfaces. However, due to the large volume of web resources and the dynamic nature of deep web, achieving wide

coverage and high efficiency is a challenging issue. We propose a two-stage framework, namely SmartCrawler, for efficient

harvesting deep web interfaces. In the first stage, SmartCrawler performs site-based searching for center pages with the help of

search engines, avoiding visiting a large number of pages. To achieve more accurate results for a focused crawl, SmartCrawler

ranks websites to prioritize highly relevant ones for a given topic. In the second stage, SmartCrawler achieves fast in-site

searching by excavating most relevant links with an adaptive link-ranking. To eliminate bias on visiting some highly relevant

links in hidden web directories, we design a link tree data structure to achieve wider coverage for a website. Our experimental

results on a set of representative domains show the agility and accuracy of our proposed crawler framework, which efficiently

retrieves deep-web interfaces from large-scale sites and achieves higher harvest rates than other crawlers.

INTRODUCTION

The deep (or hidden) web refers to the contents lie behind searchable web interfaces that cannot be indexed

by searching engines. Based on extrapolations from a study done at University of California, Berkeley, it is

estimated that the deep web contains approximately 91,850 terabytes and the surface web is only about 167

terabytes in 2003 [1]. More recent studies estimated that 1.9 zettabytes were reached and 0.3 zettabytes

were consumed worldwide in 2007 [2], [3]. An IDC report estimates that the total of all digital data created,

replicated, and consumed will reach 6 zettabytes in 2014 [4]. A significant portion of this huge amount

of data is estimated to be stored as structured or relational data in web databases — deep web makes

up about 96% of all the content on the Internet, which is 500-550 times larger than the surface web [5], [6].

These data contain a vast amount of valuable information and entities such as Infomine [7], Clusty [8],

BooksInPrint [9] may be interested in building an index of the deep web sources in a given domain

(such as book). Because these entities cannot access the proprietary web indices of search engines (e.g.,

Google and Baidu), there is a need for an efficient

crawler that is able to accurately and quickly explore the deep web databases.

It is challenging to locate the deep web databases, because they are not registered with any search engines,

are usually sparsely distributed, and keep constantly changing. To address this problem, previous

work has proposed two types of crawlers, generic crawlers and focused crawlers. Generic crawlers [10],

[11], [12], [13], [14] fetch all searchable forms and cannot focus on a specific topic. Focused crawlers

such as Form-Focused Crawler (FFC) [15] and Adaptive Crawler for Hidden-web Entries (ACHE) [16] can automatically search online databases on a specific topic. FFC is designed with link, page, and form

classifiers for focused crawling of web forms, and is extended by ACHE with additional components

for form filtering and adaptive link learner. The link classifiers in these crawlers play a pivotal role in

achieving higher crawling efficiency than the best-first crawler [17]. However, these link classifiers are used

to predict the distance to the page containing searchable forms, which is difficult to estimate, especially

for the delayed benefit links (links eventually lead to pages with forms). As a result, the crawler can be

inefficiently led to pages without targeted forms.

Besides efficiency, quality and coverage on relevant deep web sources are also challenging. Crawler must

produce a large quantity of high-quality results from the most relevant content sources [15], [16], [18], [19],

[20], [21]. For assessing source quality, SourceRank ranks the results from the selected sources by computing the agreement between them [20], [21]. When selecting a relevant subset from the available content sources, FFC and ACHE prioritize links that bring immediate return (links directly point to pages con- taining searchable forms) and delayed benefit links.

But the set of retrieved forms is very heterogeneous.

For example, from a set of representative domains,

on average only 16% of forms retrieved by FFC are

relevant [15], [16]. Furthermore, little work has been

done on the source selection problem when crawling

more content sources [19], [22]. Thus it is crucial to

develop smart crawling strategies that are able to

quickly discover relevant content sources from the

deep web as much as possible.

proposed system:

In this paper, we propose an effective deep web harvesting framework, namely SmartCrawler, for achieving

both wide coverage and high efficiency for a focused crawler. Based on the observation that deep

websites usually contain a few searchable forms and most of them are within a depth of three [23], [10],

our crawler is divided into two stages: site locating and in-site exploring. The site locating stage helps achieve

wide coverage of sites for a focused crawler, and the in-site exploring stage can efficiently perform searches

for web forms within a site. Our main contributions are:

We propose a novel two-stage framework to address the problem of searching for hidden-web

resources. Our site locating technique employs a reverse searching technique (e.g., using Google’s

”link:” facility to get pages pointing to a given link) and incremental two-level site prioritizing

technique for unearthing relevant sites, achieving more data sources. During the in-site exploring

stage, we design a link tree for balanced link prioritizing, eliminating bias toward webpages in

popular directories.

We propose an adaptive learning algorithm that performs online feature selection and uses these

features to automatically construct link rankers. In the site locating stage, high relevant sites are

prioritized and the crawling is focused on a topic using the contents of the root page of sites,

achieving more accurate results. During the insite exploring stage, relevant links are prioritized

for fast in-site searching.

SPRING SOURCE TECHNOLOGIES

Search This Blog

SmartCrawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Comments

Post a Comment

Popular posts from this blog

Jio

Enabling Cloud Storage Auditing with Verifiable Outsourcing of Key Updates

PUNCHING MACHINE