Abstract—As deep web grows at a very fast pace, there has been increased interest in techniques that help efficiently locate
deep-web interfaces. However, due to the large volume of web resources and the dynamic nature of deep web, achieving wide
coverage and high efficiency is a challenging issue. We propose a two-stage framework, namely SmartCrawler, for efficient
harvesting deep web interfaces. In the first stage, SmartCrawler performs site-based searching for center pages with the help of
search engines, avoiding visiting a large number of pages. To achieve more accurate results for a focused crawl, SmartCrawler
ranks websites to prioritize highly relevant ones for a given topic. In the second stage, SmartCrawler achieves fast in-site
searching by excavating most relevant links with an adaptive link-ranking. To eliminate bias on visiting some highly relevant
links in hidden web directories, we design a link tree data structure to achieve wider coverage for a website. Our experimental
results on a set of representative domains show the agility and accuracy of our proposed crawler framework, which efficiently
retrieves deep-web interfaces from large-scale sites and achieves higher harvest rates than other crawlers.
INTRODUCTION
The deep (or hidden) web refers to the contents lie behind searchable web interfaces that cannot be indexed
by searching engines. Based on extrapolations from a study done at University of California, Berkeley, it is
estimated that the deep web contains approximately 91,850 terabytes and the surface web is only about 167
terabytes in 2003 [1]. More recent studies estimated that 1.9 zettabytes were reached and 0.3 zettabytes
were consumed worldwide in 2007 [2], [3]. An IDC report estimates that the total of all digital data created,
replicated, and consumed will reach 6 zettabytes in 2014 [4]. A significant portion of this huge amount
of data is estimated to be stored as structured or relational data in web databases — deep web makes
up about 96% of all the content on the Internet, which is 500-550 times larger than the surface web [5], [6].
These data contain a vast amount of valuable information and entities such as Infomine [7], Clusty [8],
BooksInPrint [9] may be interested in building an index of the deep web sources in a given domain
(such as book). Because these entities cannot access the proprietary web indices of search engines (e.g.,
Google and Baidu), there is a need for an efficient
crawler that is able to accurately and quickly explore the deep web databases.
It is challenging to locate the deep web databases, because they are not registered with any search engines,
are usually sparsely distributed, and keep constantly changing. To address this problem, previous
work has proposed two types of crawlers, generic crawlers and focused crawlers. Generic crawlers [10],
[11], [12], [13], [14] fetch all searchable forms and cannot focus on a specific topic. Focused crawlers
such as Form-Focused Crawler (FFC) [15] and Adaptive Crawler for Hidden-web Entries (ACHE) [16] can automatically search online databases on a specific topic. FFC is designed with link, page, and form
classifiers for focused crawling of web forms, and is extended by ACHE with additional components
for form filtering and adaptive link learner. The link classifiers in these crawlers play a pivotal role in
achieving higher crawling efficiency than the best-first crawler [17]. However, these link classifiers are used
to predict the distance to the page containing searchable forms, which is difficult to estimate, especially
for the delayed benefit links (links eventually lead to pages with forms). As a result, the crawler can be
inefficiently led to pages without targeted forms.
Besides efficiency, quality and coverage on relevant deep web sources are also challenging. Crawler must
produce a large quantity of high-quality results from the most relevant content sources [15], [16], [18], [19],
[20], [21]. For assessing source quality, SourceRank ranks the results from the selected sources by computing the agreement between them [20], [21]. When selecting a relevant subset from the available content sources, FFC and ACHE prioritize links that bring immediate return (links directly point to pages con- taining searchable forms) and delayed benefit links.
But the set of retrieved forms is very heterogeneous.
For example, from a set of representative domains,
on average only 16% of forms retrieved by FFC are
relevant [15], [16]. Furthermore, little work has been
done on the source selection problem when crawling
more content sources [19], [22]. Thus it is crucial to
develop smart crawling strategies that are able to
quickly discover relevant content sources from the
deep web as much as possible.
proposed system:
In this paper, we propose an effective deep web harvesting framework, namely SmartCrawler, for achieving
both wide coverage and high efficiency for a focused crawler. Based on the observation that deep
websites usually contain a few searchable forms and most of them are within a depth of three [23], [10],
our crawler is divided into two stages: site locating and in-site exploring. The site locating stage helps achieve
wide coverage of sites for a focused crawler, and the in-site exploring stage can efficiently perform searches
for web forms within a site. Our main contributions are:
We propose a novel two-stage framework to address the problem of searching for hidden-web
resources. Our site locating technique employs a reverse searching technique (e.g., using Google’s
”link:” facility to get pages pointing to a given link) and incremental two-level site prioritizing
technique for unearthing relevant sites, achieving more data sources. During the in-site exploring
stage, we design a link tree for balanced link prioritizing, eliminating bias toward webpages in
popular directories.
We propose an adaptive learning algorithm that performs online feature selection and uses these
features to automatically construct link rankers. In the site locating stage, high relevant sites are
prioritized and the crawling is focused on a topic using the contents of the root page of sites,
achieving more accurate results. During the insite exploring stage, relevant links are prioritized
for fast in-site searching.
Comments
Post a Comment