Statistical
Entity Extraction from Web
ABSTRACT
There are
various kinds of valuable semantic information about real-world entities
embedded in web pages and databases. Extracting and integrating these entity
information from the Web is of great significance. Comparing to traditional
information extraction problems, web entity extraction needs to solve several
new challenges to fully take advantage of the unique characteristic of the Web.
In this paper, we introduce our recent work on statistical extraction of
structured entities, named entities, entity facts and relations from Web. We
also briefly introduce iKnoweb, an interactive knowledge mining framework for
entity information integration. We will use two novel web applications,
Microsoft Academic Search (aka Libra) and EntityCube, as working examples.
Existing System
The need for collecting and understanding Web
information about a real-world entity (such as a person or a product) is currently
fulfilled manually through search engines. However, information about a single
entity might appear in thousands of Web pages. Even if a search engine could
find all the relevant Web pages about an entity, the user would need to sift
through all these pages to get a complete view of the entity. Some basic
understanding of the structure and the semantics of the web pages could
significantly improve people's browsing and searching experience.
Proposed
System
The information
about a single entity may be distributed in diverse web sources, entity
information integration is required. The most challenging problem in entity
information integration is name disambiguation. This is because we simply don’t
have enough signals on the Web to make automated disambiguation decisions with
high confidence. In many cases, we need knowledge in users’ minds to help
connect knowledge pieces automatically mined by algorithms. We propose a novel
knowledge mining framework (called iKnoweb) to add people into the knowledge mining
loop and to interactively solve the name disambiguation problem with users.
MODULE DESCRIPTION:
1.
Web Entity Extraction
2.
Detecting
Maximum Recognition Units
3.
Question
Generation
4.
Network
Effects
5.
Interaction
Optimization
Modules Description
1.
Web Entity Extraction
Ø Visual
Layout Features
·
Web pages usually contain many explicit or
implicit visual separators such as lines,
blank area, image, font size and color, element size and position. They are
very valuable for the extraction process. Specifically, it affects two aspects
in our framework: block segmentation and feature function construction.
·
Using visual information together with
delimiters is easy to segment a web page into semantically coherent blocks, and
to segment each block of the page into appropriate sequence of elements for web
entity extraction.
·
Visual information itself can also produce
powerful features to assist the extraction. For example, if an element has the
maximal font-size and centered at the top of a paper header, it will be the title
with high probability.
Ø
Text Features
·
Text content is the most natural feature to use
for entity extraction.In web pages, there are a lot of HTML elements which only
contain very short text fragments (which are not natural sentences). We do not
further segment these short text fragments into individual words.
·
Instead, we consider them as the atomic labeling
units for web entity extraction. For long text sentences/paragraphs within web
pages, however, we further segment them into text fragments using algorithms
like Semi-CRF .
Ø
Knowledge Base Features
o
We can treat the information in the knowledge
base as additional training examples to compute the element (i.e. text
fragment) emission probability, which is computed using a linear combination
of the emission probability of each word within the element. In this way we can
build more robust feature functions based on the element emission probabilities
than those on the word emission probabilities.
·
The knowledge base can be used to see if there
are some matches between the current
text fragment and stored attributes. We
can apply the set of domain-independent string transformations to compute the
matching degrees between them.
2.
Detecting
Maximum Recognition Units
We need to automatically
detect highly accurate knowledge units, and the key here is to ensure that the
precision is higher than or equal to that of human performance.
3.
Question
Generation
By asking easy
questions, iKnoweb can gain broad knowledge about the targeted entity. An
example question could be: “Is the person a researcher? (Yes or No)”, the
answer can help the system find the topic of the web appearances of the entity.
4.
Network
Effects
A
new User will directly benefit from the knowledge contributed by others, and
our learning algorithm will be improved through users’ participation.
5.
Interaction
Optimization
This component is
used to determine when to ask questions, and when to invite users to initiate
the interaction and to provide more signals.
System Configuration:-
H/W System
Configuration:-
Processor - Pentium –III
Speed - 1.1 Ghz
RAM - 256 MB
(min)
Hard
Disk - 20 GB
Floppy
Drive - 1.44 MB
Key
Board - Standard Windows Keyboard
Mouse - Two or Three Button Mouse
Monitor - SVGA
S/W System Configuration:-
v
Operating System :Windows95/98/2000/XP
v
Application
Server : Wampserver2.2e
v
Front End : HTML,Css
v
Scripts : JavaScript.
v
Server side Script :
PHP.
v
Database : Mysql
CONCLUSION
How to accurately extract structured information
about real-world entities from the Web has led to significant interest
recently. This paper summarizes our recent research work on statistical web
entity extraction, which targets to extract and integrate all the related web
information about the same entity together as an information unit. In web
entity extraction, it is important to take advantage of the following unique characteristics
of the Web: visual layout, information redundancy, information fragmentation,
and the availability of a knowledge base. Specifically, we first introduced our
vision-based web entity extraction work, which considers visual layout
information and knowledge base features in understanding the page structure and
the text content of a web page. We then introduced our statistical snowball
work to automatically discover text patterns from billions of web pages
leveraging the information redundancy property of the Web. We also introduced
iKnoweb, an interactive knowledge mining framework, which collaborates with the
end users to connect the extracted knowledge pieces mined from Web and builds
an accurate entity knowledge web.
Comments
Post a Comment