Skip to main content

Statistical Entity Extraction from Web

                         

          Statistical Entity Extraction from Web                      



 ABSTRACT

There are various kinds of valuable semantic information about real-world entities embedded in web pages and databases. Extracting and integrating these entity information from the Web is of great significance. Comparing to traditional information extraction problems, web entity extraction needs to solve several new challenges to fully take advantage of the unique characteristic of the Web. In this paper, we introduce our recent work on statistical extraction of structured entities, named entities, entity facts and relations from Web. We also briefly introduce iKnoweb, an interactive knowledge mining framework for entity information integration. We will use two novel web applications, Microsoft Academic Search (aka Libra) and EntityCube, as working examples.

Existing System
           
The need for collecting and understanding Web information about a real-world entity (such as a person or a product) is currently fulfilled manually through search engines. However, information about a single entity might appear in thousands of Web pages. Even if a search engine could find all the relevant Web pages about an entity, the user would need to sift through all these pages to get a complete view of the entity. Some basic understanding of the structure and the semantics of the web pages could significantly improve people's browsing and searching experience.



Proposed System
         
The information about a single entity may be distributed in diverse web sources, entity information integration is required. The most challenging problem in entity information integration is name disambiguation. This is because we simply don’t have enough signals on the Web to make automated disambiguation decisions with high confidence. In many cases, we need knowledge in users’ minds to help connect knowledge pieces automatically mined by algorithms. We propose a novel knowledge mining framework (called iKnoweb) to add people into the knowledge mining loop and to interactively solve the name disambiguation problem with users.


MODULE DESCRIPTION:

1.     Web Entity Extraction
2.     Detecting Maximum Recognition Units

3.     Question Generation

4.     Network Effects

5.     Interaction Optimization



Modules Description



1.     Web Entity Extraction

Ø  Visual Layout Features
·         Web pages usually contain many explicit or implicit visual separators such as  lines, blank area, image, font size and color, element size and position. They are very valuable for the extraction process. Specifically, it affects two aspects in our framework: block segmentation and feature function construction.
·         Using visual information together with delimiters is easy to segment a web page into semantically coherent blocks, and to segment each block of the page into appropriate sequence of elements for web entity extraction.
·         Visual information itself can also produce powerful features to assist the extraction. For example, if an element has the maximal font-size and centered at the top of a paper header, it will be the title with high probability.


Ø  Text Features
·         Text content is the most natural feature to use for entity extraction.In web pages, there are a lot of HTML elements which only contain very short text fragments (which are not natural sentences). We do not further segment these short text fragments into individual words.
·         Instead, we consider them as the atomic labeling units for web entity extraction. For long text sentences/paragraphs within web pages, however, we further segment them into text fragments using algorithms like Semi-CRF .

Ø  Knowledge Base Features
           
o   We can treat the information in the knowledge base as additional training examples to compute the element (i.e. text fragment) emission probability, which is computed using a linear combination of the emission probability of each word within the element. In this way we can build more robust feature functions based on the element emission probabilities than those on the word emission probabilities.

·         The knowledge base can be used to see if there are some matches between the       current  text fragment and stored attributes. We can apply the set of domain-independent string transformations to compute the matching degrees between them.




2.     Detecting Maximum Recognition Units

We need to automatically detect highly accurate knowledge units, and the key here is to ensure that the precision is higher than or equal to that of human performance.





3.     Question Generation

By asking easy questions, iKnoweb can gain broad knowledge about the targeted entity. An example question could be: “Is the person a researcher? (Yes or No)”, the answer can help the system find the topic of the web appearances of the entity.

4.     Network Effects

A new User will directly benefit from the knowledge contributed by others, and our learning algorithm will be improved through users’ participation.

5.     Interaction Optimization

This component is used to determine when to ask questions, and when to invite users to initiate the interaction and to provide more signals.

System Configuration:-

H/W System Configuration:-


        Processor               -    Pentium –III


Speed                                -    1.1 Ghz
RAM                                 -    256  MB (min)
Hard Disk                          -   20 GB
Floppy Drive                     -    1.44 MB
Key Board                         -    Standard Windows Keyboard
Mouse                                -    Two or Three Button Mouse
Monitor                              -    SVGA



 

 S/W System Configuration:-


v   Operating System            :Windows95/98/2000/XP
v   Application  Server          :   Wampserver2.2e                                               
v   Front End                          :   HTML,Css
v    Scripts                                :   JavaScript.
v   Server side Script             :   PHP.
v   Database                            :   Mysql



CONCLUSION
                       

How to accurately extract structured information about real-world entities from the Web has led to significant interest recently. This paper summarizes our recent research work on statistical web entity extraction, which targets to extract and integrate all the related web information about the same entity together as an information unit. In web entity extraction, it is important to take advantage of the following unique characteristics of the Web: visual layout, information redundancy, information fragmentation, and the availability of a knowledge base. Specifically, we first introduced our vision-based web entity extraction work, which considers visual layout information and knowledge base features in understanding the page structure and the text content of a web page. We then introduced our statistical snowball work to automatically discover text patterns from billions of web pages leveraging the information redundancy property of the Web. We also introduced iKnoweb, an interactive knowledge mining framework, which collaborates with the end users to connect the extracted knowledge pieces mined from Web and builds an accurate entity knowledge web.

Comments

Popular posts from this blog

Android Tutorial

Android  is a complete set of software for mobile devices such as tablet computers, notebooks, smartphones, electronic book readers, set-top boxes etc. It contains a  linux-based Operating System ,  middleware  and  key mobile applications . It can be thought of as a mobile operating system. But it is not limited to mobile only. It is currently used in various devices such as mobiles, tablets, televisions etc. This tutorial is developed for beginners and experienced persons. Let's see the topics of android that we are going to learn. Basics of Android In this fundamental chapter, you will learn about android, its components, how to create first android application, internal of first android application etc. What is Android History and Version Software Stack Core Building Blocks Android Emulator Installing softwares Setup Eclipse Hello Android example Internal Details Dalvik VM AndroidManifest.xml R.java Hide Title Bar Activity and I...

CLOUD WORKFLOW SCHEDULING WITH DEADLINE AND TIME SLOT ALGORITHM

CLOUD WORKFLOW SCHEDULING WITH DEADLINE AND TIME SLOT ALGORITHM Abstract Allocating service capacities in cloud computing is based on the assumption that they are unlimited and can be used at any time. However, available service capacities change with workload and cannot satisfy users’ requests at any time from the cloud provider’s perspective because cloud services can be shared by multiple tasks. Cloud service providers provide available time slots for new user’s requests based on available capacities. In this paper, we consider workflow scheduling with deadline and time slot availability in cloud computing. An iterated heuristic framework is presented for the problem under study which mainly consists of initial solution construction, improvement, and perturbation. Three initial solution construction strategies, two greedy- and fair-based improvement strategies and a perturbation strategy are proposed. Different strategies in the three phases result in several heuristics. ...

MobiContext: A Context-aware Cloud-Based Venue Recommendation Framework

            MobiContext: A Context-aware Cloud-Based Venue Recommendation Framework ABSTRACT  In recent years, recommendation systems have seen significant evolution in the field of knowledge engineering. Most of the existing recommendation systems based their models on collaborative filtering approaches that make them simple to implement. However, performance of most of the existing collaborative filtering-based recommendation system suffers due to the challenges, such as: (a) cold start, (b) data sparseness, and (c) scalability. Moreover, recommendation problem is often characterized by the presence of many conflicting objectives or decision variables, such as users’ preferences and venue closeness. In this paper, we proposed MobiContext , a hybrid cloud-based Bi-Objective Recommendation Framework (BORF) for mobile social networks. The MobiContext utilizes multi-objective optimization techniques to generate personalized recommendat...