Abstract—Twitter has attracted millions of users to share and disseminate most up-to-date information, resulting in large volumes of
data produced everyday. However, many applications in Information Retrieval (IR) and Natural Language Processing (NLP) suffer
severely from the noisy and short nature of tweets. In this paper, we propose a novel framework for tweet segmentation in a batch
mode, called HybridSeg. By splitting tweets into meaningful segments, the semantic or context information is well preserved and easily
extracted by the downstream applications. HybridSeg finds the optimal segmentation of a tweet by maximizing the sum of the stickiness
scores of its candidate segments. The stickiness score considers the probability of a segment being a phrase in English (i.e.,
global context) and the probability of a segment being a phrase within the batch of tweets (i.e., local context). For the latter, we propose
and evaluate two models to derive local context by considering the linguistic features and term-dependency in a batch of tweets,
respectively. HybridSeg is also designed to iteratively learn from confident segments as pseudo feedback. Experiments on two tweet
data sets show that tweet segmentation quality is significantly improved by learning both global and local contexts compared with using
global context alone. Through analysis and comparison, we show that local linguistic features are more reliable for learning local context
compared with term-dependency. As an application, we show that high accuracy is achieved in named entity recognition by applying
segment-based part-of-speech (POS) tagging.
In this paper, we focus on the task of tweet segmentation.
The goal of this task is to split a tweet into a sequence of consecutive
n-grams (n 1Þ, each of which is called a segment. A
segment can be a named entity (e.g., a movie title “finding
nemo”), a semantically meaningful information unit (e.g.,
“officially released”), or any other types of phrases which
appear “more than by chance” [1]. Fig. 1 gives an example.
In this example, a tweet “They said to spare no effort to increase
traffic throughput on circle line.” is split into eight segments.
Semantically meaningful segments “spare no effort”,
“traffic throughput” and “circle line” are preserved.
Because these segments preserve semantic meaning of the
tweet more precisely than each of its constituent words
does, the topic of this tweet can be better captured in the
subsequent processing of this tweet. For instance, this
segment-based representation could be used to enhance the
extraction of geographical location from tweets because of
the segment “circle line” [12]. In fact, segment-based representation
has shown its effectiveness over word-based
representation in the tasks of named entity recognition and
event detection [1], [2], [13]. Note that, a named entity is
valid segment; but a segment may not necessarily be a
named entity. In [6] the segment “korea versus greece” is
detected for the event related to the world cup match
between Korea and Greece.
To achieve high quality tweet segmentation, we propose
a generic tweet segmentation framework, named HybridSeg.
HybridSeg learns from both global and local contexts, and has
the ability of learning from pseudo feedback.
Global context. Tweets are posted for information sharing
and communication. The named entities and semantic
phrases are well preserved in tweets. The global context
derived from Web pages (e.g., Microsoft Web N-Gram corpus)
or Wikipedia therefore helps identifying the meaningful
segments in tweets. The method realizing the proposed
framework that solely relies on global context is denoted by
HybridSegWeb.
Local context. Tweets are highly time-sensitive so that
many emerging phrases like “She Dancin” cannot be found
in external knowledge bases. However, considering a large
number of tweets published within a short time period (e.
g., a day) containing the phrase, it is not difficult to recognize
“She Dancin” as a valid and meaningful segment. We
therefore investigate two local contexts, namely local linguistic
features and local collocation. Observe that tweets
from many official accounts of news agencies, organizations,
and advertisers are likely well written. The well preserved
linguistic features in these tweets facilitate named
entity recognition with high accuracy. Each named entity is
a valid segment. The method utilizing local linguistic features
is denoted by HybridSegNER. It obtains confident segments
based on the voting results of multiple off-the-shelf
NER tools. Another method utilizing local collocation
knowledge, denoted by HybridSegNGram, is proposed based
on the observation that many tweets published within a
short time period are about the same topic. HybridSegNGram
segments tweets by estimating the term-dependency within
a batch of tweets.
Pseudo feedback. The segments recognized based on local
context with high confidence serve as good feedback to
extract more meaningful segments. The learning from
pseudo feedback is conducted iteratively and the method
implementing the iterative learning is named HybridSegIter.
We conduct extensive experimental analysis on Hybrid-
Seg1 on two tweet data sets and evaluate the quality of tweet
segmentation against manually annotated tweets. Our
experimental results show that HybridSegNER and
HybridSegNGram, the two methods incorporating local context
in additional to global context, achieve significant
improvement in segmentation quality over HybridSegWeb,
the method use global context alone. Between the former
two methods, HybridSegNER is less sensitive to parameter
settings than HybridSegNGram and achieves better segmentation
quality. With iterative learning from pseudo feedback,
HybridSegIter further improves the segmentation quality.
As an application of tweet segmentation, we propose
and evaluate two segment-based NER algorithms. Both
algorithms are unsupervised in nature and take tweet segments
as input. One algorithm exploits co-occurrence of
named entities in targeted Twitter streams by applying random
walk (RW) with the assumption that named entities are
more likely to co-occur together. The other algorithm utilizes
Part-of-Speech (POS) tags of the constituent words in segments.
The segments that are likely to be a noun phrase (NP)
are considered as named entities. Our experimental results
show that (i) the quality of tweet segmentation significantly
affects the accuracy of NER, and (ii) POS-based NER method
outperforms RW-based method on both data sets.
The remainder of this paper is organized as follows.
Section 2 surveys related works on tweet segmentation.
Section 3 defines tweet segmentation and describes the
proposed framework. Section 4 details how the local context
is exploited in the framework. In Section 5, the segment-
based NER methods are investigated. In Section 6,
we evaluate the proposed HybridSeg framework and the
two segment
LEARNING FROM LOCAL CONTEXT
Illustrated in Fig. 2, the segment phraseness PrðsÞ is computed
based on both global and local contexts. Based on
Observation 1, PrðsÞ is estimated using the n-gram probability
provided by Microsoft Web N-Gram service, derived
from English Web pages. We now detail the estimation of
PrðsÞ by learning from local context based on Observations
2 and 3. Specifically, we propose learning PrðsÞ from the
results of using off-the-shelf Named Entity Recognizers
(NERs), and learning PrðsÞ from local word collocation in a
batch of tweets. The two corresponding methods utilizing
the local context are denoted by HybridSegNER and
HybridSegNGram respectively.
4.1 Learning from Weak NERs
To leverage the local linguistic features of well-written
tweets, we apply multiple off-the-shelf NERs trained on formal
texts to detect named entities in a batch of tweets T by
voting. Voting by multiple NERs partially alleviates the
errors due to noise in tweets. Because these NERs are not
specifically trained on tweets, we also call them weak
NERs. Recall that each named entity is a valid segment, the
detected named entities are valid segments.
Given a candidate segment s, let fs be its total frequency
in T . A NER ri may recognize s as a named entity fri;s times.
Note that fri;s fs since a NER may only recognize some of
s’s occurrences as named entity in all tweets of T . Assuming
there are m off-the-shelf NERs r1; r2; . . . ; rm, we further
denote fR
s to be the number of NERs that have detected at
least one occurrence of s as named entity, fR
s ¼
P
m
i Iðfri;sÞ:
Iðfri;sÞ ¼ 1 if fri;s > 0; Iðfri;sÞ ¼ 0 otherwise.
We approximate the probability of s being a valid name
entity (i.e., a valid segment) using a voting algorithm defined
by Eq. (4):
P^
data produced everyday. However, many applications in Information Retrieval (IR) and Natural Language Processing (NLP) suffer
severely from the noisy and short nature of tweets. In this paper, we propose a novel framework for tweet segmentation in a batch
mode, called HybridSeg. By splitting tweets into meaningful segments, the semantic or context information is well preserved and easily
extracted by the downstream applications. HybridSeg finds the optimal segmentation of a tweet by maximizing the sum of the stickiness
scores of its candidate segments. The stickiness score considers the probability of a segment being a phrase in English (i.e.,
global context) and the probability of a segment being a phrase within the batch of tweets (i.e., local context). For the latter, we propose
and evaluate two models to derive local context by considering the linguistic features and term-dependency in a batch of tweets,
respectively. HybridSeg is also designed to iteratively learn from confident segments as pseudo feedback. Experiments on two tweet
data sets show that tweet segmentation quality is significantly improved by learning both global and local contexts compared with using
global context alone. Through analysis and comparison, we show that local linguistic features are more reliable for learning local context
compared with term-dependency. As an application, we show that high accuracy is achieved in named entity recognition by applying
segment-based part-of-speech (POS) tagging.
In this paper, we focus on the task of tweet segmentation.
The goal of this task is to split a tweet into a sequence of consecutive
n-grams (n 1Þ, each of which is called a segment. A
segment can be a named entity (e.g., a movie title “finding
nemo”), a semantically meaningful information unit (e.g.,
“officially released”), or any other types of phrases which
appear “more than by chance” [1]. Fig. 1 gives an example.
In this example, a tweet “They said to spare no effort to increase
traffic throughput on circle line.” is split into eight segments.
Semantically meaningful segments “spare no effort”,
“traffic throughput” and “circle line” are preserved.
Because these segments preserve semantic meaning of the
tweet more precisely than each of its constituent words
does, the topic of this tweet can be better captured in the
subsequent processing of this tweet. For instance, this
segment-based representation could be used to enhance the
extraction of geographical location from tweets because of
the segment “circle line” [12]. In fact, segment-based representation
has shown its effectiveness over word-based
representation in the tasks of named entity recognition and
event detection [1], [2], [13]. Note that, a named entity is
valid segment; but a segment may not necessarily be a
named entity. In [6] the segment “korea versus greece” is
detected for the event related to the world cup match
between Korea and Greece.
To achieve high quality tweet segmentation, we propose
a generic tweet segmentation framework, named HybridSeg.
HybridSeg learns from both global and local contexts, and has
the ability of learning from pseudo feedback.
Global context. Tweets are posted for information sharing
and communication. The named entities and semantic
phrases are well preserved in tweets. The global context
derived from Web pages (e.g., Microsoft Web N-Gram corpus)
or Wikipedia therefore helps identifying the meaningful
segments in tweets. The method realizing the proposed
framework that solely relies on global context is denoted by
HybridSegWeb.
Local context. Tweets are highly time-sensitive so that
many emerging phrases like “She Dancin” cannot be found
in external knowledge bases. However, considering a large
number of tweets published within a short time period (e.
g., a day) containing the phrase, it is not difficult to recognize
“She Dancin” as a valid and meaningful segment. We
therefore investigate two local contexts, namely local linguistic
features and local collocation. Observe that tweets
from many official accounts of news agencies, organizations,
and advertisers are likely well written. The well preserved
linguistic features in these tweets facilitate named
entity recognition with high accuracy. Each named entity is
a valid segment. The method utilizing local linguistic features
is denoted by HybridSegNER. It obtains confident segments
based on the voting results of multiple off-the-shelf
NER tools. Another method utilizing local collocation
knowledge, denoted by HybridSegNGram, is proposed based
on the observation that many tweets published within a
short time period are about the same topic. HybridSegNGram
segments tweets by estimating the term-dependency within
a batch of tweets.
Pseudo feedback. The segments recognized based on local
context with high confidence serve as good feedback to
extract more meaningful segments. The learning from
pseudo feedback is conducted iteratively and the method
implementing the iterative learning is named HybridSegIter.
We conduct extensive experimental analysis on Hybrid-
Seg1 on two tweet data sets and evaluate the quality of tweet
segmentation against manually annotated tweets. Our
experimental results show that HybridSegNER and
HybridSegNGram, the two methods incorporating local context
in additional to global context, achieve significant
improvement in segmentation quality over HybridSegWeb,
the method use global context alone. Between the former
two methods, HybridSegNER is less sensitive to parameter
settings than HybridSegNGram and achieves better segmentation
quality. With iterative learning from pseudo feedback,
HybridSegIter further improves the segmentation quality.
As an application of tweet segmentation, we propose
and evaluate two segment-based NER algorithms. Both
algorithms are unsupervised in nature and take tweet segments
as input. One algorithm exploits co-occurrence of
named entities in targeted Twitter streams by applying random
walk (RW) with the assumption that named entities are
more likely to co-occur together. The other algorithm utilizes
Part-of-Speech (POS) tags of the constituent words in segments.
The segments that are likely to be a noun phrase (NP)
are considered as named entities. Our experimental results
show that (i) the quality of tweet segmentation significantly
affects the accuracy of NER, and (ii) POS-based NER method
outperforms RW-based method on both data sets.
The remainder of this paper is organized as follows.
Section 2 surveys related works on tweet segmentation.
Section 3 defines tweet segmentation and describes the
proposed framework. Section 4 details how the local context
is exploited in the framework. In Section 5, the segment-
based NER methods are investigated. In Section 6,
we evaluate the proposed HybridSeg framework and the
two segment
LEARNING FROM LOCAL CONTEXT
Illustrated in Fig. 2, the segment phraseness PrðsÞ is computed
based on both global and local contexts. Based on
Observation 1, PrðsÞ is estimated using the n-gram probability
provided by Microsoft Web N-Gram service, derived
from English Web pages. We now detail the estimation of
PrðsÞ by learning from local context based on Observations
2 and 3. Specifically, we propose learning PrðsÞ from the
results of using off-the-shelf Named Entity Recognizers
(NERs), and learning PrðsÞ from local word collocation in a
batch of tweets. The two corresponding methods utilizing
the local context are denoted by HybridSegNER and
HybridSegNGram respectively.
4.1 Learning from Weak NERs
To leverage the local linguistic features of well-written
tweets, we apply multiple off-the-shelf NERs trained on formal
texts to detect named entities in a batch of tweets T by
voting. Voting by multiple NERs partially alleviates the
errors due to noise in tweets. Because these NERs are not
specifically trained on tweets, we also call them weak
NERs. Recall that each named entity is a valid segment, the
detected named entities are valid segments.
Given a candidate segment s, let fs be its total frequency
in T . A NER ri may recognize s as a named entity fri;s times.
Note that fri;s fs since a NER may only recognize some of
s’s occurrences as named entity in all tweets of T . Assuming
there are m off-the-shelf NERs r1; r2; . . . ; rm, we further
denote fR
s to be the number of NERs that have detected at
least one occurrence of s as named entity, fR
s ¼
P
m
i Iðfri;sÞ:
Iðfri;sÞ ¼ 1 if fri;s > 0; Iðfri;sÞ ¼ 0 otherwise.
We approximate the probability of s being a valid name
entity (i.e., a valid segment) using a voting algorithm defined
by Eq. (4):
P^
Comments
Post a Comment