This is a guide to annotating "microblog" entries (i.e. tweets) from Twitter.
*
is always a token.
"..." is one token.
"dessert/espresso" s.b. "dessert" "/" "espresso" (three tokens).
<del>Most tweets are in and of themselves utterances. </del> NO
Tweets will have a mixture of URLs and hashtags dangling to the end of the tweet. Dangling URLs and strings of hashtags and at-mentions are separate utterances.
Example:
#Coffee Instagram by @kmarrero7 Iced coffee Starbucks newbie #forgotthemilk #dark #espresso #palpiations #ineedanek … http://t.co/pT42jAdnmQ
[#Coffee Instagram by @kmarrero7] [Iced coffee Starbucks newbie] [#forgotthemilk #dark #espresso #palpiations #ineedanek …] [http://t.co/pT42jAdnmQ]
If it's not a list, NPs can be separate utterances.
Example:
#Coffee Instagram by @kmarrero7 Iced coffee Starbucks newbie #forgotthemilk #dark #espresso #palpiations #ineedanek … http://t.co/pT42jAdnmQ
[#Coffee Instagram by @kmarrero7] [Iced coffee Starbucks newbie] [#forgotthemilk #dark #espresso #palpiations #ineedanek …] [http://t.co/pT42jAdnmQ]
Constructed dialog with explicit speakers should be thought of as an abbreviated forms of "X said" and are therefore one utterance.
Example:
Me: “Lexi‚ are you drinking black coffee?“ Lexi: “Yeah‚ like my soul“ http://t.co/VByA66xwsN
[Me: “Lexi‚ are you drinking black coffee?“] [Lexi: “Yeah‚ like my soul“] [http://t.co/VByA66xwsN]
Modern Internet initialisms, such as "idk", "brb", "lol", are ACR
. This is intended to be for the benefit of CMC researchers, who may be interested in these explicitly despite their numerous possible syntactic roles. Additionally, some initialisms aren't constituents: "idk" takes a whole CP as an argument. Better keep it simple here.
"RT" are IN
.
URLs should be tagged URL
.
Usernames (begins with @
) are NP
/NPS
. They're names typically. If they aren't, they've been turned into one.
"..." is ":", since it may not be sentence terminal.
#
is Twitter's most infamous feature. First and foremost, hashtags retain their original parts of speech, but they should also be marked "H" because they're licensed to stand alone in some circumstances. Examples:
NNH
: "#coffee"NNSH
: "#beers"NPH
: "#kimkardashian"VVGH
: "#winning"JJH
: "#hot"For cases that are more complex, take the head word of the highest level constituent.
NNH
: "#BananaSplit"NPH
: "#HeladoGourmet"NNSH
: "#HatsOnForHarry"NNH
: "#Brownie&Cream"See [Twitter Constituent Parsing](Twitter Constituent Parsing).