This is a guide to annotating "microblog" entries (i.e. tweets) from Twitter.
* is always a token.
"..." is one token.
"dessert/espresso" s.b. "dessert" "/" "espresso" (three tokens).
<del>Most tweets are in and of themselves utterances. </del> NO
Tweets will have a mixture of URLs and hashtags dangling to the end of the tweet. Dangling URLs and strings of hashtags and at-mentions are separate utterances.
Example:
#Coffee Instagram by @kmarrero7 Iced coffee Starbucks newbie #forgotthemilk #dark #espresso #palpiations #ineedanek … http://t.co/pT42jAdnmQ
[#Coffee Instagram by @kmarrero7] [Iced coffee Starbucks newbie] [#forgotthemilk #dark #espresso #palpiations #ineedanek …] [http://t.co/pT42jAdnmQ]
If it's not a list, NPs can be separate utterances.
Example:
#Coffee Instagram by @kmarrero7 Iced coffee Starbucks newbie #forgotthemilk #dark #espresso #palpiations #ineedanek … http://t.co/pT42jAdnmQ
[#Coffee Instagram by @kmarrero7] [Iced coffee Starbucks newbie] [#forgotthemilk #dark #espresso #palpiations #ineedanek …] [http://t.co/pT42jAdnmQ]
Constructed dialog with explicit speakers should be thought of as an abbreviated forms of "X said" and are therefore one utterance.
Example:
Me: “Lexi‚ are you drinking black coffee?“ Lexi: “Yeah‚ like my soul“ http://t.co/VByA66xwsN
[Me: “Lexi‚ are you drinking black coffee?“] [Lexi: “Yeah‚ like my soul“] [http://t.co/VByA66xwsN]
Modern Internet initialisms, such as "idk", "brb", "lol", are ACR. This is intended to be for the benefit of CMC researchers, who may be interested in these explicitly despite their numerous possible syntactic roles. Additionally, some initialisms aren't constituents: "idk" takes a whole CP as an argument. Better keep it simple here.
"RT" are IN.
URLs should be tagged URL.
Usernames (begins with @) are NP/NPS. They're names typically. If they aren't, they've been turned into one.
"..." is ":", since it may not be sentence terminal.
# is Twitter's most infamous feature. First and foremost, hashtags retain their original parts of speech, but they should also be marked "H" because they're licensed to stand alone in some circumstances. Examples:
NNH: "#coffee"NNSH: "#beers"NPH: "#kimkardashian"VVGH: "#winning"JJH: "#hot"For cases that are more complex, take the head word of the highest level constituent.
NNH: "#BananaSplit"NPH: "#HeladoGourmet"NNSH: "#HatsOnForHarry"NNH: "#Brownie&Cream"See [Twitter Constituent Parsing](Twitter Constituent Parsing).