As a general rule, hyphenated words should be split, since they can often be spelled apart. For example:
The same logic applies to participles and their argument, as well as 'self':
Spans of time, where the hyphen means from-to, and hyphens coordinating items on the same level (copulative, non-determinative compounds):
Exceptions which should not be tokenized apart include:
Keep URLs together, even if they contain discernible words or hyphens:
Many dates are written as if they contained a genitive 's. These items should be treated as plurals, and thus as single tokens. For example:
But if a year really does have a genitive 's in it, it should be tokenized separately:
Items which originally were spelled together but which will be tokenized separately should be surrounded with the <w> tag to indicate that there was no space between them in the original text (unless original spacing is trivial to infer). For example:
The <w> tag is not used in cases of morphologically complex words which are analyzed as single tokens, such as:
Each sentence tag <s> receives a type attribute from the following list:
decl
- declarative sentence (indicative)imp
- imperativesub
- subjunctive, including modals like would, could, but not indicative future 'will', and deontic 'have to'/'got to' (=must)q
- a polar, yes/no questionwh
- a WH question (e.g. who, what, why, where, when, how)inf
- an independent infinitive-headed clause (e.g. 'To kill a mockingbird.', or 'How to dance.')ger
- an independent gerund-headed clause (e.g. 'Finding Nemo')intj
- an interjection utterance ('Yes.', 'Hello!', 'Um...')frag
- a fragment without a subject predicate structure, lacking a finite verb, not covered by the above ('The End.', 'At home.')multiple
- a coordination of two or more types above ('I'm done and you shut up now!' - decl + imp)other
- a construction not covered by the above (e.g. nominal predication 'Nice, that!')Note that multiple takes priority over other (e.g. decl+other = multiple).
In certain cases, what looks like a modal can actually be indicative, e.g. 'can' describing ability - this should be tagged as decl
if it's simply a statement of fact:
A modal 'can' of potential, not ability, is tagged sub
(this is the more common case):
Similarly 'will' can be used in a non-indicative way and the sentence will be tagged sub
<s type="sub">Boys will be boys</s> (i.e. they may well behave as boys; this is not an indicative future claiming some boys will in fact be boys)
<s type="sub">I couldn't stand it if it spoke.</s> (i.e.Whenever it might have spoken, I wouldn't have been able to stand it.)
Forms of address in the vocative combining an interjection and a noun phrase will be tagged intj
, e.g. <s type="intj">Hi Lou!</s>.
The category multiple
is meant for sentences containing two (or more) complete clauses of varying types (e.g. do it and I don’t care how! – imp
+ decl
)
The multiple
category does not apply when there is a main clause of one type and a subordinate clause of a different type, e.g. "washing the dishes, John noticed the burglar" - in this case, we have a normal declarative clause that has a subordinate gerund. It is not a gerund type (ger
), since there is really only one main matrix clause: the past tense one with "noticed".
The multiple
category also does not apply when parenthetical sentences are present; parenthetical sentences may be 'below the level' of the main clause, and so only the type of the main clause applies. For example, the following is a sub
type, notwithstanding the parenthetical clause in italics:
For declarative-like phatic insertions (back-channeling and similar), such as "you know" and "I mean", which are typically labeled as organization-phatic in RST (see the RST guidelines) and 'parataxis' in dependencies, we disregard the phatic expression in determining the sentence type. As a result, the following examples are frag
and intj
and not multiple
or decl
.
frag
)intj
)This exception does not apply when verbs like 'mean' or 'know' actually specify a complement:
decl
)There is a hierarchy among the sentence types that sometimes comes into play when sentences fit two definitions. Specifically, being a question gets ‘first dibs’ on the sentence type. We might have wanted to say about a sentence that it’s both hypothetical and a question, for example: "Would you do it if you could?". but we only get one label, and whether or not something is a question is seen as more crucial, so this example gets the type q
(yes/no question).
Unless you have mixed sentence types (multiple
), priorities are:
wh
beats q
(if a question has a wh word, it is wh
)q
beats anything elsefrag
beats intj
(e.g. "yes, that book" is frag)