As a general rule, hyphenated words should be split, since they can often be spelled apart. For example:
The same logic applies to participles and their argument, as well as 'self':
Spans of time, where the hyphen means from-to, and hyphens coordinating items on the same level (copulative, non-determinative compounds):
Exceptions which should not be tokenized apart include:
Keep URLs together, even if they contain discernible words or hyphens:
Many dates are written as if they contained a genitive 's. These items should be treated as plurals, and thus as single tokens. For example:
But if a year really does have a genitive 's in it, it should be tokenized separately:
Items which originally were spelled together but which will be tokenized separately should be surrounded with the <w> tag to indicate that there was no space between them in the original text (unless original spacing is trivial to infer). For example:
The <w> tag is not used in cases of morphologically complex words which are analyzed as single tokens, such as:
Each sentence tag <s> receives a type attribute from the following list:
decl
- declarative sentence (indicative)imp
- imperativesub
- subjunctive, including modals like would, could, and deontic 'have to'/'got to'/'need to' (=must), but not indicative future 'will'q
- a polar, yes/no questionwh
- a WH question (e.g. who, what, why, where, when, how)inf
- an independent infinitive-headed clause (e.g. 'To kill a mockingbird.', or 'How to dance.')ger
- an independent gerund-headed clause (e.g. 'Finding Nemo')intj
- an interjection utterance ('Yes.', 'Hello!', 'Um...')frag
- a fragment without a subject predicate structure, lacking a finite verb, not covered by the above ('The End.', 'At home.')multiple
- a coordination of two or more types above ('I'm done and you shut up now!' - decl + imp)other
- a construction not covered by the above (e.g. nominal predication 'Nice, that!')Note that multiple takes priority over other (e.g. decl+other = multiple).
In certain cases, what looks like a modal can actually be indicative, e.g. 'can' describing ability - this should be tagged as decl
if it's simply a statement of fact:
A modal 'can' of potential, not ability, is tagged sub
(this is the more common case):
Similarly 'will' can be used in a non-indicative way and the sentence will be tagged sub
<s type="sub">Boys will be boys</s> (i.e. they may well behave as boys; this is not an indicative future claiming some boys will in fact be boys)
<s type="sub">I couldn't stand it if it spoke.</s> (i.e.Whenever it might have spoken, I wouldn't have been able to stand it.)
Forms of address in the vocative combining an interjection and a noun phrase will be tagged intj
, e.g. <s type="intj">Hi Lou!</s>.
The frag
tag applies to NP, PP, ADVP and ADJP fragments when they are not instantiating one of the other categories. Subordinate clauses without a main clause are other
, not frag
. For example, the following are frag:
The following are not frag
because they belong to a different category.
other
)intj
)intj
, adjective interpreted as a response particle)other
, since there is a subject in non-canonical order with no verb)Note that the inventory of response adjectives should be kept small, e.g. "fine", "right" when meaning a simple "yes", but not "excellent" (potentially also an evaluation adjective).
The category multiple
is meant for sentences containing two (or more) complete clauses of varying types (e.g. do it and I don’t care how! – imp
+ decl
)
The multiple
category does not apply when there is a main clause of one type and a subordinate clause of a different type, e.g. "washing the dishes, John noticed the burglar" - in this case, we have a normal declarative clause that has a subordinate gerund. It is not a gerund type (ger
), since there is really only one main matrix clause: the past tense one with "noticed".
The multiple
category also does not apply when parenthetical sentences are present; parenthetical sentences may be 'below the level' of the main clause, and so only the type of the main clause applies. For example, the following is a sub
type, notwithstanding the parenthetical clause in italics:
For declarative-like phatic insertions (back-channeling and similar), such as "you know" and "I mean", which are typically labeled as organization-phatic in RST (see the RST guidelines) and 'parataxis' in dependencies, we disregard the phatic expression in determining the sentence type. As a result, the following examples are frag
and intj
and not multiple
or decl
.
frag
)intj
)This exception does not apply when verbs like 'mean' or 'know' actually specify a complement:
decl
)There is a hierarchy among the sentence types that sometimes comes into play when sentences fit two definitions. Specifically, being a question gets ‘first dibs’ on the sentence type. We might have wanted to say about a sentence that it’s both hypothetical and a question, for example: "Would you do it if you could?". but we only get one label, and whether or not something is a question is seen as more crucial, so this example gets the type q
(yes/no question).
Unless you have mixed sentence types (multiple
), priorities are:
wh
beats q
(if a question has a wh word, it is wh
)q
beats anything elsefrag
beats intj
(e.g. "yes, that book" is frag)Morphological segmentation in GUM is not annotated as part of LING-4427. It is annotated semi-automatically using three resources, relying primarily on the Unimorph lexical resource (Kirov et al. 2018), specifically using scripts based on the lexicon data here, and expanded using data from CELEX and Universal Segmentations. Analyses are concatenative, using hyphens as separators, and are guaranteed to sum up to the string of each token with only hyphens added. Existing hyphens in a word form are retained and assumed to be meaningful. Analyses cover inflection, derivation and compounding. For example:
Note that stems are retained in their orthographic forms (explanation does not become explain+ation), and 'etymological affixation' in loanwords is not necessarily analyzed (e.g. "ex" is not split off since the corresponding affixation process is no longer interpretable in English).
Pluralization, tense inflection and participial endings are generally segmented. In case of vowel changes, affixes are still separated but vowels are simply taken over from the surface form. If no affix is seperable, no segmentation is carried out. For example:
Note that in choosing the segmentation point, preserving the form of the affix takes priority, e.g. 'bigg-er' is preferred to 'big-ger', since there is no suffix '-ger'. As a result, 'giv-en' supersedes a possible 'give-n', which would have preserved the stem form better. The verb forms 'bit' and 'wrote' are not segmented, since the tense is only expressed via vowel change.
Derivational affixes are segmented when the process is associated with an English morphological process, generally corresponding to a more or less transparent derivation, for which at least the meaning of the stem is recoverable. Vowel changes are again simply reproduced in the analysis. For example:
Both prefixes and affixes are identified (first three examples). Vowel change behavior is demonstrated by 'strength'. For 'fundamental', note that even though '-ment' is a common suffix, the Unimorph lexicon does not recognize an internal derivation, since, although 'fundament' is recognizable as a word and stem, it is not actively derived from 'fund' in English. Finally, although 'describe' is paralleled by e.g. 'inscribe' with other (Latinate) prefixes, there is no underlying active combinatory process in English.
Compounds that are spelled together are separated into their constituents. Affixoids prioritize preservation of the affix form, similarly to inflection. All processes, including inflection and derivation, can occur concurrently. Examples:
Deciding which processes are active in a language or word is inherently a subjective process. In many cases, the decisions in the lexicon provided by Unimorph may seem odd, however for simplicity and consistency with Unimorph we have simply reproduced segmentations from the resource. For example, affixes generally apply regardless of senses, which are not annotated in Unimorph. As a result, the following segmentations occur in the data:
We therefore advise caution in interpreting the analyses cognitively and direct readers to the Unimorph documentation for reference.
Some more examples with many segments: