Dependency annotation generally follows Universal Dependencies, currently version 2.0, based on McDonald et al. (2013) (see https://universaldependencies.org/en/dep/).
Instructions for some special cases follow below the list of labels.
The copula 'be' appears primarily in three constructions:
In the normal predicative construction, the nominal predicate 'B' is the root, and 'A' is the nsubj. The verb 'be' itself is a dependent of the predicate B and takes the label cop.
Similarly, when the predicate is a prepositional phrase, the convention is to analyze the nominal head of the prepositional phrase as the root. The rest are dependent on the head: the preposition as case, the copula as cop and the subject as nsubj.
In the existential construction 'there is A', the verb 'be' is taken to mean 'exist', and is labeled as the root. The subject is A (nsubj) and expletive 'there' is labeled expl.
Dates with multiple coreferent parts are handled as appositions (appos). For example, "Monday, the 5th", constitutes two mentions of the same day. By 'rule of first dibs', the apposition goes from 'Monday' to '5th'. When constructing a calendar date, regardless of the order among 'year', 'month' and 'day', the 'day' is always the head. The 'month' and 'year' are dependent on the 'day', receiving 'compound' and 'nmod:unmarked' respectively. In other words, 'February 5' is a type of '5' (not a type of 'February'); years added to dates are seen as temporal modifiers of the day expression.
Image credits of the type: 'Image: XYZ' are seen as an individual construction and not analyzed as parataxis or nominal predication (root+nsubj). Instead, the convention is to use the dep label to point from the first part ('image') to the head of the second part. This avoids counting these constructions when searching e.g. for subjects or nominal sentences.
The same logic applies to quotation attribution with a speech verb. For example, in:
"To be or not to be" -- Hamlet
The root is in the quotation, and 'Hamlet' is attached to that as dep. This is not the guideline if a speech verb is present, i.e. 'said' is the root in:
"To be or not to be", said Hamlet.
Although the proper noun tag is applied even to (capitalized) adjectives in complex names, syntactic analysis should still treat them as adjectives etc. The rationale is that the POS tag can help find names, while a function label such as amod allows us to identify the internal structure of the name in question.
For complex personal names, we make the first token the head, and if we don't know anything about the internal structure, then everything else is flat from that:
However if there are sub-groups within a first name and a middle or last name, each can form its own 'flat' group, so "Jim Bob Schmidt - Bailey" would be:
The apparent 'double object' construction with 'make' and similar verbs is given a small clause type of analysis, wherein the object of the verb 'make' is seen as the essential role (rather than as the subject of an embedded predication). In other words, 'make A a B' is analyzed as making A to be a B. As a result, the analysis uses the xcomp label emanating from 'make' to signify that the accusative object of 'make' is the same as the subject of the clausal predicate, but the 'thing being made' is internally labeled as the object of the main predication. This can be seen in the image below:
Another way of thinking of this is that the analysis means: make (that woman) (to be the president)
Verbs such as 'let' in "let someone do something" or 'allow' in "allow A to do B" are analyzed as governing an xcomp clause, where the noun following the verb acts the object of the main clause, not as the subject of the subordinate small clause.
Verbs like 'call' or 'name' appear to take a double accusative object, e.g. "John called [Mary] [a saint]". This makes it hard to distinguish the name argument from the named theme argument. The guidelines instead favor a different analysis using xcomp. The idea is that the naming action creates a direct object (the named) and a small clause with the name as predicate: John performs a naming act, whose object is Mary and the small clause predicate is a saint.
In some cases, a whole phrase can be used in place of a single word, e.g. as a compound modifier. In these cases, the complex modifier should be analyzed internally, and its local root is still attached to token it modifies with the normal label.
In the example, 'what to buy' is an infinitive + object with an internal analysis, but it functions much like a compound modifier (cf. 'the shopping section'). For this reason, it is attached at its head (the verb) with the function compound.
In adverbial clauses, the subordinating conjunction is labeled as 'mark' by convention for 'if' and 'whether' clauses. However for WH adverbs such as 'when' or 'where' we use advmod, which indicates that they are seen as serving the function of an adverbial in the subordinate clause (i.e. representing the 'place where' or 'time when'). In interrogative clauses too, they are labeled advmod. This applies to 'when', 'where' and 'how', paralleling such adverbs as 'then', 'there', and 'thus'.
Compare:
Adverbial infinitive clauses, such as purpose clauses, which are not an argument of their embedding clause predicate, are advcl, not xcomp (since they are not complements). A common test to distinguish these is whether or not we can insert 'in order to':
See also the guideline for 'in order to' below.
Comparative adjectives that take 'than Y' dominate the word/phrase 'Y' as obl and 'than' is case dependent on 'Y'. For analytic comparatives, the word 'more' is seen as advmod to the lexical adjective, and 'than' is governed by the lexical adjective as well (e.g. in 'more expensive than...', expensive governs the other two words).
However, 'more than' in 'more than 5 bags' is treated as fixed from 'more' to 'than'.
Raising verbs appear to take a subject that actually belongs to a subordinate predicate semantically. The can be identifies by alternations such as "John seems sick" vs "it seems John is sick" or "I happen to own a boat" vs. "It so happens I own a boat". In both cases, the subject is predicated on in the embedded predicate (e.g. happens(I own a boat), not happen(I), or "I happen"). In these cases, a subject 'it' will be attached as expl, and the clause is attached as csubj. Lexical subjects of verbs like "seem" are attached normally to "seem" at the syntactic level, not to the subordinate clause:
Sentence initial coordinating conjunctions are attached to the root, pointing backwards, with the cc function.
For the adverb 'so' (not to be confused with purpose 'so' tagged IN, or discourse 'so' tagged UH) between two clauses, we attach 'so' as advmod to the second clause (not cc), and connect the clauses as parataxis (not conj):
Footnote markers (the footnote number) should be attached as dep to the root of the constituent that the footnote refers to. If the footnote refers to the entire sentence, then it attaches to the root. If the footnote refers to a smaller constituent, then its root is the source of the dep arrow.
'In order' is seen as a multi-word expression, which may or may not appear with 'to' (cf. 'in order that'). The function of 'in order' is mark and it is attached at the 'in'. The token 'order' is pointed at with fixed as shown below:
The verb of 'in order to' clause is attached as advcl to the main clause.
Subject clauses can be full finite clauses, as in "[that they came] annoyed me". But the csubj label can also apply to gerund clauses, as in "[doing that] can cause trouble". In both of these cases, the subordinate clause verb is labeled as csubj to the main clause predicate.
By default, if no other clear syntactic relation applies when an academic reference is supplied, it's root (usually a first author name) is attached to the root of the clause containing it as dep and the year is attached to the first author as nmod:unmarked:
However if the citation has a distinct syntactic function, the first author is taken as the head and the function is assigned as usual, for example here as the obj of the verb 'see':
References consisting only of a number, e.g. "[4]", function in the same way: the number is the head of the reference, and it is attached as dep to the local root unless it has another normal function (obj, nmod, etc.)
Multiple adjacent references are considered to be coordinated, whether or not an explicit 'and' appears:
Ranges of references with a hyphen are treated as a prepositional "TO" phrase:
When the direct object of a saying verb is a quote, it is labeled as ccomp whether or not the quote is a full clause.
The exception is when the "X said" appears medially, in which case it is considered a parenthetical, with the verb of saying dependent on the speech's root as parataxis.
Verbs of saying can have two objects, direct (obj) and indirect (iobj). Both are present in
In this case, Mary is the indirect object. It's important that, even if what is said is missing, the person being told is still iobj. For example, the following has iobj only:
When used as an introducer of speech, based on EWT precedent, "be like" is treated as a phrasal verb with "be" as the head:
For compound nouns generally written as one word, or as two words separated by a hyphen, that you feel have been incorrectly split apart, treat the relation as an compound.
If more than is modifying a quantity, then the lexical word is the head. more than is a advmod which is internally a fixed.
If more than is used to compare things (a is more than b), then it is not a fixed, reverting to obl + case.
The fixed relation is used for certain multi-word idioms that behave as one function word. fixeds are always annotated head-initially.
The current list of fixeds includes the following expressions:
fixed dependencies should be limited to these specific expressions. If you have a word that seems to have been incorrectly split apart, such as with out, use goeswith instead. The head is what you feel is the "main" part of the word. goeswith should only be used as a last resort, when you feel like you have exhausted all other possible dependencies.