Markables for annotation include:
Some more specific constructions and guidelines follow:
Unlike OntoNotes (Weischedel et al. 2012), copula predicates are markables, whether or not they are coreferred to separately from the subject: [John] is [a teacher]. GUM distinguishes such indefinite predicates with a special type of coreference edge ('pred'), but this requires that such phrases are captured as mentions. Negative predication precludes coreference, but markables should still be annotated for negated copula predicates, which can sometimes be referred back to separately:
The same reasoning is applied to cases such as [A] is considered [a kind of B] - both markables are created and they co-refer, albeit with the 'pred' subtype.
By contrast compare: "[A] is like [B]" or "[A] is similar to [B]" (two markables, but no coreference)
Copula coreference is not annotated for modal predication, since the copular identity is not complete and this can lead to contradictions:
Cases with "[A] as [B]" may be annotated as coreferent if they imply identity of referents, similar to copula predication. For example:
In most cases, the span of the markable will be the entire NP including modifiers, such as prepositional phrases that belong to the NP, possessors or genitives. Thus in “[the boy with the blue coat] saw [her]” the spans of the two markables are maximal, as delineated by the brackets.
Covering all modifiers includes clausal expansions, such as relative an infinitival clauses, as long as they are expanding the head noun. The following examples show clauses that are included inside the markable:
Relatives expanding a controlling verb are not included in a noun's markable:
In some cases, adverbs such as “here” or “there” when referring to a place that has already been mentioned (or similarly “then” for a time) will also need to be made into markables, e.g. “…went to [London]. [There] …”. Such adverbs are not annotated if they are not referring back to a noun phrase or pronoun.
Occasionally, an entire sentence or clause will be referred to by a second referring expression, i.e. discourse deixis. In this case the entire sentence being referred to will be made into a markable, usually an event, and the coreference subtype is 'disc': “[the rain flooded the village.] <- [it] was terrible”.
When such reference does not occur, sentences are not made into markables. Note that if the entire sentence is inside the markable, then sentence final punctuation is also included in the markable (here the final period), but otherwise not (e.g. a sentence final VP markable would omit the final period).
It is possible that a sentence antecedent of a pronoun has been mentioned even before its most recent occurrence. In this case, previous occurrences will also be marked. For example, the clause "push the button for every floor" is considered given because the VP was mentioned in a previous title. In this case, all three markables should be linked together:
Titles and epithets: Words like Mr., but also roles like President, are part of a markable and do not constitute an apposition (see below on appositions). As the third example below shows, plural titles can refer to an entire coordinate markable, each constituent of which is a markable without the title (since a plural title belongs only to the coordination).
Coordinate phrases generally receive separate markables for each component, e.g. “[restaurants] and [hotels]”. If both are referred to together, an additional markable is added: ... [ [restaurants] and [hotels] ]. [They] are always expensive.
In cases where two (or more) nouns are not full NPs, i.e. when they share an article, we only assign one markable by default:
The two submarkables sharing a determiner are also annotated in three exceptional cases:
* We saw [the [car] and [driver]] . [The car] was black and [the driver's] uniform matched [its] color.
If both submarkables would have different entity types (first sentence in example above).
When both submarkables are named entities, which may require separate entity linking (Wikification):
* [The [Galileo] and [Ulysses spacecraft]]
If there is aggregate mention to a mixed type markable, the entity type is 'abstract', e.g.:
If a repair results in two separate NPs (even if incomplete), both are annotated, and can be coreferent in context. This can be identified by presence of either separate articles or head nouns (but see UD and tagging guidelines on which interrupted words are considered reconstructible). Compare:
States are generally available as individual referents, but City+State also form a markable. Note that the city markable is therefore longer. If Ohio is referred to later on in the text, it is coreferent with the smaller markable.
Abbreviations such as [OH] for Ohio are also accepted as markables.
Entity names within complex tokens are not annotated (e.g. Googleable does not contain a subtoken markable "Google"), but if we have a segmentable hyphenated word such as 'church-related', we can annotate just the subpart 'church' as a markable, since it is a separate token.
In cases where a name for something is given, that name is coreferent with the thing being named, unless the name is being discussed as such, in which case it may form an abstract entity. The rationale for this is that subsequent reference to the name as a concept supercedes its reference to the thing it names.
Authorial citations with author names are taken to be references to the author(s) and year (similar to "Smith said in 2009"):
At the same time, the entire reference is taken to be an abstract entity (the paper or book), so we add a third markable:
Numerical links are taken to be mentions of the work, and therefore abstract in the same way. Coreference is also marked for each matching citation number:
... has been shown in the past ( [17]abstract ) Other studies disagreed ... (see <-coref-- [17]abstract )
In citation numbers in square brackets, only the number is part of the entity span, and square brackets are left outside the entity, since for multiple references we can get: [13, 17, 18]. But even for a single reference "[4]", only the number token is taken as the entity span for consistency.
Indefinite pronouns such as 'something' are only annotated if they refer to nominals. Cases referring to verbs can be identified for example by coordination with verbs:
The following examples are not considered referential NPs:
Information status has the following values:
Do not overuse the accessible generic category: not every definite NP is accessible if it is the first mention in the chain. Some examples that are not considered accessible:
Personal pronouns that are inferable in the situation (I/me, you etc.) are accessible the first time they are mentioned (acc:com). They are subsequently tagged as ‘giv’, since they have already been referred to explicitly.
Information status for cataphors: see Coreference below.
There are 10 entity type:
With the exception of bridge relations, two coreferring markables must have the same entity type. This can be tricky when two markables seem to fall into two different categories. For instance, the owners of Steve's Bar may describe it as both their [business]organization and [a dive bar that locals frequent]place. But Steve's Bar is ultimately an organization, and so all markables will have organization entities (we pick 'the best class for the whole chain).
An entity is considered salient if and only if it appears in the summary of a document
Annotate the first mention of a salient entity as salient
Any nominal/noun-phrase mention in the summary is a valid candidate for being a salient entity.
Exception: If the entity mention markable guidelines cause something to not be a mention, then a canonical mention in the summary will not change that
The coreference scheme is loosely based on the design principles of the OntoNotes coreferece scheme (Weischedel et al. 2012) but with more unrestricted coreference criteria (as in ARRAU, Poesio & Artstein 2008), and with specific relation types, inspired by the TüBa-D/Z coreference scheme (Teljohann et al. 2012), which can be used to include or exclude certain phenomena in the data. A major design principle is that coreference should serve to identify the discourse referent referred to by underspecified expressions such as pronouns, and allow us to track the behavior of discourse referents as their expressions evolve over the course of a discourse, including all mentions of any kind (i.e. not excluding predication or compound modifiers if relevant).
There are two major types of coreference links: coreference proper, and bridging anaphora. Coreference contains six different subtypes of cases which are automatically derived from the 'coref' type, and bridging covers at least three types of cases:
from coref using the syntax trees.
Ages specified after a person's name are considered appositional, following the OntoNotes guidelines. The idea is that a phrase like "[Mr. Smith], [43]" is something like:
The entity type is therefore also person for both markables in this case.
Note that other mentions of ages are abstract, including in "[I] was [16]abstract" (no coref, as in OntoNotes)
In cases where two full NP realizations of the entity are separated by a coordination such as 'and/or', the normal coref type is used, even if there is a subsequent apposition:
If two mentions share an article, they are no longer separate NPs, and they become one markable according to markable recognition rules: (see exceptions under markable definitions above)
Bridging occurs when two entities do not corefer exactly, but the basis for the identifiability of one referent is the previous mention of one or more previous referents. This can be because the second referent forms part of the whole described by the antecedent, or because multiple referents are aggregated into a larger referring expression (see examples below).
If the second referent designating a part or other predictable component of the first referent contains an explicit possessive, the possessive itself should be linked to the first phrase, and no bridging relation needs to be added (since the possessive coreference is explicit).
In the case of inferrable parts, the new referent is viewed as ‘accessible’ (by way of bridging) and inferable (acc:inf).
Aggregate referents, i.e. group referents (Mary, Jake -> they) are viewed as ‘split’ at the anaphor, and the relationship is 'bridge' (information status: acc:aggr).
Examples:
Multiple mentions of ‘I’, ‘me’, ‘you’, ‘your’, ‘mine’ etc. are linked via the coref relationship (subtype ana), just like 3rd person pronouns: [I] <-coref-- [me]. In a conversation, one person's ‘I’ may corefer with a ‘you’ used by another interlocutor.
Instances of ‘I/me’ that are coreferent with a ‘you’ coming from the other speaker in the dialog are considered linked, via the ‘coref’ relationship like other pronoun chains: He went with [you]? <-coref-- [I] went alone.
Pronominal 'one' is linked as ana in both generic uses ([one] usually likes [one's] house) and substitutive uses if strictly coreferent ([which one] did you get? [This one].). In a partitive context, bridge should be used (I have [beer]. Give me [one]; note that 'one' is a subset of the beer).
The indexical adverbs ‘here’ and ‘there’, when they have an explicit antecedent (e.g. 'your new place') qualify as pronouns for the purpose of coreference type, since their interpretation depends completely on the antecedent. The relation is therefore labeled coref (ana).
The reciprocal reflexive phrases 'each other' are regarded as anaphoric. They are linked with either one coref relation if an aggregate plural mention already exists, or with bridge relations, if the components of 'each other' have only been mentioned separately so far. The information status is acc:aggr in the latter case, otherwise giv.
Cataphora are pronominal or otherwise underspecified elements (including e.g. ‘those’) that precede an occurrence of a non-pronominal element that occurs within the same utterance and resolves their discourse referent.
Cataphora may be annotated in copula sentences as linking to their predicate, if the reference of the pronoun is otherwise unresolvable (see example below). Unlike other relations, cataphora point forwards, from the pronoun to the expression that resolves them. Examples:
Subject in copula sentence:
Information status for cataphors follows the value of their coreferent (the following mention). Thus both a cataphor and its subsequent mention may be considered new.
For clefts, the pronoun is not annotated at all, so there is no cataphora or coreference, though the entire nominal phrase including the cleft clause are marked up as an entity, following OntoNotes guidelines:
Note the non-agreement of 'it' and 'Kim' and the lack of substitutability, indicating no coreference but rather an expletive, non-referential pronoun.
The coref type is used for all types of lexical coreference. Some specific tricky cases that ARE included are: