Entity annotation concerns the annotation of referring expressions in a text, i.e. spans of text that refer to things in the world, and their classification into entity types. The purpose of entity annotation in Coptic Scriptorium is to facilitate searches which include specific entity types (e.g. finding a certain epithet using linguistic annotations, such as 'ⲟⲩⲁⲁⲃ' "holy", but only when applied to a person), to inventorize entities (find all cases of e.g. places mentioned in the Apophthegmata Patrum), and to function as a gateway for entity linking, enabling searches for specific persons ("John the Baptist"), regardless of the exact expression used to mention them. The latter task of entity linking is left outside of the scope of the current guidelines.
Entity annotation can be applied to three types of referring expressions:
Almost all nouns and proper nouns correspond to referring expressions, with the exception of non-referring nouns, such as:
One test for referentiality is whether a pronominal or nominal subsequent mention is possible/plausible. For example, the following sounds odd:
We distinguish 11 entity types:
Repeated mentions of the same entity in apposition are considered a single span, and do not contain more mentions of the same entity:
Although outwardly very similar, appositions must be distinguished from dislocations, in which a pronominal subject or object is repeated separately. For personal pronouns, the pronoun is simply left out of the nominal span:
If the pronoun is a substitutive demonstrative (ⲡⲁⲓ, ⲧⲁⲓ, ⲛⲁⲓ), then two spans are annotated:
But note that it is also possible for a substitutive demonstrative to stand in true apposition to a noun without dislocation, in which case a single span is annotated as for any apposition:
See the UD Coptic guidelines for more information on identifying dislocation vs. apposition.
The relative construction expanding an article is annotated as an entity:
However, if the ⲡ is tagged as a copula, that part of the construction is not part of the entity span, since it is part of a predication. In these instances, we view the predicate noun phrase as an entity, and the relative clause as a subject clause (compare the Universal Dependency annotation guidelines):
In this example, "God" receives a span, but "who made them grow" is considered a subject clause (i.e. 'who made them grow is God'), which is not nominal and hence not annotated. Note that according to the tagging guidelines, the second ⲡ should be tagged as COP and lemmatized ⲡⲉ in this sentence.
Most body parts are marked as objects, since they are tangible:
However some referential body parts are considered abstract, notably ϩⲏⲧ "heart"
Other uses of body parts may be totally figurative or idiomatic, in which case they are not annotated.
Groups of entities are interpreted as the entity type of their constituents, for example, a herd of animals is of the type animal:
An exception to this guideline is groups of people who form an organization, e.g. ⲥⲩⲛⲁⲅⲱⲅⲏ, ⲥⲧⲣⲁⲧⲉⲩⲙⲁ etc are 'organization', not 'person'
In morphologically complex items containing a verb inside a larger token, that noun cannot be annotated:
We do not mark coordinate entities in addition to their constituents:
Container and substance form two entities, for example:
Pluralized demonyms indicating members of a people are labeled person:
However peoples mentioned as a people (not as a group of individuals) are labeled organization:
These cases are usually singular and involve a named people. This guideline does not apply to ad-hoc groups of people who do not form an organized entity, e.g. ⲙⲏⲏϣⲉ 'crowd' is still usually 'person'.
Entity expressions interrupted by a copula or particle are spanned to contain the copula or particle. For example, the following span includes the intervening copula:
Similarly:
Non-adjacent relative clauses are included, unless the interruption contains the verb controlling the head noun (this prevents some possibly very long 'hermeneutical' relatives inside mentions):
But not:
The interruption by the verb 'bark' which is the predicate of 'fox' triggers the guideline to omit the relative clause. Otherwise, the mention could potentially cover the entire clause '[ⲧ ⲃⲁϣⲟⲣ ⲁϣⲕⲁⲕ ⲉⲃⲟⲗ ... ]'.
Note that ⲣⲱⲙⲉ without an article functions adjectivally here, and is not an entity; the phrase with ⲛⲓⲙ is interrogative and therefore not an entity; but the 'p-et-...' phrase is still annotated, as none of these exceptions apply to it.
The following are considered idioms, in which the constituent nouns are not construed as referential:
<color red>(note: currently we do not annotate pronouns!)</color>
In projects where pronouns are annotated (note: currently we do not annotate pronouns!) we recommend that correlative/expletive pronouns are not annotated as entities at all:
The relative converter ⲉⲧ is not considered referential. In relative clauses with explicit subject pronouns, those pronouns are annotated as usual:
Note that this results in the second pronoun pointing back to the span that contains it - this is allowed in WebAnno.