Lexical chains are annotated for words with the same lemma. For example:
- Interest rates below ... The rate ... rate-based ...
Note how 'rate-based' is not considered part of the lexical chain.
In some cases, lexical_chain is annotated for synonyms or other non-identical terms. In such cases, if the similar words can be identified, we annotate lexical_chain as usual, but add 'non_ident' in the notes column. For example:
- [not comparable] ... [vary] - lexical_chain (notes: non_ident)
- Annotate any verb of saying (e.g. "said", "reported") and any subordinator introducing reported speech (e.g. "that"
- 'Unsure' signals are ignored
¶ Choosing source and target
- For satellite-nucleus relations, the satellite is the source and the nucleus is the target
- For multinucs, the child is the source, and the non-terminal multinuc node is the target
- Normal anchored signals are placed on all signalling words in one contiguous span if possible, otherwise, multiple contiguous spans
- Discontinuous spans receive an automatically increased co-index in a column 'discontinuous' (e.g. all parts receive coindex 1, then next discontinuous item receives 2 ... 2, etc.)
- The special co-index 0 is used for non-discontinuous annotations that share a row with a discontinuous annotations (e.g. '3|0', marking a line sharing discontinuous index '3' and a second, non-discontinuous annotation signified by '0').
- '0' is also used for all other annotations that would otherwise be empty when a '|' is used to separate multiple annotations. For example, if we have 'note' applying to one of two annotations, we use 'some_note|0' to indicate the note applying to the first annotation.
- Unanchored signals are placed on the single first token after the position of the annotation
- Multiple signals annotated at the same token are separated by pipe in ALL cells of that row's signaling annotation, including:
- signal (but|items_in_sequence)
- type (dm|graphical)
- anchoring (e.g. no|no)
- relation (List|List)
- source and target
- If a multiple token signal (e.g. several words on multiple row in GitDox) overlaps a smaller signal (e.g. single word), we split up the larger span into multiple identical annotations, since we can't use the '|' syntax for only part of the span.
- 'Tense' signals cover all aspects of tense, aspect and mood, including periphrastic constructions in English
- It is not necessary to automatically annotate every verb in the source and target spans - only occurrences of tenses that matter for the relations being signaled should be annotated.
- In particular, tenses in relative clauses are often not relevant to tense signals affecting the main clause
- For all tense/aspect/mood signals, the entire verbal complex should be annotated, creating parity between simple lexical verbs, periphrastic tenses, and passives:
- John [went] there (simple past - annotate just the verb)
- John [had gone] there (periphrastic tense, annotate auxiliary and lexical verb)
- John [was brought] there (passive, annotate auxiliary and lexical verb)
- John [was] happy (non-verbal predicate, only the verb should be annotated)
Some words are interpreted as explicit anchored signals of e.g. genre-based signaling. Examples:
- newspaper_style_attribution - the word 'source', when the text specifically specifies the source.
- Labels containing + for combined signals are always alphabetized (e.g. always semantic+syntactic, not syntactic+semantic)
- We found a questionable distinction between 'past_participial_clause' and nominal_modifier, the former is used used in a non-restrictive vmod clause: [The average of interbank offered rates for dollar deposits in the London market] [based on quotations at five major banks .]
If the Signaling Corpus contains a clear annotation error, we do not include that signal, but add a note structured as follows: rem:TYPE:SIGNAL
. For example: rem:semantic:lexical_chain