tag@attribute | meaning |
---|---|
add | text inserted by an editor, e.g. in interviews inside [ ] |
caption | caption for images in the text |
caption@rend | a description of the appearance of the caption (e.g. bold) |
cell | a table cell |
cell@rend | a description of the appearance of the cell (e.g. bold, red background) |
date | date expressions |
date@from | starting date for a range of dates |
date@notAfter | latest possible date for an inexact date |
date@notBefore | earliest possible date for an inexact date |
date@rend | formatting of a date expression (e.g. italics, color) |
date@to | end date for a range of dates |
date@when | date in question, normalized to the format yyyy-mm-dd (Day and Month can be omitted) |
figure | marks the position of a figure in the text |
figure@rend | a description of the appearance of the figure (e.g. "drawing of four slightly deflated balls") |
foreign | marks a non-English word or phrase |
foreign@xml:lang | gives the language of a non-English word or phrase using its three-character identifier (e.g. "fra") |
gap | a gap in the text (missing words) |
gap@reason | a gap with the reason (e.g. omitted) |
head | marks a heading |
head@rend | a description of the appearance of the heading (e.g. bold, large) |
hi@rend | a highlighted section with a description of its appearance (e.g. bold, italic, green, small-caps, emphatic, lengthened, "space between 'U.' and 'S.'") |
incident | an extralinguistic incident |
incident@type | an incident with type (e.g. laugh, chuckle, whistle, graphic or text appearing on screen, "opens door") |
incident@who | the person responsible for an incident (e.g. #Angela) |
item | item or bullet point in a list |
item@n | item number |
l | a line in poetry |
l@n | a line in poetry with its number |
lg | a line group |
lg@n | a line group with the group's number |
lg@type | line group type (e.g. stanza) |
list | list of bullet points |
list@type | a list type (e.g. ordered, unordered, etc.) |
note | a note |
note@n | a note with number |
note@place | the place of the note (e.g. foot) |
p | a paragraph |
p@rend | a description of the appearance of the paragraph (e.g. bold, indent) |
q | quotation marks not marking a quotation (e.g. scare quotes; placed outside the quotes!) |
quote | a quotation |
quote@rend | the appearance of the quotation (e.g. block, bold) |
ref | an external reference, usually a hyperlink |
ref@rend | the appearance of the reference (e.g. italic) |
ref@target | the target of the reference (usually a URL, if not omitted) |
row | a table row |
s | a main sentence span |
s@type | a sentence with type (e.g. decl, q, wh, frag, imp, ger, intj, sub, multiple, other) |
sic | a section containing an apparent language error, thus in the original |
sic@ana | a corresponding reconstructed target hypothesis in standard English |
sp@who | a section uttered by a particular speaker with a reference to the speaker |
sp@whom | a section uttered with a particular speaker as an addressee |
supplied | text supplied by the annotator |
supplied@reason | reason for adding text (e.g. letter signature from envelope) |
table | a table |
table@cols | a table with the number of columns |
table@rows | a table with the number of rows |
table@rend | the appearance of the table (e.g. bold, red background) |
time | time expressions |
time@from | starting time for a stretch of time |
time@to | end time for a stretch of time |
time@when | time in question, normalized to the format HH:mm:ss (e.g. 16:30:00) |
w | tag to delimit a word, used when two tokens are spelled with no space, e.g. cannot |
Obvious typos and errors should be surrounded by the sic tag but not corrected. Instead, the correction appears in the attribute ana. Later in lemmatization, they will also receive correctly spelled lemmas. Note that British spelling is not considered an error and should not be marked up in any special way.
I know <sic ana="the">th</sic> way.
Your coat is a lovely colour.
Dates are marked up using the date element, usually with the @when attribute, in the yyyy-mm-dd format. It is possible to annotate dates fully, if they are known from context, even if the text mentions a partial date, e.g.:
<s><date when="2015-05-07" rend="bold">Thursday, May 7, 2015</date></s>
Numbered lists and unnumbered bullet points are considered parts of structural markup, and both are a type of <list>.
The following example illustrates markup for a numbered list:
<list type="ordered">
<item n="1">
<!-- the number 1 is not a token, even though it appeared in the text-->
<p>This is the first step
Figures are surrounded by <figure> tags. Although the figures themselves are not preserved in the corpus, they can be described in the attribute @rend of the figure element. Descriptions are only made in the @rend attribute and are not added to the tokens of the text itself (for this reason, the alternative TEI method of using <figureDesc> is NOT used).
<figure rend="Picture of Queen Elizabeth II"><caption>The Queen in Beijing last year</caption></figure>
<figure rend="list of suspects in the case and their mug shots">CEO - secretary - ambassador</figure>
<figure rend="picture of a valley"></figure>
Pop up image descriptions or tooltips (in HTML, things like 'alt' or 'title') are not considered running tokens of the text. They may optionally be included in @rend if desired.
Typographical information for spans of text in hi@rend should be single words where possible, often derived from corresponding CSS vocabulary. For example, we use 'bold', 'italic', and 'large'. Multiple values are possible and should be separated by spaces:
<hi rend="bold italic large">The Big Picture</hi>
Literal quotes are surrounded by the 'quote' tags, regardless of whether or not quotation marks are used. But other uses of quotation marks are surrounded by 'q'. Compare the following two uses:
Caesar said <quote>veni, vidi, vici</quote>. You could say that was his <q>" motto "</q>.
Footnotes with running text (not bibliographical references realized using numbers hyperlinked to the bibliography) are placed at the position immediately after the paragraph that contains the numbered references. The number is surrounded by ref tags, and the note is enclosed in note:
<p>
Some long text.<ref>1</ref> Paragraph continues. At the end of this paragraph we'll insert the note.
</p>
<p><note place="foot" n="1">This is the footnote, which physically appeared at the bottom of the page, which was the middle of the next paragraph.</note></p>
<p>
Next paragraph. This one is split across pages, but the footnote does not appear in the middle of it, even though it was there graphically.
</p>
If a deleted comments in reddit is not replied to within the context included in the document, it may be ignored. However if the comment is part of a broken thread of responses, it's existence can be encoded using an empty sp tag with the speaker set to DELETED, which can then be referred to in the reply:
<sp who="#DELETED"/>
<sp who="#kim" whom="#DELETED">
I agree with you.
</sp>
If two characters in a work of fiction say the same thing at the same time, tag both speakers in alphabetical order, separated by a comma (without a space), in the sp@who attribute:
<p>
<sp who="#Fairy,#Narrator" whom="#Pete">
“No!”
</sp>
we both said at once.
</p>
If there are multiple possible addressees and it is not clear who/which subset is being addressed, all possible addressees are included in sp@whom (usually everyone but the speaker). Speech uttered to no specific addressee is left without the @whom attribute.
Incidents will be attached to text as opposed to standing on their own. If the incident overlaps multiple tokens, incorporate those tokens in the incident tag.
<incident type="siren">Ambulance is a little loud, sorry!</incident>
If the incident is in between tokens, attach the incident tag to the very next token. However, if the incident is a sound produced by a speaker, i.e. laughter, and is at the end of that speaker's utterance, attach it to the last token produced by that speaker.
Some tokens that are spelled together cannot be trivially recognized as such after tokenization. Whereas n't or 'll are easy, can + not can be can not or cannot. To distinguish the latter case, we can add the tag <w> for 'word' to the case cannot:
I <w>cannot</w> do this (5 tokens)
If there are some graphic section dividers, which seperate different sections of the text but do not contain any words, tag them as the following example:
<p>
<s>* * *</s>
</p>
NOTE: As of GUM v2, the following tags are no longer used
tag@attribute | meaning |
---|---|
div1 | major, top level section |
div1@n | section number for div1 |
div1@type | type of section (e.g. section, chapter, etc.) |
div2 | same as div1 for a second level nested section |
div2@n | |
div2@type | |
div3 | same as div1 for a third level nested section |
div3@n |
Other deprecated tags:
tag@attribute | meaning |
---|---|
measure | span of a unit of measurement |
measure@type | a measure type (e.g. currency) |