This is a guide to annotating Irish English
The guidelines will deal with several levels of analysis:
*Irish English Tokenization - segmentation into words
*Part of speech tagging
*Utterance Segmentation
*Constituent Parsing
*Dependency Parsing
In most cases, tokenization of the Irish English corpus is quite standard.
For partial words, use target hypothesis.
Example
So uhm <,> then we all <.> dec </.> they all decided they wanted to go to the disco like but I had no money
Token | So | uhm | then | we | all | dec | they | all | decided |
---|---|---|---|---|---|---|---|---|---|
Tag | RB | UH | RB | PRP | RB | VVD | PRP | RB | VVD |
Sometimes, it may be difficult to use target hypothesis. In these cases, see the section UNCLEAR below.
Words that contain null-to-low semantic value are tagged as discourse markers (i.e. UH). These words are usually affirmative responses, where the words contain less semantic value than their alternative usage. For example, well in "oh well" no longer contains the sense of well as in "the child behaved well".
Examples
Oh right
Token | Oh | right |
---|---|---|
Tag | UH | UH |
Ah cool
Token | Ah | cool |
---|---|---|
Tag | UH | UH |
He rang her alright
Token | He | rang | her | alright |
---|---|---|---|---|
Tag | PRP | VVD | PRP | UH |
Function: "retroactive focusing power, but more importantly, [...] they can be interpreted as countering potential inferences, objections, or doubts." (Miller & Weinert, 1995)
Since clause-final 'like' is extremely common, and does not (a) appear in the same distribution, and (b) have the same function as other forms of 'like', they should be tagged as UH.
All the people were out like.
Token | All | the | people | were | out | like |
---|---|---|---|---|---|---|
Tag | PDT | TD | NNS | VBD | IN | UH |
Example
Did she go out with ye.
Token | Did | she | go | out | with | ye |
---|---|---|---|---|---|---|
Tag | VVD | PRP | VV | IN | IN | PRP |
Either use target hypothesis or the tag XX.
N.B. XX is also used in the Switchboard Corpus for partial words, and unclear parts of speech (Calhoun et al., 2010). Here, we tag partial words using target hypothesis. If the partial word is unclear, then proceed to tag as XX.
Example
Did you go UNCLEAR
Token | Did | you | go | UNCLEAR |
---|---|---|---|---|
Tag | VVD | PRP | VV | XX |
The utterance should always end after a speaker's turn.
Example
Speaker A: <#> Went in shopping for a while
// Speaker A's turn ends. End of utterance. //
Speaker B: <#> Buy anything
// Speaker B's turn ends. End of utterance. //
Speaker A: ''<#> Met Nicole in town <#> No I didn't buy anything <#> I 've hardly no money <#> <{> <[> Broke <,> ''
// Speaker A's turn ends. End of utterance. //
Notice that in this example, Speaker B had interrupted Speaker A. Speaker A was still listing out the activities from their previous turn. These two turns should be annotated distinct utterances even though they are closely related.
False starts should be included in the utterance.
Example
<#> But uhm she 's she 's from Galway
Tokens | But | uhm | she | s | she | s | from | Galeway |
---|---|---|---|---|---|---|---|---|
Utterance | UTTERANCE |
Exceptions include false starts at the beginning of a sentence, in which the lexical item differs significantly. These should be segmented as distinct utterances. However, there may be cases where the distinction between false starts and topicalization is ambiguous. In these cases, you should use your own judgment.
Example
<#> <.> Sat </.> who else <,>
Tokens | Sat | who | else | , |
---|---|---|---|---|
Sentence | SENTENCE | SENTENCE | ||
Utterance | UTTERANCE | UTTERANCE |
Pauses at the end of an utterance should be included.
Example
<#> Yeah <,> she was <{> <[> with her sister </[> <,> <#> She was going in shopping
Tokens | Yeah | , | she | was | with | her | sister | , | She | was | going | in | shopping |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Utterance | UTTERANCE | UTTERANCE |
In most cases, pre-annotated sentence boundaries should be used as utterance boundaries.
Example
<#> So then uhm <,> what 'd I do Sunday then <#> Sunday I did nothing much
Tokens | So | then | uhm | , | what | 'd | I | do | Sunday | then | Sunday | I | did | nothing | much |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sentence | SENTENCE | SENTENCE | |||||||||||||
Utterance | UTTERANCE | UTTERANCE |
In speech, subject pronouns are frequently dropped. In these case, null subjects should be marked as an empty category (NONE *)
.
Examples
<#> Met Nicole in town <#>
<#> Went in shopping for a while
You may notice in the previous examples are annotated as fragments. The question is whether these kinds of sentences should be annotated as a fragment, or a regular sentence. For example, if a speaker is providing a narrative in the first person, they may drop subject pronouns but their sentences may be well-formed and complex. We would then expect that these sentences should be annotated as a sentence, and not a fragment. However, this is not always so clear as the boundary is oftentimes fuzzy. Therefore, this guideline will adopt the following definition for fragments - FRAG.
"FRAG marks those portions of text that appear to be clauses, but lack too many essential elements. Essential elements include phonologically overt nominal subjects and verbs."
Multiple interjections may appear in clusters or "streams". Phrases containing multiple interjections should be annotated flat.
Example
<#> Oh right yeah
Clause-final LIKE is very frequent in the ICE Ireland Corpus, more so than either clause-initial or clause-medial LIKE (Schweinberger 2011). Many scholars consider the function of clause-final LIKE as a focus marker with backward scope (i.e. modifying the previous clause) (Harris 1993; Miller & Weinert 1999; Anderson 2000; Columbus 2009). Following their discussions, clause-final LIKE should then be attached to the root.
Example
<#> What 's new like </[> <#>
However, there may be situations when the presence of clause-final LIKE may be unclear.
For example, in the phrase <[> So then <,> </[> she was asking like if we were going out Saturday night
LIKE is syntactically ambiguous.
Clause-initial
So then she was asking [like if we were going out Saturday night]
Clause-final
[So then she was asking like] if we were going out Saturday night
Since the corpus does not include recordings, this may be difficult to determine. Furthermore, the syntactic positions of LIKE are linked to their discourse-pragmatic function (Anderson 1998, 2000; Miller & Weinert 1995; Miller 2009).
The functions of LIKE within the linguistic literature include (Schweinberger 2011):
LIKE can therefore be functionally ambiguous, in addition to being syntactically ambiguous. In these cases, it should be up to the annotator's intuition on the true form and function of sentences containing LIKE.
Spoken corpora contain many disfluencies such as false starts, interruptions, stutters, etc.
For several types of these disfluencies, there are usually two parts: (a) the reparandum, and (b) the repair. The reparandum is defined as the phrase that is subjected to repair.
Example
<#> So uhm <,> then we all <.> dec </.> they all decided they wanted to go to the disco like
In this example, the reparandum is we all dec
, and the repair is they all decided
.
In these cases, the guideline adopts the NXT-format Switchboard Corpus (Calhoun et al. 2009) where the reparandum is subsumed within the category EDITED. The token dec
, in the example above, appears as an the unfinished token corresponding to decided
. Unfinished categories should be annotated with the label UNF. The corresponding parse tree is represented below.
Stuttering or hesitation often results in repetition of a word, phrase, or sentence.
The repeated word or phrase (i.e. the second occurrence) should be included within the category REPEAT.
Examples
<#> But uhm she 's she 's from Galway as well though
<#> So she 's <&> laughter </&> she 's in great form like
Unclear or unfamiliar words may sometimes appear in the transcript. The guideline again adopts the NXT-format Switchboard Corpus (Calhoun et al. 2009) where unknown, uncertain or un-bracketable are subsumed within the category X.
Examples
''<#> <[> Did you go <unclear> 1 syll </unclear> </[> </{> ''
<#> Derv
Sentence-initial 'so' - flat? 'then'
Version 1
#2. Uhm Friday night I didn't do much. #11. Oh yeah unbelievable - 'Oh yeah' is INTJ together because possible MWE, but in general, each UH is an INTJ #13. Went in shopping for a while - added (NONE *) before 'Went'. #14. Buy anything - made it SQ - target hypothesis.
Version 2
#2. Broke - added (NONE *), short for 'I am broke.' therefore ADJP-PRD. Frag or S? Frag because incomplete, missing verb. #3. What's new like - 'like' is phrase-final so append INTJ to phrase before. #6. So uhm what else did I do then - sentence-initial 'so' must all be flat. #11. Did you - FRAG? SQ? where's the verb - 'did'? #14. Did you go XX - target hypothesis. #15. Derv - category X. #17. So yeah - RB is flat. #19. ...she was asking like - append INTJ UH like at the end of phrase. #22. Oh right yeah - [Oh right] [yeah] #31. Cushty - not NP-SBJ, missing verb.
#36. So uhm then we all dec they all decided... - need a label to state false start/disfluency
#39. ...I'd say - frag? PRN #44. No yeah #54. Did he - frag, no verb, not S...not NP-SBJ #58. Ah cool - separate intj?
#59 So she's she's... - false start what to do...
#73 That Cliona's mum #74 That has Cliana - fragment of an SBAR? WHNP? #75 Yeah that's right yeah - attach last 'yeah' to S or to VP? #80 Oh right right - where does thee constituency go? I made [Oh right] [right](Oh right] [right) #82 Yeah I do yeah
#88 She...you know...her her... - false starts and stuttering/disfluencies.
#90 She's she's... - diff btwn 88 and 90 is unclear, which gets marked as FRAG? first? second?
When two interjections in a row...which is the head? Phrase-final 'like' is at the end of phrase, inside. 'Sat' - single lexical item...fragment v2 #14 UNCLEAR...target hypothesis. 'so uhm' #6 v2. 'so yeah' #17 v2. #22 v2. 'oh right yeah' tagset not same as ours... #36 false start = frag, label? #39 I'd say #48 interjections at end of phrase... #54 Did he? <- frag? SQ? #59 So she's she's in great form like - FRAG as part of S? or on its own? SQ - Do you...
KISS...while the tags follow the PENN standard tagset, it is not an exhaustive use for the following reasons... Sentence initial 'so'...maybe some test... CHANGE: Keep uttereance boundary the same - helps with constituency, just make them fragments.
#4 what was on <-- was is the root.
No not really #7 - root = really, not clear, right side.
#24 who else <-- root is on who
X is Y <-- Y is root X is with Y <-- 'is' is root
#30 did you go UNCLEAR --> go->UNCLEAR is dep
#40 so then --> then->So mwe
discourse always from root.
Oh right - root where? #46 --> used mwe, but use tests to determine!
So i went home, i'd say #55 --> I'd say is discourse function or parataxis?
Uhm what s #57 --> 's' is root, right most and is verb?
#66 did she not --> 'did' is the root...
#70 did he --> 'did' is root
#74 ah cool --> mwe
#85 with Fred --> with is root because if Fred is root, cannot link together
#86 With Fred and Ciaran... --> the subject is 'I', therefore with is prep...def interesting...
#89 That Cliona's mum --> null copula
And Noirin... #93 --> parataxis into Noirin
#107 ...so... --> advcl and mark
V.2
#4 so yeah... --> what's root? hierarchy? RB, NN > INTJ as root? #6 Oh right yeah --> root to the right, [oh right] yeah #9 Oh wow --> wow is root, oh is mwe? #10 the all dec we all decided --> repair tag #11 no yeah --> you can say yeah no, so not mwe...discourse... #14 repeat tag used...what do you gain what do you lose? #20 She...you know... <-- used repair for false start, and repeat #21 contains repeat...
Combinations of Interjections
"combinations of interjections may have special pragmatic functions and distributions, representing regional and personal differences"
"the order of interjections in combinations is often fixed"
"Thus combinations of interjections hang together both functionally-semantically and formally-syntactically."
"combinations of interjections demonstrate solidarity at the level of prosody as well"
"the need for the creation of sub-corpora and qualitative analysis of individual examples"
"interjections bring out both the strengths and weaknesses of corpus investigation...careful qualitative analysis of interjections continues to be necessary to determine particular functions."
Neal R. Norrick. Corpus Pragmatics.
REPEAT - discursive features, take the first occurrence? LIKE - ambiguous - which root?
Anderson, Gisle 1998. “The pragmatic marker like from a relevance-theoretic perspective”. In Jucker, Andreas H. & Yael Ziv (eds.), Discourse markers: descriptions and theory. Amsterdam & New York: John Benjamins, 147-170.
Anderson, Gisle. 2000. The role of the pragmatic marker like in utterance interpretation. Amsterdam & New York: John Benjamins.
Bies, Ann, Mark Ferguson, Karen Katz, and Robert MacIntyre. 1995. Bracketing Guidelines for Treebank II Style. Ms., Department of Computer and Informational Science, University of Pennsylvania.
Columbus, Georgie. 2009. “Irish like as an invariant tag: evidence from ICE-Ireland”. Paper presented at AACL 2009 (American Association for Corpus Linguistics), October 9th 2009 in Edmonton, Alberta, Canada.
Calhoun, Sasha, Jean Carletta, Jason Brenier, Neil Mayo, Dan Jurafsky, Mark Steedman, and David Beaver. 2010. The NXT-format Switchboard corpus: A rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Language Resources and Evaluation.
Harris, John. 1993. “The grammar of Irish English”. In: Milroy, James & Leslie Milroy (eds.), 139-186.
Miller, J. and Weinert, R. 1995. The function of LIKE in spoken language. Journal of Pragmatics, 23, 365-393.
Miller, Jim. 2009. Like and other discourse markers. In Peters, Pam, Peter Collins & Adam Smith (eds.), Comparative Studies in Australian and New Zealand English: Grammar and beyond. 2009. John Benjamins: Amsterdam, 317-338.
Schweinberger, Martin. A variational approach towards discourse marker LIKE in Irish-English. In Bettina Migge und Maire Ni Chiosain (Eds.), New Perspectives on Irish English, 179-201. Amsterdam und New York: John Benjamins.
Schweinberger, Martin. 2011. The discourse marker like: A corpus-based analysis of selected varieties of English. Hamburg: unpublished PhD dissertation.