Sampson: Consistent annotation of speech-repair structures

Introduction

As Jane Edwards of the University of California has put it (Edwards, 1992: 139), “The single most important property of any data base for purposes of computer-assisted research is that similar instances be encoded in predictably similar ways”. Statistics are meaningless, if they are drawn from a database in which the same phenomenon is coded now this way, now that. In the field of grammatical annotation, this principle has been neglected. Many alternative lists of grammatical categories are extant, but often these are not backed up by detailed, rigorous specifications of logical boundaries between the categories: an annotation scheme may specify how to draw a parse tree with richly informative node-labels for clear, “textbook” example sentences, while leaving it quite ambiguous how the scheme should be applied to messy real-life language.

Our SUSANNE scheme (Sampson, 1995; www.grsampson.net/RSue.html) attempted to fill this gap, focusing mainly on the grammar of edited written English. The 500 pages of the published scheme aim to define a uniquely predictable structural analysis for anything occurring in real-life usage. The SUSANNE scheme has been winning a measure of international recognition: “the detail of [the SUSANNE] annotation is unrivalled” (Langendoen, 1997: 600); “impressive … very detailed and thorough” (Mason, 1997: 169, 170); “meticulous treatment of detail” (Leech & Eyes, 1997: 38). Other research groups may prefer to use different lists of grammatical symbols in creating treebanks, but the usage statistics they compile will be of questionable value unless the logical boundaries between those symbols are defined with respect to the multifarious issues that are treated explicitly in the SUSANNE scheme.

My current CHRISTINE project (http://www.grsampson.net/RChristine.html) is extending this undertaking to cover the structural annotation of spoken English, particularly spontaneous, informal spoken English (and, as a by-product, it is generating a 200,000-word annotated corpus of English speech). We are using several sources of data, the chief source being the speech subsection of the British National Corpus (http://info.ox.ac.uk/bnc/), which to my knowledge is unrivalled as a demographically-balanced “fair cross-section” of recent British speech (even if it is not always above reproach with respect to quality of transcription). When complete, the CHRISTINE Corpus will be made freely available by electronic means, legal constraints permitting, as the SUSANNE Corpus already is.

In the domain of spontaneous speech, one significant issue is that of rules for consistently annotating speech repairs – stretches of wording in which a speaker begins to realize one grammatical plan, but breaks off and either starts afresh or continues in conformity to a different plan. (The term speech management phenomena is sometimes used in a similar though broader sense.) Speech-mediated man-machine interaction is nowadays seen as a key future technology, and for this purpose the automatic recognition and analysis of speech repairs will be a crucial technique; so it is very desirable to develop standardized, predictable schemes for registering repair structures in corpus data.

The present paper surveys some of the problems that confront any attempt to make such schemes rigorous. All examples are excerpted from the British National Corpus, and are identified by the three-byte BNC file name followed after a full stop by five digits representing the BNC “s-unit” number, supplemented with leading zeros as necessary. Notations of the SUSANNE/CHRISTINE scheme are explained as needed in the discussion of individual examples; and the notations are displayed in only as much detail as is relevant to the points under discussion. (The full CHRISTINE grammatical labels in the examples quoted below are often considerably more informative than is shown in the present paper.)

Approaches to Repair Annotation

When the SUSANNE annotation scheme was developed, the most fully-worked-out and empirically-founded existing approach to annotating speech repairs was that of Howell & Young (1990, 1991), based on Levelt (1983). This approach identified, within a stretch of wording that includes a repair, a set of up to nine “repair milestones”: for instance, the point at which the original grammatical plan is abandoned (the interruption point), and the (earlier) point marking the beginning of the stretch of wording that is destined to be replaced by new wording after the interruption point.

However, Sampson (1995: 448ff.) found that the Howell & Young approach was in one respect too limited, and in another respect too rich, to be satisfactory for NLP research purposes. A speech repair will typically be embedded within a larger structure which in other respects may be grammatically well-formed; the Howell & Young scheme contained no proposals for showing the relationship between the repair and the structure containing it, an issue that may not have been significant for their purposes but which is crucial in an NLP context. In this respect, then, the scheme was unduly limited as it stood. Arguably it was unduly rich, on the other hand, in terms of the range of different milestones used to characterize the progress of a speech repair. When a speaker abandons one grammatical plan for another, the moment when the first plan is abandoned normally corresponds to a real event – the speaker is likely to be conscious of making a change, and he will often produce audible hesitation phenomena. On the other hand, even if the wording after the interruption point replaces a specific stretch of the wording before that point, as in e.g.:

so I’ll be I’ll come down KCJ.01017

where the wording from the second I’ll onwards clearly replaces I’ll be, the beginning of that stretch (in the example, the transition between so and I’ll) corresponds to no actual event (when the speaker said so I’ll, he was speaking continuously and surely did not anticipate that he was moving into a stretch of wording that was fated to be replaced).

In this example, the transition between so and I’ll can at any rate logically be identified as initiating the reparandum, in Levelt’s terminology, in the sense that excising I’ll be leaves a sequence so I’ll come down which is perfectly well-formed and plausible as an expression of what the speaker ultimately wanted to say. But, in our experience, it is often an artificial exercise to try to identify the beginning of a reparandum; wording following the interruption point does not always replace a specific stretch of wording preceding that point, such that cutting out that stretch would leave a fully felicitous utterance. Howell & Young (1990) claim that Levelt’s scheme shows promise when applied to their data, which consist of dictaphone recordings; but this is an unusual hybrid genre of language, in the sense that the speaker’s intention is not merely to convey his message but to create good written prose for a secretary to convert to typescript. Despite the fact that this might be expected to make their data more “polished” than spontaneous speech, Howell & Young’s use of Levelt’s scheme nevertheless often seems to require them to make quite complex unmotivated decisions between alternative analyses.

Levelt developed his speech-repair annotation system in the context of a psycholinguistic theory which attempted to make predictions about what types of repair do and what types do not occur in practice. It is accordingly no criticism of his system that it makes strong assumptions about repair structures; scientific theories ought to be strong, i.e. highly testable. But for the more engineering-oriented purposes of NLP data compilation, it is desirable to use annotation schemes which make weaker assumptions and hence can be readily adapted to represent whatever repair structures turn out to be found in practice, so that data can be registered here and now without waiting for the ultimate psycholinguistic theory of speech repairs to be formulated; data annotated in terms of a general scheme could be used, among other things, as evidence for and against such psycholinguistic theories.

Accordingly, the SUSANNE scheme proposed a modified approach for annotating speech repairs, based mainly on identifying the interruption point, which is represented as a terminal symbol (shown here as “#”) attached as a daughter of the lowest node in a parsetree such that the wording before # and (if any) the wording after # can both be interpreted as (partial or complete) attempts to realize that node. Illustrating via a straightforward example, the sequence

when his <pause> he was with his daughter I said you give me my bloody keys … KCT.10662

is analysed (showing relevant labelled bracketing only) as:

[S [Fa when [N his N] # <pause> [Nas he Nas] [Vsb was Vsb] [P with his daughter P] Fa] I said you give me my bloody keys … S]

– the # node is shown as a daughter of the adverbial clause (Fa) tagma, because both when his … and (when) … he was with his daughter are successive attempts to produce an adverbial clause, which as a whole fits normally into the superordinate main clause (S) as its first constituent. (The symbols N, V, and P stand respectively for noun phrase, verb group, and prepositional phrase; they are supplemented in some cases by subcategory letters giving more detailed classification, for instance he is labelled Nas, being morphologically marked as subject and singular.) Although his appears to have been intended as the first word of a noun phrase that was not completed, we do not attach the symbol # as daughter of a single noun-phrase tagma [N his # he], because it is not plausible that his was the start of a noun phrase that would have been followed by the predicate was with his daughter – the interruption point marks a change of tack for the whole adverbial clause, not just for its subject.

Thus a speech database incorporating this type of annotation does show how a repaired stretch fits grammatically into a wider grammatical structure; and it permits automatic retrieval of most properties that actually apply to a typical speech repair, while so far as possible it avoids the need for analysts to create pseudo-facts that correspond to no realities in the repairs themselves but only to artificial demands of the notation. The annotation could be described as “minimalist”; a notation which slimmed down Levelt’s rich apparatus of repair milestones even further than this might scarcely deserve to be called a system of repair annotation at all.

Problematic Issues in Repair Annotation

The CHRISTINE project is testing and improving the definition of this approach by confronting it with real-life data. The remainder of the present paper uses BNC examples to survey and classify various problematic issues that emerge from this undertaking. The paper is frankly concerned with the examination and classification of data, rather than with propounding theories or algorithms. At this stage of our understanding of speech-repair phenomena, investigating the data in an empirical spirit is what is needed.

The overall aim of the SUSANNE and CHRISTINE annotation scheme is to indicate explicitly as much detail of the structure of language as is practical, but to do so in a predictable way, so that two analysts equipped with copies of the scheme and given the same language sample must produce identical annotations. The scheme tries always to provide fallback notations which avoid forcing the analyst to make decisions beyond the evidence available; as a simple case, the noun phrases the man and the men are labelled Ns and Np, being morphologically marked as singular and plural respectively, but the fish is labelled just N. Consequently, the speech-repair annotation system is perceived as problematic in cases where it creates alternative ways to annotate the same stretch of wording, and where there is no convenient way to define a neutral fallback notation. Despite the “slimming down” of Levelt’s notation that has been applied in creating the CHRISTINE scheme, quite a number of such problems do still arise. The examples listed below are classified by types of analytic ambiguity.

Repair v. Co-ordination

Grammatical annotation even of edited written prose must provide for various types of co-ordinate structure, not all of which (apposition, asyndetic co-ordination) are overtly marked by conjunctions. It can then be difficult or impossible to distinguish between repairs in which one tagma is replaced by another, and co-ordinations in which both tagmas are intended. Thus:

yeah but he don’t get done he doesn’t get done that is the problem KD6.03060

she can’t be much cop if she’d open her legs to a first date to a Dutch s- sailor KSS.05002

In the KD6 example, he doesn’t get done could be intended to replace the nonstandard verb form in he don’t get done, implying a structure [S+ but he don’t get done # he doesn’t get done S+]; alternatively the speaker may intend the near-repetition as an appositional reiteration to add emphasis, implying (in SUSANNE notation) a structure [S+ but he don’t get done [S@ he doesn’t get done S@] S+]. In the KSS case, to a Dutch s- sailor could be intended as additional information in apposition to the preceding phrase to a first date, though this might be regarded as a rather literary structure in the context of informal speech; alternatively, the structure may be a speech repair in which to a Dutch s- sailor is substituted for to a first date as the true ground for criticism (though this might raise questions about the speaker’s scale of relative social heinousness).

Repair v. Interpolation

Interpolation, whereby a structurally independent tagma is inserted into the middle of a construction that would be complete without it, is normal in written prose; it is often marked orthographically by brackets. Since speech exploits almost all constructions found in writing, it would be arbitrary not to recognize interpolations in the spoken language, but they can be very difficult to distinguish from repairs. E.g.:

that’s a certainty but I don’t know what whether to make one or not I don’t know what to do KSS.04846

but he he the thing is he just thumps them KD6.03085

In the KSS example, it is clear that the speaker makes two attempts to produce a clause after but consisting of I don’t know followed by an indirect question (Fn?). The whether clause might be a well-formed interpolation (I) inserted between these attempts, or alternatively what whether to … could be a repair: the analysis might be either of the following:

[S+ but I don’t know [Fn? what ] # [I whether to make one or not ] I don’t know [Fn? what to do ] S+]

[S+ but I don’t know [Fn? what # whether to make one or not ] # I don’t know [Fn? what to do ] S+]

In the KD6 case, the thing is might be treated as an interpolation inserted into a succession of attempts to realize a subject pronoun, or alternatively the thing is he just thumps them might be a repaired version of a sentence which was initially planned to run simply but he just thumps them.

Repair v. Nonstandard Usage

Real-life spontaneous speech of the kind assembled in the BNC contains a large variety of nonstandard turns of phrase. In the case of some well-established and widely-discussed regional variants – say, the omission of definite articles in the speech of East Yorkshire – analysts can reasonably be expected to understand the wording in terms of the dialect rules governing a particular speaker’s usage. But the spectrum of English varieties contains far more grammatical variants than any analyst will be aware of; and oddities of usage may often be one-off performance deviations not referrable to any rule, standard or nonstandard. Consider:

I can remember on erm <pause> the other day <pause> I <pause> I accidentally dropped the helicopter on the floor KCA.02691

like I done with the house <pause> I put different parts for there KCA.02674

In standard English, the other day, or on Tuesday, can function as Time adjuncts, but on the other day cannot. Should erm in the former example be taken as marking an interruption point where a planned construction such as on Tuesday was abandoned in favour of the other day, or should on the other day be treated as a nonstandard but intended phrase, without repair? In the latter example, for there seems not to represent any standard English usage, but it might consist of some standard prepositional phrase abandoned after the preposition and replaced by there, or alternatively for there may be well-formed in the speaker’s dialect.

Verb groups are one area particularly rich in repair/nonstandard-usage ambiguities. In you’ve can’t make a a wedding cake … KSS.04868, it is tempting to say that +’ve can’t is so deviant a sequence, and the regular patterns of English verb groups are so well established, that the example must represent a repair, with an interruption point falling between +’ve and can’t. This may be correct; but in the same dialogue alone, the same speaker (female, age 72, Salvation Army, Lancashire) also says it’s be KSS.04811 and don’t you pushing your nose KSS.05009, and another speaker (male, 45, unemployed, Lancashire) says she wents into town with me KSS.04989. Other BNC files contain a great diversity of nonstandard verb groups. Not all of these forms can plausibly be analysed as speech repairs, so perhaps none should be. (In the last example, wents may represent nonstandard phonology rather than grammar, namely the affrication characteristic of Liverpool speech; but the other examples seem to require an account in terms of wording rather than pronunciation.)

A situation which does not involve “repair” as such, but is conveniently included in this section, is exemplified by:

oh she was shouting at him at dinner time … oh god dinner time she was shouting him KD6.03154

The second clause at first sight appears to contain an unrepaired omission of at, required in standard English with the target of a shout (and included in the first clause); the present writer, for one, was not previously aware of any English dialect in which the person shouted at occurs as direct object of shout. However, from other examples in this BNC file it becomes clear that this particular speaker regularly uses shout in that way, so that no omission should be marked in the second clause; the first clause presumably represented the speaker temporarily shifting towards the standard.

How Much Omitted?

Repairs often occur when a first attempt at expressing an idea accidentally omits linguistic forms which are needed for the adequate or correct expression of the idea. Even a relatively austere repair annotation scheme, such as the CHRISTINE scheme, can then find itself forced to answer unanswerable questions about how much has been omitted. Consider:

if that if it was the drama teacher that said that I’m gonna write to her KD6.03122

it’s a big mistake when they let <anonymized name> in into that school KD6.03100

In the former case, there are clearly two attempts at an if clause, with an interruption point between that and the second if; but that preceding this interruption point might have been intended as subject of the first attempt (the role eventually filled by it), or it might be the same that which eventually occurs as object of said (implying that the first attempt at the if clause was abandoned because of a gross omission). No doubt there are other possibilities. In the second example, in may be an attempt at into which was abandoned and restarted after the first syllable (a common repair pattern) – in which case the repair is at word level; or in may be intended as a complete adverb, in which case the repair is at phrase level: [P [ in # into ] [N that school ] ], or [R#P in # into [N that school ] ]. (The label R#P indicates a repaired structure which begins as an adverbial phrase and ends as a prepositional phrase.)

Intentional Discontinuity

Grammatical discontinuity is normally taken to be a performance deviation from the competence rules (of the standard language, or of a nonstandard dialect). Perhaps surprisingly, one not uncommon pattern in CHRISTINE data is that discontinuity is used intentionally to achieve a particular communicative effect. In:

and he takes the mickey out of him which okay then he called him … KD6.03054

the most plausible interpretation of which okay has it saying, in effect, “I am not going to complete the relative clause I have initiated with which, but, were I to do so, that clause would amount to a concession of the issue just raised”. It is not clear whether the concept “speech repair” can appropriately be extended to intentional discontinuities; but, without that concept, it is difficult to see how a structural annotation could indicate what is going on in such an example.

Conclusion

The main purpose of this paper is not to point out that patterns such as those listed above occur frequently in real-life speech. In itself, that observation would be fairly trivial. The aim, rather, is to draw attention to the problems that such repair phenomena create for the enterprise of devising a structural annotation system which is both informative and predictable.

Any grammatical annotation scheme, even one devised for edited written language, will inevitably encounter sporadic examples which force the analyst to guess between alternative acceptable analyses. But a good annotation scheme should not systematically require repeated guesswork with respect to some aspect of structure: it should permit the analyst to specify only what there is positive evidence for in the words spoken. This aim is particularly difficult to achieve in the area of speech repairs. Yet consistent annotation standards are as necessary in the domain of speech repairs as they are in the domain of “grammar” in the strict sense, if NLP researchers are to develop systems capable of handling spontaneous speech.

Acknowledgment

The research reported here was supported by grant R000 23 6443, “Analytic Standards for Spoken Grammatical Performance”, awarded by the Economic and Social Research Council (UK).

References

Edwards, Jane A. (1992). Design principles in the transcription of spoken discourse. In J. Svartvik (Ed.), Directions in Corpus Linguistics (pp. 129–144). Berlin: Mouton de Gruyter.

Howell, P. & Young, K. (1990). Speech repairs: report of work conducted October 1st 1989–March 31st 1990. Department of Psychology, University College London.

Howell, P. & Young, K. (1991). The use of prosody in highlighting alterations in repairs from unrestricted speech. Quarterly Journal of Experimental Psychology, 43A, 733–758.

Langendoen, D.T. (1997). Review of Sampson (1995). Language, 73, 600–603.

Leech, G.N. & Eyes, Elizabeth. (1997). Syntactic annotation: treebanks. In R.G. Garside et al. (Eds.), Corpus Annotation (pp. 35–52). Harlow: Longman.

Levelt, W.J.M. (1983). Monitoring and self-repair in speech. Cognition, 14, 41–104.

Mason, O. (1997). Review of Sampson (1995). Inter-national Journal of Corpus Linguistics, 2, 169–172.

Sampson, G.R. (1995). English for the Computer: the SUSANNE Corpus and Analytic Scheme . Oxford: Clarendon Press (Oxford University Press).