The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version. |
Extending Grammar
Annotation Standards to Spontaneous Speech
Anna Rahman and Geoffrey
Sampson
University of Sussex
Published in J.M.
Kirk, ed., Corpora Galore: Analyses and Techniques in Describing English,
Rodopi (Amsterdam), 1999.
Abstract
We examine the problems that arise in extending an explicit, rigorous scheme of grammatical annotation standards for written English into the domain of spontaneous speech. Problems of principle occur in connexion with part-of-speech tagging; the annotation of speech repairs and structurally incoherent speech; logical distinctions dependent on the orthography of written language (the direct/indirect speech distinction); differentiating between nonstandard usage and performance errors; and integrating inaudible wording into analyses of otherwise-clear passages. Perhaps because speech has contributed little in the past to the tradition of philological analysis, it proves difficult in this domain to devise annotation guidelines which permit the analyst to express what is true without forcing him to go beyond the evidence.
Background
To quote Jane Edwards
(1992: 139), “The single most important property of any data base for purposes
of computer-assisted research is that similar instances be encoded in
predictably similar ways”. This principle has not often been
observed in the domain of grammatical annotation. Although many alternative lists of grammatical categories
have been proposed for English and for other languages, in most cases these are
not backed up by detailed, rigorous specifications of boundaries between the
categories. A scheme may define
how to draw a parse tree for a clear, “textbook” example sentence, with node
labels drawn from a large, informative label-alphabet, but may leave it
entirely to analysts’ discretion how to apply the annotation to the messy
constructions that are typical of real-life data.
The SUSANNE scheme,
developed over the period 1983–93 (Sampson 1995; www.grsampson.net/RSue.html) is a first published attempt to fill this gap for
English; the 500 pages of the scheme aim to define an explicit analysis for
everything that occurs in the language in practice. Figure 1 shows a brief extract from the scheme (the first
two of rather over four pages defining the logical boundaries of the category
P, “prepositional phrase”, one of the simplest categories recognized by the
scheme). Figure 1 gives a flavour
of the kinds of issue that have to be explicitly settled, one way or another,
if a category is to be applied in a consistent fashion. No claim is made that the numerous
annotation rules comprised in the SUSANNE scheme are “correct” with respect to
some psychological or other reality; undoubtedly there are cases where the
opposite choice of rule could have yielded an equally well-defined and
internally consistent annotation scheme.
But, without some
explicit choice of rules on a long list of issues comparable to those discussed
in Figure 1, one has only a list of category-names and symbols, not a
well-defined scheme for applying them.
The SUSANNE scheme has
been achieving a degree of international recognition: “the detail … is unrivalled” (Langendoen 1997: 600);
“impressive … very detailed and thorough” (Mason 1997: 169, 170); “meticulous
treatment of detail” (Leech & Eyes 1997: 38). We are not aware of any alternative annotation scheme (for English,
or for another language) which covers the ground at a comparable level of
detail. (The other schemes that we
know about seem to have been initiated substantially more recently than
SUSANNE, as well as being less detailed.
We do not survey these schemes here; but it is worth mentioning, as
particularly closely related to the work described below, the scheme for
annotating dysfluencies in the Switchboard corpus of American telephone
conversations, http://www.ldc.upenn.edu/myl/DFL-book.pdf.)
Various research groups
may prefer to use different lists of grammatical symbols; but it is not clear
what value will attach to statistics derived from annotated corpora, unless the
boundaries between their categories are defined with respect to the same issues
that the SUSANNE scheme treats explicitly.
Currently, the CHRISTINE
project (www.grsampson.net/RChristine.html) is extending the SUSANNE scheme, which
was based mainly on edited written English, to the domain of spontaneous spoken
English. CHRISTINE is developing
the outline extensions of the SUSANNE scheme for speech which were contained in
Sampson (1995: ch. 6) into a set of annotation guidelines comparable in degree
of detail to the rest of the scheme, “debugging” them by applying them manually
to samples of British English representing a wide variety of regional, social
class, age, and social setting variables.
Figure 2 displays an extract from the corpus of annotated speech
currently being produced through this process. The sources of language samples used by the CHRISTINE project
are the speech section of the British National Corpus (http://info.ox.ac.uk/bnc/), the Reading Emotional Speech Corpus (http://midwich.reading.ac.uk/research/
speechlab/emotion/), and the
London-Lund Corpus (Svartvik 1990).
Figure 2 is extracted from file KSS of the British National Corpus. (Except where otherwise stated,
examples quoted in later sections of this paper will also come from the BNC,
with the location specified as three-character filename followed after full
stop by five-digit “s-unit number”.
BNC transcriptions include punctuation and capitalization, which are of
questionable status in representations of spoken wording; in Figure 2 these
matters are normalized away, but they have been allowed to stand in examples
quoted in the text below.)
In Figure 2, the words
uttered by the speakers are in the next-to-rightmost field. The field to their left classifies the
words, using the SUSANNE tagset supplemented by some additional refinements to
handle special problems of speech:
the part-word i
at byte 0692161 is tagged “VV0v/ice” to show that it is a broken-off attempt to
utter the verb ice. The rightmost field gives the
grammatical analysis of the constructions, in the form of a labelled tree
structure which again uses the SUSANNE conventions. All tagmas are classified formally, with a capital letter
followed in some cases by lower-case subcategory letters: S stands for “main clause”, Nea
represents “noun phrase marked as first-person-singular and subject”, Ve labels
don’t know as a verb
group marked as negative.
Additionally, immediate constituents of clauses are classified
functionally, by a letter after a colon:
Nea:s in the first line shows that I is the subject of say, Fn:o in the third line shows that the nominal clause
(Fn) I don’t know where … is direct object of say.
Three-digit index numbers relate surface to logical structures. Thus where in the seventh line is marked as an
interrogative adverb phrase (Rq) having no logical role (:G) in its own clause
(that is, the clause headed by +’s gon, i.e. is going), but corresponding logically to an unspoken Place
adjunct (“p101”) within the infinitival clause +na (= to) get cake done. (The
character “y” in column 3 identifies a line which contains an element of the
structural analysis rather than a spoken word.)
Legal constraints
permitting, the CHRISTINE Corpus will be made freely available electronically,
after completion in December 1999, in the same way as the SUSANNE Corpus
already is.
Defining a rigorous,
predictable structural annotation scheme for spontaneous speech involves a
number of difficulties which are not only additional to, but often different in
kind from, those involved in defining such a scheme for written language. This paper examines various of these
difficulties. In some cases, our
project has already identified tentative annotation rules for addressing these
difficulties, and in these cases we shall mention the decision adopted; but in
other cases we have not yet been able to formulate any satisfactory solution. Even in cases where our project has
chosen a provisional solution, discussing this is not central to our aims in
the present paper. Our goal,
rather, is to identify the types of issue needing to be resolved, and to show
how devising an annotation scheme for speech involves problems of principle, of
a kind that would have been difficult to anticipate before undertaking the
task.
The Software
Engineering Precedent
The following pages will
examine a number of conceptual problems that arise in defining rigorous
annotation standards for spontaneous speech. Nothing will be said about computational technicalities, for
instance the possibilities of designing an automatic parser that could apply
such annotation, or the nature of the software tools used in our project to
support manual annotation. (The
project has developed a range of such tools, but we regard them as being of
interest only to ourselves.)
In our experience, some
computational linguists see a paper of this type as insubstantial and of limited
value in advancing the discipline.
While it is not for us to decide the value of our particular
contribution, as a judgement on a genre we see this attitude as profoundly
wrong-headed. To explain why, let
us draw an analogy with developments in industrial and commercial computing.
Writing programs and
watching them running is fun.
Coding and typing at keyboards are the programmer activities which are
most easy for IT managers to perceive as productive. For both these reasons, in the early decades of computing it
was common for software developers to move fairly quickly from taking on a new
assignment to drafting code – though, unless the assignment was trivially
simple, the first software drafts did not work. Sometimes they could be rescued through debugging – usually
a great deal of debugging.
Sometimes they could not:
the history of IT is full of cases of many-man-year industrial projects
which eventually had to be abandoned as irredeemably flawed without ever
delivering useful results.
There is nowadays a
computer science subdiscipline, software engineering (e.g. Sommerville 1992),
which has as one of its main aims the training of computing personnel to resist
their instincts and to treat coding as a low priority. Case studies have shown (Boehm 1981:
39-41) that the cost of curing programming mistakes rises massively, depending
how late they are caught in the process that begins with analysing a new
programming task and ends with maintenance of completed software. In a well-run modern software house,
tasks and their component subtasks are rigorously documented at progressively
more refined levels of detail, so that unanticipated problems can be detected
and resolved before a line of code is written; programming can almost be
described as the easy bit at the end of a project.
The subject-matter of
computational linguistics, namely human language, is one of the most complex
phenomena dealt with by any branch of IT.
To someone versed in modern industrial software engineering, which mainly
deals with structures and processes much simpler than any natural language, it
would seem very strange that our area of academic computing research could
devote substantially more effort to developing language-processing software
than to analysing in detail the precise specifications which software such as
natural-language parsers should be asked to deliver, and to uncovering hidden
indeterminacies in those specifications.
Accordingly, we make no bones about the data-oriented rather than
technique-oriented nature of the present paper. At the current juncture in computational linguistics,
consciousness-raising about problematic aspects of the subject-matter is a high
priority.
Wordtagging
One fundamental aspect
of grammatical annotation is classifying the grammatical roles of words in
context – wordtagging. The SUSANNE
scheme defined an alphabet of over 350 distinct wordtags for written English,
most of which are equally applicable to the spoken language though a few have
no relevance to speech (for instance, tags for roman numerals, or mathematical
operators). Spoken language also,
however, makes heavy use of “discourse items” (Stenström 1990) having pragmatic
functions with little real parallel in writing: e.g. well as an utterance initiator.
Discourse items fall into classes which in most cases are about as
clearly distinct as the classifications applicable to written words, and the
CHRISTINE scheme provides a set of discourse-item wordtags developed from
Stenström’s classification.
However, where words are ambiguous as between alternative discourse-item
classes, the fact that discourse items are not normally syntactically
integrated into wider structures means that there is little possibility of
finding evidence to resolve the tagging ambiguity.
Thus, three
discourse-item classes are Expletive (e.g. gosh), Response (e.g. ah), and Imitated Noise (e.g. glug glug).
Consider the following extracts from a sample in which children are
“playing horses”, one riding on the other’s back:
KPC.00999–1002 speaker
PS1DV: … all you can do is
<pause> put your belly up and I’ll go flying! … Go on then,
put your belly up! speaker PS1DR: Gung!
KPC.10977 Chuck a chuck a chuck
chuck! Ee ee! Go on then.
In the former case, gung is neither a standard English expletive,
nor an obviously appropriate vocal imitation of anything happening in the horse
game. Conversely, in the latter
case ee could equally
well be the standard Northern regional expletive expressing mildly shocked
surprise, or a vocal imitation of a “riding” noise. In many such cases, the analyst is forced by the current
scheme to make arbitrary guesses, yet clear cases of the discourse-item classes
are too distinct from one another to justify eliminating guesswork by
collapsing the classes into one.
Not all spoken words
posing tagging problems are discourse items. In:
KSU.00396–8
Ah ah! Diddums! Yeah.
any English speaker will
recognize the word diddums
as implying that the speaker regards the hearer as childish, but intuition does
not settle how the word should be tagged (noun? if so, proper or common?); and published dictionaries do not
help. To date we have formulated
no principled rule for choosing an analysis in cases like these.
Speech Repairs
Probably the most
crucial single area where grammatical standards developed for written language
need to be extended to represent the structure of spontaneous spoken utterances
is that of speech repairs. The
CHRISTINE repair annotation system draws on Levelt (1983) and Howell &
Young (1990, 1991), to our knowledge the most fully-worked-out and
empirically-based previously existing approach. This approach identified a set of up to nine repair
milestones within a repaired utterance, for instance the point at which the
speaker’s first grammatical plan is abandoned (the “moment of interruption”),
and the earlier point marking the beginning of the stretch of wording which
will be replaced by new wording after the moment of interruption. However, this approach is not fully
workable for many real-life speech repairs. In one respect it is insufficiently informative: the Levelt/Howell & Young notation
provides no means of showing how a local sequence containing a repair fits into
the larger grammatical architecture of the utterance containing it. In other respects, the notation proves
to be excessively rich: it
requires speech repairs to conform to a canonical pattern from which, in
practice, many repairs deviate.
Accordingly, CHRISTINE
embodies a simplified version of this notation, in which the “moment of interruption”
in a speech repair is marked (by a “#” sign within the stream of words), but no
attempt is made to identify other milestones, and the role of the repaired
sequence is identified by making the “#” node a daughter of the lowest labelled
node in a parse tree such that both the material preceding and the material
following the # are (partial) attempts to realize that category, and the mother
node fits normally into the surrounding structure. This approach works well for the majority of speech repairs,
e.g.:
KBJ.00943
That’s why I said [Ti:o
to get ma ba # , get you back then] …
KCA.02828
I’ll have to [VV0v#
cha # change ] it
In the KBJ case, to
get ma ba (in which ma and ba are truncated words, the former identified by the
wordtagging as too distorted to reconstruct and the latter as an attempt at back as an adverb), and get you back then, are successive attempts to produce an
infinitival clause (Ti) functioning as object (:o) of said.
In the KCA case, cha and change are
successive attempts to produce a single word whose wordtag is VV0v (base form
of verb having transitive and intransitive uses). In Figure 2, the “#” symbol is used at two levels in the
same speaker turn: speaker PS6RC
makes two attempts to realize a main clause (S), and the second attempt begins
with two attempts to pronounce the verb ice.
However, although the
CHRISTINE speech-repair notation is less informative than the full
Levelt/Howell & Young scheme, and seems as simple as is consistent with
offering an adequate description of repair structure, applying it consistently
is not always straightforward. In
the first place, as soon as the annotation scheme includes any system for
marking speech repairs, analysts are obliged to decide whether particular
stretches of wording are in fact repairs or well-formed constructions, and this
is often unclear. Sampson (1998)
examined a number of indeterminacies that arise in this area; one of these is
between repairs and appositional structures, as in:
KSS.05002 she can’t be much cop
if she’d open her legs to a first date to a Dutch s- sailor
– where to a Dutch s-
sailor might be intended
to replace to a first date
as the true reason for objecting to the girl, but alternatively to a Dutch
s- sailor could be an
appositional phrase giving fuller and better particulars of the nature of her
offence. Annotation ought not
systematically to require guesswork, but it is hard to see how a neutral
notation could be devised that would allow the analyst to suspend judgment on
such a fundamental issue as whether a stretch of wording is a repair or a
well-formed construction.
Even greater problems
are posed by a not uncommon type of ill-formed utterance that might be called
“syntactically Markovian”, in which each element coheres logically with what immediately
precedes but the utterance as a whole is not coherent. The following examples come from the
London-Lund Corpus, with text numbers followed by first and last tone-unit
numbers for the respective extracts:
S.1.3 0901–3
… of course I would
be willing to um <pause => come into the common-room <pause => and
uh <pause – – –> in fact I would like nothing I would like
better [speaker is undergraduate, age ca 36,
describing interview for Oxbridge fellowship]
S.5.5 0539–45 and
what is happening <pause=> in Britain today <pause –> is ay- demand
for an entirely new foreign policy quite different from the cold war policy
<pause => is emerging from the Left [speaker is Anthony Wedgwood Benn MP on radio
discussion programme]
In the former example, nothing functions simultaneously as the last
uttered word of an intended sequence I would like nothing better and the first uttered word of an implied
sequence something like there is nothing I would like better.
In the latter, the long NP an
entirely new foreign policy quite different from the cold war policy appears to function both as the complement
of the preposition for,
and as subject of is emerging. In such cases one cannot
meaningfully identify a single point where one grammatical plan is abandoned in
favour of another. Because these
structures involve phrases which simultaneously play one grammatical role in
the preceding construction and a different role in the following construction,
they resist analysis in terms of tree-shaped constituency diagrams (or,
equivalently, labelled bracketing of the word-string). Yet constituency analysis is so solidly
established as the appropriate formalism for representing natural-language
structure in general that it seems unthinkable to abandon it merely in order to
deal with one special type of speech repair.
Logical Distinctions
Dependent on the Written Medium
There are cases where
grammatical category distinctions that are highly salient in written English
seem much less significant in the spoken language, so that maintaining them in
the annotation scheme arguably misrepresents the structure of speech. Probably the most important of these is
the direct/indirect speech distinction.
Written English takes great pains to distinguish clearly between direct
speech, involving a commitment to transmit accurately the quoted speaker’s
exact wording, and indirect speech which preserves only the general sense of
the quotation. The SUSANNE
annotation scheme uses categories which reflect this distinction (Q v.
Fn). However, the most crucial
cues to the distinction are orthographic matters such as inverted commas, which
lack spoken counterparts.
Sometimes the distinction can be drawn in spoken English by reference to
pronouns, verb forms, vocatives, etc.:
KD6.03060 … he says he
hates drama because the teacher takes no notice, he said one week Stuart was
hitting me with a stick and the teacher just said calm down you
boys …
– the underlined he (rather than I) implies that the complement of says is indirect speech; me implies that the passage beginning one
week is a direct
quotation, and the imperative form calm and vocative you boys imply that the teacher is quoted directly. But in practice these cues frequently
conflict rather than reinforcing one another:
KCT.10673 [reporting speaker’s own
response to a directly-quoted objection]:
I said well that’s his hard luck!
KCJ.01053–5 well
Billy, Billy says well take that and then he’ll come back
and then he er gone and pay that
In the KCT example, the
discourse item well and
the present tense of [i]s
after past-tense said
suggest direct speech, but his (which from the context denotes the objector) suggests indirect
speech. Likewise in the KCJ
example, well and the
imperative take imply
direct speech, he’ll
rather than I’ll implies
indirect speech. Arguably,
imposing a sharp two-way direct v. indirect distinction on speech is a
distortion; one might instead feel that speech uses a single construction for
reporting others’ utterances, though different instances may contain more or
fewer indicators of the relative directness of the report. On the other hand, logically speaking
the direct v. indirect speech distinction is so fundamental that an annotation
scheme which failed to recognize it could seem unacceptable. (To date, CHRISTINE analyses retain the
distinction.)
Nonstandard Usage
Real-life British speech
contains many differences from standard usage with respect to both individual
words and syntactic patterns.
In the case of
wordtagging, the SUSANNE rule (Sampson 1995: §3.67) was that words used in ways
characteristic of nonstandard dialects are tagged in the same way as the words
that would replace them in standard English. This rule was reasonable in the context of written English,
where nonstandard forms are a peripheral nuisance, but it quickly became
apparent within the CHRISTINE project that the rule is quite impractical for
analysing spontaneous speech which contains a high incidence of such
forms. For CHRISTINE, this particular
rule has been reversed; in general, words used in nonstandard grammatical
functions are given the same wordtags as their standard uses, but the phrases
containing them are tagged in accordance with their grammatical function in
context.
This revised rule tends
to be unproblematic for pronouns and determiners, thus in:
KP4.03497
it’s a bit of fun, it livens up me day
KCT.10705
she told me to have them plums
the underlined words are
wordtagged as object pronouns (rather than as my, those), but the phrases headed by day and plums are tagged as noun phrases. It is more difficult to specify a
predictable way to apply such a rule in the case of nonstandard uses of strong
verb forms, where the word used nonstandardly is head of a phrase requiring a
tag of its own. Standard base
forms can be used in past contexts, e.g.:
KCJ.01096–8
a man bought a horse and give it to her, now it’s won the race
and the solution of
phrasetagging such an instance as a past-tense verb group (Vd) is put into
doubt because frequently nonstandard English omits the auxiliary of the
standard perfective construction, suggesting that give might be replacing given rather than gave; cf.:
KCA.02536 What I done, I
taped it back like that.
KCA.02572 What it is, when you got
snooker on and just snooker you’re quite <pause> content to watch it …
Eisikovits (1987: 134)
argues in effect that the tense system exemplified in clauses like What I
done is the same as that
of standard English, but that a single form done is used for both past tense and past participle in
the nonstandard dialect (in the same way that single forms such as said, allowed are used for both functions in the standard language,
in the case of many other verbs); I done here would correspond straightforwardly to standard I
did. (Eisikovits’s article is based on data
from an Australian urban dialect, but, as Trudgill & Chambers (1991: 52)
rightly point out, the facts are similar for many UK dialects.) But Eisikovits’s analysis seems to
overlook cases like the you got snooker on example (which are quite common in our material)
where got clearly
corresponds to standard have got, meaning “have”, and not to a past tense.
It is quite impractical
for annotation to be based on fully adequate grammatical analyses of each
nonstandard dialect in its own terms; but it is not easy to specify consistent
rules for annotating such uses as deviations from the known, standard
dialect. The CHRISTINE project has
attempted to introduce predictability into the analysis of cases such as those
just discussed, by recognizing an extra nonstandard-English “tense” realized as
past participle not preceded by auxiliary, and by ruling (as an exception to
the general rule quoted earlier) that any verb form used in a nonstandard
structure with past reference will be classified as a past participle (thus give in the KCJ example above is wordtagged as
a nonstandard equivalent of given). This
approach does work well for many cases, but it remains to be seen whether it
deals satisfactorily with all the usages that arise.
At the syntactic level,
an example of a nonstandard construction requiring adaptation of the
written-English annotation scheme would be relative clauses containing both
relative pronoun and undeleted relativized NP, unknown in standard English but
usual in various nonstandard dialects, e.g.:
KD6.03075
… bloody Colin who, he borrowed his computer that time, remember?
Here the CHRISTINE
decision is to treat the relativized NP (he) as appositional to the relative pronoun. For the case quoted, this works; but it
will not work if a case is ever encountered where the relativized element is
not the subject of the relative clause.
Examples like this raise the question what it means to specify
consistent grammatical annotation standards applicable to a spectrum of
different dialects, rather than a single dialect. Written English usually conforms more or less closely to the
norms of the national standard language, so that grammatical dialect variation
is marginal and annotation standards can afford to ignore it. In the context of speech, it cannot be
ignored, but the exercise of specifying annotation standards for unpredictably
varying structures seems conceptually confused.
Dialect Difference v.
Performance Error
Special problems arise
in deciding whether a turn of phrase should be annotated as well-formed with
respect to the speaker’s nonstandard dialect, or as representing standard usage
but with words elided as a performance error. Speakers often do omit necessary words, e.g.:
KD2.03102–3 There’s
one thing I don’t like <pause> and that’s having my photo taken. And it will be hard when we have to
photos.
– it seems safe to
assume that the speaker intended something like have to show photos.
One might take it that a similar process explains the underlined words
in:
KD6.03154 oh she was shouting at
him at dinner time <shift shouting> Steven <shift> oh god dinner
time she was shouting him.
where at is missing; but this is cast in doubt when
other speakers, in separate samples, are found to have produced:
KPC.00332 go in the sitting room
until I shout you for tea
KD2.02798 The spelling mistakes
only occurred when <pause> I was shouted.
– this may add up to
sufficient evidence for taking shout to have a regular transitive use in nonstandard
English.
This problem is particularly
common at the ends of utterances, where the utterance might be interpreted as
broken off before it was grammatically complete (indicated in the SUSANNE
scheme by a “#” terminal node as last daughter of the root node), but might
alternatively be an intentional nonstandard elision. In:
KE2.08744 That’s right, she said
Margaret never goes, I said well we never go for lunch out, we hardly ever
really
the words we hardly
ever really would not
occur in standard English without some verb (if only a placeholding do), so the sequence would most plausibly be
taken as a broken-off utterance of some clause such as we hardly ever really
go out to eat at all; but
it is not difficult to imagine that the speaker’s dialect might allow we
hardly ever really for
standard we hardly ever do really, in which case it would be misleading to include the
“#” sign.
It seems inconceivable
that a detailed annotation scheme could fail to distinguish difference of
dialect from performance error; indeed, a scheme which ignored this distinction
might seem offensive. But analysts
will often in practice have no basis for applying the distinction to particular
examples.
Transcription Inadequacies
One cannot expect every
word of a sample of spontaneous speech recorded in field conditions to be
accurately transcribable from the recordings. Our project relies on transcriptions produced by other
researchers, which contain many passages marked as “unclear”; the same would
undoubtedly be true if we had chosen to gather our own material. A structural annotation system needs to
be capable of assigning an analysis to a passage containing unclear segments;
to discard any utterance or sentence containing a single unclear word would
require throwing away too many data, and would undesirably bias the retained
collection of samples towards utterances that were spoken carefully and may
therefore share some special structural properties.
The SUSANNE scheme uses
the symbol Y to label nodes dominating stretches of wholly unclear speech, or
tagmas which cannot be assigned a grammatical category because they contain
unclear subsegments that make the categorization doubtful. This system is unproblematic, so long
as the unclear material in fact consists of one or more complete grammatical
constituents. Often, however, this
is not so; e.g.:
KCT.10833
Oh we didn’t <unclear> to drink yourselves.
Here it seems sure that
the unclear stretch contained multiple words, beginning with one or more words
that complete the verb group (V) initiated by didn’t; and the relationship of the words to
drink yourselves to the
main clause could be quite different, depending what the unclear words
were. For instance, if the unclear
words were give you anything, then to drink
would be a modifying tagma within an NP headed by anything; on the other hand, if the unclear stretch
were expect you, then to
drink would be the head of
an object complement clause.
Ideally, a grammatical annotation scheme would permit all the clear
grammar to be indicated, but allow the analyst to avoid implying any decision
about unresolvable issues such as these.
Given that clear grammar
is represented in terms of labelled bracketing, however, it is very difficult
to find usable notational conventions that avoid commitment about the
structures to which unclear wording contributes. Our best attempt so far at defining notational conventions
for this area is a set of rules which prescribe, among other things, that the Y
node dominating an inaudible stretch is attached to the lowest node that clearly
dominates at least the first inaudible word, and that clear wording following
an inaudible stretch is attached to the Y node above that stretch if the clear
wording could be part of some unknown grammatical constituent that is initiated
within the inaudible stretch (even if it could equally well not be).
These conventions are
reasonably successful at enabling analysts to produce annotations in a
predictably consistent way; but they have the disadvantage that many structures
produced are undoubtedly different from the grammatical structures of the
wording actually uttered. For
instance, in the example above, the Y above the unclear stretch is made a
daughter of the V dominating didn’t, because that word will have been followed by an
unclear main verb; and to drink yourselves is placed under the Y node, because any plausible
interpretation of the unclarity would make the latter words part of a tagma
initiated within the unclear stretch.
Yet there is no way that to drink yourselves could really be part of a verb group tagma
beginning with didn’t.
Provided that users of
the Corpus bear in mind that a tree structure which includes a Y node makes
only limited claims about the actual structure produced by the speaker, these
conventions are not misleading. But
at the same time they are not very satisfying.
Conclusion
In annotating written
English, where one is drawing on an analytic tradition evolved over centuries,
it seems on the whole to be true that most annotation decisions have definite
answers; where some particular example is vague between two categories, these
tend to be subcategories of a single higher-level category, so a neutral
fallback annotation is available.
(Most English noun phrases are either marked as singular or marked as
plural, and the odd exceptional case such as the fish can at least be classified as a noun
phrase, unmarked for number.) One
way of summarizing many of the problems outlined in the preceding sections is
to say that, in annotating speech, whose special structural features have had
little influence on the analytic tradition, ambiguities of classification
constantly arise that cut across traditional category schemes. In consequence, not only is it often
difficult to choose a notation which attributes specfic properties to an
example; unlike with written language, it is also often very difficult to
define fallback notations which enable the annotator to avoid attributing
properties for which there is no evidence, while allowing what can safely be
said to be expressed.
Some members of the
research community may be tempted to feel that a paper focusing on these
problems ranks as self-indulgent hand-wringing in place of serious effort to
move the discipline forward. We
hope that our earlier discussion of software engineering will have shown why
that feeling would be misguided.
Nothing is easier and more appealing than to plunge into the work of
getting computers to deliver some desired behaviour, leaving conceptual
unclarities to be sorted out as and when they arise. Huge quantities of industrial resources have been wasted
over the decades through allowing IT workers to adopt that approach. Natural language processing was one of
the first application areas ever proposed for computers (by Alan Turing in 1948
– Hodges 1983: 382); fifty years later, the level of success of NLP software
(while not insignificant) does not suggest that computational linguistics can
afford to go on ignoring lessons that have already been painfully learned by
more central sectors of the IT industry.
Effort put into
automatic analysis of natural language implies a prior requirement for serious
effort devoted to defining and debugging detailed standard schemes of
linguistic analysis. Our SUSANNE
and CHRISTINE projects have been and are contributing to this goal, but they
are no more than a beginning. We
urge other computational linguists to recognize this area as a priority.
Acknowledgment
The research reported
here was supported by grant R000 23 6443, “Analytic Standards for Spoken
Grammatical Performance”, awarded by the Economic and Social Research Council
(UK).
REFERENCES
Boehm, B.W. 1981 Software Engineering Economics.
Prentice-Hall.
Edwards, Jane A.
1992 “Design principles in
the transcription of spoken discourse”.
In J. Svartvik, ed., Directions in Corpus Linguistics, Mouton de Gruyter.
Eisikovits, Edina 1987 “Variation
in the lexical verb in Inner-Sydney English”. Australian Journal of English 7.1-24; our page reference is to the reprint in
Trudgill & Chambers (1991).
Hodges, A.
1983 Alan Turing: The
Enigma of Intelligence. Burnett Books.
Howell, P. & K. Young 1990 “Speech
repairs: report of work conducted October 1st 1989–March 31st 1990”. Department of Psychology, University
College London.
Howell, P. & K. Young 1991 “The use
of prosody in highlighting alterations in repairs from unrestricted
speech”. Quarterly Journal of
Experimental Psychology
43A.733–58.
Langendoen, D.T.
1997 Review of Sampson
(1995). Language 73.600–3.
Leech, G.N. & Elizabeth Eyes 1997 “Syntactic annotation: treebanks”. Ch. 3 of R.G. Garside et al., eds., Corpus Annotation, Longman.
Levelt, W.J.M.
1983 “Monitoring and
self-repair in speech”. Cognition 14.41–104.
Mason, O.
1997 Review of Sampson
(1995). International Journal
of Corpus Linguistics
2.169–72.
Sampson, G.R.
1995
English for the
Computer. Clarendon Press (Oxford).
Sampson, G.R.
1998
“Consistent annotation
of speech-repair structures”.
In
A. Rubio et al., eds., Proceedings
of the First International Conference on Language Resources and Evaluation,
Granada, Spain, 28-30 May 1998, vol. 2.
Sommerville, I.
1992 Software
Engineering (4th
ed.). Addison-Wesley (Wokingham,
Berks.).
Stenström, Anna-Brita 1990 “Lexical
items peculiar to spoken discourse”.
In Svartvik (1990).
Svartvik, J., ed. 1990 The
London-Lund Corpus of Spoken English.
Lund University Press.
Trudgill, P. & J.K. Chambers, eds. 1991 Dialects of English.
Longman.
Figure 1
P Prepositional Phrase
§ 4.234 By far the commonest type of P consists of a preposition followed by an N, D, or M.
to [Ns the earth ] G04:0070
behind [D any of these tree-clumps ] G04:0530
in [M a hundred ] A05:0330
Since P is a phrase category, a one-word N, D, or M within a P will not be phrasetagged, unless an N node is required to carry the Nn subcategory – cf. § 4.140: [P in water_NN1u ], [P for [Nns Victor_NP1m ] ].
§ 4.235 On occasion, the complement of a preposition is something other than an N, D, or M; for instance it may be a genitive:
this favourite bar [Po of [G Granville’s ] ] P25.115
experiences [Po of [G my own ] ] G22:1680
or another P (see the examples quoted in § 4.10 above).
§ 4.236 A P may lack any overt complement, because the logical complement has been removed by a grammatical movement rule (in this case the logical grammar annotation system will represent the missing complement by a “ghost node”):
jumping from the chair [Fr she sat [P in ] ] (Leigh Hunt)
[Ns a magical cavern, [Tn [Vn
presided ] [P over ] [Pb by the Mistress of the Copper Mountain ] Tn] Ns] C05.102
Much as they had [Ti to look forward [P to ] Ti], … N13:0380
I don’t know what you’re [P up_to_II= ], … N12:0810
§ 4.237 It is common for the complement of a P to be a Tg, e.g.:
[P Because_of [Tg [Ns its
important game with Arkansas ] coming up
Saturday ] ], <bbold> Baylor <ebold> worked out … A12:1130
… kept him [P from [Tg reaching inside his coat for his gun ] ]. N10:1230
and this analysis is used even where the introductory word is one which has uses as a subordinating conjunction as well as prepositional uses, so that analysis as a finite clause might in principle be considered:
[P Before_ICSt [Tg leaving for India ] ], Curzon came … G08.082
Consumer spending edged down in April [P after_ICSt [Tg rising for two consecutive months Tg] P], … A28:1740
We think this is a lesser risk, however, [P than_CSN [Tg having a pupil [Tb get to a corner and forget how to get round it, … Tb] Tg] P]. E13.095
– such sequences are not parsed as [Fa before [Vg leaving ] … ], etc.
§ 4.238 The solidus (“slash”) is wordtagged IIp and begins a P when unmistakably representing per:
the units being [Nu kg [P +/ [Nu +sq. mm. ] P] Nu]; [S@ English eyes would have preferred [Nup tons [P +/ [Nu +sq. in. ] P] Nup] S@] J77.041
– contrast this with true stress/strain curves J77.037, where the curves may be graphs of stress “against” strain but where the phrase can equally be read as “stress and strain”: in such a case the solidus is wordtagged CC:
[Np true [NN1n& stress [NN1n+ +/ +strain ] ] curves ] J77.037
§ 4.239 A P will not normally include a postmodifier after the N (or other) complement:
… were [P much like ordinary folk ] [Ti to look at ]; … G21.032
I was [P in time ] [Ti to hear the charge, … ] G23.054
– the Ti’s are not analysed as daughters of the P nodes.
§ 4.240 The word like in its prepositional use, wordtagged ICSk, introduces a sequence tagmatagged P even in a case such as:
the … movements seem [P like [Np facets of one personality ] ]; … G42.066
– where a P introduced by a different preposition could scarcely occur as complement of a copular verb.
§ 4.241 For P following comparative as, see §§ 4.321-2 below.
Premodified
P
§ 4.242 A P phrase may include a measure expression preceding the head, as may an Fa clause (§ 4.304 below). The rationale is the same in both cases: logically, such an expression might best be seen as modifying the preposition (in the P case) or the conjunction (in the Fa case), rather than the entire clause or phrase; but the SUSANNE scheme has no category of phrase consisting of an II… or CS… word together with a modifier, so instead these measure expressions are treated as immediate constituents of the Fa or P tagma.
[P [Np Three minutes ] from [Ns the end ] ] a typical bit of Woosnam Soccer technique laid on … A22.142
Figure 2
April
1992, location South Shields, locale home, activity conversation
PS6RC:
f, 72, dial Lancashire, Salvation Army, educ X, class UU
PS6R9:
m, 45, dial Lancashire, unemployed, educ X, class DE
----- PS6RC
0691908 04779 - PPIS1 I [S[Nea:s.Nea:s]
0691917 04779 - VV0v say [V.V]
0691928 04779 - PPIS1 I [Fn:o[Nea:s.Nea:s]
0691937 04779 - VD0 do [Ve.
0691946 04779 - XX +n't .
0691957 04779 - VV0v know .Ve]
0691969 04779 - RRQq where [Fn?:o[Rq:G101.Rq:G101]
0691983 04779 - PPHS1f she [Nas:s.Nas:s]
0691993 04779 - VBZ +'s [Vzut.
0692003 04779 - VVGK gon .Vzut]
0692013 04779 - TO +na [Ti:z[Vi.
0692040 04779 - VV0v get .Vi]
0692051 04779 - NN1n cake [Ns:o.Ns:o]
0692064 04779 - VDN done [Tn:j[Vn.Vn]Tn:j]
0000000 04779 y YG p101 .Ti:z]Fn?:o]
0692076 04779 - RRs yet [Rs:t.Rs:t]Fn:o]S]
0692087 04779 - ? <unclear> [Y.Y]
0692097 04779 - PPY you [S[Ny:s.Ny:s]
0692108 04779 - VMo ca [Vce.
0692117 04779 - XX +n't .
0692128 04779 - VV0v ice .Vce]
0692144 04779 - AT1 a [Ns:o.Ns:o]
0000000 04779 y YR # .
0692161 04779 t VV0v/ice i [V[VV0v#.
0000000 04779 y YR # .
0692179 04779 - VV0v ice .VV0v#]V]
0692194 04779 - AT1 a [Ns:o.
0692203 04779 - NN1n cake .Ns:o]
0692215 04779 - CSi if [Fa:c.
0692226 04779 - PPY you [Ny:s.Ny:s]
0692237 04779 - VH0 have [Vef.
0692248 04779 - XX +n't .
0692259 04779 - VVNv got .Vef]
0692270 04779 - MC1 one [Ms:o.Ms:o]Fa:c]S]
----- PS6R9
0692357 04780 - VH0 have [S?[Vo.Vo]
0692369 04780 - PPY you [Ny:s.Ny:s]
0692380 04780 - VVNv seen [Vrn.Vrn]
0692392 04780 - ? <unclear> [Y:o.Y:o]S?]
0692402 04780 - NN1c mum [Nns".Nns"]
0692421 04780 - ? <unclear> [Y.Y]
0692450 04781 - ? <unclear> [Y.Y]
0692460 04781 - NN1n cake [Ns.
0692472 04781 - NN1c stand .Ns]
0692514 04781 - PPH1 it [S[Ni:s.Ni:s]
0692523 04781 - VBZ +'s [Vzb.Vzb]
0692534 04781 - II at [P:p.
0692544 04781 - AT the [Ns.
0692555 04781 - NN1c back .Ns]P:p]S]
0692575 04781 - PPH1 it [S[Ni:s.Ni:s]
0692584 04781 - VBZ +'s [Vzb.Vzb]
0692594 04781 - II on [P:p.
0692604 04781 - AT1 a [Ns.
0692614 04781 - NN1c poster .Ns]P:p]S]
0692664 04782 - PPH1 it [S[Ni:s.Ni:s]
0692673 04782 - VBZ +'s [Vzb.Vzb]
0692683 04782 - AT1 a [Ns:e.
0692692 04782 - JJ big .
0692703 04782 - MC three [Ns.
0692716 04782 - NN1c tier .Ns]
0692729 04782 - NN1n cake .Ns:e]S]