The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version. Published in Natural Language Engineering 4.281–3, 1998. |
Yoshinori Sagisaka, Nick Campbell and Norio
Higuchi, editors. Computing
Prosody: Computational Models for Processing Spontaneous Speech. New York: Springer. 1997. ISBN 0-387-94804-X
[Price not specified.] xvii
+ 401 pages.
This book is the proceedings of a workshop
held in Kyoto in Spring, 1995, on “Computational Approaches to Processing the
Prosody of Spontaneous Speech”. It
contains 23 chapters; in terms of their authors’ institutional affiliation,
about half come from Japanese sites, and the others divide as follows: USA five, Germany and the Netherlands
two each, Britain, France, and Sweden one each. The separate chapters are grouped under four headings: “The Prosody of Spontaneous Speech”,
“Prosody and the Structure of the Message”, “Prosody in Speech Synthesis”, and
“Prosody in Speech Recognition”.
Each of these four parts is prefaced by an introduction; several of
these introductory chapters, though brief, are among the more enlightening
elements of the book, not limiting themselves to surveying the following
chapters but offering clear thumbnail sketches of the current intellectual
situation within the respective topic area.
In the opening chapter, for instance, Robert
Ladd points out that a chief difficulty associated with the workshop topic is a
special case of a problem inherent in all studies of spontaneous speech: the need for data to be natural
conflicts with the need for them to be rigorously scientifically controlled,
and Ladd believes that this longstanding tension, far from finding solutions,
is becoming more acute in the present phase of research. His point is reinforced in ch. 2, by
Mary Beckman on “A typology of spontaneous speech”, where the author sums up by
pointing out that this problem has caused her chapter to be “as much a typology
of elicitation paradigms as it has been of spontaneous speech phenomena”. (It is striking that several of these
authors express a need for bodies of data on emotional speech; this need should
be partly met by the Emotion in Speech corpus recently compiled at Reading and
Leeds Universities, see http://midwich.reading.ac.uk/research/speechlab/emotion/.)
The book is one destined to be bought by
libraries rather than by individuals; some chapters are essentially site or
project reports rather than finished analyses of particular scientific
questions. It would probably not
be useful for this review to devote separate remarks to each successive
chapter. Rather than that, I shall
take up a couple of themes which run through several chapters and which may
have wider resonances for readers of this journal.
Chapters in the synthesis part by Nick
Campbell, Jan van Santen, and Hiroaki Kato et al. all bear on the issue of
timing of the spoken segment sequence, with particular relevance to the
possibility of “concatenative” speech synthesis. Campbell distinguishes three fundamental approaches to the
synthesis task: articulatory
synthesis, which generates waveforms via models of the vocal tract; formant
synthesis, which directly models the acoustics of the waveform; and concatenative
synthesis, which constructs novel utterances from precorded segments of real
speech. Campbell sees the
concatenative approach as in principle the best way to get natural-sounding speech,
because it captures fine detail which escapes our articulatory or acoustic
models, but he also describes it as the most difficult approach, because
prerecorded segments must be subjected to complex transformations to turn them
into a sequence which is prosodically consistent and appropriate in its
context. Campbell finds that
segment durations, for instance, exhibit quite different patterns of variance
as between read speech and spontaneous speech.
Traditional phoneticians held that segment
durations were determined partly by constraints which stretched out or
compressed individual segments in order to maintain regular overall durations
of larger linguistic units.
According to Abercrombie (1967: 97), for instance, all languages are
either syllable-timed languages (such as French) or stress-timed languages
(such as English), in which successive syllables or successive feet,
respectively, occupy equal times in an utterance. (This widely-held view was questioned by Roach 1982.) Van Santen points out that
concatenative synthesis requires that all articulatory parameters of a
segment-token are stretched or compressed to the same extent, and he reports
research showing that this constraint is observed in practice; however, his
group’s findings also imply that, in American English and Mandarin Chinese,
units such as syllables and feet are not relevant to the rate at which segments
are uttered. If syllables tend to
be pronounced faster in longer words, van Santen suggests that this may be
because there is more phonetic redundancy in long words, rather than because
English is “stress-timed”.
Kato et al., on the other hand, report
acceptability experiments which suggest that Japanese may indeed be
syllable-timed (or, strictly, “mora-timed”), so that lengthening a consonant
sounds best if compensated by shortening of the following vowel, and vice
versa.
Another issue running through many of the
chapters is the status of the ToBI prosodic labelling system. ToBI was promulgated a few years ago as
a standard annotation scheme for English prosody (Silverman et al. 1992, and
see http://ling.ohio-state.edu/Phonetics/ToBI/Tobi.html), based
on ideas of Janet Pierrehumbert about the structure of American English
intonation (Pierrehumbert 1980).
Very quickly it has become a de facto standard for representing prosody
in all languages studied by speech scientists, though its inventors explicitly
did not claim it to be universally applicable. Several authors in this book acknowledge that ToBI was
designed for American English but nevertheless see it as uncontroversial to
extend it to different languages (Swedish in the chapter by Gösta Bruce et al.,
German in the chapter by Wolfgang Hess et al., Japanese in Alan Black’s
chapter).
The trouble with this is that ToBI is not a
flexible, general-purpose descriptive scheme designed to provide a
representation for any physically-possible intonation pattern. It appears to embody strong theoretical
claims about prosody in American English, and there is no a-priori reason to
expect other languages to obey its principles even if American English does
so. A number of British
phoneticians, notably Francis Nolan of Cambridge (Nolan & Grabe 1997), have
argued that ToBI represents an uneasy compromise between phonetic and phonemic
transcription, and cannot be made to work satisfactorily even for English in
some of its British dialects. It
may be that British phoneticians are readier than others to query the validity
of ToBI because Britain was a nation with a well-established prior tradition of
prosody annotation (see e.g. O’Connor & Arnold 1961) that embodied very
different claims about intonation structure, though this has now been
completely eclipsed. In this book,
only Alan Black mentions any alternative to ToBI, and he does not seek to
choose between the alternatives.
It seems possible that international speech science is heading for a
situation where it is in practice obligatory to represent prosody in all
languages in terms of a system that is suitable only for one dialect of one
language. It would be reassuring
to see some non-Anglophone speech scientists test the validity of ToBI before
its position becomes wholly unassailable.
The book contains worthwhile material on many
further topics which there is no space to discuss here. It offers a good way to survey what is
happening internationally in the research area which it covers. The book is well produced physically,
though sometimes marred by careless copy-editing. (A French title in a bibliography even uses accented
characters that do not exist in French.)
REFERENCES
Abercrombie,
D. (1967). Elements of General Phonetics. Edinburgh: Edinburgh University Press.
Nolan, F.
& Grabe, Esther. (1997). Can ToBI transcribe intonational
variation in the British Isles? Proceedings
of the European Speech Communication Association Workshop on Intonation, 18-20
September 1997. Athens.
O’Connor,
J.D. & Arnold, G.F.
(1961). Intonation of
Colloquial English.
London: Longman.
Pierrehumbert,
Janet B. (1980). The Phonology and Phonetics of English
Intonation. Bloomington, Indiana: Indiana University Linguistics Club.
Roach,
P.J. (1982). On the distinction between
“stress-timed” and “syllable-timed” languages. In Crystal, D. (ed.), Linguistic Controversies, pp. 73-9. London: Arnold.
Silverman,
K., et al. (1992). ToBI: a standard for labelling English
prosody. In Proceedings of the
International Conference on Spoken Language Processing, vol. 2,
pp. 867-70. Banff, Alberta.
Geoffrey Sampson
University of Sussex