Sampson on Sagisaka/Campbell/Higuchi, Computing Prosody

Yoshinori Sagisaka, Nick Campbell and Norio Higuchi, editors. Computing Prosody: Computational Models for Processing Spontaneous Speech. New York: Springer. 1997. ISBN 0-387-94804-X [Price not specified.] xvii + 401 pages.

This book is the proceedings of a workshop held in Kyoto in Spring, 1995, on “Computational Approaches to Processing the Prosody of Spontaneous Speech”. It contains 23 chapters; in terms of their authors’ institutional affiliation, about half come from Japanese sites, and the others divide as follows: USA five, Germany and the Netherlands two each, Britain, France, and Sweden one each. The separate chapters are grouped under four headings: “The Prosody of Spontaneous Speech”, “Prosody and the Structure of the Message”, “Prosody in Speech Synthesis”, and “Prosody in Speech Recognition”. Each of these four parts is prefaced by an introduction; several of these introductory chapters, though brief, are among the more enlightening elements of the book, not limiting themselves to surveying the following chapters but offering clear thumbnail sketches of the current intellectual situation within the respective topic area.

In the opening chapter, for instance, Robert Ladd points out that a chief difficulty associated with the workshop topic is a special case of a problem inherent in all studies of spontaneous speech: the need for data to be natural conflicts with the need for them to be rigorously scientifically controlled, and Ladd believes that this longstanding tension, far from finding solutions, is becoming more acute in the present phase of research. His point is reinforced in ch. 2, by Mary Beckman on “A typology of spontaneous speech”, where the author sums up by pointing out that this problem has caused her chapter to be “as much a typology of elicitation paradigms as it has been of spontaneous speech phenomena”. (It is striking that several of these authors express a need for bodies of data on emotional speech; this need should be partly met by the Emotion in Speech corpus recently compiled at Reading and Leeds Universities, see http://midwich.reading.ac.uk/research/speechlab/emotion/.)

The book is one destined to be bought by libraries rather than by individuals; some chapters are essentially site or project reports rather than finished analyses of particular scientific questions. It would probably not be useful for this review to devote separate remarks to each successive chapter. Rather than that, I shall take up a couple of themes which run through several chapters and which may have wider resonances for readers of this journal.

Chapters in the synthesis part by Nick Campbell, Jan van Santen, and Hiroaki Kato et al. all bear on the issue of timing of the spoken segment sequence, with particular relevance to the possibility of “concatenative” speech synthesis. Campbell distinguishes three fundamental approaches to the synthesis task: articulatory synthesis, which generates waveforms via models of the vocal tract; formant synthesis, which directly models the acoustics of the waveform; and concatenative synthesis, which constructs novel utterances from precorded segments of real speech. Campbell sees the concatenative approach as in principle the best way to get natural-sounding speech, because it captures fine detail which escapes our articulatory or acoustic models, but he also describes it as the most difficult approach, because prerecorded segments must be subjected to complex transformations to turn them into a sequence which is prosodically consistent and appropriate in its context. Campbell finds that segment durations, for instance, exhibit quite different patterns of variance as between read speech and spontaneous speech.

Traditional phoneticians held that segment durations were determined partly by constraints which stretched out or compressed individual segments in order to maintain regular overall durations of larger linguistic units. According to Abercrombie (1967: 97), for instance, all languages are either syllable-timed languages (such as French) or stress-timed languages (such as English), in which successive syllables or successive feet, respectively, occupy equal times in an utterance. (This widely-held view was questioned by Roach 1982.) Van Santen points out that concatenative synthesis requires that all articulatory parameters of a segment-token are stretched or compressed to the same extent, and he reports research showing that this constraint is observed in practice; however, his group’s findings also imply that, in American English and Mandarin Chinese, units such as syllables and feet are not relevant to the rate at which segments are uttered. If syllables tend to be pronounced faster in longer words, van Santen suggests that this may be because there is more phonetic redundancy in long words, rather than because English is “stress-timed”.

Kato et al., on the other hand, report acceptability experiments which suggest that Japanese may indeed be syllable-timed (or, strictly, “mora-timed”), so that lengthening a consonant sounds best if compensated by shortening of the following vowel, and vice versa.

Another issue running through many of the chapters is the status of the ToBI prosodic labelling system. ToBI was promulgated a few years ago as a standard annotation scheme for English prosody (Silverman et al. 1992, and see http://ling.ohio-state.edu/Phonetics/ToBI/Tobi.html), based on ideas of Janet Pierrehumbert about the structure of American English intonation (Pierrehumbert 1980). Very quickly it has become a de facto standard for representing prosody in all languages studied by speech scientists, though its inventors explicitly did not claim it to be universally applicable. Several authors in this book acknowledge that ToBI was designed for American English but nevertheless see it as uncontroversial to extend it to different languages (Swedish in the chapter by Gösta Bruce et al., German in the chapter by Wolfgang Hess et al., Japanese in Alan Black’s chapter).

The trouble with this is that ToBI is not a flexible, general-purpose descriptive scheme designed to provide a representation for any physically-possible intonation pattern. It appears to embody strong theoretical claims about prosody in American English, and there is no a-priori reason to expect other languages to obey its principles even if American English does so. A number of British phoneticians, notably Francis Nolan of Cambridge (Nolan & Grabe 1997), have argued that ToBI represents an uneasy compromise between phonetic and phonemic transcription, and cannot be made to work satisfactorily even for English in some of its British dialects. It may be that British phoneticians are readier than others to query the validity of ToBI because Britain was a nation with a well-established prior tradition of prosody annotation (see e.g. O’Connor & Arnold 1961) that embodied very different claims about intonation structure, though this has now been completely eclipsed. In this book, only Alan Black mentions any alternative to ToBI, and he does not seek to choose between the alternatives. It seems possible that international speech science is heading for a situation where it is in practice obligatory to represent prosody in all languages in terms of a system that is suitable only for one dialect of one language. It would be reassuring to see some non-Anglophone speech scientists test the validity of ToBI before its position becomes wholly unassailable.

The book contains worthwhile material on many further topics which there is no space to discuss here. It offers a good way to survey what is happening internationally in the research area which it covers. The book is well produced physically, though sometimes marred by careless copy-editing. (A French title in a bibliography even uses accented characters that do not exist in French.)

REFERENCES

Abercrombie, D. (1967). Elements of General Phonetics. Edinburgh: Edinburgh University Press.

Nolan, F. & Grabe, Esther. (1997). Can ToBI transcribe intonational variation in the British Isles? Proceedings of the European Speech Communication Association Workshop on Intonation, 18-20 September 1997. Athens.

O’Connor, J.D. & Arnold, G.F. (1961). Intonation of Colloquial English. London: Longman.

Pierrehumbert, Janet B. (1980). The Phonology and Phonetics of English Intonation. Bloomington, Indiana: Indiana University Linguistics Club.

Roach, P.J. (1982). On the distinction between “stress-timed” and “syllable-timed” languages. In Crystal, D. (ed.), Linguistic Controversies, pp. 73-9. London: Arnold.

Silverman, K., et al. (1992). ToBI: a standard for labelling English prosody. In Proceedings of the International Conference on Spoken Language Processing, vol. 2, pp. 867-70. Banff, Alberta.

Geoffrey Sampson

University of Sussex