Sampson on Gibbon/Moore/Winski

Dafydd Gibbon, Roger Moore, and Richard Winski, editors. Handbook of Standards and Resources for Spoken Language Systems. Berlin: Mouton de Gruyter. 1997. ISBN 3-11-015366-1. Price DM 298. xxx + 886 pages.

EAGLES, the Expert Advisory Group on Language Engineering Standards, was a European research project on a rather large scale (almost one million in sterling terms spent over three years from 1993) whose task was to compile descriptions of best practice, and identify and publicize de facto and emerging standards, in all areas of natural language and speech research, particularly areas having practical applications. Although EAGLES as a formal project has terminated, related work under the same name continues. The EAGLES initiative comprised five working groups (Text Corpora, Lexica, Formalisms, Assessment, Speech). The book under review is the output of the Speech working group, which drew heavily on the earlier European project on Speech Assessment Methods (SAM, 1987–93).

So far as I know, Speech is the only EAGLES working group to have published its results in book form (other groups have placed material on the Web). With due respect to the hardworking members of the other groups, the output of the Speech group may be the most valuable single part of the EAGLES enterprise; concepts of standardization and best practice are less easily applicable to areas where research activity depends heavily on inherently-fluid theoretical assumptions and there are fewer hard, quantitative scientific underpinnings.

The hardbound book reviewed here includes a CD-ROM containing a hypertext version of the material, and this is also available on the Web at http://www.deGruyter.de/EAGLES/. After the first introductory chapter, the book is divided into four parts, each subdivided as follows:

I Spoken language system and corpus design

ch. 2 System design

ch. 3 SL corpus design

ch. 4 SL corpus collection

ch. 5 SL corpus representation

II Spoken language characterization

ch. 6 SL lexica

ch. 7 Language models

ch. 8 Physical characterization and description

III Spoken language system assessment

ch. 9 Assessment methodologies and experimental design

ch. 10 Assessment of recognition systems

ch. 11 Assessment of speaker verification systems

ch. 12 Assessment of synthesis systems

IV Spoken language reference materials

– a series of fourteen appendices, dealing with topics ranging from the SAMPA computer-readable phonetic alphabet to a list of the holdings of the Bavarian Archive for Speech Signals

Each of the four parts is also published separately as a paperback book, at about one-eighth of the price of the volume under review.

The individual chapters survey the state of the art in their respect domains (on a global scale, but from a European perspective), identifying points where best practice can be recommended; the recommendations interspersed throughout each chapter are drawn together and restated in a section at the chapter end. (Oddly, although recommendations are printed as separate numbered items both in the body of each chapter and in a block at the end, the numbered lists do not match; for instance, Recommendation 4 in the body of chapter 5 is paraphrased as Recommendation 7 at the end of that chapter.)

The chapters are a somewhat mixed bag. Chapter 7, on language models, is essentially a slice of the theory of mathematical linguistics: it discusses various approaches, including n-gram models, probabilistic context-free grammars, etc., that can be used to obtain estimates of prior probabilities for word-sequences, as needed for speech recognition – where these prior probabilities are multiplied with conditional probabilities (derived from an acoustic model) of acoustic observations given word-sequences, to yield (via Bayes’s Theorem) posterior probabilities of word-sequences given acoustic observations, which are the figures actually needed in order to make speech understanding systems work. This chapter too includes a set of “recommendations”, but in such a theory-laden area the concept of recommending best practice hardly applies. However, chapter 7 is atypical. Most chapters briefly sketch any theoretical material they need to draw on, with references, and focus mainly on the practical decisions that have to be made in capturing data and compiling resources for speech research, or in establishing régimes for fairly and adequately assessing the performance of speech-processing software.

The book should help considerably in encouraging laboratories dealing with less-studied languages to move towards the standards and practices evolved in North West Europe, North America, and Japan; but even in these fortunate regions there will be few people responsible for managing teams of speech researchers who will not profit by dipping into the book, to compare their local practice with that of their peers, and to learn about issues in neighbouring areas of the field with which they do not directly deal.

One reason why the book is likely to be consulted widely, on paper or via the Web, is for the appendix on the SAMPA phonetic alphabet, which is the most accessible definition known to me of this emerging standard. SAMPA is a response to the problem that the ASCII character set contains many fewer elements than the IPA phonetic alphabet. IPA symbols can be represented electronically by three-digit numerical codes (another appendix specifies the coding system that has been promulgated by the International Phonetic Association); but, for broad phonetic (phonemic) transcriptions, it is far more convenient to represent sounds by single ASCII characters. SAMPA achieves this by mapping the ASCII characters onto a core subset of the IPA alphabet, and by defining, for any given language, a system of broad transcription that limits itself to those core symbols. The appendix describes the system and specifies its application to sixteen languages of Western and Eastern Europe.

For English the SAMPA transcriptions are quite intuitive in most respects; for instance, thinkers comes out as /"TINk@z/. There is one major exception to the intuitiveness: the pat vowel is represented as an opening curly bracket, /p{t/. This seems inconsistent with the general SAMPA approach, which makes avoidance of queer symbols a higher priority that narrow-phonetic precision; thus (p. 685) for the six vowel phonemes of Bulgarian, the phonetically-exact symbol-set /I E a O U @/ is rejected in favour of the more “ordinary” symbol-set /i e a o u @/. The same principle would suggest that English pat should be represented as /p&t/, using the ampersand sign (which SAMPA assigns to a slightly different vowel sound that does not occur in English or any of the fifteen other languages described). The ampersand has obvious mnemonic aptness to represent the pat vowel, and is used in that way in widely-used existing electronic resources such as Roger Mitton’s computer-usable version of the Oxford Advanced Learner’s Dictionary. The use of a syncategorematic sign such as a bracket to represent a segmental sound is so unnatural that I predict this particular detail of the SAMPA system will be modified in practice.

The book’s recommendations are unassertively expressed (the introduction, p. 26, raises a possibility of moving towards a more “prescriptive” stance in the future, but it is not clear that any of the authors seriously believe that would be desirable); and in general, where I am qualified to judge, the recommendations seem very reasonable. I found only one point which struck me as misguided, where Recommendations 2 and 3 on p. 170, taken together with the statement in Appendix M of the transcription conventions of the Speechdat project, seem to suggest that a conventional reduced pronunciation such as English wanna ought to be represented in an orthographically transcribed speech corpus in its conventional spelling, as want to. The forms want to and wanna are not grammatically interchangeable: thus, one can say Why do you wanna go?, but not *Who do you wanna go? Consequently the difference between wanna and want to matters to a class of researchers who are unlikely to work directly with audio recordings, and it should be shown in an orthographic transcription. (This case is quite different from dialect variation such as rhotic v. non-rhotic pronunciations of car, which have no significance above the phonological level and are appropriately ignored in an orthographic transcription.)

A camel is said to be a horse designed by a committee. It is perhaps inevitable that a book as large as this, involving input from many sites scattered across a continent, has a certain camelious quality. There are repetitions: the Wizard of Oz technique is treated in detail in at least three different chapters, the American SPRINT speaker verification system is described twice within a single chapter. Some rough edges have survived the editorial process. On p. 810 there are places where the writer left notes to himself about data to be added in a final draft that was never produced; the book contains a fair number of misprints, not all of which are trivial. (I could not follow the definition on p. 190 of “static coverage” of a lexicon, because a crucial word or phrase was omitted: does the concept relate to word-types or tokens?) The SAMPA appendix includes individual opinions expressed in the first person singular, but nothing tells us who wrote this appendix. On p. 58 the French number-words vingt and un are transcribed as containing the same nasal vowel, which they do not (and the transcription clashes with the SAMPA convention for either vowel). But the large task of editing this volume must have been subject to a law of diminishing returns. Another year or so of polishing could have increased the elegance and coherence of the whole, but the gain would have been outweighed by the disadvantage of delaying the dissemination of good practice. The editors’ compromise between perfection and speed was probably a reasonable one.

I shall close by mentioning two errors of detail and a typographical oddity. On p. 125, it is not true that the speech section of the British National Corpus was compiled by regularly sampling speakers’ output “for Y minutes every Z hours”. On p. 212, the categories of derivation and grammatical agreement do not exhaust the field of word morphology (inflecting verbs for tense, for instance, has nothing to do with either). Finally, the typesetting of this book is unique in my experience, in that paragraphs are not marked either by indenting the first line or by space between adjacent paragraphs. The only visible indicator is that the last line of a paragraph usually ends before reaching the right margin; in other words, paragraphing is marked on the right-hand sides but not on the left-hand sides of pages. I found this a surprisingly severe hindrance to rapid reading.

Geoffrey Sampson

Department of Informatics

University of Sussex

Falmer, Brighton BN1 9QH, England