The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version.

Published in AISBQ 105.26–7, 2001.


The Spoken Language Translator.  By Manny Rayner, David Carter, Pierrette Bouillon, Vassilis Digalakis, and Mats Wirén.  Cambridge University Press, 2000.  xviii + 337 pp.  ISBN 0 521 77077 7.

 

Reviewed by Geoffrey Sampson, University of Sussex.

 

 

In 300-odd pages this remarkable book gives a clear account of all aspects, including detailed evaluation, of a speech-to-speech translation project, “SLT”, sponsored by the Swedish company Telia and executed over the period 1992-9 by Telia, SRI International (Cambridge), and academic institutions in Greece, Sweden, Switzerland, and elsewhere in Europe.  SLT functions in the ATIS domain (conversations about booking flights), and handles five of the six language-pairs involving the languages Swedish, English, and French.  The book is addressed principally to language researchers, while including enough material on the speech components to show language experts the general nature of the problems faced and solutions adopted in those parts of the project.

 

Speech-to-speech translation ­– creation of a system which accepts conversational spoken input in one human language and converts it into comprehensible synthesized speech in another – is a challenge which requires almost every function within language and speech technology to be achieved to a greater or lesser extent.  As such, it was selected by the German government for the very large (DM 115 million) flagship Verbmobil initiative, which involved collaboration by a high proportion of all German academic sites active in language and speech, together with a number of leading companies, over about the same period as the SLT project.  At a time when many academic computer scientists in our part of the world are making dejected comments about American dominance and the extent to which industrial research is bypassing academe, Wolfgang Wahlster, leader of the Verbmobil initiative, feels that the Verbmobil experience proves that European, academically-based efforts are well able to take on and beat American competition, provided funding at the right level is made available (Wahlster 1999).

 

In that context it is very interesting that these authors consider that SLT delivered results comparable to those of Verbmobil, with a total budget less than a tenth the size.  They are frank about the fact that it is difficult to make precise numerical comparisons between independent projects; but I am inclined to trust their judgement of broad comparability with Verbmobil – one very refreshing thing about this book is that it seems entirely free of the hype that disfigures a great deal of writing in the computing domain.  When some SLT component worked less well than hoped, they say so.

 

So it seems that even the massive level of funding enjoyed by Verbmobil (which those of us working elsewhere in Europe are unlikely to see replicated) may not be indispensable.  Good organization, good ideas, and genuine enthusiasm from both sides for collaboration across the industrial/academic divide could be more important, for European informatics research to establish its place in the sun.

 

SLT naturally built on a great deal of existing work.  The speech-recognition module used SRI’s Hidden Markov Model-based Decipher and Nuance family of systems, though with heavy adaptation.  Unlike comparable speech-recognition work being funded at the same period in the USA by DARPA, the SLT recognition module was required to perform at or near real time; one of its innovations was a novel “genonic” type of HMM which represents a particularly favourable compromise between speed and accuracy (and has since been adopted by others).  Translation exploited the Core Language Engine (Alshawi 1992), again developed by SRI; when I was working in the private sector on parser evaluation in 1990, I found that this was by a clear margin the best of a range of natural-language analysis systems then available.  Output speech was synthesized by TrueTalk (from Entropic Systems).

 

The translation module proper is a hybrid system combining rule-based and statistical processing.  The Core Language Engine, based on a unification-grammar formalism, converts input sentences into their “quasi logical forms” (comparable to the logical forms of predicate calculus, except that quantifier scope is not specified – since scope ambiguities are often quite parallel between natural languages, it is easy to believe that resolving them would be wasted effort with respect to correct translation).  The range of quasi logical forms for one language is similar but not identical to those for another language, so transfer rules are needed to convert source-language into target-language QLFs.  (Differences in manner of expression such as French il me faut X v. English “I need X” are cited as one reason why a single QLF interlingua is unfeasible.)  The set of transfer rules offers alternative target-language QLFs, with statistical preferences among the rules which yield the alternatives; but there are also statistical preferences among target-language QLF structures, without reference to transfers from source-language structures, and the two kinds of preference are balanced off against one another to determine the system’s choice of translation.

 

Of course, conversational speech does not always consist of complete, well-formed sentences, and even when it does a speech recognition system does not resolve utterances unambiguously into the words intended by the speaker.  All inputs are translated at four degrees of “depth”, of which quasi logical forms for sentences are the deepest and word-for-word translation the shallowest; again statistical preferences are used to choose the likeliest translation for each input fragment, with weightings that favour deep translations and large fragments.

 

The relevant statistics are derived largely from a module which allows researchers to train the system by answering system-generated questions about the correct analysis of crucial corpus examples.  This amounts to a convenient, practical way of tuning very general hand-coded grammars to make them specific enough to work within a particular language domain, such as ATIS.  The authors note that many researchers currently are inclined to reject any role for hand-coded grammars, feeling that the only hope for practical natural language processing lies in automatically-generated techniques applied to surface properties of language.  But many important aspects of human language content cannot be addressed without “competence” rules which have to be designed by taking thought, rather than derived mechanically from surface features.  The authors see their most important single contribution as being the demonstration that it is possible to use hand-coded linguistic rules within serious, non-“toy” applications.

 

Part of this demonstration consists of showing that, for the particular domain they are concerned with, training requires a corpus of only five to ten thousand utterances, with no advantage gained from substantially greater amounts of data.  This is typical of the way that, throughout the book, all aspects of development and performance are carefully documented in quantitative terms.

 

Another feature of the project concerned the issue of reusability of expensively-produced resources.  The unification-based English-language analysis system was capable of adaptation to Swedish and to French at a fraction of the time cost that would have been needed to develop independent systems.  Furthermore, the authors discuss a technique whereby L1 → L2 and L2 → L3 transfer rules can be composed to yield L1 → L3 transfer rules semi-automatically.  In the speech area, the Swedish recognizer was bootstrapped from the model for English.  A particularly interesting section discusses how recognizers for regional dialects can be achieved at low cost by treating dialect adaptation as a special case of language-model training.  (Here I did wonder whether the good results reported depended on the fact that differences between Scanian and Stockholm dialects of Swedish, used for the experiment, seem not to be large, relative to differences between some dialects of English.)

 

The authors note in conclusion that their project did not result in “a functioning, commercially available speech translation system”.  But I wonder whether either they or their sponsors expected to get all the way to market in seven years’ work.  The impression that arises from this book, though, is of a very solid foundation from which marketable products should ultimately be derivable through incremental improvements.

 

Above all, the authors make it clear at each step what strategic choices they made, and why.  Even if some of their decisions eventually prove to have been mistaken, others are likely to reach better solutions sooner than they otherwise would, with documentation of this quality to react to.

 

On the evidence of the book, the SLT project sounds like a model for large-scale IT research initiatives.  The book is certainly a model of research dissemination.

 

 

References

 

Alshawi, H., ed.  1992.  The Core Language Engine.  MIT Press.

Wahlster, W.  1999.  Interview with Wolfgang Wahlster.  ELSNews 8.4, December 1999, pp. 8-10.