Sampson on Heid, "Zur Strukturierung ..."

Ulrich Heid, Zur Strukturierung von einsprachigen und kontrastiven elektronischen Wörterbüchern. (Lexicographica, Series Maior, 77.) Tübingen: Max Niemeyer, 1997. ix + 255 pp. ISBN 3-484-30977-6.

Reviewed by Geoffrey Sampson, University of Sussex.

A buzzword, in language engineering research funded by the European Union over the last few years, has been reusability of resources; it has been seen as important both to find ways of using existing resources, e.g. published paper dictionaries, for the new purposes of computational linguistics, and to design new electronic research resources so as to be multifunctional rather than limited to serving the purposes in connexion with which they are compiled. This book is an account of one large EU research project motivated by the urge for reusability: the DELIS project, which ran from 1993 to 1995. DELIS, and Heid’s book, are presented as bridging the gulf between computational-linguistic treatment of lexical matters and the technicalities of lexicography in the traditional sense – Heid says that various parallels between the two disciplines have been little appreciated in the past.

DELIS stands for “Descriptive lexical specification and tools for corpus-based lexicon building”. The project was co-ordinated by Heid at the University of Stuttgart; other participants included Pisa, Clermont-Ferrand, Amsterdam, and Copenhagen Universities, three dictionary publishers (Van Dale of the Netherlands, OUP, and Den Danske Ordbog), and some software and consultancy firms.

What DELIS aimed to do, in a nutshell, was to define and illustrate a formalism for electronic dictionaries which specifies the organization of their contents in such a way that it is possible automatically to check that a dictionary is fully explicit (paper dictionaries invariably leave many things implicit, relying on human users’ ability to fill in the gaps by common sense and experience, which is obviously unacceptable in a dictionary for use by computers), and also so that two monolingual dictionaries of different languages can be used together, without adaptation, to support machine translation in either direction between the respective languages.

Heid draws an analogy with the formal specifications which define the class of all and only the well-formed formulae of a formal language, such as the propositional calculus. DELIS aimed in this sense to provide formal specifications for a class of well-formed dictionaries: an electronic file which can be shown to conform to the specifications is guaranteed to have the practically-desirable properties which motivated the project. This might sound like a job for an SGML Document Type Definition. The passage in which Heid first introduces the formal-specification concept is linked to a critique of various inconsistencies in an existing electronic dictionary, “OALD3e”, based on the Third Edition of the Oxford Advanced Learner’s Dictionary (evidently not Roger Mitton’s CUVOALD (Mitton 1986), but a separate computer-usable version of OALD3 developed by OUP themselves); the kind of problems Heid identifies in OALD3e could very readily be solved through SGML. (For instance, the same category is sometimes labelled emphat_pron and sometimes emph_pron.) But Heid says that the kind of formal specification aimed at by DELIS goes beyond consistency in labels and syntactic structuring of entries, which could be codified in an SGML DTD, to include also a content model (Inhaltsmodell) that would to some extent determine whether a given dictionary entry makes sense.

DELIS sets out to achieve this largely by expressing semantic and grammatical properties of lexical items in terms of a formalism based on constraint logic, TFS (standing for “Typed Feature Structures”), devised by Heid’s Stuttgart colleagues Martin Emele and Rémi Zajac. The actual substance of DELIS semantic descriptions draws on a theory called Frame Semantics due to Charles Fillmore; the main references cited by Heid are both unpublished, but Frame Semantics appears to be an updating of Fillmore’s well-known case grammar theory.

Using the TFS formalism is claimed to enable a DELIS dictionary even to express conditions relating to the interaction between separate, nonadjacent levels of linguistic description, of a kind which have bedevilled machine translation systems in the past. Heid gives a clear illustration of this problem, in connexion with the task of translating the French phrase un train devant attendre le passage d’un express into German. Like French, German does normally allow nouns to be modified by present-participle clauses. In an MT system of the transfer type, the morphosyntactically-analysed French text may be successfully recast in terms of progressively deeper levels of representation, such as “functional structure” and, eventually, “French interface structure”, and replaced by an equivalent German interface structure which in turn will be realized at progressively shallower levels of German representation – until the enterprise falls at the last fence, because the German lexical database cannot supply a present participle corresponding to French devant. (The German modal verb müssen, like English must, has no participle forms; the French text has to be paraphrased in German as ein Zug, der die Durchfahrt eines Expresszugs abwarten muß.) Heid claims that TFS allows the relevant facts to be expressed in a DELIS dictionary in a manner that would allow an MT system to anticipate such problems, and move directly to an acceptable translation without needing to backtrack from plausible translation routes which turn out to lead to dead ends in this way.

This flexibility in TFS could presumably be exploited only once a dictionary compiler had access to some listing of inter-level relational properties that are potentially relevant for translation (or perhaps for other language-engineering applications). A chapter of Heid’s book which is of interest independently of the DELIS system attempts to establish a classification of the kinds of contrastive features among languages which create translation problems, before going on to show how such things can be encoded in TFS.

The outcome of the project, as reflected in this book, is more a programme for dictionary construction than a substantial implementation of the DELIS design decisions; from a single three-year research project one could scarcely expect more. The project produced sample fragments of DELIS dictionaries for verbs of perception and communication in several European languages, but these are an illustration, not an implementation, of the system (indeed Heid points out that the software implementation of TFS is not yet capable of functioning as part of a realistically large system). The goals of DELIS were clearly extremely ambitious, and at this point one can hardly guess whether or not the specifications have the potential to deliver language-engineering resources that succeed in working as envisaged.

One aspect of the book which arouses some misgivings is the relative fewness of its references to relevant work being done outside the framework of EU research funding. I should have liked, for instance, to see some discussion of the choice of TFS rather than DATR (for which see e.g. Evans & Gazdar 1996). I know very little about either of these formalisms, but to me as a nonexpert they sound very comparable, and DATR is both older-established and specifically designed for encoding lexical information, whereas Heid himself points out that TFS was originally designed for other purposes and is in some respects not ideally adapted to this one. I can easily believe that TFS may nevertheless have been a better choice for DELIS than DATR, because of considerations that I know nothing of; but Heid seems never to mention DATR. (I say “seems”; like many books published on the Continent, this one lacks an index, so it is very difficult to be sure that no brief reference is lurking in an out-of-the-way footnote. How do Continental scholars cope without indices?) A reader wanting simply to survey the range of work currently being done to develop dictionaries for use in language-engineering contexts would do better to consult a book like Ooi (1998), though I think no research discussed by Ooi has ambitions as lofty as those of DELIS.

Heid’s book contains plenty of references to others’ work; but the pattern of these references creates an impression that the community of EU-funded language engineering researchers forms something of a world of its own, into which news about research conducted elsewhere, or in Europe under national rather than EU auspices, is relatively slow to penetrate. In fact I believe that this situation is often recognized by members of this research community themselves, and that it is a consequence of EU funding practices. When research projects are executed by consortia linking numerous sites, which by European law must include at least two countries and commonly include many more, the overhead of keeping in touch with research partners must be immense, and perhaps leaves little time over for keeping track of the rest of the world.

Nevertheless, the concept of automatic validation of dictionaries with respect to substantive aspects of content, as envisaged by the DELIS project, is entirely original as far as I know. Heid and his collaborators deserve credit for setting their sights so high.

References

Evans, R. & G. Gazdar (1996) “DATR: A language for lexical knowledge representation”. Computational Linguistics 22.167-216.

Mitton, R. (1986) “A partial dictionary of English in computer-usable form”. Literary and Linguistic Computing 1.214-15.

Ooi, V.B.Y. (1998) Computer Corpus Lexicography. Edinburgh University Press.