Sampson: Probabilistic Parsing

Probabilistic Parsing

Geoffrey Sampson

Bentham, Yorkshire

We want to give computers the ability to process human languages. But computers use systems of their own which are also called ‘languages’, and which share at least some features with human languages; and we know how computers succeed in processing computer languages, since it is humans who have arranged for them to do so. Inevitably there is a temptation to see the automatic processing of computer languages as a precedent or model for the automatic processing (and perhaps even for the human processing) of human languages. In some cases the precedent may be useful, but clearly we cannot just assume that human languages are similar to computer languages in all relevant ways. In the area of grammatical parsing of human languages, which seems to be acknowledged by common consent as the central problem of natural language processing – ‘NLP’ – at the present time, I believe the computer-language precedent may have misled us. One of the ideas underlying my work is that human languages, as grammatical systems, may be too different from computer languages for it to be appropriate to use the same approaches to automatic parsing.

Although the average computer scientist would probably think of natural-language parsing as a somewhat esoteric task, automatic parsing of computer programming languages such as C or Pop-11 is one of the most fundamental computing operations; before a program written in a user-oriented programming language such as these can be run it must be ‘compiled’ into machine code – that is, automatically translated into a very different, ‘low level’ programming language – and compilation depends on extracting the grammatical structure by virtue of which the C or Pop-11 program is well-formed. To construct a compiler capable of doing this, one begins from a ‘production system’ (i.e. a set of rules) which defines the class of well-formed programs in the relevant user-oriented language. In fact there exist software systems called ‘compiler-compilers’ or ‘parser generators’ which accept as input a production system for a language and automatically yield as output a parser for the language. To the computer scientist it is self-evident that parsing is based on rules for well-formedness in a language.

If one seeks to apply this concept to natural languages, an obvious question is whether rules of well-formedness can possibly be as central for processing natural languages, which have grown by unplanned evolution and accretion over many generations, as they are for processing formal programming languages, which are rule-governed by stipulation. What counts as a valid C program is fixed by Brian Kernighan and Dennis Ritchie – or, now, by the relevant American National Standards Institute committee – and programmers are expected to learn the rules and keep within them. If a programmer inadvertently tries to extend the language by producing a program that violates some detail (perhaps a very minor detail) of the ANSI rules, it is quite all right for the compiler software to reject the program outright. In the case of speakers of natural languages it is not intuitively obvious that their skill revolves in a similar way round a set of rules defining what is well-formed in their mother tongue. It is true that I sometimes hear children or foreigners producing English utterances that sound a little odd, but it seems to me that (for me and for other people) the immediate response is to understand the utterance, and noticing its oddity is a secondary phenomenon if it occurs at all. It is not normally a case, as the compiler model might suggest, of initially hearing the utterance as gibberish and then resorting to special mental processes to extract meaning from it nevertheless.

These points seem fairly uncontroversial, and if the period when linguistics and computer science first became heavily involved with one another had been other than when it was (namely about 1980) they might have led to widespread scepticism among linguists about the applicability of the compiler model to natural language parsing. But intellectual trends within linguistics at that time happened to dovetail neatly with the compiler model. The 1970s had been the high point of Noam Chomsky’s intellectual dominance of linguistics – the period when university linguistics departments routinely treated ‘generative grammar’ as the centrepiece of their first-year linguistics courses, as the doctrine of the phoneme had been twenty years earlier – and, for Chomsky, the leading task of linguistics was how to formulate a rigorous definition specifying the range of well-formed sentences for a natural language. Chomsky’s first book began: ‘... The fundamental aim in the linguistic analysis of a language L is to separate the grammatical sequences which are the sentences of L from the ungrammatical sequences which are not sentences of L ...’ (Chomsky 1957, 13).

Chomsky’s reason for treating this aim as fundamental did not have to do with automatic parsing, which was not a topic that concerned him. Chomsky has often, with some justification, contradicted writers who suggested that his kind of linguistics is preoccupied with linguistic automation, or that he believes the mind ‘works like a computer’. In the context of Chomsky’s thought, the reason to construct grammars which generate ‘all and only’ the grammatical sentences of different natural languages was his belief that these grammars turn out to have various highly specific and unpredicted formal properties, which do not differ from one natural language to another, and that these universals of natural grammar are a proof of the fact that (as Chomsky later put it) ‘we do not really learn language; rather, grammar grows in the mind’ (Chomsky 1980, 134) – and that, more generally, an adult’s mental world is not the result of his individual intellectual inventiveness responding to his environment but rather, like his anatomy, is predetermined in much of its detail by his genetic inheritance. For Chomsky formal linguistics was a branch of psychology, not of computer science.

This idea of Chomsky’s that linguistic structure offers evidence for genetic determination of our cognitive life seems when the arguments are examined carefully to be quite mistaken (cf. Sampson 1980; 1989). But it was influential for many years; and its significance for present purposes is that, when linguists and computer scientists began talking to one another, it led the linguists to agree with the computer scientists that the important thing to do with a natural language was to design a production system for it. What linguists call a ‘generative grammar’ is what computer scientists call a ‘production system.’[1] Arguably, indeed, the rise of NLP gave generative grammar a new lease of life within linguistics. About 1980 it seemed to be losing ground in terms of perceived centrality in the discipline to less formal, more socially-oriented trends, but during the subsequent decade there was a striking revival of formal grammar theory.

Between them, then, these mutually-reinforcing traditions made it seem inevitable that the way to produce parsers for natural languages was to define generative grammars for them and to derive parsers from the generative grammars as compilers are derived from computer-language production systems. Nobody thought the task would be easy: a natural language grammar would clearly be much larger than the grammar for a computer language, which is designed to be simple, and Chomsky’s own research had suggested that natural-language generative grammars were also formally of a different, less computationally tractable type than the ‘context-free’ grammars which are normally adequate for programming languages. But these points do not challenge the principle of parsing as compilation, though they explain why immediate success cannot reasonably be expected.

I do not want to claim that the compiler model for natural language parsing is necessarily wrong; but the head start which this model began with, partly for reasons of historical accident, has led alternative models to be unjustifiably overlooked. Both Geoffrey Leech’s group and mine – but as yet few other computational-linguistics research groups, so far as I know – find it more natural to approach the problem of automatically analysing the grammatically rich and unpredictable material contained in corpora of real-life usage via statistical techniques somewhat akin to those commonly applied in the field of visual pattern recognition, rather than via the logical techniques of formal-language compilation.

To my mind there are two problems about using the compiler model for parsing the sort of language found in resources such as the LOB Corpus. The first is that such resources contain a far greater wealth of grammatical phenomena than standard generative grammars take into account. Since the LOB Corpus represents written English, one obvious example is punctuation marks: English and other European languages use a variety of these, they have their own quite specific grammatical properties, but textbooks of linguistic theory never in my experience comment on how punctuation marks are to be accounted for in grammatical analyses. This is just one example: there are many, many more. Personal names have their own complex grammar – in English we have titles such as Mrs, Dr which can introduce a name, titles such as Ph.D., Bart which can conclude one, Christian names can be represented by initials but surnames in most contexts cannot, and so on and so forth – yet the textbooks often gloss all this over by offering rules which rewrite ‘NP’ as ‘ProperName’ and ‘ProperName’ simply as John, Mary, .... Addresses have internal structure which is a good deal more complex than that of personal names (and highly language-specific – English addresses proceed from small to large, e.g. house name, street, town, county, while several European languages place the town before the smaller units, for instance), but they rarely figure in linguistics textbooks at all. Money sums (note the characteristic grammar of English £2.50 v. Portuguese 2$50), weights and measures, headlines and captions, and many further items are part of the warp and weft of real-life language, yet are virtually invisible in standard generative grammars.

In one sense that is justifiable. We have seen that the original motivation for generative linguistic research had to do with locating possible genetically-determined cognitive mechanisms, and if such mechanisms existed one can agree that they would be more likely to relate to conceptually general areas of grammar, such as question-formation, than to manifestly culture-bound areas such as the grammar of money or postal addresses. But considerations of that kind have no relevance for NLP as a branch of practical technology. If computers are to deal with human language, we need them to deal with addresses as much as we need them to deal with questions.

The examples I have just quoted are all characteristic more of written than spoken language; my corpus experience has until recently been mainly with the LOB and Brown Corpora of written English. But my group has recently begun to turn its attention to automatic parsing of spoken English, working with the London-Lund Corpus, and it is already quite clear that this too involves many frequent phenomena which play no part in standard generative grammars. Perhaps the most salient is so-called ‘speech repairs’, whereby a speaker who notices himself going wrong backtracks and edits his utterance on the fly. Standard generative grammars would explicitly exclude speech repairs as ‘performance deviations’, and again for theoretical linguistics as a branch of cognitive psychology this may be a reasonable strategy; but speech repairs occur, they fall into characteristic patterns, and practical automatic speech-understanding systems will need to be able to analyse them.

Furthermore, even in the areas of grammar which are common to writing and speech and which linguists would see as part of what a language description ought (at least ideally) to cover, there is a vast amount to be done in terms of listing and classifying the phenomena that occur. Many constructions are omitted from theoretical descriptions not for reasons of principle but because they are not very frequent and/or do not seem to interact in theoretically-interesting ways with central aspects of grammar, and although they may be mentioned in traditional descriptive grammars they are not systematically assigned places in explicit inventories of the resources of the language. One example among very many might be the English the more ... the more ... construction discussed by Fillmore et al. (1988), an article which makes some of the same points I am trying to make about the tendency for much of a language’s structure to be overlooked by the linguist.

All this is to say, then, that there is far more to a natural language than generative linguistics has traditionally recognized. That does not imply that comprehensive generative grammars cannot be written, but it does mean that the task remains to be done. There is no use hoping that one can lift a grammar out of a standard generative-linguistic definition of a natural language and use it with a few modifications as the basis of an adequate parser.

But the second problem, which leads me to wonder whether reasonably comprehensive generative grammars for real-life languages are attainable even in principle, is the somewhat anarchic quality of much of the language one finds in resources such as LOB. If it is correct to describe linguistic behaviour as rule-governed, this is much more like the sense in which car-drivers’ behaviour is governed by the Highway Code than the sense in which the behaviour of material objects is governed by the laws of physics, which can never be violated. When writing carefully for publication, we do stick to most of the rules, and with a police car behind him an Englishman keeps to 30 m.p.h. in a built-up area. But any rule can be broken on occasion. If a tree has fallen on the left side of the road, then common sense overrides the Highway Code and we drive cautiously round on the right. With no police near, ‘30 m.p.h.’ is interpreted as ‘not much over 40’.

So it seems to be with language. To re-use an example that I have quoted elsewhere (Garside et al. 1987, 19): a rule of English that one might have thought rock-solid is that the subject of a finite clause cannot consist wholly of a reflexive pronoun, yet LOB contains the following sentence, from a current-affairs magazine article by Bertrand Russell:

Each side proceeds on the assumption that itself loves peace, but the other side consists of warmongers.

Itself served better than plain it to carry the contrast with the other side, so the grammatical rule gives way to the need for a persuasive rhetorical effect. A tree has blocked the left-hand lane, so the writer drives round on the right and is allowed to do so, even though the New Statesman’s copy-editor is behind him with a blue light on his roof. In this case the grammatical deviation, though quite specific, is subtle; in other cases it can be much more gross. Ten or fifteen years ago I am sure we would all have agreed about the utter grammatical impossibility of the sentence:

*Best before see base of can.

But any theory which treated it as impossible today would have to contend with the fact that this has become one of the highest-frequency sentences of written British English.

Formal languages can be perfectly rule-governed by stipulation; it is acceptable for a compiler to reject a C program containing a misplaced comma. But with a natural language, either the rules which apply are not complete enough to specify what is possible and what is not possible in many cases, or if there is a complete set of rules then language-users are quite prepared to break them. I am not sure which of these better describes the situation, but, either way, a worthwhile NLP system has to apply to language as it is actually used: we do not want it to keep rejecting authentic inputs as ‘ill-formed’.

The conclusion I draw from observations like these is that, if I had to construct a generative grammar covering everything in the LOB Corpus in order to derive a system capable of automatically analysing LOB examples and others like them, the job would be unending. Rules would have to be multiplied far beyond the number found in the completest existing formal linguistic descriptions, and as the task of rule-writing proceeded one would increasingly find oneself trying to make definite and precise statements about matters that are inherently vague and fluid.

In a paper to an ICAME conference (Sampson 1987) I used concrete numerical evidence in order to turn this negative conclusion into something more solid than a personal wail of despair. I looked at statistics on the diversity of grammatical constructions found in the ‘Lancaster-Leeds Treebank’, a ca 40,000-word subset of the LOB Corpus which I had parsed manually, in collaboration with Geoffrey Leech and his team, in order to create a database (described in Garside et al. 1987, chap. 7, and Sampson 1991) to be exploited for our joint NLP activities. I had drawn labelled trees representing the surface grammatical structures of the sentences, using a set of grammatical categories that were chosen to be maximally uncontroversial and in conformity with the linguistic consensus, and taking great pains to ensure that decisions about constituent boundaries and category membership were consistent with one another across the database, but imposing no prior assumptions about what configurations of grammatical categories can and cannot occur in English. In Sampson (1987) I took the highest-frequency grammatical category (the noun phrase) and looked at the numbers of different types of noun phrase in the data, where a ‘type’ of noun phrase is a particular sequence of one or more daughter categories immediately dominated by a noun phrase node. Types were classified using a very coarse vocabulary of just 47 labels for daughter nodes (14 phrase and clause classes, 28 word-classes, and five classes of punctuation mark), omitting many finer subclassifications that are included in the Treebank. There were 8328 noun phrase tokens in my data set, which between them represented 747 different types, but the frequency of the types was very various: the commonest single type (determiner followed by singular noun) accounted for about 14% of all noun phrase tokens, while many different types were represented by one token each.

The particularly interesting finding emerged when I considered figures on the proportion of all noun phrase tokens belonging to types of not more than a set frequency in the data, and plotted a graph showing the proportion p as a function of the threshold type-frequency f (with f expressed as a fraction of the frequency of the commonest type, so that p = 1 when f = 1). The 58 points for different observed frequencies fell beautifully close to an exponential curve, p = f 0.4 . As the fraction f falls, f 0.4 falls much more slowly: as we consider increasingly low-frequency constructions, the number of different constructions occurring at such frequencies keeps multiplying in a smoothly predictable fashion so that quite sizeable proportions of the data are accounted for even by constructions of the lowest frequencies. (More than 5% of the noun phrase tokens in my data set represented constructions which each occurred just once.) If this regular relationship were maintained in larger samples of data (this is admittedly a big ‘if’ – as yet there simply do not exist carefully-analysed language samples large enough to allow the question to be checked), it would imply that even extremely rare constructions would collectively be reasonably frequent. One in a thousand noun phrase tokens, for instance, would represent some noun phrase type occurring not more than once in a thousand million words. Yet how could one hope to design a grammar that generates ‘all and only’ the correct set of constructions, if ascertaining the set of constructions to be generated requires one to monitor samples of that size?

Accordingly, our approach to automatic parsing avoids any use of the concept of well-formedness. In fitting a labelled tree to an input word-string, our system simply asks ‘What labelled tree over these words comes closest to being representative of the configurations in our database of parsed sentences?’ The system does not go on to ask ‘Is that a grammatically “legal” tree?’ – in our framework this question has no meaning.

This general concept of parsing as maximizing conformity with statistical norms is, I think, common to the work of Geoffrey Leech’s team at Lancaster and my Project APRIL, sponsored by the Royal Signals & Radar Establishment and housed at the University of Leeds, under the direction of Robin Haigh since my departure from the academic profession.[2] There are considerable differences between the deterministic techniques used by Leech’s team (see e.g. Garside et al. 1987, chap. 6) and the stochastic APRIL approach, and I can describe only the latter; but although the APRIL technique of parsing by stochastic optimization was an invention of my own, I make no claim to pioneer status with respect to the general concept of probabilistic parsing – this I borrowed from the Leech team.

The APRIL system is described for instance in Sampson et al. (1989). In broad outline the system works like this. We assume that the desired analysis for any input string is always going to be a tree structure with labels drawn from an agreed vocabulary of grammatical categories. For any particular input, say w words in length, the range of solutions available to be considered in principle is simply the class of all distinct tree-structures having w terminal nodes and having labels drawn from the agreed vocabulary on the nonterminal nodes. The root node of a tree is required to have a specified ‘sentence’ label, but apart from that any label can occur on any nonterminal node: a complex, many-layered tree over a long sentence in which every single node between root and ‘leaves’ is labelled ‘prepositional phrase’, say, would in APRIL terms not be an ‘illegal/ill-formed/ungrammatical’ tree, it would just be a quite poor tree in the sense that it would not look much like any of the trees in the Treebank database.

Parsing proceeds by searching the massive logical space of distinct labelled trees to find the best. There are essentially two problems: how is the ‘goodness’ of a labelled tree measured, and how is the particular tree that maximizes this measure located (given that there will be far too many alternative solutions in the solution-space for each to be checked systematically)?

The answer to the first question is that individual nodes of a tree are assigned figures of merit by reference to probabilistic transition networks. Suppose some node in a tree to be evaluated is labelled X, and has a sequence of daughters labelled P Q R. This might be a sequence which would commonly be found below an X node in correctly-parsed sentences (if X is ‘noun phrase’, P Q R might be respectively ‘definite article’, ‘singular noun’, ‘relative clause’, say), or it might be some absurd expansion for X (say, ‘comma’, ‘prepositional phrase’, ‘adverbial clause’), or it might be something in between. For the label X (and for every other label in the agreed vocabulary) the system has a transition network which – ignoring certain complications for ease of exposition – includes a path (designed manually) for each of the sequences commonly found below that label in accurately parsed material, to which skip arcs and loop arcs have been added automatically in such a fashion that any label-string whatever, of any length, corresponds to some route through the network. (Any particular label on a high-frequency path can be bypassed via a skip arc, and any extra label can be accepted at any point via a loop arc.) The way the network is designed ensures that (again omitting some complications) it is deterministic – whatever label sequence may be found below an X in a tree, there will be one and only one route by which the X network can accept that sequence. Probabilities are assigned to the arcs of the networks for X and the other labels in the vocabulary by driving the trees of the database over them, which will tend to result in arcs on the manually-designed routes being assigned relatively high probabilities and the automatically-added skip and loop arcs being assigned relatively low probabilities. (One might compare the distinction between the manually-designed high-frequency routes and the routes using automatically-added skip or loop arcs to Chomsky’s distinction between ideal linguistic ‘competence’ and deviant ‘performance’ – though this comparison could not be pressed very far: the range of constructions accepted by the manually-designed parts of the APRIL networks alone would not be equated, by us or by anyone else, with the class of ‘competent/well-formed’ constructions.) Then, in essence, the figure of merit assigned to any labelled tree is the product of the probabilities associated with the arcs traversed when the daughter-strings of the various nodes of the tree are accepted by the networks.

As for the second question: APRIL locates the best tree for an input by a stochastic optimization technique, namely the technique of ‘simulated annealing’ (see e.g. Kirkpatrick et al. 1983; Aarts & Korst 1989). That is, the system executes a random walk through the solution space, evaluating each random move from one labelled tree to another as it is generated, and applying an initially weak but steadily growing bias against accepting moves from ‘better’ to ‘worse’ trees. In this way the system evolves towards an optimal analysis for an input, without needing initially to know whereabouts in the solution space the optimum is located, and without getting trapped at ‘local minima’ – solutions which are in themselves suboptimal but happen to be slightly better than each of their immediately-neighbouring solutions. Stochastic optimization techniques like this one have something of the robust simplicity of Darwinian evolution in the natural world: the process does not ‘know where it is going’, and it may be subject to all sorts of chance accidents on the way, but in the long run it creates highly-valued outcomes through nothing more than random mutation and a tendency to select fitter alternatives.

As yet, APRIL’s performance leaves plenty to be desired. Commonly it gets the structure of an input largely right but with various local errors, either because the tree-evaluation function fails to assign the best score to what is in fact the correct analysis, or because the annealing process ‘freezes’ on a solution whose score is not the best available, or both. Let me give one example, quoted from Sampson et al. (1989), of the outcome of a run on the following sentence (taken from LOB text E23, input to APRIL as a string of wordtags):

The final touch was added to this dramatic interpretation, by placing it to stand on a base of misty grey tulle, representing the mysteries of the human mind.

According to our parsing scheme, the correct analysis is as follows (for the symbols, see Garside et al. 1987, chap. 7, sec. 5):

[S[N the final touch] [Vp was added] [P to [N this dramatic interpretation]], [P by [Tg [Vg placing] [N it] [Ti [Vi to stand] [P on [N a base [P of [N misty grey tulle, [Tg [Vg representing] [N the mysteries [P of [N the human mind]]]]]]]]]]].]

The analysis produced by APRIL on the run in question was as follows:

[S [N the final touch] [Vp was added] [P to [N this dramatic interpretation]], [P by [Tg [Vg placing] [N it] [Ti [Vi to stand] [P on [N a base [P of [N misty grey tulle]]]], [Tg [Vg representing] [N the mysteries] [P of [N the human mind]]]]]].]

That is, of the human mind was treated as an adjunct of representing rather than as a postmodifier of mysteries, and the representing clause was treated as an adjunct of placing rather than as a postmodifier of tulle.

Our method of assessing performance gives this output a mark of 76%, which was roughly average for APRIL’s performance at the time (though some errors which would reduce the percentage score by no more than the errors in this example might look less venial to a human judge). We had then, and still have, a long way to go. But our approach has the great advantage that it is easy to make small incremental adjustments: probabilities can be adjusted on individual transition-network arcs, for instance, without causing the system to crash and fail to deliver any analysis at all for some input (as can relatively easily happen with a compiler-like parser); and the system does not care how grammatical the input is. The example above is in fact a rather polished English sentence; but APRIL would operate in the same fashion on a thoroughly garbled, ill-formed input, evolving the best available analysis irrespective of whether the absolute value of that analysis is high or low. Currently, much of our work on APRIL is concerned with adapting it to deal with spoken English, where grammatical ill-formedness is much commoner than in the edited writing of LOB.

To delve deeper into the technicalities of APRIL would not be appropriate here. But in any case this specific research project is a less significant topic for the corpus linguistics community at large than is the general need, which this and related research has brought into focus for me, for a formal stocktaking of the resources of the languages we work with. Those of us who work with English think of our language as relatively thoroughly studied, yet we have no comprehensive inventory and classification, at the level of precision needed for NLP purposes, of the grammatical phenomena found in real-life written and/or spoken English usage; I surmise that the same is true for other European languages.

By far the largest part of the work of creating the 40,000-word Lancaster-Leeds Treebank lay not in drawing the labelled trees for the individual sentences but in developing a set of analytic categories and maintaining a coherent body of precedents for their application, so as to ensure that anything occurring in the texts could be given a labelled tree structure, and that a decision to mark some sequence off as a constituent of a given category at one point in the texts would always be consistent with decisions to mark off and label comparable sequences elsewhere. It is easy, for instance, to agree that English has a category of ‘adjective phrases’ (encoded in our scheme as J), core examples of which would be premodified adjectives (very small, pale green); but what about cases where an adjective is followed by a prepositional phrase which expands its meaning, as in:

they are alike in placing more emphasis ...

– should in placing ... be regarded as a postmodifier within the J whose head is alike, or is alike a one-word J and the in placing ... sequence a sister constituent? There is no one answer to this question which any English linguist would immediately recognize as obviously correct; and, while some answer might ultimately be derived from theoretical studies of grammar, we cannot expect that theoreticians will decide all such questions for us immediately and with a single voice: notoriously, theories differ, theories change, and many of the tree-drawing problems that crop up have never yet been considered by theoretical grammarians. But, for probabilistic approaches to NLP, we must have some definite answer to this and very many other comparable issues. Statistics extracted from a database of parsed sentences in which some cases of adjective + prepositional phrase were grouped as a single constituent and other, linguistically indistinguishable cases were analysed as separate immediate constituents of the sentence node, on a random basis, would be meaningless and useless. Accordingly, much of the work of creating the database involved imposing and documenting decisions with respect to a multitude of such issues; one strove to make the decisions in a linguistically reasonable way, but the overriding principle was that it was more important to have some clearcut body of explicit analytic precedents, and to follow them consistently, than it was that the precedents should always be indisputably ‘correct’.

The body of ‘parsing law’ that resulted was in no sense a generative grammar – it says nothing about what sequences ‘cannot occur’ or are ‘ungrammatical’, which is the distinctive property of a generative grammar – but what it does attempt to do is to lay down explicit rules for bracketing and labelling in a predictable manner any sequence that does occur, so that as far as possible two analysts independently drawing labelled trees for the same novel and perhaps unusual example of English would be forced by the parsing law to draw identical trees.

Although my own motive for undertaking this precedent-setting task had to do with providing a statistical database to be used by probabilistic parsers, a thoroughgoing formal inventory of a language’s resources is important for NLP progress in other ways too. Now that NLP research internationally is moving beyond the preoccupation with artificially-simple invented examples that characterized its early years, there is a need for research groups to be able routinely to exchange quantities of precise and unambiguous information about the contents of a language; but at present this sort of information exchange is hampered in the domain of grammar by the fact that traditional terminology is used in inconsistent and sometimes vague ways. For instance, various English-speaking linguists use the terms ‘complement’, or ‘predicate’, in quite incompatible ways. Other terms, such as ‘noun phrase’, are used much more consistently in the sense that different groups agree on core examples of the term; but traditional descriptive grammars, such as Quirk et al. (1985) and its lesser predecessors, do not see it as part of their task to define clearcut boundaries between terms that would allow borderline cases to be assigned predictably to one category or another. For computational purposes we need sharpness and predictability.

What I am arguing for – I see it as currently the most pressing need in the NLP discipline – is taxonomic research in the grammatical domain that should yield something akin to the Linnaean taxonomy for the biological world. Traditional grammars describe constructions shading into one another, as indeed they do, but the analogous situation in biology did not prevent Linné imposing sharp boundaries between botanical species and genera. Linné said: Natura non facit saltus. Plantae omnes utrinque affinitatem monstrant, uti territorium in mappa geographica; but Linné imposed boundaries in this apparent continuum, as nineteenth-century European statesmen created colonial boundaries in the map of Africa. The arrangement of species and genera in the Linnaean system was artificial and in some respects actually conflicted with the natural (i.e. theoretically correct) arrangement, and Linné knew this perfectly well – indeed, he spent part of his career producing fragments of a natural taxonomy, as an alternative to his artificial taxonomy; but the artificial system was based on concrete, objective features which made it practical to apply, and because it did not have to wait on the resolution of theoretical puzzles Linné could make it complete. Artificial though the Linnaean system was, it enabled the researcher to locate a definite name for any specimen (and to know that any other botanist in the world would use the same name for that specimen), and it gave him something approaching an exhaustive conspectus of the ‘data elements’ which a more theoretical approach would need to be able to cope with.

If no-one had ever done what Linné did, then Swedish biologists would continually be wondering what British biologists meant (indeed, Lancastrian biologists would be wondering what Cambridge biologists meant) by, say, cuckoo-pint, and whether cuckoo-pint, cuckoo flower, and ragged robin were one plant, two, or three. Since Linné, we all say Arum maculatum and we know what we are talking about. Computational linguistics, I feel, is still operating more or less on the cuckoo-pint standard. First let us do a proper stocktaking of our material, and then we shall have among other things a better basis for theoretical work.

In one area an excellent start has already been made. Stig Johansson’s Tagged LOB Corpus Manual (Johansson 1986) includes a great deal of detailed boundary-drawing between adjacent wordtags of the LOB tagset. Leech’s and my groups have refined certain aspects of Johansson’s wordclass taxonomy, making more distinctions in areas such as proper names and numerical and technical items, for instance, but we could not have done what we have done except by building on the foundation provided by Johansson’s work; and it is interesting and surprising to note that, although Johansson (1986) was produced for one very specific and limited purpose (to document the tagging decisions in a specific tagged corpus), the book has to my knowledge no precedent in the level of detail with which it specifies the application of wordclass categories. One might have expected that many earlier linguists would have felt the need to define a detailed set of wordclasses with sufficient precision to allow independent analysts to apply them predictably: but apparently the need was not perceived before the creation of analysed versions of electronic corpora.

With respect to grammatical structure above the level of terminal nodes, i.e. the taxonomy of phrases and clauses, nothing comparable to Johansson’s work has been published. I have referred to my own, unpublished, work done in connexion with the Lancaster-Leeds Treebank; and at present this work is being extended under Project SUSANNE, a project sponsored by the Economic & Social Research Council at the University of Leeds and directed by myself as an external consultant, the goal of which is the creation of an analysed English corpus significantly larger than the Lancaster-Leeds Treebank, and analysed with a comparable degree of detail and self-consistency, but in conformity with an analytic scheme that extends beyond the purely ‘surface’ grammatical notations of the Lancaster-Leeds scheme to represent also the ‘underlying’ or logical structure of sentences where this conflicts with surface structure.[3] We want to develop our probabilistic parsing techniques so that they deliver logical as well as surface grammatical analyses, and a prerequisite for this is a database of logically-parsed material.

The SUSANNE Corpus is based on a grammatically-annotated 128,000-word subset of the Brown Corpus created at Gothenburg University in the 1970s by Alvar Ellegård and his students (Ellegård 1978). The solid work already done by Ellegård’s team has enabled my group to aim to produce a database of a size and level of detail that would otherwise have been far beyond our resources. But the Gothenburg product does have limitations (as its creator recognizes); notably, the annotation scheme used, while covering a rather comprehensive spectrum of English grammatical phenomena, is defined in only a few pages of instructions to analysts. As an inevitable consequence, there are inconsistencies and errors in the way it is applied to the 64 texts from four Brown genres represented in the Gothenburg subset.

We are aiming to make the analyses consistent (as well as representing them in a more transparent notation, and adding extra categories of information); but, as a logically prior task, we are also formulating and documenting a much more detailed set of definitions and precedents for applying the categories used in the SUSANNE Corpus. Our strategy is to begin with the ‘surfacy’ Lancaster-Leeds Treebank parsing scheme, which is already well-defined and documented internally within our group, and to add to it new notations representing the deep-grammar matters marked in the Gothenburg files but not in the Treebank, without altering the well-established Lancaster-Leeds Treebank analyses of surface grammar. (For most aspects of logical grammar it proved easier than one might have expected to define notations that diverse theorists should be able to interpret in their own terms.) Thus the outcome of Project SUSANNE will include an analytic scheme in which the surface-parsing standards of the Lancaster-Leeds parsing law are both enriched by a larger body of precedent and also extended by the addition of standards for deep parsing. (Because the Brown Corpus is American, the SUSANNE analytic scheme has also involved broadening the Lancaster-Leeds Treebank scheme to cover American as well as British usage.)

Project SUSANNE is scheduled for completion in January 1992. I am currently discussing with an academic publisher the possibility of publishing its product as a package incorporating the annotated corpus itself, in electronic form, and the analytic scheme to which the annotations conform, as a book. Corpus-builders have traditionally, I think, seen the manuals they write as secondary items playing a supporting role to the corpora themselves. My view is different. If our work on Project SUSANNE has any lasting value, I am convinced that this will stem primarily from its relatively comprehensive and explicitly-defined taxonomy of English grammatical phenomena. Naturally I hope – and believe – that the SUSANNE Corpus too will prove useful in various ways. But, although the SUSANNE Corpus will be some three times the size of the database we have used as a source of grammatical statistics to date, in terms of sheer size I believe that the SUSANNE and other existing analysed corpora described in Sampson (1991) are due soon to be eclipsed by much larger databases being produced in the USA, notably the ‘Penn Treebank’ being created by Mitchell Marcus of the University of Pennsylvania. The significance of the SUSANNE Corpus will lie not in size but in the detail, depth, and explicitness of its analytic scheme. (Marcus’s Treebank uses a wordtag set that is extremely simple relative to that of the Tagged LOB or Brown Corpora – it contains just 36 tags (Santorini 1990); and, as I understand, the Penn Treebank will also involve quite simple and limited indications of higher-level structure, whether because the difficulty of producing richer annotations grows with the size of a corpus, or because Marcus wishes to avoid becoming embroiled in the theoretical controversies that might be entailed by commitment to any richer annotation scheme.) Even if we succeed perfectly in the ambitious task of bringing every detail of the annotations in 128,000 words of text into line with the SUSANNE taxonomic principles, one of the most significant long-term roles of the SUSANNE Corpus itself will be as an earnest of the fact that the rules of the published taxonomy have been evolved through application to real-life data rather than chosen speculatively. I hope our SUSANNE work may thus offer the beginnings of a ‘Linnaean taxonomy of the English language’. It will be no more than a beginning; there will certainly be plenty of further work to be done.

How controversial is the general programme of work in corpus linguistics that I have outlined in these pages? To me it seems almost self-evidently reasonable and appropriate, but it is easy to delude oneself on such matters. The truth is that the rise of the corpus-based approach to computational linguistics has not always been welcomed by adherents of the older, compilation-oriented approach; and to some extent my own work seems to be serving as the representative target for those who object to corpus linguistics. (I cannot reasonably resent this, since I myself have stirred a few academic controversies in the past.)

In particular, a series of papers (Taylor, Grover, & Briscoe 1989; Briscoe 1990) have challenged my attempt, discussed above, to demonstrate that individually rare constructions are collectively so common as to render unfeasible the aim of designing a production system to generate ‘all and only’ the constructions which occur in real-life usage in a natural language. My experiment took for granted the relatively surfacy, theoretically middle-of-the-road grammatical analysis scheme that had been evolved over a series of corpus linguistics projects at Lancaster, in Norway, and at Leeds in order to represent the grammar of LOB sentences in a manner that would as far as possible be uncontroversial and accordingly useful to a wide range of researchers. But of course it is true that a simple, concrete theoretical approach which eliminates controversial elements is itself a particular theoretical approach, which the proponents of more abstract theories may see as mistaken. Taylor et al. believe in a much more abstract approach to English grammatical analysis; and they argue that my findings about the incidence of rare constructions are an artefact of my misguided analysis, rather than being inherent in my data. Their preferred theory of English grammar is embodied in the formal generative grammar of the Alvey Natural Language Tools (‘ANLT’) parsing system (for distribution details see note 1 of Briscoe 1990); Taylor et al. use this grammar to reanalyse my data, and they argue that most of the constructions which I counted as low-frequency are generated by high-frequency rules of the ANLT grammar. According to Taylor et al., the ANLT system is strikingly successful at analysing my data-set, accurately parsing as many as 97% of my noun phrase tokens.[4] My use of the theoretically-unenlightened LOB analytic scheme is, for Briscoe (1990), symptomatic of a tendency for corpus linguistics in general to operate as ‘a self-justifying and hermeneutically sealed sub-discipline’.

Several points in these papers seem oddly misleading. Taylor et al. repeatedly describe the LOB analytic scheme as if it were much more a private creation of my own than it actually was, thereby raising in their readers’ minds a natural suspicion that problems such as those described in Sampson (1987) might well stem purely from an idiosyncratic analytic scheme which is possibly ill-defined, ill-judged, and/or fixed so as to help me prove my point. One example relates to the system, used in my 1987 investigation, whereby the detailed set of 132 LOB wordtags is reduced to a coarser classification by grouping certain classes of cognate tags under more general ‘cover tags’. Referring to this system, Taylor et al. comment that ‘Sampson ... does not explain the extent to which he has generalised types in this fashion’; ‘Sampson ... gives no details of this procedure’; Briscoe (1990) adds that an attempt I made to explain the facts to him in correspondence ‘does not shed much light on the generalisations employed ... as Garside et al. (1987) does not give a complete listing of cover-tags’. In fact I had no hand in defining the system of cover tags which was used in my experiment (or in defining the wordtags on which the cover tags were based). The cover tags were defined, in a perfectly precise manner, by a colleague (Geoffrey Leech, as it happens) and were in routine use on research projects directed by Leech at Lancaster in which Lolita Taylor was an active participant. Thus, although it is true that my paper did not give enough detail to allow an outsider to check the nature or origin of the cover-tag system (and outside readers may accordingly have been receptive to the suggestions of Taylor et al. on this point), Taylor herself was well aware of the true situation. She (and, through her, Briscoe) had access to the details independently of my publications, and independently of Garside et al. (1987). (They had closer access than I, since I had left Lancaster at the relevant time while Taylor et al. were still there.)

Then, although Taylor et al. (1989) and Briscoe (1990) claim that the ANLT grammar is very successful at parsing the range of noun phrase structures on which my research was based, the status of this claim is open to question in view of the fact that the grammar was tested only manually. The ANLT grammar was created as part of an automatic parsing system, and Taylor et al. say that they tried to check the data using the automatic parser but had to give up the attempt: sometimes parses failed not because of inadequacies in the grammar but because of ‘resource limitations’, and sometimes so many alternative parses were generated that it was impractical to check whether these included the correct analysis. But anyone with experience of highly complex formal systems knows that it is not easy to check their implications manually. Even the most painstakingly designed computer programs turn out to behave differently in practice from what their creators intend and expect; and likewise the only satisfactory way to test whether a parser accepts an input is to run the parser over the input automatically. Although much of the text of Briscoe (1990) is word for word identical with Taylor et al. (1989), Briscoe suppresses the paragraphs explaining that the checking was done manually, saying simply that examples were ‘parsed using the ANLT grammar. Further details of this process ... can be found in Taylor et al. (1989)’ (the latter publication being relatively inaccessible).

I was particularly surprised by the success rate claimed for the ANLT grammar in view of my own experience with this particular system. It happens that I was recently commissioned by a commercial client to develop assessment criteria for automatic parsers and to apply them to a range of systems; the ANLT parser was one of those I tested (using automatic rather than manual techniques), and its performance was strikingly poor both absolutely and by comparison with its leading competitor, SRI International’s Core Language Engine (‘CLE’: Alshawi et al. 1989). I encountered no ‘resource limitation’ problems – the ANLT system either found one or more analyses for an input or else finished processing the input with an explicit statement that no analyses were found; but the latter message repeatedly occurred in response to inputs that were very simple and unquestionably well-formed. Sentences such as Can you suggest an alternative?, Are any of the waiters students?, and Which college is the oldest? proved unparsable. (For application-related reasons my test-set consisted mainly of questions. I cite the examples here in normal orthography, though the ANLT system requires the orthography of its inputs to be simplified in various ways: e.g. capitals must be replaced by lower-case letters, and punctuation marks eliminated.) I did not systematically examine performance on the data-set of Sampson (1987), which was not relevant to the commission I was undertaking, but the grammar had features which appeared to imply limited performance on realistically complex noun phrase structures. The only form of personal name that seemed acceptable to the system was a one-word Christian name: the lexical coding system had no category more precise than ‘proper name’. Often I could find no way to reconcile a standard form of real-life English proper name with the orthographic limitations imposed by the ANLT system on its inputs – I tried submitting the very standard type of sovereign’s name King John VIII in each of the forms:

king john viii

king john 8

king john eight

king john the eighth

but each version led to parsing failure.[5]

It is true that the ANLT system tested by me was ‘Release 1’, dated November 1987, while Taylor et al. (1989) discuss also a ‘2nd release’ dated 1989. But the purely manual testing described by Taylor et al. seems to me insufficient evidence to overcome the a priori implausibility of such a dramatic performance improvement between 1987 and 1989 versions as their and my findings jointly imply.

A problem in any theoretically-abstract analytic approach is that depth of vision tends to be bought at the cost of a narrow focus, which overlooks part of the richness and diversity present in the data. Taylor et al. are open about one respect in which this is true of their approach to natural language parsing: in reanalysing my data they stripped out all punctuation marks occurring within the noun phrases, because ‘we do not regard punctuation as a syntactic phenomenon’. That is, the range of constructions on which the ANLT parsing system is claimed to perform well is not the noun phrases of a 40,000-word sample of written English, but the noun phrases of a sample of an artificial language derived by eliminating punctuation marks from written English. With respect to my data-set this is quite a significant simplification, because more than a tenth of the vocabulary of symbols used to define the noun phrase structures are cover tags for punctuation marks. Of course, where the ANLT system does yield the right analysis for an input it is in one sense all the more admirable if this is achieved without exploiting the cues offered by punctuation. But on the other hand punctuation is crucial to many of the constructions which I have discussed above as needing more attention than they have received from the computational linguistics of the past. A Harvard-style bibliographical reference, for instance, as in Smith (1985: 32) writes ..., is largely defined by its use of brackets and colon.[6] It would be unfortunate to adopt a theory which forced one to ignore an aspect of the English language as significant as punctuation, and I do not understand Taylor et al.’s attempt to justify this by denying that punctuation is a ‘syntactic phenomenon’: punctuation is physically there, as much part of the written language as the alphabetic words are, and with as much right to be dealt with by systems for automatically processing written language.

I do not believe that the choice of a concrete rather than abstract intellectual framework, which allows researchers to remain open to such phenomena, can reasonably be described as retreat into ‘a self-justifying and hermeneutically sealed sub-discipline’.

The most serious flaw in Taylor et al.’s paper is that they misunderstand the nature of the problem raised in Sampson (1987). According to Taylor et al., I assumed that in a generative grammar each distinct noun phrase type ‘will be associated with one rule’, and I argued that ‘any parsing system based on generative rules will need a large or open-ended set of spurious “rules” which ... only apply once’; Taylor et al. point out, rightly, that a typical generative grammar will generate many of the individual constructions in my data-set through more than one rule-application, and consequently a relatively small set of rules can between them generate a relatively large range of constructions. But corpus linguists are not as ignorant of alternative approaches to linguistic analysis as Taylor et al. suppose. I had explicitly tried in my 1987 paper to eliminate the possibility of misunderstandings such as theirs by writing: ‘the problem is not that the number of distinct noun phrase types is very large. A generative grammar can define a large (indeed, infinitely large) number of alternative expansions for a symbol by means of a small number of rules.’ As I went on to say, the real problem lies in knowing which expansions should and which should not be generated. If extremely rare constructions cannot be ignored because they are collectively frequent enough to represent an important part of a language, then it is not clear how we could ever hope to establish the class of constructions, all of the (perhaps infinitely numerous) members of which and only the members of which should be generated by a generative grammar – even though, if such a class could be established, it may be that a generative grammar could define it using a finite number of rules. Briscoe (1990, note 3) comments on a version of this point which I made in a letter prompted by the publication of Taylor et al. (1989), but in terms which suggest that he has not yet understood it. According to Briscoe, I ‘impl[y] that we should declare rare types ungrammatical, by fiat, and not attempt to write rules for them’. I have written nothing carrying this implication.

Taylor et al. examine the residue of noun phrases in my data-set which they accept that the ANLT grammar cannot deal with, and they suggest various ways in which the ANLT rule-set might be extended to cope with such cases. Their suggestions are sensible, and it may well be that adopting them would improve the system’s performance. My suspicion, though, is that with a real-life language there will be no end to this process. When one looks carefully to see where a rainbow meets the ground, it often looks easy to reach that spot; but we know that, having done so, one is no closer to the rainbow. I believe the task of producing an observationally adequate definition of usage in a natural language is like that. That is why I personally prefer to work on approaches to automatic parsing that do not incorporate any distinction between grammatical/well-formed/legal and ungrammatical/ill-formed/illegal.

But let me not seem to claim too much. The compilation model for language processing has real virtues: in particular, when the compilation technique works at all it is far more efficient, in terms of quantity of processing required, than a stochastic optimizing technique. In domains involving restricted, relatively well-behaved input language, the compilation model may be the only one worth considering; and it seems likely that as NLP applications multiply there will be such domains – it is clear, for instance, that language consciously addressed by humans to machines tends spontaneously to adapt to the perceived limitations of the machines. And even in the case of unrestricted text or speech I am certainly not saying that my probabilistic APRIL system is superior to the ANLT system. To be truthful, at present neither of these systems is very good. Indeed, I would go further: it is difficult to rank natural language parsers on a single scale, because they differ on several incommensurable parameters, but if I had to select one general-purpose English-language parsing system as best overall among those now existing, I would vote for the CLE – which is a compiler-like rather than probabilistic parser. The CLE too leaves a great deal to be desired, and the probabilistic approach is so new that I personally feel optimistic about the possibility that in due course it may overtake the compilation approach, at least in domains requiring robust performance with unrestricted inputs; but at present this is a purely hypothetical forecast.

What I do strongly believe is that there is a great deal of important natural-language grammar, often related to cultural rather than logical matters, over and beyond the range of logic-related core constructions on which theoretical linguists commonly focus; and that it will be very regrettable if the discipline as a whole espouses abstract theories which prevent those other phenomena being noticed. How far would botany or zoology have advanced, if communication among researchers was hampered because no generally-agreed comprehensive taxonomy and nomenclature could be established pending final resolution of species relationships through comparison of amino-acid sequences? We need a systematic, formal stocktaking of everything in our languages; this will help theoretical analysis to advance, rather than get in its way, and it can be achieved only through the compilation and investigation of corpora.

REFERENCES

Aarts, E. and J. Korst. 1989. Simulated Annealing and Boltzmann Machines. Chichester: Wiley.

Alshawi, H. et al. 1989. Research Programme in Natural Language Processing: Final Report. Prepared by SRI International for the Information Engineering Directorate Natural Language Processing Club. (Alvey Project no. ALV/PRJ/IKBS/105, SRI Project no. 2989.)

Briscoe, E.J. 1990. ‘English noun phrases are regular: a reply to Professor Sampson’. In J. Aarts and W. Meijs, eds., Theory and Practice in Corpus Linguistics. Amsterdam: Rodopi.

Chomsky, A.N. 1957. Syntactic Structures. The Hague: Mouton.

–. 1980. Rules and Representations. Oxford: Blackwell.

Ellegård, A. 1978. The Syntactic Structure of English Texts. Gothenburg Studies in English, 43.

Fillmore, C.J., et al. 1988. ‘Regularity and idiomaticity in grammatical constructions’. Language 64.501-538.

Garside, R.G., et al., eds. 1987. The Computational Analysis of English. London: Longman.

Johansson, S. 1986. The Tagged LOB Corpus Users’ Manual. Bergen: Norwegian Computing Centre for the Humanities.

Kirkpatrick, S. et al. 1983. ‘Optimization by simulated annealing’. Science 220.671‑80.

Quirk, R. et al. 1985. A Comprehensive Grammar of the English Language. London: Longman.

Sampson, G.R. 1980. Making Sense. Oxford: Oxford University Press.

–. 1987. ‘Evidence against the “grammatical”/“ungrammatical” distinction’. In W. Meijs, ed., Corpus Linguistics and Beyond. Amsterdam: Rodopi.

–. 1989. ‘Language acquisition: growth or learning?’ Philosophical Papers 18.203-240.

–. 1991. ‘Analysed corpora of English: a consumer guide’. In Martha Pennington and V. Stevens, eds., Computers in Applied Linguistics. Clevedon, Avon: Multilingual Matters.

– et al. 1989. ‘Natural language analysis by stochastic optimization: a progress report on Project APRIL’. Journal of Experimental and Theoretical Artificial Intelligence 1.271-87.

Santorini, Beatrice. 1990. Annotation Manual for the Penn Treebank Project (preliminary draft dated 28.3.1990). University of Pennsylvania.

Taylor, Lolita, et al. 1989. ‘The syntactic regularity of English noun phrases’. In Proceedings of the Fourth Annual Meeting of the European Chapter of the Association for Computational Linguistics, University of Manchester Institute of Science and Technology.