The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version. Published in J. Svartvik, ed., Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, Mouton de Gruyter (Berlin), 1992. |
Probabilistic Parsing
Geoffrey Sampson
Bentham, Yorkshire
We want to give computers the ability to process human languages. But computers use systems of their own
which are also called ‘languages’, and which share at least some features with
human languages; and we know how computers succeed in processing computer
languages, since it is humans who have arranged for them to do so. Inevitably there is a temptation to see
the automatic processing of computer languages as a precedent or model for the
automatic processing (and perhaps even for the human processing) of human
languages. In some cases the
precedent may be useful, but clearly we cannot just assume that human languages
are similar to computer languages in all relevant ways. In the area of grammatical parsing of
human languages, which seems to be acknowledged by common consent as the
central problem of natural language processing – ‘NLP’ – at the present time, I
believe the computer-language precedent may have misled us. One of the ideas underlying my work is
that human languages, as grammatical systems, may be too different from
computer languages for it to be appropriate to use the same approaches to
automatic parsing.
Although the average computer scientist would probably think of natural-language
parsing as a somewhat esoteric task, automatic parsing of computer programming
languages such as C or Pop-11 is one of the most fundamental computing
operations; before a program written in a user-oriented programming language
such as these can be run it must be ‘compiled’ into machine code – that is,
automatically translated into a very different, ‘low level’ programming
language – and compilation depends on extracting the grammatical structure by
virtue of which the C or Pop-11 program is well-formed. To construct a compiler capable of
doing this, one begins from a ‘production system’ (i.e. a set of rules) which
defines the class of well-formed programs in the relevant user-oriented
language. In fact there exist
software systems called ‘compiler-compilers’ or ‘parser generators’ which
accept as input a production system for a language and automatically yield as
output a parser for the language.
To the computer scientist it is self-evident that parsing is based on
rules for well-formedness in a language.
If one seeks to apply this concept to natural languages, an obvious
question is whether rules of well-formedness can possibly be as central for
processing natural languages, which have grown by unplanned evolution and
accretion over many generations, as they are for processing formal programming
languages, which are rule-governed by stipulation. What counts as a valid C program is fixed by Brian Kernighan
and Dennis Ritchie – or, now, by the relevant American National Standards
Institute committee – and programmers are expected to learn the rules and keep
within them. If a programmer
inadvertently tries to extend the language by producing a program that violates
some detail (perhaps a very minor detail) of the ANSI rules, it is quite all
right for the compiler software to reject the program outright. In the case of speakers of natural
languages it is not intuitively obvious that their skill revolves in a similar
way round a set of rules defining what is well-formed in their mother
tongue. It is true that I
sometimes hear children or foreigners producing English utterances that sound a
little odd, but it seems to me that (for me and for other people) the immediate
response is to understand the utterance, and noticing its oddity is a secondary
phenomenon if it occurs at all. It
is not normally a case, as the compiler model might suggest, of initially
hearing the utterance as gibberish and then resorting to special mental
processes to extract meaning from it nevertheless.
These points seem fairly uncontroversial, and if the period when
linguistics and computer science first became heavily involved with one another
had been other than when it was (namely about 1980) they might have led to
widespread scepticism among linguists about the applicability of the compiler
model to natural language parsing.
But intellectual trends within linguistics at that time happened to
dovetail neatly with the compiler model.
The 1970s had been the high point of Noam Chomsky’s intellectual
dominance of linguistics – the period when university linguistics departments
routinely treated ‘generative grammar’ as the centrepiece of their first-year
linguistics courses, as the doctrine of the phoneme had been twenty years
earlier – and, for Chomsky, the leading task of linguistics was how to
formulate a rigorous definition specifying the range of well-formed sentences
for a natural language. Chomsky’s
first book began: ‘... The fundamental aim in the linguistic
analysis of a language L is to separate the grammatical sequences which are the sentences of L
from the ungrammatical
sequences which are not sentences of L ...’ (Chomsky 1957, 13).
Chomsky’s reason for treating this aim as fundamental did not have to do
with automatic parsing, which was not a topic that concerned him. Chomsky has often, with some
justification, contradicted writers who suggested that his kind of linguistics
is preoccupied with linguistic automation, or that he believes the mind ‘works
like a computer’. In the context
of Chomsky’s thought, the reason to construct grammars which generate ‘all and
only’ the grammatical sentences of different natural languages was his belief
that these grammars turn out to have various highly specific and unpredicted
formal properties, which do not differ from one natural language to another,
and that these universals of natural grammar are a proof of the fact that (as
Chomsky later put it) ‘we do not really learn language; rather, grammar grows
in the mind’ (Chomsky 1980, 134) – and that, more generally, an adult’s mental
world is not the result of his individual intellectual inventiveness responding
to his environment but rather, like his anatomy, is predetermined in much of
its detail by his genetic inheritance.
For Chomsky formal linguistics was a branch of psychology, not of
computer science.
This idea of Chomsky’s that linguistic structure offers evidence for
genetic determination of our cognitive life seems when the arguments are
examined carefully to be quite mistaken (cf. Sampson 1980; 1989). But it was influential for many years;
and its significance for present purposes is that, when linguists and computer
scientists began talking to one another, it led the linguists to agree with the
computer scientists that the important thing to do with a natural language was
to design a production system for it.
What linguists call a ‘generative grammar’ is what computer scientists
call a ‘production system.’[1]
Arguably, indeed, the rise of NLP gave generative grammar a new lease of
life within linguistics. About
1980 it seemed to be losing ground in terms of perceived centrality in the
discipline to less formal, more socially-oriented trends, but during the
subsequent decade there was a striking revival of formal grammar theory.
Between them, then, these mutually-reinforcing traditions made it seem
inevitable that the way to produce parsers for natural languages was to define
generative grammars for them and to derive parsers from the generative grammars
as compilers are derived from computer-language production systems. Nobody thought the task would be
easy: a natural language grammar
would clearly be much larger than the grammar for a computer language, which is
designed to be simple, and Chomsky’s own research had suggested that
natural-language generative grammars were also formally of a different, less
computationally tractable type than the ‘context-free’ grammars which are
normally adequate for programming languages. But these points do not challenge the principle of parsing
as compilation, though they explain why immediate success cannot reasonably be
expected.
I do not want to claim that the compiler model for natural language
parsing is necessarily wrong; but the head start which this model began with,
partly for reasons of historical accident, has led alternative models to be
unjustifiably overlooked. Both
Geoffrey Leech’s group and mine – but as yet few other
computational-linguistics research groups, so far as I know – find it more
natural to approach the problem of automatically analysing the grammatically rich
and unpredictable material contained in corpora of real-life usage via
statistical techniques somewhat akin to those commonly applied in the field of
visual pattern recognition, rather than via the logical techniques of
formal-language compilation.
To my mind there are two problems about using the compiler model for
parsing the sort of language found in resources such as the LOB Corpus. The first is that such resources
contain a far greater wealth of grammatical phenomena than standard generative
grammars take into account. Since
the LOB Corpus represents written English, one obvious example is punctuation
marks: English and other European
languages use a variety of these, they have their own quite specific
grammatical properties, but textbooks of linguistic theory never in my
experience comment on how punctuation marks are to be accounted for in
grammatical analyses. This is just
one example: there are many, many
more. Personal names have their
own complex grammar – in English we have titles such as Mrs, Dr which can introduce a name, titles such as Ph.D., Bart which can conclude one, Christian names can be
represented by initials but surnames in most contexts cannot, and so on and so
forth – yet the textbooks often gloss all this over by offering rules which
rewrite ‘NP’ as ‘ProperName’ and ‘ProperName’ simply as John, Mary, ....
Addresses have internal structure which is a good deal more complex than
that of personal names (and highly language-specific – English addresses
proceed from small to large, e.g. house name, street, town, county, while
several European languages place the town before the smaller units, for
instance), but they rarely figure in linguistics textbooks at all. Money sums (note the characteristic
grammar of English £2.50
v. Portuguese 2$50),
weights and measures, headlines and captions, and many further items are part
of the warp and weft of real-life language, yet are virtually invisible in
standard generative grammars.
In one sense that is justifiable.
We have seen that the original motivation for generative linguistic
research had to do with locating possible genetically-determined cognitive
mechanisms, and if such mechanisms existed one can agree that they would be
more likely to relate to conceptually general areas of grammar, such as
question-formation, than to manifestly culture-bound areas such as the grammar
of money or postal addresses. But
considerations of that kind have no relevance for NLP as a branch of practical technology. If computers are to deal with human
language, we need them to deal with addresses as much as we need them to deal
with questions.
The examples I have just quoted are all characteristic more of written
than spoken language; my corpus experience has until recently been mainly with
the LOB and Brown Corpora of written English. But my group has recently begun to turn its attention to
automatic parsing of spoken English, working with the London-Lund Corpus, and
it is already quite clear that this too involves many frequent phenomena which
play no part in standard generative grammars. Perhaps the most salient is so-called ‘speech repairs’,
whereby a speaker who notices himself going wrong backtracks and edits his
utterance on the fly. Standard
generative grammars would explicitly exclude speech repairs as ‘performance
deviations’, and again for theoretical linguistics as a branch of cognitive
psychology this may be a reasonable strategy; but speech repairs occur, they
fall into characteristic patterns, and practical automatic speech-understanding
systems will need to be able to analyse them.
Furthermore, even in the areas of grammar which are common to writing
and speech and which linguists would see as part of what a language description
ought (at least ideally) to cover, there is a vast amount to be done in terms
of listing and classifying the phenomena that occur. Many constructions are omitted from theoretical descriptions
not for reasons of principle but because they are not very frequent and/or do
not seem to interact in theoretically-interesting ways with central aspects of
grammar, and although they may be mentioned in traditional descriptive grammars
they are not systematically assigned places in explicit inventories of the
resources of the language. One
example among very many might be the English the more ... the more ... construction discussed by Fillmore et al.
(1988), an article which makes some of the same points I am trying to make
about the tendency for much of a language’s structure to be overlooked by the
linguist.
All this is to say, then, that there is far more to a natural language
than generative linguistics has traditionally recognized. That does not imply that comprehensive
generative grammars cannot be written, but it does mean that the task remains
to be done. There is no use hoping
that one can lift a grammar out of a standard generative-linguistic definition
of a natural language and use it with a few modifications as the basis of an
adequate parser.
But the second problem, which leads me to wonder whether reasonably
comprehensive generative grammars for real-life languages are attainable even
in principle, is the somewhat anarchic quality of much of the language one
finds in resources such as LOB. If
it is correct to describe linguistic behaviour as rule-governed, this is much
more like the sense in which car-drivers’ behaviour is governed by the Highway
Code than the sense in which the behaviour of material objects is governed by
the laws of physics, which can never be violated. When writing carefully for publication, we do stick to most
of the rules, and with a police car behind him an Englishman keeps to 30 m.p.h.
in a built-up area. But any rule
can be broken on occasion. If a
tree has fallen on the left side of the road, then common sense overrides the
Highway Code and we drive cautiously round on the right. With no police near, ‘30 m.p.h.’ is
interpreted as ‘not much over 40’.
So it seems to be with language.
To re-use an example that I have quoted elsewhere (Garside et al. 1987,
19): a rule of English that one
might have thought rock-solid is that the subject of a finite clause cannot
consist wholly of a reflexive pronoun, yet LOB contains the following sentence,
from a current-affairs magazine article by Bertrand Russell:
Each side proceeds on the assumption that itself
loves peace, but the other side consists of warmongers.
Itself served
better than plain it
to carry the contrast with the other side, so the grammatical rule gives way to the need for a
persuasive rhetorical effect. A
tree has blocked the left-hand lane, so the writer drives round on the right
and is allowed to do so, even though the New Statesman’s copy-editor is behind him with a blue
light on his roof. In this case
the grammatical deviation, though quite specific, is subtle; in other cases it
can be much more gross. Ten or
fifteen years ago I am sure we would all have agreed about the utter
grammatical impossibility of the sentence:
*Best
before see base of can.
But any theory which treated it as impossible today would have to contend
with the fact that this has become one of the highest-frequency sentences of
written British English.
Formal languages can be perfectly rule-governed by stipulation; it is
acceptable for a compiler to reject a C program containing a misplaced
comma. But with a natural
language, either the rules which apply are not complete enough to specify what
is possible and what is not possible in many cases, or if there is a complete
set of rules then language-users are quite prepared to break them. I am not sure which of these better
describes the situation, but, either way, a worthwhile NLP system has to apply
to language as it is actually used:
we do not want it to keep rejecting authentic inputs as ‘ill-formed’.
The conclusion I draw from observations like these is that, if I had to
construct a generative grammar covering everything in the LOB Corpus in order
to derive a system capable of automatically analysing LOB examples and others
like them, the job would be unending.
Rules would have to be multiplied far beyond the number found in the
completest existing formal linguistic descriptions, and as the task of
rule-writing proceeded one would increasingly find oneself trying to make
definite and precise statements about matters that are inherently vague and
fluid.
In a paper to an ICAME conference (Sampson 1987) I used concrete
numerical evidence in order to turn this negative conclusion into something
more solid than a personal wail of despair. I looked at statistics on the diversity of grammatical
constructions found in the ‘Lancaster-Leeds Treebank’, a ca 40,000-word subset
of the LOB Corpus which I had parsed manually, in collaboration with Geoffrey
Leech and his team, in order to create a database (described in Garside et al.
1987, chap. 7, and Sampson 1991) to be exploited for our joint NLP
activities. I had drawn labelled
trees representing the surface grammatical structures of the sentences, using a
set of grammatical categories that were chosen to be maximally uncontroversial
and in conformity with the linguistic consensus, and taking great pains to
ensure that decisions about constituent boundaries and category membership were
consistent with one another across the database, but imposing no prior
assumptions about what configurations of grammatical categories can and cannot
occur in English. In Sampson
(1987) I took the highest-frequency grammatical category (the noun phrase) and
looked at the numbers of different types of noun phrase in the data, where a
‘type’ of noun phrase is a particular sequence of one or more daughter
categories immediately dominated by a noun phrase node. Types were classified using a very
coarse vocabulary of just 47 labels for daughter nodes (14 phrase and clause
classes, 28 word-classes, and five classes of punctuation mark), omitting many
finer subclassifications that are included in the Treebank. There were 8328 noun phrase tokens in
my data set, which between them represented 747 different types, but the
frequency of the types was very various:
the commonest single type (determiner followed by singular noun)
accounted for about 14% of all noun phrase tokens, while many different types
were represented by one token each.
The particularly interesting finding emerged when I considered figures
on the proportion of all noun phrase tokens belonging to types of not more than
a set frequency in the data, and plotted a graph showing the proportion p as a function of the threshold
type-frequency f (with
f expressed as a
fraction of the frequency of the commonest type, so that p = 1 when f = 1).
The 58 points for different observed frequencies fell beautifully close
to an exponential curve, p = f 0.4 . As the
fraction f falls, f 0.4 falls much more slowly: as we consider increasingly
low-frequency constructions, the number of different constructions occurring at
such frequencies keeps multiplying in a smoothly predictable fashion so that quite
sizeable proportions of the data are accounted for even by constructions of the
lowest frequencies. (More than 5%
of the noun phrase tokens in my data set represented constructions which each
occurred just once.) If this
regular relationship were maintained in larger samples of data (this is
admittedly a big ‘if’ – as yet there simply do not exist carefully-analysed
language samples large enough to allow the question to be checked), it would
imply that even extremely rare constructions would collectively be reasonably
frequent. One in a thousand noun
phrase tokens, for instance, would represent some noun phrase type occurring
not more than once in a thousand million words. Yet how could one hope to design a grammar that generates
‘all and only’ the correct set of constructions, if ascertaining the set of
constructions to be generated requires one to monitor samples of that size?
Accordingly, our approach to automatic parsing avoids any use of the
concept of well-formedness. In
fitting a labelled tree to an input word-string, our system simply asks ‘What
labelled tree over these words comes closest to being representative of the
configurations in our database of parsed sentences?’ The system does not go on to ask ‘Is that a grammatically
“legal” tree?’ – in our framework this question has no meaning.
This general concept of parsing as maximizing conformity with
statistical norms is, I think, common to the work of Geoffrey Leech’s team at
Lancaster and my Project APRIL, sponsored by the Royal Signals & Radar
Establishment and housed at the University of Leeds, under the direction of
Robin Haigh since my departure from the academic profession.[2]
There are considerable differences between the deterministic techniques
used by Leech’s team (see e.g. Garside et al. 1987, chap. 6) and the stochastic APRIL approach, and I can
describe only the latter; but although the APRIL technique of parsing by
stochastic optimization was an invention of my own, I make no claim to pioneer
status with respect to the general concept of probabilistic parsing – this I
borrowed from the Leech team.
The APRIL system is described for instance in Sampson et al. (1989). In broad outline the system works like
this. We assume that the desired
analysis for any input string is always going to be a tree structure with
labels drawn from an agreed vocabulary of grammatical categories. For any particular input, say w words in length, the range of solutions
available to be considered in principle is simply the class of all distinct
tree-structures having w
terminal nodes and having labels drawn from the agreed vocabulary on the
nonterminal nodes. The root node
of a tree is required to have a specified ‘sentence’ label, but apart from that
any label can occur on any nonterminal node: a complex, many-layered tree over a long sentence in which
every single node between root and ‘leaves’ is labelled ‘prepositional phrase’,
say, would in APRIL terms not be an ‘illegal/ill-formed/ungrammatical’ tree, it
would just be a quite poor tree in the sense that it would not look much like
any of the trees in the Treebank database.
Parsing proceeds by searching the massive logical space of distinct
labelled trees to find the best.
There are essentially two problems: how is the ‘goodness’ of a labelled tree measured, and how
is the particular tree that maximizes this measure located (given that there
will be far too many alternative solutions in the solution-space for each to be
checked systematically)?
The answer to the first question is that individual nodes of a tree are
assigned figures of merit by reference to probabilistic transition
networks. Suppose some node in a
tree to be evaluated is labelled X, and has a sequence of daughters labelled
P Q R. This might be a
sequence which would commonly be found below an X node in correctly-parsed
sentences (if X is ‘noun phrase’, P Q R might be respectively
‘definite article’, ‘singular noun’, ‘relative clause’, say), or it might be
some absurd expansion for X (say, ‘comma’, ‘prepositional phrase’, ‘adverbial
clause’), or it might be something in between. For the label X (and for every other label in the agreed
vocabulary) the system has a transition network which – ignoring certain
complications for ease of exposition – includes a path (designed manually) for
each of the sequences commonly found below that label in accurately parsed
material, to which skip arcs and loop arcs have been added automatically in
such a fashion that any label-string whatever, of any length, corresponds to
some route through the network.
(Any particular label on a high-frequency path can be bypassed via a
skip arc, and any extra label can be accepted at any point via a loop
arc.) The way the network is
designed ensures that (again omitting some complications) it is deterministic –
whatever label sequence may be found below an X in a tree, there will be one
and only one route by which the X network can accept that sequence. Probabilities are assigned to the arcs
of the networks for X and the other labels in the vocabulary by driving the
trees of the database over them, which will tend to result in arcs on the
manually-designed routes being assigned relatively high probabilities and the
automatically-added skip and loop arcs being assigned relatively low
probabilities. (One might compare
the distinction between the manually-designed high-frequency routes and the
routes using automatically-added skip or loop arcs to Chomsky’s distinction
between ideal linguistic ‘competence’ and deviant ‘performance’ – though this
comparison could not be pressed very far:
the range of constructions accepted by the manually-designed parts of
the APRIL networks alone would not be equated, by us or by anyone else, with
the class of ‘competent/well-formed’ constructions.) Then, in essence, the figure of merit assigned to any
labelled tree is the product of the probabilities associated with the arcs
traversed when the daughter-strings of the various nodes of the tree are
accepted by the networks.
As for the second question:
APRIL locates the best tree for an input by a stochastic optimization
technique, namely the technique of ‘simulated annealing’ (see e.g. Kirkpatrick
et al. 1983; Aarts & Korst 1989).
That is, the system executes a random walk through the solution space,
evaluating each random move from one labelled tree to another as it is
generated, and applying an initially weak but steadily growing bias against
accepting moves from ‘better’ to ‘worse’ trees. In this way the system evolves towards an optimal analysis
for an input, without needing initially to know whereabouts in the solution
space the optimum is located, and without getting trapped at ‘local minima’ –
solutions which are in themselves suboptimal but happen to be slightly better
than each of their immediately-neighbouring solutions. Stochastic optimization techniques like
this one have something of the robust simplicity of Darwinian evolution in the
natural world: the process does
not ‘know where it is going’, and it may be subject to all sorts of chance
accidents on the way, but in the long run it creates highly-valued outcomes
through nothing more than random mutation and a tendency to select fitter
alternatives.
As yet, APRIL’s performance leaves plenty to be desired. Commonly it gets the structure of an
input largely right but with various local errors, either because the
tree-evaluation function fails to assign the best score to what is in fact the
correct analysis, or because the annealing process ‘freezes’ on a solution
whose score is not the best available, or both. Let me give one example, quoted from Sampson et al. (1989),
of the outcome of a run on the following sentence (taken from LOB text E23,
input to APRIL as a string of wordtags):
The final touch was added to this dramatic
interpretation, by placing it to stand on a base of misty grey tulle,
representing the mysteries of the human mind.
According to our parsing scheme, the correct analysis is as follows (for
the symbols, see Garside et al. 1987, chap. 7, sec. 5):
[S[N the final touch] [Vp was added] [P to [N this dramatic interpretation]], [P by [Tg [Vg placing] [N it] [Ti [Vi to stand] [P on [N a base [P of [N misty grey tulle, [Tg [Vg representing] [N the mysteries [P of [N the human mind]]]]]]]]]]].]
The analysis produced by APRIL on the run in question was as follows:
[S [N the final touch] [Vp was added] [P to [N this dramatic interpretation]], [P by [Tg [Vg placing] [N it] [Ti [Vi to stand] [P on [N a base [P of [N misty grey tulle]]]], [Tg [Vg representing] [N the mysteries] [P of [N the human mind]]]]]].]
That is, of the human mind was treated as an adjunct of representing rather than as a postmodifier of mysteries, and the representing clause was treated as an adjunct of placing rather than as a postmodifier of tulle.
Our method of assessing performance gives this output a mark of 76%,
which was roughly average for APRIL’s performance at the time (though some
errors which would reduce the percentage score by no more than the errors in
this example might look less venial to a human judge). We had then, and still have, a long way
to go. But our approach has the
great advantage that it is easy to make small incremental adjustments: probabilities can be adjusted on
individual transition-network arcs, for instance, without causing the system to
crash and fail to deliver any analysis at all for some input (as can relatively
easily happen with a compiler-like parser); and the system does not care how
grammatical the input is. The
example above is in fact a rather polished English sentence; but APRIL would
operate in the same fashion on a thoroughly garbled, ill-formed input, evolving
the best available analysis irrespective of whether the absolute value of that
analysis is high or low.
Currently, much of our work on APRIL is concerned with adapting it to
deal with spoken English, where grammatical ill-formedness is much commoner
than in the edited writing of LOB.
To delve deeper into the technicalities of APRIL would not be
appropriate here. But in any case
this specific research project is a less significant topic for the corpus
linguistics community at large than is the general need, which this and related
research has brought into focus for me, for a formal stocktaking of the
resources of the languages we work with.
Those of us who work with English think of our language as relatively
thoroughly studied, yet we have no comprehensive inventory and classification,
at the level of precision needed for NLP purposes, of the grammatical phenomena
found in real-life written and/or spoken English usage; I surmise that the same
is true for other European languages.
By far the largest part of the work of creating the 40,000-word
Lancaster-Leeds Treebank lay not in drawing the labelled trees for the
individual sentences but in developing a set of analytic categories and
maintaining a coherent body of precedents for their application, so as to
ensure that anything occurring in the texts could be given a labelled tree
structure, and that a decision to mark some sequence off as a constituent of a
given category at one point in the texts would always be consistent with
decisions to mark off and label comparable sequences elsewhere. It is easy, for instance, to agree that
English has a category of ‘adjective phrases’ (encoded in our scheme as J),
core examples of which would be premodified adjectives (very small, pale green); but what about cases where an adjective
is followed by a prepositional phrase which expands its meaning, as in:
they are alike in placing more emphasis ...
– should in placing ... be regarded as a postmodifier within the J whose head is alike, or is alike a one-word J and the in placing ... sequence a sister constituent? There is no one answer to this question
which any English linguist would immediately recognize as obviously correct;
and, while some answer might ultimately be derived from theoretical studies of
grammar, we cannot expect that theoreticians will decide all such questions for
us immediately and with a single voice:
notoriously, theories differ, theories change, and many of the
tree-drawing problems that crop up have never yet been considered by
theoretical grammarians. But, for
probabilistic approaches to NLP, we must have some definite answer to this and
very many other comparable issues.
Statistics extracted from a database of parsed sentences in which some
cases of adjective + prepositional phrase were grouped as a single constituent
and other, linguistically indistinguishable cases were analysed as separate
immediate constituents of the sentence node, on a random basis, would be
meaningless and useless.
Accordingly, much of the work of creating the database involved imposing
and documenting decisions with respect to a multitude of such issues; one
strove to make the decisions in a linguistically reasonable way, but the
overriding principle was that it was more important to have some clearcut body
of explicit analytic precedents, and to follow them consistently, than it was
that the precedents should always be indisputably ‘correct’.
The body of ‘parsing law’ that resulted was in no sense a generative
grammar – it says nothing about what sequences ‘cannot occur’ or are
‘ungrammatical’, which is the distinctive property of a generative grammar –
but what it does attempt to do is to lay down explicit rules for bracketing and
labelling in a predictable manner any sequence that does occur, so that as far
as possible two analysts independently drawing labelled trees for the same
novel and perhaps unusual example of English would be forced by the parsing law
to draw identical trees.
Although my own motive for undertaking this precedent-setting task had
to do with providing a statistical database to be used by probabilistic parsers,
a thoroughgoing formal inventory of a language’s resources is important for NLP
progress in other ways too. Now
that NLP research internationally is moving beyond the preoccupation with
artificially-simple invented examples that characterized its early years, there
is a need for research groups to be able routinely to exchange quantities of
precise and unambiguous information about the contents of a language; but at
present this sort of information exchange is hampered in the domain of grammar
by the fact that traditional terminology is used in inconsistent and sometimes
vague ways. For instance, various
English-speaking linguists use the terms ‘complement’, or ‘predicate’, in quite
incompatible ways. Other terms,
such as ‘noun phrase’, are used much more consistently in the sense that
different groups agree on core examples of the term; but traditional
descriptive grammars, such as Quirk et al. (1985) and its lesser predecessors,
do not see it as part of their task to define clearcut boundaries between terms that would allow borderline
cases to be assigned predictably to one category or another. For computational purposes we need
sharpness and predictability.
What I am arguing for – I see it as currently the most pressing need in
the NLP discipline – is taxonomic research in the grammatical domain that
should yield something akin to the Linnaean taxonomy for the biological world. Traditional grammars describe
constructions shading into one another, as indeed they do, but the analogous
situation in biology did not prevent Linné imposing sharp boundaries between
botanical species and genera.
Linné said: Natura non facit saltus. Plantae omnes utrinque affinitatem monstrant, uti
territorium in mappa geographica; but Linné imposed boundaries in this apparent continuum, as
nineteenth-century European statesmen created colonial boundaries in the map of
Africa. The arrangement of species
and genera in the Linnaean system was artificial and in some respects actually
conflicted with the natural (i.e. theoretically correct) arrangement, and Linné
knew this perfectly well – indeed, he spent part of his career producing
fragments of a natural taxonomy, as an alternative to his artificial taxonomy;
but the artificial system was based on concrete, objective features which made
it practical to apply, and because it did not have to wait on the resolution of
theoretical puzzles Linné could make it complete. Artificial though the Linnaean system was, it enabled the
researcher to locate a definite name for any specimen (and to know that any
other botanist in the world would use the same name for that specimen), and it
gave him something approaching an exhaustive conspectus of the ‘data elements’
which a more theoretical approach would need to be able to cope with.
If no-one had ever done what Linné did, then Swedish biologists would
continually be wondering what British biologists meant (indeed, Lancastrian
biologists would be wondering what Cambridge biologists meant) by, say,
cuckoo-pint, and whether cuckoo-pint, cuckoo flower, and ragged robin were one
plant, two, or three. Since Linné,
we all say Arum maculatum and we know what we are talking about. Computational linguistics, I feel, is still operating more
or less on the cuckoo-pint standard.
First let us do a proper stocktaking of our material, and then we shall
have among other things a better basis for theoretical work.
In one area an excellent start has already been made. Stig Johansson’s Tagged LOB Corpus
Manual (Johansson 1986) includes a great deal of detailed boundary-drawing
between adjacent wordtags of the LOB tagset. Leech’s and my groups have refined certain aspects of
Johansson’s wordclass taxonomy, making more distinctions in areas such as
proper names and numerical and technical items, for instance, but we could not
have done what we have done except by building on the foundation provided by
Johansson’s work; and it is interesting and surprising to note that, although
Johansson (1986) was produced for one very specific and limited purpose (to document
the tagging decisions in a specific tagged corpus), the book has to my
knowledge no precedent in the level of detail with which it specifies the
application of wordclass categories.
One might have expected that many earlier linguists would have felt the
need to define a detailed set of wordclasses with sufficient precision to allow
independent analysts to apply them predictably: but apparently the need was not perceived before the
creation of analysed versions of electronic corpora.
With respect to grammatical structure above the level of terminal nodes,
i.e. the taxonomy of phrases and clauses, nothing comparable to Johansson’s
work has been published. I have
referred to my own, unpublished, work done in connexion with the
Lancaster-Leeds Treebank; and at present this work is being extended under
Project SUSANNE, a project sponsored by the Economic & Social Research
Council at the University of Leeds and directed by myself as an external
consultant, the goal of which is the creation of an analysed English corpus
significantly larger than the Lancaster-Leeds Treebank, and analysed with a
comparable degree of detail and self-consistency, but in conformity with an
analytic scheme that extends beyond the purely ‘surface’ grammatical notations
of the Lancaster-Leeds scheme to represent also the ‘underlying’ or logical
structure of sentences where this conflicts with surface structure.[3]
We want to develop our probabilistic parsing techniques so that they
deliver logical as well as surface grammatical analyses, and a prerequisite for
this is a database of logically-parsed material.
The SUSANNE Corpus is based on a grammatically-annotated 128,000-word
subset of the Brown Corpus created at Gothenburg University in the 1970s by
Alvar Ellegård and his students (Ellegård 1978). The solid work already done by Ellegård’s team has enabled
my group to aim to produce a database of a size and level of detail that would
otherwise have been far beyond our resources. But the Gothenburg product does have limitations (as its
creator recognizes); notably, the annotation scheme used, while covering a
rather comprehensive spectrum of English grammatical phenomena, is defined in
only a few pages of instructions to analysts. As an inevitable consequence, there are inconsistencies and
errors in the way it is applied to the 64 texts from four Brown genres
represented in the Gothenburg subset.
We are aiming to make the analyses consistent (as well as representing
them in a more transparent notation, and adding extra categories of
information); but, as a logically prior task, we are also formulating and
documenting a much more detailed set of definitions and precedents for applying
the categories used in the SUSANNE Corpus. Our strategy is to begin with the ‘surfacy’ Lancaster-Leeds
Treebank parsing scheme, which is already well-defined and documented
internally within our group, and to add to it new notations representing the
deep-grammar matters marked in the Gothenburg files but not in the Treebank,
without altering the well-established Lancaster-Leeds Treebank analyses of
surface grammar. (For most aspects
of logical grammar it proved easier than one might have expected to define
notations that diverse theorists should be able to interpret in their own
terms.) Thus the outcome of
Project SUSANNE will include an analytic scheme in which the surface-parsing
standards of the Lancaster-Leeds parsing law are both enriched by a larger body
of precedent and also extended by the addition of standards for deep
parsing. (Because the Brown Corpus
is American, the SUSANNE analytic scheme has also involved broadening the
Lancaster-Leeds Treebank scheme to cover American as well as British usage.)
Project SUSANNE is scheduled for completion in January 1992. I am currently discussing with an
academic publisher the possibility of publishing its product as a package
incorporating the annotated corpus itself, in electronic form, and the analytic
scheme to which the annotations conform, as a book. Corpus-builders have traditionally, I think, seen the
manuals they write as secondary items playing a supporting role to the corpora
themselves. My view is
different. If our work on Project
SUSANNE has any lasting value, I am convinced that this will stem primarily
from its relatively comprehensive and explicitly-defined taxonomy of English
grammatical phenomena. Naturally I
hope – and believe – that the SUSANNE Corpus too will prove useful in various
ways. But, although the SUSANNE
Corpus will be some three times the size of the database we have used as a
source of grammatical statistics to date, in terms of sheer size I believe that
the SUSANNE and other existing analysed corpora described in Sampson (1991) are
due soon to be eclipsed by much larger databases being produced in the USA,
notably the ‘Penn Treebank’ being created by Mitchell Marcus of the University
of Pennsylvania. The significance
of the SUSANNE Corpus will lie not in size but in the detail, depth, and
explicitness of its analytic scheme.
(Marcus’s Treebank uses a wordtag set that is extremely simple relative
to that of the Tagged LOB or Brown Corpora – it contains just 36 tags
(Santorini 1990); and, as I understand, the Penn Treebank will also involve
quite simple and limited indications of higher-level structure, whether because
the difficulty of producing richer annotations grows with the size of a corpus,
or because Marcus wishes to avoid becoming embroiled in the theoretical
controversies that might be entailed by commitment to any richer annotation
scheme.) Even if we succeed
perfectly in the ambitious task of bringing every detail of the annotations in
128,000 words of text into line with the SUSANNE taxonomic principles, one of
the most significant long-term roles of the SUSANNE Corpus itself will be as an
earnest of the fact that the rules of the published taxonomy have been evolved
through application to real-life data rather than chosen speculatively. I hope our SUSANNE work may thus offer
the beginnings of a ‘Linnaean taxonomy of the English language’. It will be no more than a beginning;
there will certainly be plenty of further work to be done.
How controversial is the general programme of work in corpus linguistics
that I have outlined in these pages?
To me it seems almost self-evidently reasonable and appropriate, but it
is easy to delude oneself on such matters. The truth is that the rise of the corpus-based approach to
computational linguistics has not always been welcomed by adherents of the
older, compilation-oriented approach; and to some extent my own work seems to
be serving as the representative target for those who object to corpus
linguistics. (I cannot reasonably
resent this, since I myself have stirred a few academic controversies in the
past.)
In particular, a series of papers (Taylor, Grover, & Briscoe 1989;
Briscoe 1990) have challenged my attempt, discussed above, to demonstrate that
individually rare constructions are collectively so common as to render
unfeasible the aim of designing a production system to generate ‘all and only’
the constructions which occur in real-life usage in a natural language. My experiment took for granted the
relatively surfacy, theoretically middle-of-the-road grammatical analysis
scheme that had been evolved over a series of corpus linguistics projects at
Lancaster, in Norway, and at Leeds in order to represent the grammar of LOB
sentences in a manner that would as far as possible be uncontroversial and
accordingly useful to a wide range of researchers. But of course it is true that a simple, concrete theoretical
approach which eliminates controversial elements is itself a particular
theoretical approach, which the proponents of more abstract theories may see as
mistaken. Taylor et al. believe in
a much more abstract approach to English grammatical analysis; and they argue
that my findings about the incidence of rare constructions are an artefact of
my misguided analysis, rather than being inherent in my data. Their preferred theory of English
grammar is embodied in the formal generative grammar of the Alvey Natural Language
Tools (‘ANLT’) parsing system (for distribution details see note 1 of Briscoe
1990); Taylor et al. use this grammar to reanalyse my data, and they argue that
most of the constructions which I counted as low-frequency are generated by
high-frequency rules of the ANLT grammar.
According to Taylor et al., the ANLT system is strikingly successful at
analysing my data-set, accurately parsing as many as 97% of my noun phrase
tokens.[4]
My use of the theoretically-unenlightened LOB analytic scheme is, for Briscoe
(1990), symptomatic of a tendency for corpus linguistics in general to operate
as ‘a self-justifying and hermeneutically sealed sub-discipline’.
Several points in these papers seem oddly misleading. Taylor et al. repeatedly describe the
LOB analytic scheme as if it were much more a private creation of my own than
it actually was, thereby raising in their readers’ minds a natural suspicion
that problems such as those described in Sampson (1987) might well stem purely
from an idiosyncratic analytic scheme which is possibly ill-defined,
ill-judged, and/or fixed so as to help me prove my point. One example relates to the system, used
in my 1987 investigation, whereby the detailed set of 132 LOB wordtags is reduced
to a coarser classification by grouping certain classes of cognate tags under
more general ‘cover tags’.
Referring to this system, Taylor et al. comment that ‘Sampson ... does
not explain the extent to which he has generalised types in this fashion’;
‘Sampson ... gives no details of this procedure’; Briscoe (1990) adds that an
attempt I made to explain the facts to him in correspondence ‘does not shed
much light on the generalisations employed ... as Garside et al. (1987) does
not give a complete listing of cover-tags’. In fact I had no hand in defining the system of cover tags
which was used in my experiment (or in defining the wordtags on which the cover
tags were based). The cover tags
were defined, in a perfectly precise manner, by a colleague (Geoffrey Leech, as
it happens) and were in routine use on research projects directed by Leech at
Lancaster in which Lolita Taylor was an active participant. Thus, although it is true that my paper
did not give enough detail to allow an outsider to check the nature or origin
of the cover-tag system (and outside readers may accordingly have been
receptive to the suggestions of Taylor et al. on this point), Taylor herself
was well aware of the true situation.
She (and, through her, Briscoe) had access to the details independently
of my publications, and independently of Garside et al. (1987). (They had closer access than I, since I
had left Lancaster at the relevant time while Taylor et al. were still there.)
Then, although Taylor et al. (1989) and Briscoe (1990) claim that the
ANLT grammar is very successful at parsing the range of noun phrase structures
on which my research was based, the status of this claim is open to question in
view of the fact that the grammar was tested only manually. The ANLT grammar was created as part of
an automatic parsing system, and Taylor et al. say that they tried to check the
data using the automatic parser but had to give up the attempt: sometimes parses failed not because of
inadequacies in the grammar but because of ‘resource limitations’, and
sometimes so many alternative parses were generated that it was impractical to
check whether these included the correct analysis. But anyone with experience of highly complex formal systems
knows that it is not easy to check their implications manually. Even the most painstakingly designed
computer programs turn out to behave differently in practice from what their
creators intend and expect; and likewise the only satisfactory way to test
whether a parser accepts an input is to run the parser over the input
automatically. Although much of
the text of Briscoe (1990) is word for word identical with Taylor et al.
(1989), Briscoe suppresses the paragraphs explaining that the checking was done
manually, saying simply that examples were ‘parsed using the ANLT grammar. Further details of this process ... can
be found in Taylor et al. (1989)’ (the latter publication being relatively
inaccessible).
I was particularly surprised by the success rate claimed for the ANLT
grammar in view of my own experience with this particular system. It happens that I was recently
commissioned by a commercial client to develop assessment criteria for
automatic parsers and to apply them to a range of systems; the ANLT parser was
one of those I tested (using automatic rather than manual techniques), and its
performance was strikingly poor both absolutely and by comparison with its
leading competitor, SRI International’s Core Language Engine (‘CLE’: Alshawi et al. 1989). I encountered no ‘resource limitation’
problems – the ANLT system either found one or more analyses for an input or
else finished processing the input with an explicit statement that no analyses
were found; but the latter message repeatedly occurred in response to inputs
that were very simple and unquestionably well-formed. Sentences such as Can you suggest an alternative?, Are any of the waiters students?, and Which college is the oldest? proved unparsable. (For application-related reasons my
test-set consisted mainly of questions.
I cite the examples here in normal orthography, though the ANLT system
requires the orthography of its inputs to be simplified in various ways: e.g. capitals must be replaced by
lower-case letters, and punctuation marks eliminated.) I did not systematically examine
performance on the data-set of Sampson (1987), which was not relevant to the
commission I was undertaking, but the grammar had features which appeared to
imply limited performance on realistically complex noun phrase structures. The only form of personal name that
seemed acceptable to the system was a one-word Christian name: the lexical coding system had no
category more precise than ‘proper name’.
Often I could find no way to reconcile a standard form of real-life
English proper name with the orthographic limitations imposed by the ANLT system
on its inputs – I tried submitting the very standard type of sovereign’s name King
John VIII in each of the
forms:
king john viii
king john 8
king john eight
king
john the eighth
but each version led to parsing failure.[5]
It is true that the ANLT system tested by me was ‘Release 1’, dated
November 1987, while Taylor et al. (1989) discuss also a ‘2nd release’ dated
1989. But the purely manual
testing described by Taylor et al. seems to me insufficient evidence to
overcome the a priori
implausibility of such a dramatic performance improvement between 1987 and 1989
versions as their and my findings jointly imply.
A problem in any theoretically-abstract analytic approach is that depth
of vision tends to be bought at the cost of a narrow focus, which overlooks
part of the richness and diversity present in the data. Taylor et al. are open about one
respect in which this is true of their approach to natural language
parsing: in reanalysing my data
they stripped out all punctuation marks occurring within the noun phrases,
because ‘we do not regard punctuation as a syntactic phenomenon’. That is, the range of constructions on
which the ANLT parsing system is claimed to perform well is not the noun
phrases of a 40,000-word sample of written English, but the noun phrases of a
sample of an artificial language derived by eliminating punctuation marks from
written English. With respect to
my data-set this is quite a significant simplification, because more than a
tenth of the vocabulary of symbols used to define the noun phrase structures
are cover tags for punctuation marks.
Of course, where the ANLT system does yield the right analysis for an
input it is in one sense all the more admirable if this is achieved without exploiting
the cues offered by punctuation.
But on the other hand punctuation is crucial to many of the
constructions which I have discussed above as needing more attention than they
have received from the computational linguistics of the past. A Harvard-style bibliographical
reference, for instance, as in Smith (1985: 32) writes ..., is largely defined by its use of
brackets and colon.[6]
It would be unfortunate to adopt a theory which forced one to ignore an
aspect of the English language as significant as punctuation, and I do not
understand Taylor et al.’s attempt to justify this by denying that punctuation
is a ‘syntactic phenomenon’:
punctuation is physically there, as much part of the written language as
the alphabetic words are, and with as much right to be dealt with by systems
for automatically processing written language.
I do not believe that the choice of a concrete rather than abstract
intellectual framework, which allows researchers to remain open to such
phenomena, can reasonably be described as retreat into ‘a self-justifying and
hermeneutically sealed sub-discipline’.
The most serious flaw in Taylor et al.’s paper is that they
misunderstand the nature of the problem raised in Sampson (1987). According to Taylor et al., I assumed
that in a generative grammar each distinct noun phrase type ‘will be associated
with one rule’, and I argued that ‘any parsing system based on generative rules
will need a large or open-ended set of spurious “rules” which ... only apply
once’; Taylor et al. point out, rightly, that a typical generative grammar will
generate many of the individual constructions in my data-set through more than
one rule-application, and consequently a relatively small set of rules can
between them generate a relatively large range of constructions. But corpus linguists are not as
ignorant of alternative approaches to linguistic analysis as Taylor et al.
suppose. I had explicitly tried in
my 1987 paper to eliminate the possibility of misunderstandings such as theirs
by writing: ‘the problem is not that the number of distinct noun phrase
types is very large. A generative
grammar can define a large (indeed, infinitely large) number of alternative
expansions for a symbol by means of a small number of rules.’ As I went on to say, the real problem
lies in knowing which expansions should and which should not be generated. If extremely rare constructions cannot
be ignored because they are collectively frequent enough to represent an
important part of a language, then it is not clear how we could ever hope to
establish the class of constructions, all of the (perhaps infinitely numerous)
members of which and only the members of which should be generated by a
generative grammar – even though, if such a class could be established, it may
be that a generative grammar could define it using a finite number of
rules. Briscoe (1990, note 3)
comments on a version of this point which I made in a letter prompted by the
publication of Taylor et al. (1989), but in terms which suggest that he has not
yet understood it. According to
Briscoe, I ‘impl[y] that we should declare rare types ungrammatical, by fiat,
and not attempt to write rules for them’.
I have written nothing carrying this implication.
Taylor et al. examine the residue of noun phrases in my data-set which
they accept that the ANLT grammar cannot deal with, and they suggest various
ways in which the ANLT rule-set might be extended to cope with such cases. Their suggestions are sensible, and it
may well be that adopting them would improve the system’s performance. My suspicion, though, is that with a
real-life language there will be no end to this process. When one looks carefully to see where a
rainbow meets the ground, it often looks easy to reach that spot; but we know
that, having done so, one is no closer to the rainbow. I believe the task of producing an
observationally adequate definition of usage in a natural language is like
that. That is why I personally
prefer to work on approaches to automatic parsing that do not incorporate any
distinction between grammatical/well-formed/legal and
ungrammatical/ill-formed/illegal.
But let me not seem to claim too much. The compilation model for language processing has real
virtues: in particular, when the
compilation technique works at all it is far more efficient, in terms of
quantity of processing required, than a stochastic optimizing technique. In domains involving restricted,
relatively well-behaved input language, the compilation model may be the only
one worth considering; and it seems likely that as NLP applications multiply
there will be such domains – it is clear, for instance, that language
consciously addressed by humans to machines tends spontaneously to adapt to the
perceived limitations of the machines.
And even in the case of unrestricted text or speech I am certainly not
saying that my probabilistic APRIL system is superior to the ANLT system. To be truthful, at present neither of
these systems is very good.
Indeed, I would go further:
it is difficult to rank natural language parsers on a single scale,
because they differ on several incommensurable parameters, but if I had to
select one general-purpose English-language parsing system as best overall
among those now existing, I would vote for the CLE – which is a compiler-like
rather than probabilistic parser.
The CLE too leaves a great deal to be desired, and the probabilistic
approach is so new that I personally feel optimistic about the possibility that
in due course it may overtake the compilation approach, at least in domains
requiring robust performance with unrestricted inputs; but at present this is a
purely hypothetical forecast.
What I do strongly believe is that there is a great deal of important
natural-language grammar, often related to cultural rather than logical
matters, over and beyond the range of logic-related core constructions on which
theoretical linguists commonly focus; and that it will be very regrettable if
the discipline as a whole espouses abstract theories which prevent those other
phenomena being noticed. How far
would botany or zoology have advanced, if communication among researchers was
hampered because no generally-agreed comprehensive taxonomy and nomenclature
could be established pending final resolution of species relationships through
comparison of amino-acid sequences?
We need a systematic, formal stocktaking of everything in our languages;
this will help theoretical analysis to advance, rather than get in its way, and
it can be achieved only through the compilation and investigation of corpora.
REFERENCES
Aarts, E. and J. Korst. 1989. Simulated Annealing and Boltzmann Machines.
Chichester: Wiley.
Alshawi, H. et al. 1989. Research Programme in Natural Language Processing: Final
Report. Prepared by SRI International for the
Information Engineering Directorate Natural Language Processing Club. (Alvey Project no. ALV/PRJ/IKBS/105,
SRI Project no. 2989.)
Briscoe, E.J. 1990. ‘English
noun phrases are regular: a reply to Professor Sampson’. In J. Aarts and W. Meijs, eds., Theory
and Practice in Corpus Linguistics. Amsterdam: Rodopi.
Chomsky, A.N. 1957. Syntactic
Structures. The Hague: Mouton.
–.
1980. Rules and
Representations. Oxford: Blackwell.
Ellegård, A. 1978. The
Syntactic Structure of English Texts.
Gothenburg Studies in English, 43.
Fillmore, C.J., et al. 1988. ‘Regularity and idiomaticity in grammatical
constructions’. Language 64.501-538.
Garside, R.G., et al., eds. 1987. The Computational Analysis of English.
London: Longman.
Johansson, S. 1986. The
Tagged LOB Corpus Users’ Manual. Bergen: Norwegian
Computing Centre for the Humanities.
Kirkpatrick, S. et al. 1983. ‘Optimization by simulated annealing’. Science 220.671‑80.
Quirk, R. et al. 1985. A
Comprehensive Grammar of the English Language.
London: Longman.
Sampson, G.R. 1980.
Making
Sense. Oxford: Oxford University Press.
–.
1987. ‘Evidence against the
“grammatical”/“ungrammatical” distinction’. In W. Meijs, ed., Corpus Linguistics and Beyond.
Amsterdam: Rodopi.
–.
1989. ‘Language
acquisition: growth or learning?’ Philosophical
Papers 18.203-240.
–.
1991. ‘Analysed corpora of
English: a consumer guide’. In
Martha Pennington and V. Stevens, eds., Computers in Applied Linguistics.
Clevedon, Avon: Multilingual Matters.
– et al. 1989. ‘Natural
language analysis by stochastic optimization: a progress report on Project
APRIL’. Journal of Experimental
and Theoretical Artificial Intelligence 1.271-87.
Santorini, Beatrice. 1990. Annotation Manual for the Penn Treebank Project (preliminary draft dated 28.3.1990). University of Pennsylvania.
Taylor, Lolita, et al. 1989. ‘The syntactic regularity of English noun phrases’. In Proceedings of the Fourth Annual
Meeting of the European Chapter of the Association for Computational
Linguistics, University
of Manchester Institute of Science and Technology.
[1]In fact these two concepts were not independent
developments: the early work of
Chomsky and some of his collaborators, such as M.P. Schützenberger, lay at the
root of formal language theory within computer science, as well as of
linguistic theory – though few of the linguists who became preoccupied with
generative grammars in the 1960s and 1970s had any inkling of the role played
by the equivalent concept in computer science.
[2]‘APRIL’ stands for ‘Annealing Parser for
Realistic Input Language’. The
current phase of Project APRIL is funded under Ministry of Defence contract no.
D/ER1/9/4/2062/151.
[3]‘SUSANNE’ stands for ‘Surface and
Underlying Structural Analyses of Natural English’. Project SUSANNE is funded under ESRC grant no. R00023
1142/3.
[4]Taylor et al. quote the percentage to two
places of decimals, but I hardly imagine their claim is intended to be so
precise.
[5]The name in the test data was King
Henry VIII, but it
happened that the name Henry was not in the ANLT dictionary and therefore I made the test fair by
substituting a name that was.
[6]For my expository purposes it is an awkward complication that the house style of the present publication incorporates an unusual variant of the Harvard system which substitutes comma for colon.