The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version. Published in Literary and Linguistic Computing 8.267–73, 1993. |
School of
Cognitive and Computing Sciences
University of
Sussex
Natural language
research needs something akin to a “Linnaean taxonomy”, identifying and
rigorously specifying boundaries between the various structural categories of a
language, to allow data to be collected and exchanged in unambiguous form. The author has made a first attempt to
provide such a taxonomy for English grammar; the scheme is to appear in book
form, and an electronic corpus annotated in conformity with it has been
available since 1992.
A central
requirement for SALT (speech and language technology) progress is a
comprehensive stocktaking and classification of the linguistic phenomena
(word-types, grammatical constructions, etc.) that are found in real-life
written and spoken usage in relevant natural languages, emphasizing
comprehensive coverage and explicitness of classification rather than
theoretical depth. Those of us
whose native language is English think of our language as relatively thoroughly
studied, yet despite the length of time during which computational linguists
have addressed the task of processing English, such a classification scheme has
not been available for our language.
I surmise that the same holds for other European languages.
In the case of
English I have recently attempted to fill this gap by producing a parsing
scheme – the “SUSANNE” scheme – which offers explicit proposals for grammatical
taxonomy that the research community may adopt, alter, extend, or otherwise
treat as it sees fit. The SUSANNE
annotation scheme is certainly not presented as “the right scheme” for
describing English grammar, and indeed one of the points I aim to make in what
follows is that “correctness” is not an applicable concept in this domain. What matters is that an annotation
scheme should be practical, publicly known, unambiguous, comprehensive, and
explicit; it is quite possible that alternative schemes might fulfil these
criteria equally well while being very different from one another in their
details. The chief purpose of this
paper is to explain why, at the present juncture in the development of speech
and language technology, this sort of work is worth doing: why information technology needs
grammatical taxonomies.
Natural language
processing (NLP) systems crucially need the ability to parse – to infer the
structure of an input text or spoken utterance. Parsing is widely recognized as “[t]he central problem”
(Obermeier 1989:69) in virtually all NLP applications. This is relatively obvious in the case
of applications towards the “intelligent/knowledge-based” end of the spectrum,
such as question-answering systems (front ends to databases), or machine
translation. In both these areas,
the largest problem lies in analysing the input (“understanding” users’
questions in the case of a question-answering system, or source-language texts
in the case of a machine-translation system); if this can be achieved, synthesizing
appropriate responses (answers to questions, or target-language translations)
is a lesser difficulty. Even in
areas which seem prima facie not to require natural-language “understanding”, parsing is also
needed. Automatic speech
recognition, for instance for voice-driven typewriters, needs the ability to
tell what is a grammatically-plausible arrangement of words in order to
constrain the alternative word-hypotheses offered by processing the speech
signal.
Because of its
practical significance, very many groups internationally have been working on
the automatic parsing problem for English – sometimes within the framework of
development of a particular application system, but often as a freestanding
research problem. (For recent
surveys see e.g. Reyle & Rohrer 1988, Grune & Jacobs 1990; and see
items in DARPA (1989) and subsequent Proceedings in the same series.) The great complexity of any natural
language and consequent long-term nature of the process of developing
natural-language parsers, together with the fact that humans’ ability to decode
the structure of their mother-tongue seems not to be significantly geared to
individual topic areas but rather to be general, make it appropriate to tackle
automatic parsing of a natural language as an independent research goal aiming
to produce systems that can be slotted in to diverse applications.
It would be
natural to suppose, when automatic parsing has been a widely-pursued goal over
many years, that there is general agreement about what the analyses of English
sentences should look like – what target an English-language parser should be
aiming at. This is far from
true. The normal situation has
been that an individual research group makes its own independent decisions
about the intended output of its parser, in such a way that one group’s
analyses are not just notationally distinct from but usually substantially
non-equivalent to those of other groups.
Furthermore, description/definition of target analysis schemes has
tended not to be a high priority, so it is quite difficult for an outsider to
know just what structural properties of English a particular group’s parser
aims to specify; researchers have usually been far more concerned to publicize
their parsing system
(the nature of the software they have created in order to move from raw input
to analysed output) than to publicize their parsing scheme (the nature of the structural analyses
comprised in the output of a parsing system) – indeed it is not always seen as
important to codify the latter explicitly even for a research group’s internal
purposes. And it is clear that the
(explicit or implicit) parsing schemes of virtually all groups are highly
incomplete: any such scheme will
offer no specific analyses for very many phenomena that frequently occur in
English.
There are (at
least) two reasons for this state of affairs, both stemming from aspects of the
recent history of linguistics.
First, computational linguists have tended to treat their subject as a
branch of theoretical linguistics, and theoretical linguistics has for decades
been concerned with rival notational systems for capturing highly abstract
generalizations about a limited range of “core” grammatical constructions, such
as relative clauses, or verb complements.
To a theoretical linguist it is simply not part of his goals to use an
analytic system which comprehensively covers everything that occurs in the
language in practice, which represents analytic distinctions in a maximally
straightforward, self-explanatory fashion, or which coincides with the
notations used by rival theorists; there are valid intellectual reasons why
these should not be goals for theoretical linguistics, but the result has been
(since it is largely the same people who practice both disciplines) that they
have not become goals of computational linguistics either, where their lack is
unfortunate.
Secondly, within
linguistics there has been a tradition (which has only recently begun to
dissolve) of hostility towards corpus studies – the reasons for this are
analysed by Aarts & van den Heuvel (1985: 303ff.); yet it is only through
work with corpora (large samples of language as used in real life) that the
analyst is forced to confront the great diversity of linguistic phenomena that
occur in practice and to seek an analytic scheme comprehensive enough to cope
with them. If the linguist relies
on data invented by himself in his role as native-speaker of the language, as
has been more usual (not because linguists are lazy, but as a consequence of
methodological axioms about “competence” and “performance” (Chomsky 1965) which
are respectable within theoretical linguistics though, again, they are less
relevant to practical NLP research), then it is near-inevitable that the
linguist will focus on a limited range of phenomena which the research
community has picked out as posing interesting problems, while overlooking many
other phenomena that happen never to have struck anyone as noteworthy. (Some linguists, e.g. Labov (1975), would
argue that the emphasis on invented rather than observed data has led to
significant distortion even of those facts that are taken into account, but my
point does not depend on this relatively controversial claim.)
Some specific
consequences are obvious.
Written-language punctuation, for instance, is normally excluded from
grammatical analysis altogether.
NLP applications often concern written rather than spoken language, and
the sentences discussed by theoretical and computational linguists commonly
involve the formal, elaborate style characteristic of the written mode; but
theoretical linguists have never discussed punctuation, and there is no
consensus among computational linguists about how (or whether) to include
punctuation marks in parse-trees (despite the fact that for automatic analysis
of written language punctuation marks are highly significant, comparable in
importance to grammatical words such as of or the).
Again, real-life (written and spoken) language contains many
high-frequency phenomena such as dates (August 7th 1992), weights and measures (five foot ten), Harvard-style bibliographical
references in academic literature (Greenberg (1963: 90) wrote ...), addresses (10, Bridge Rd, Ambridge,
Borsetshire BC21 7EW),
etc. etc., which have their own characteristic structures in different
languages (compare the varying national formats for postal addresses, or
compare Portuguese 2$50
with American $2.50,
for instance); but theoretical linguists – and indeed those who produce
language descriptions of a more traditional type, such as (for English) the
series of grammars by Randolph Quirk and his collaborators culminating in Quirk
et al. (1985) – perceive them as peripheral, and for these phenomena too there
is no consensus about how they should be analysed. Yet for practical NLP applications they will often be as
important as many of the constructions that theoretical linguists see as part
of the “core” of language.
The neglected
areas just listed relate chiefly to written language; but there are as many or
perhaps more phenomena characteristic of spoken English which tend to fall
outside the purview of computational linguistics, with its focus on relatively
formal, impersonal language; cf. Allwood et al. (1990). It is clear, for instance, that
word-classification schemes developed for tagging written English words are
likely to be very inadequate for tagging the words of spoken utterances, which
are full of items serving discourse rather than logical functions (good work
has been done in this area by Swedish researchers, e.g. Stenström 1990,
Altenberg 1990).
Roger Moore of
the Speech Research Unit at the Defence Research Agency, Malvern, urges (in a
prepublication draft of his paper for the present volume) that speech science
and technology have now developed to the point where they face “an overwhelming
need for agreed standards” in transcribing the structure of everyday,
unprepared speech; he notes that no such conventions are currently known to
exist, and suggests that speech scientists themselves are not best qualified to
devise conventions relating to structural features. One problem is that speech – particularly “private” speech
such as face-to-face or telephone conversations, as opposed to lectures,
broadcasts, etc. – contains a high incidence of phenomena such as speech
repairs and hesitations which tend to be invisible in standard grammatical
description, since this is usually based on a “competence” version of
linguistic behaviour that excludes them.
From the limited work I have done in this area to date (see §6 below),
it is clear that existing attempts to specify notations for these phenomena
(notably Levelt 1983, Howell & Young 1990) are unsatisfactory: they depend on spoken language
conforming to patterns which, in practice, are frequently violated.[1]
Both in the case
of writing and in that of speech, NLP applications require the ability to
penetrate beyond the surface grammar of a text or utterance to disentangle its
logic. Great attention has been
paid to this issue by linguistic theorists in recent decades with respect to
the “competent” language characteristic of writing, where it largely involves
the reconstruction of deleted items whose identity is implied by the surface
grammar, and recovery of the logical position of “transformationally moved”
items. In addition to these
matters, though, spoken language involves a large extra layer of
surface/logical contrasts having to do with the unannounced changes of tack,
breaking off of utterances before their logical completion, production of
logically-confused utterances, etc., which are frequent in speech but are
normally edited out of writing. It
will be a long time before automatic language processing systems are capable of
dealing adequately with naturalistic speech in which such phenomena are salient;
but a prerequisite for advances in that direction is availability of databases
showing what patterns the phenomena fall into in practice, and this in turn
presupposes adequate annotation schemes.
Furthermore, even
in the areas of language which are shared between writing and speech and which
linguists would see as part of what a language description ought (at least
ideally) to cover, there is a vast amount to be done in terms of listing and
classifying the phenomena that occur.
Many constructions are omitted from theoretical descriptions not for
reasons of principle but because they are not very frequent and/or do not seem
to interact in theoretically-interesting ways with central aspects of grammar,
and although they are mentioned in traditional grammars they are not
systematically assigned places in explicit inventories of the resources of the
language. One example among very
many might be the English the more ... the more ... construction discussed by Fillmore et al.
(1988), an article which makes some of the same points I am trying to make
about the tendency for much of a language’s structure to be overlooked by the
linguist. Discussion between
research groups about the grammatical resources of a language is hampered by
the fact that traditional terminology is used in inconsistent and sometimes
vague ways. For instance, various
English-speaking linguists use the terms “complement” and “predicate” in quite
incompatible ways. Other terms,
such as “noun phrase”, are used much more consistently, in the sense that
different groups agree on core examples of the term; but traditional grammars
devote little attention to defining clearcut boundaries between such terms that would allow
unclear cases to be assigned predictably to one category or another. The work of producing the tagged
version of the LOB Corpus of written British English forced Stig Johansson to
produce a short-book-length specification of boundaries between the 136 tags
used to classify English words; this work (Johansson 1986) is so far as I know
unique in English linguistics.
Johansson’s manual is by no means the last word to be said on word
classification, and there is as much or, in my view, even more to be done in
the area of classifying grammatical constituents.
Unlike the
writers of traditional language descriptions, theoretical linguists are in one
sense heavily concerned with the definition of boundaries between grammatical
constructions. A theoretician
might well be interested in the question whether or not (to borrow an example
from Garside et al. (1987: ch. 7)) the wording following is in the sentence A dog is as much God’s
handiwork as a man should
be classified as a noun phrase.
But the sense in which a theoretician would address himself to this
question is different from the sense in which it requires an answer for the
purposes of the linguistic stocktaking advocated here. For the theoretician, the question
would be whether as much God’s handiwork as a man “really is” derived from the same node as
core examples of noun phrases, such as pronouns or proper names, in the most
psychologically-correct or explanatorily adequate formal definition of
English. A question of this sort
is very deep, and can be answered only provisionally and for a limited number
of grammatical phenomena. For NLP
purposes, the most pressing need I perceive is for an explicit, comprehensive
classification scheme to be imposed on a language, without too many worries about
whether its details are psychologically or otherwise correct, so that we can
all talk about the elements of the language using a common notation and knowing
that we mean the same thing by our notational categories and that the set of
categories is reasonably exhaustive.
What I believe we
need in computational linguistics is something like the Linnaean taxonomy for
the biological world. The Linnaean
system of plant nomenclature did not always agree with the natural,
biologically-valid classification – and Linné knew that. Its great advantages were that it
enabled the researcher to locate a definite name for any specimen (and to know
that any other botanist in the world would use the same name for that
specimen), and that it gave him something approaching an exhaustive conspectus
of the “data elements” which a more theoretical approach would need to be able
to cope with. The confusions
inherent in traditional vernacular species names were bypassed by adopting an
artificial system of Latin binomials; likewise a standardized grammatical
taxonomy could avoid misunderstandings about e.g. “predicate” or “complement”
by using neutral code letters or the like rather than descriptive words for its
categories. (For further discussion of the analogy between linguistic and
biological taxonomy, cf. Sampson (1992).)
I wrote above
that the tabulation of structural phenomena should be “reasonably
exhaustive”. To aim at perfect
comprehensiveness would be to follow a mirage, since at grammatical and
semantic levels any natural language is an open-ended system. Language is an intellectual product,
and the diversity of constructions that a speaker/writer can use is not
constrained by any physical limits.
I have published a mathematical investigation (Sampson 1987) which tends
to suggest that there is no finite bound to the range of distinct constructions
in English, so that as ever larger corpora are examined additional
constructions will always continue to be found. Richard Sharman of the IBM UK Scientific Centre likens the
grammar of a natural language to a fractal object such as a coastline, in which
new detail continues to emerge indefinitely as one looks closer and
closer. But, while an ultimately
exact statement of the shape of France is impossible, one can do much better than
say “France is hexagonal”. One
important desideratum for an IT-oriented grammatical taxonomy for a language is
informed judgment about what levels of detail it is appropriate to specify for
different areas of the language in our current state of knowledge.
While theoretical
and computational linguists have not striven for comprehensiveness of coverage,
they have put considerable effort into identifying divergences between the
“surface structure” of natural-language utterances and their “underlying
structure” or “logical form”. (To
illustrate this distinction via the classic example: John is eager to please and John is easy to please share the same surface structure, but
logically their grammar is quite different: in one case John is the logical subject and in the other case the
logical object of please. Likewise it might be said that
active/passive pairs such as John ate the toast v. The toast was eaten by John are distinct in surface grammar but
logically equivalent.) For many
SALT applications, identifying the underlying logic of an input is a necessary
stage of analysis; only for a few special cases such as text-to-speech systems
is surface parsing alone arguably sufficient. However, there is much more divergence between various
theorists’ conceptions of logical form than of surface structure. It is probably safe to say that
everyone agrees on representing the surface grammar of sentences by means of
labelled tree structures (or some notation clearly equivalent to labelled
trees), though the alphabet of node-labels would differ considerably from
research group to research group, and to a lesser extent the shapes of the
trees drawn for particular sentences would also differ. In the area of logical form, however,
although some researchers would again use labelled trees to represent the facts
others would use quite different methods of representation (for a survey see
Winograd 1983). Sometimes it is
not clear whether these differences are notational or substantive. Thus, Winograd (op. cit.) represents
the logical forms output by the ATN parsers with which he is centrally
concerned by means of diagrams that look superficially quite unlike labelled
trees, and which are never brought into relationship with the trees that
Winograd displays in connexion with other systems of analysis he discusses; yet
these diagrams can be mechanically converted into labelled tree structures that
are unorthodox in only one or two minor respects.
By contrast, in
other areas differences between notations for logical form are entirely real
and hard to resolve. An example
would be the question of how to distinguish the various arguments of the
predicate element (usually the verb) of a clause. The arguments include the items that appear as subject,
direct object, etc. in surface grammar, but some researchers regard these
categories as unhelpful for specifying the logic of a clause (note that the
grammatical subject of a verb is by no means always the “doer” of the
action). Numerous alternative
proposals are available in the literature; many are couched in terms of
Fillmorean “case theory” (Fillmore 1968), but they diverge widely with respect
to the sets of cases recognized, and other schemes again use concepts other
than case.
The importance of
logical analysis for NLP applications, already alluded to, might suggest that
any project oriented towards such an application would be forced to develop a
well-defined analytic approach in this area. Surprisingly, this is not always true of even the largest projects. The European Communities’ EUROTRA project
for machine translation between all the official languages of the member-states
was probably the largest and most expensive NLP project anywhere in the world
(after a pilot phase lasting several years it was fully established in 1982 and
subsequently employed on the order of one hundred full-time researchers at any
time, spread over all EC member states).
According to Bente Maegaard (1989: 44), even at that late date the
EUROTRA representations of source- and target-language logical structures
resolved the issue discussed above about identifying the various arguments of a
verb simply by labelling them “arg1”, “arg2”, etc. – i.e. they said nothing
substantive at all about this important aspect of logical structure and merely
tried to rely on the accident that Western European languages usually order
corresponding arguments of corresponding verbs in the same sequence. This is a striking illustration of the
way in which the level of sophistication of NL analysis targets is currently
lagging behind the sophistication of the software being created to execute NL
analysis.
My Project
SUSANNE has now produced a first attempt at a comprehensive grammatical
taxonomy for English, covering word-classification, surface grammatical
structure, logical grammar, and limited aspects of word meaning.
SUSANNE (a
project sponsored by the UK Economic and Social Research Council[2] from 1988
to 1992) had the initial aim of creating a parsed sample of English to serve as
a source of statistics for use in probabilistic NLP techniques. This aim was achieved; but, as the work
proceeded, it became increasingly clear that the chief value of the project lay
in the rigorous taxonomic scheme that was developed to ensure that an analysis
was available for any phenomenon occurring in the language and that analyses of
different texts were always consistent with one another. By the time the project was complete,
there were a number of parsed English corpora in existence, the largest of
which (Mitchell Marcus’s Pennsylvania Treebank, described elsewhere in this
volume) dwarfs the SUSANNE Corpus in size; but the explicit SUSANNE taxonomic
scheme is so far as I am aware the only extant large-scale attempt at a
rigorous, comprehensive grammatical taxonomy for any natural language which is reproducible, in the sense that two analysts both
armed with the scheme and faced with the same sample of realistically “messy”
text but working independently must annotate the grammar of the text
identically. Only because the
corpus is limited in size was it feasible to examine its contents with the
degree of intensity needed in order to produce and document an annotation
scheme which approaches the ideal of being sensitive to the full range of
grammatical subtleties found in the language it represents. The SUSANNE Corpus may still have a
value as a statistical database; but, as I now see it, the main raison
d’être of the corpus is
as a guarantee of the fact that the parsing scheme is developed from and
applicable to realistic data, rather than being a mere aprioristic invention.
The first release
of the electronic SUSANNE Corpus, including a documentation file giving a brief
outline of the analytic scheme, was circulated by the Oxford Text Archive via
anonymous ftp (file transfer protocol) beginning in October 1992;
correspondence received showed that within six months it was already in use in
academic and commercial research environments in many countries on at least
four continents.[3] The book
defining the parsing scheme in detail was completed in 1993, and is to be
published by Oxford University Press under the title English for the
Computer. I stress that this scheme cannot claim
to be more than a first attempt; even if it achieves a degree of recognition in
the research community, it unquestionably leaves a great deal of room for
extension and improvement in time to come.
Project SUSANNE
grew out of work I did from 1983 onwards to develop a surface parsing scheme
for the language of the LOB Corpus of written British English (cf. Garside et
al. 1987: ch. 7), in connexion with the Lancaster automatic parsing
project. The SUSANNE scheme
retained the body of analytic precedents developed in that work, and built on
them by adding new types of annotation to represent logical or “deep” grammar,
by refining existing areas of the annotation scheme (e.g. the classification of
words) in order to relate them to aspects of grammar not previously considered,
and by defining all the boundaries between adjacent analytic categories as
precisely as possible. The aim of
the SUSANNE scheme is to provide a predictable way of recording every aspect of
English grammar that is sufficiently precise to be susceptible of being reduced
to a formal notation.
Project SUSANNE
proceeded by taking a 128,000-word subset of the Brown Corpus of written
American English which had been manually analysed, though in a somewhat obscure
and inconsistent fashion, by Alvar Ellegård’s group at Gothenburg (Ellegård
1978), and reformatting, modifying, correcting, and extending the notations
used in that resource. Thus the
SUSANNE taxonomy as it now exists is based on samples of both British and
American varieties of English. It
focuses mainly on written English; however, a project on automatic parsing of
speech[4] which I was directing at the same period,
sponsored by the Ministry of Defence, was generating annotation standards for
the distinctive grammatical phenomena of spoken English which were designed to
be consistent with the SUSANNE standards for annotating written English, and
the book English for the Computer includes this material.
Like all complex
research enterprises, SUSANNE was a team effort. I should like to take this opportunity to record my debt to
the colleagues who helped to shape the SUSANNE scheme: Hélène Knight, Tim Willis, Nancy
Glaister, David Tugwell, and above all Robin Haigh.
In some
linguistic traditions, terms such as “taxonomy” and “botanizing” have strongly
negative connotations (see e.g. Katz 1971: 31ff.). But that is because different linguists have different
goals. If one’s aim is to use
natural language as a window onto the human cognitive faculty, I quite agree
that one should go for theoretically-insightful description of selected areas
of language. But in this paper I
am seeking to address readers who are involved with construction of automatic
systems for processing natural language and speech to achieve
economically-useful tasks, which is a different goal; for this purpose I believe taxonomy is desirable
and a high current priority.
The advantages
that will flow from availability of a comprehensive linguistic taxonomy are of
two kinds: more adequate
natural-language analysis systems, and greater sophistication among the user
community about the systems on offer.
A standard
taxonomic scheme will encourage the development of more adequate
natural-language analysis systems, by displaying to the system-builder a
relatively complete check-list, tabulated in the relatively formal fashion that
is appropriate for use in a computing context, of the constructions found in
the language and the structural categories relevant for description of those
constructions at different analytic levels. At present, someone designing language-analysis software who
aims to make his system comprehensive has no straightforward way to monitor how
much of the total task he has covered.
Linguistics textbooks list and discuss only a special subset of the
total range of linguistic phenomena. “Descriptive” or “pedagogical” grammars
are far more catholic in their coverage, but their discursive style makes them
difficult to use for SALT purposes:
what is needed is explicit lists of possibilities classified in terms of
explicit criteria, but descriptive grammars often leave such lists to be
inferred by the reader while including much material that is redundant and
confusing for computational purposes.
Descriptive grammars offer something like a zoo-visitor’s guide rather
than a Linnaean taxonomy. The wide
availability of a systematic listing of linguistic phenomena will inevitably
act as a spur encouraging system developers to ensure that they can handle all
or at least more than at present of the constructions of the written and/or
spoken language.
Furthermore, the
availability of a conspectus of the analytic categories used by other groups
will enable system designers to make more judicious choices about how their own
systems should represent the phenomena they cover. For any given analytic area it will become easy either to
make a reasoned decision that a particular existing scheme embodies the best
available practice and should be followed, or else to construct a new scheme
which offers a clear improvement over all extant alternatives. At present, it is all too easy to waste
much time reinventing wheels, in the shape of language-analysis schemes that
are less adequate than schemes which have already been worked out, but which
are documented only sketchily and in obscure publications or unpublished
research reports.
Turning to the
customer for a natural-language analysis system, or for an application system
that embodies natural language analysis:
he wants to know what he is getting for his money. At present that is not really possible,
since natural language is so massively complex and there are no
generally-understood benchmarks in terms of which a system’s performance can be
stated. A standard taxonomy makes
it easy for system-designers to state in concise but precise terms what areas
of English their systems are and what areas they are not claimed to cope with,
and to state the variant analytic concepts used (where the choice between
alternative analysis schemes might be relevant to a customer’s decisions).
This situation
should have two beneficial effects on the market. By facilitating informed judgments about which systems are
more and which less successful, it will speed up the process by which good work
is accepted and extended while poorer work is winnowed away. And the fact that customers for the
technology will no longer feel themselves to be buying a pig in a poke should
make the market more receptive to SALT overall, thus promoting industrial
uptake.
Taxonomic work of
the sort described must be carried out separately for separate languages. At one time it was a truism within the
discipline of linguistics that each language must be analysed in its own terms,
and that (as Leonard Bloomfield put it (1933: 20)) “Features which we think
ought to be universal may be absent from the very next language that becomes
accessible”. More recently, with
the emphasis among theoretical linguists on possible biological bases of
language structure, this point of view has tended to be lost sight of. But even from a theoretical viewpoint it
is acknowledged that not all linguistic properties are biologically determined
and therefore universal; and a practical, NLP-oriented taxonomic scheme must be
heavily concerned with aspects of language which are governed by culture rather
than by biology, and which therefore are very unlikely to be constant as
between languages of separate societies.
It may possibly be that there are universal constraints on the relative
ordering, say, of subject, verb, and object within a clause and that these are
a consequence of genetically-inherited mechanisms, but it is hardly likely that
genetics will have implications for the way a nation chooses to set out its
postal addresses.[5] Clearly one
would be quite unlikely to produce the most practically-adequate classification
scheme for, say, the words of Chinese, or even of Italian, if one began with
the presumption that this is likely to be closely similar to a scheme that has
already been worked out for English.
However, although
the details of a linguistic taxonomy will be specific to an individual
language, the principles to which such taxonomies need to conform will be
common ones. Compare the way in
which the detailed classifications of different plant genera are very different
one from another, but all equally obey the complex laws of nomenclature
promulgated by the International Committee on Botanical Nomenclature. Thus, taxonomic work on any one
language should have much to gain from being carried out with a lively
awareness of taxonomic standards worked out for other languages.
[1] Johansson et al. (1991)
have compiled a survey of existing conventions for speech transcription, Stig
Johansson being convener of a three-man group (with Doug Biber and myself)
charged with formulating recommendations in this domain for the US/EC-sponsored
Text Encoding Initiative.
[2] ESRC reference R00023
1142, “Construction of an Analysed Corpus of English”. The name “SUSANNE” stands for “Surface
and Underlying Structural Analyses of Naturalistic English”.
[3] The SUSANNE Corpus is
available freely and without formality to anyone who wishes to use it, from www.grsampson.net → dowloadable research
resources.
[4] Ministry of Defence
reference D/ER/1/9/4/2062/151(RSRE), “A Speech-Oriented Stochastic Parser”.
[5] The above remarks are not
intended to imply agreement with the thesis that “core” aspects of language
structure are genetically determined. As it happens, I believe that that thesis is quite
wrong (cf. Sampson 1989). But for
present purposes it is unnecessary to enter into this debate; even if there
were some truth in the nativist account of certain aspects of linguistic
structure, it could scarcely help and might well distort the development of
practical, IT-oriented grammatical taxonomies of diverse languages to treat
that truth as central to the work of developing them.
J. Aarts & T. van den Heuvel 1985 “Computational tools for the syntactic analysis of
corpora”. Linguistics 23.303-35.
J. Allwood et al. 1990 “Speech management – on the non-written life of
speech”. Nordic Journal of
Linguistics 13.3-48.
B. Altenberg 1990 “Spoken
English and the dictionary”. In
Svartvik 1990.
L. Bloomfield 1933 Language.
Holt.
A.N. Chomsky 1965 Aspects
of the Theory of Syntax. MIT Press.
DARPA 1989 Speech
and Natural Language. Morgan Kaufmann.
A. Ellegård 1978 The
Syntactic Structure of English Texts, Gothenburg Studies in English, 43.
C.J. Fillmore 1968 “The case
for case”. In E. Bach & R.T.
Harms, eds., Universals in Linguistic Theory. Holt,
Rinehart and Winston.
C.J. Fillmore et al. 1988 “Regularity and idiomaticity in grammatical
constructions”. Language 64.501-538.
R.G. Garside et al., eds. 1987 The Computational Analysis of English.
Longman.
D. Grune & Ceriel J.H. Jacobs 1990 Parsing Techniques: A Practical Guide.
Ellis Horwood.
P. Howell & K. Young 1990 “Speech repairs: report of work conducted October 1st 1989 –
March 31st 1990”. Report, Department
of Psychology, University College London.
S. Johansson 1986 The
Tagged LOB Corpus: Users’ Manual. Norwegian Computing Centre
for the Humanities (Bergen).
S. Johansson et al. 1991 “Working paper on spoken texts”. Report to Text Encoding Initiative working meeting, Myrdal,
Norway, November 1991.
J.J. Katz 1971 The
Underlying Reality of Language and Its Philosophical Import.
Harper & Row.
W. Labov 1975 “Empirical
foundations of linguistic theory”.
In R. Austerlitz, ed., The Scope of American Linguistics.
Pieter de Ridder Press.
W.J.M. Levelt 1983
“Monitoring and self-repair in speech”. Cognition 14.41-104.
Bente Maegaard 1989
“EUROTRA: the machine
translation project of the European Communities”. In J.A. Campbell & J. Cuena, eds., Perspectives in
Artificial Intelligence, vol. ii. Ellis Horwood.
K.K. Obermeier 1989 Natural
Language Processing Technologies in Artificial Intelligence: The Science and
Industry Perspective. Ellis Horwood.
R. Quirk et al. 1985 A
Comprehensive Grammar of the English Language.
Longman.
U. Reyle & C. Rohrer 1988 Natural Language Parsing and Linguistic Theories. Kluwer.
G.R. Sampson 1987 “Evidence
against the ‘grammatical’/‘ungrammatical’ distinction”. In W. Meijs, ed., Corpus Linguistics
and Beyond. Rodopi.
G.R. Sampson 1989 “Language
acquisition: growth or
learning?” Philosophical Papers 18.203-40.
G.R. Sampson 1992
“Probabilistic
parsing”. In
J. Svartvik, ed., Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82.
Mouton de Gruyter.
Anna-Brita Stenström 1990 “Lexical items peculiar to spoken discourse”. In Svartvik 1990.
J. Svartvik, ed. 1990 The
London-Lund Corpus of Spoken English. Lund
University Press.
T. Winograd 1983 Language
as a Cognitive Process, vol. 1: Syntax.
Addison-Wesley.
BIOGRAPHICAL NOTE
Born in 1944,
Geoffrey Sampson studied Oriental languages, linguistics, and computer science
at Cambridge and Yale during the 1960s.
After some years at Oxford and the LSE he worked for more than a decade
at the University of Lancaster, where he was a founder member with Geoffrey
Leech and Roger Garside of the Unit for Computer Research on English
Language. He left Lancaster for
the University of Leeds in 1985, and later moved into self-employment as a
language engineering consultant.
In 1991 he returned to the academic profession at the University of
Sussex, where he is Director of the Centre for Advanced Software Applications
and Chairman of Computer Science & Artificial Intelligence.