The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version.

Published in Literary and Linguistic Computing 8.267–73, 1993.


The Need for Grammatical Stocktaking

 

 

Geoffrey Sampson

 

School of Cognitive and Computing Sciences

University of Sussex

 

 

 

ABSTRACT

 

Natural language research needs something akin to a “Linnaean taxonomy”, identifying and rigorously specifying boundaries between the various structural categories of a language, to allow data to be collected and exchanged in unambiguous form.  The author has made a first attempt to provide such a taxonomy for English grammar; the scheme is to appear in book form, and an electronic corpus annotated in conformity with it has been available since 1992.

 

 

1          Introduction

 

A central requirement for SALT (speech and language technology) progress is a comprehensive stocktaking and classification of the linguistic phenomena (word-types, grammatical constructions, etc.) that are found in real-life written and spoken usage in relevant natural languages, emphasizing comprehensive coverage and explicitness of classification rather than theoretical depth.  Those of us whose native language is English think of our language as relatively thoroughly studied, yet despite the length of time during which computational linguists have addressed the task of processing English, such a classification scheme has not been available for our language.  I surmise that the same holds for other European languages.

 

In the case of English I have recently attempted to fill this gap by producing a parsing scheme – the “SUSANNE” scheme – which offers explicit proposals for grammatical taxonomy that the research community may adopt, alter, extend, or otherwise treat as it sees fit.  The SUSANNE annotation scheme is certainly not presented as “the right scheme” for describing English grammar, and indeed one of the points I aim to make in what follows is that “correctness” is not an applicable concept in this domain.  What matters is that an annotation scheme should be practical, publicly known, unambiguous, comprehensive, and explicit; it is quite possible that alternative schemes might fulfil these criteria equally well while being very different from one another in their details.  The chief purpose of this paper is to explain why, at the present juncture in the development of speech and language technology, this sort of work is worth doing:  why information technology needs grammatical taxonomies.

 

 

2          The present situation

 

Natural language processing (NLP) systems crucially need the ability to parse – to infer the structure of an input text or spoken utterance.  Parsing is widely recognized as “[t]he central problem” (Obermeier 1989:69) in virtually all NLP applications.  This is relatively obvious in the case of applications towards the “intelligent/knowledge-based” end of the spectrum, such as question-answering systems (front ends to databases), or machine translation.  In both these areas, the largest problem lies in analysing the input (“understanding” users’ questions in the case of a question-answering system, or source-language texts in the case of a machine-translation system); if this can be achieved, synthesizing appropriate responses (answers to questions, or target-language translations) is a lesser difficulty.  Even in areas which seem prima facie not to require natural-language “understanding”, parsing is also needed.  Automatic speech recognition, for instance for voice-driven typewriters, needs the ability to tell what is a grammatically-plausible arrangement of words in order to constrain the alternative word-hypotheses offered by processing the speech signal.

 

Because of its practical significance, very many groups internationally have been working on the automatic parsing problem for English – sometimes within the framework of development of a particular application system, but often as a freestanding research problem.  (For recent surveys see e.g. Reyle & Rohrer 1988, Grune & Jacobs 1990; and see items in DARPA (1989) and subsequent Proceedings in the same series.)  The great complexity of any natural language and consequent long-term nature of the process of developing natural-language parsers, together with the fact that humans’ ability to decode the structure of their mother-tongue seems not to be significantly geared to individual topic areas but rather to be general, make it appropriate to tackle automatic parsing of a natural language as an independent research goal aiming to produce systems that can be slotted in to diverse applications.

 

It would be natural to suppose, when automatic parsing has been a widely-pursued goal over many years, that there is general agreement about what the analyses of English sentences should look like – what target an English-language parser should be aiming at.  This is far from true.  The normal situation has been that an individual research group makes its own independent decisions about the intended output of its parser, in such a way that one group’s analyses are not just notationally distinct from but usually substantially non-equivalent to those of other groups.  Furthermore, description/definition of target analysis schemes has tended not to be a high priority, so it is quite difficult for an outsider to know just what structural properties of English a particular group’s parser aims to specify; researchers have usually been far more concerned to publicize their parsing system (the nature of the software they have created in order to move from raw input to analysed output) than to publicize their parsing scheme (the nature of the structural analyses comprised in the output of a parsing system) – indeed it is not always seen as important to codify the latter explicitly even for a research group’s internal purposes.  And it is clear that the (explicit or implicit) parsing schemes of virtually all groups are highly incomplete:  any such scheme will offer no specific analyses for very many phenomena that frequently occur in English.

 

There are (at least) two reasons for this state of affairs, both stemming from aspects of the recent history of linguistics.  First, computational linguists have tended to treat their subject as a branch of theoretical linguistics, and theoretical linguistics has for decades been concerned with rival notational systems for capturing highly abstract generalizations about a limited range of “core” grammatical constructions, such as relative clauses, or verb complements.  To a theoretical linguist it is simply not part of his goals to use an analytic system which comprehensively covers everything that occurs in the language in practice, which represents analytic distinctions in a maximally straightforward, self-explanatory fashion, or which coincides with the notations used by rival theorists; there are valid intellectual reasons why these should not be goals for theoretical linguistics, but the result has been (since it is largely the same people who practice both disciplines) that they have not become goals of computational linguistics either, where their lack is unfortunate.

 

Secondly, within linguistics there has been a tradition (which has only recently begun to dissolve) of hostility towards corpus studies – the reasons for this are analysed by Aarts & van den Heuvel (1985: 303ff.); yet it is only through work with corpora (large samples of language as used in real life) that the analyst is forced to confront the great diversity of linguistic phenomena that occur in practice and to seek an analytic scheme comprehensive enough to cope with them.  If the linguist relies on data invented by himself in his role as native-speaker of the language, as has been more usual (not because linguists are lazy, but as a consequence of methodological axioms about “competence” and “performance” (Chomsky 1965) which are respectable within theoretical linguistics though, again, they are less relevant to practical NLP research), then it is near-inevitable that the linguist will focus on a limited range of phenomena which the research community has picked out as posing interesting problems, while overlooking many other phenomena that happen never to have struck anyone as noteworthy.  (Some linguists, e.g. Labov (1975), would argue that the emphasis on invented rather than observed data has led to significant distortion even of those facts that are taken into account, but my point does not depend on this relatively controversial claim.)

 

 

3          Overlooked phenomena

 

Some specific consequences are obvious.  Written-language punctuation, for instance, is normally excluded from grammatical analysis altogether.  NLP applications often concern written rather than spoken language, and the sentences discussed by theoretical and computational linguists commonly involve the formal, elaborate style characteristic of the written mode; but theoretical linguists have never discussed punctuation, and there is no consensus among computational linguists about how (or whether) to include punctuation marks in parse-trees (despite the fact that for automatic analysis of written language punctuation marks are highly significant, comparable in importance to grammatical words such as of or the).  Again, real-life (written and spoken) language contains many high-frequency phenomena such as dates (August 7th 1992), weights and measures (five foot ten), Harvard-style bibliographical references in academic literature (Greenberg (1963: 90) wrote ...), addresses (10, Bridge Rd, Ambridge, Borsetshire BC21 7EW), etc. etc., which have their own characteristic structures in different languages (compare the varying national formats for postal addresses, or compare Portuguese 2$50 with American $2.50, for instance); but theoretical linguists – and indeed those who produce language descriptions of a more traditional type, such as (for English) the series of grammars by Randolph Quirk and his collaborators culminating in Quirk et al. (1985) – perceive them as peripheral, and for these phenomena too there is no consensus about how they should be analysed.  Yet for practical NLP applications they will often be as important as many of the constructions that theoretical linguists see as part of the “core” of language.

 

The neglected areas just listed relate chiefly to written language; but there are as many or perhaps more phenomena characteristic of spoken English which tend to fall outside the purview of computational linguistics, with its focus on relatively formal, impersonal language; cf. Allwood et al. (1990).  It is clear, for instance, that word-classification schemes developed for tagging written English words are likely to be very inadequate for tagging the words of spoken utterances, which are full of items serving discourse rather than logical functions (good work has been done in this area by Swedish researchers, e.g. Stenström 1990, Altenberg 1990).

 

Roger Moore of the Speech Research Unit at the Defence Research Agency, Malvern, urges (in a prepublication draft of his paper for the present volume) that speech science and technology have now developed to the point where they face “an overwhelming need for agreed standards” in transcribing the structure of everyday, unprepared speech; he notes that no such conventions are currently known to exist, and suggests that speech scientists themselves are not best qualified to devise conventions relating to structural features.  One problem is that speech – particularly “private” speech such as face-to-face or telephone conversations, as opposed to lectures, broadcasts, etc. – contains a high incidence of phenomena such as speech repairs and hesitations which tend to be invisible in standard grammatical description, since this is usually based on a “competence” version of linguistic behaviour that excludes them.  From the limited work I have done in this area to date (see §6 below), it is clear that existing attempts to specify notations for these phenomena (notably Levelt 1983, Howell & Young 1990) are unsatisfactory:  they depend on spoken language conforming to patterns which, in practice, are frequently violated.[1]

 

Both in the case of writing and in that of speech, NLP applications require the ability to penetrate beyond the surface grammar of a text or utterance to disentangle its logic.  Great attention has been paid to this issue by linguistic theorists in recent decades with respect to the “competent” language characteristic of writing, where it largely involves the reconstruction of deleted items whose identity is implied by the surface grammar, and recovery of the logical position of “transformationally moved” items.  In addition to these matters, though, spoken language involves a large extra layer of surface/logical contrasts having to do with the unannounced changes of tack, breaking off of utterances before their logical completion, production of logically-confused utterances, etc., which are frequent in speech but are normally edited out of writing.  It will be a long time before automatic language processing systems are capable of dealing adequately with naturalistic speech in which such phenomena are salient; but a prerequisite for advances in that direction is availability of databases showing what patterns the phenomena fall into in practice, and this in turn presupposes adequate annotation schemes.

 

 

4          The need for classification

 

Furthermore, even in the areas of language which are shared between writing and speech and which linguists would see as part of what a language description ought (at least ideally) to cover, there is a vast amount to be done in terms of listing and classifying the phenomena that occur.  Many constructions are omitted from theoretical descriptions not for reasons of principle but because they are not very frequent and/or do not seem to interact in theoretically-interesting ways with central aspects of grammar, and although they are mentioned in traditional grammars they are not systematically assigned places in explicit inventories of the resources of the language.  One example among very many might be the English the more ... the more ... construction discussed by Fillmore et al. (1988), an article which makes some of the same points I am trying to make about the tendency for much of a language’s structure to be overlooked by the linguist.  Discussion between research groups about the grammatical resources of a language is hampered by the fact that traditional terminology is used in inconsistent and sometimes vague ways.  For instance, various English-speaking linguists use the terms “complement” and “predicate” in quite incompatible ways.  Other terms, such as “noun phrase”, are used much more consistently, in the sense that different groups agree on core examples of the term; but traditional grammars devote little attention to defining clearcut boundaries between such terms that would allow unclear cases to be assigned predictably to one category or another.  The work of producing the tagged version of the LOB Corpus of written British English forced Stig Johansson to produce a short-book-length specification of boundaries between the 136 tags used to classify English words; this work (Johansson 1986) is so far as I know unique in English linguistics.  Johansson’s manual is by no means the last word to be said on word classification, and there is as much or, in my view, even more to be done in the area of classifying grammatical constituents.

 

Unlike the writers of traditional language descriptions, theoretical linguists are in one sense heavily concerned with the definition of boundaries between grammatical constructions.  A theoretician might well be interested in the question whether or not (to borrow an example from Garside et al. (1987: ch. 7)) the wording following is in the sentence A dog is as much God’s handiwork as a man should be classified as a noun phrase.  But the sense in which a theoretician would address himself to this question is different from the sense in which it requires an answer for the purposes of the linguistic stocktaking advocated here.  For the theoretician, the question would be whether as much God’s handiwork as a man “really is” derived from the same node as core examples of noun phrases, such as pronouns or proper names, in the most psychologically-correct or explanatorily adequate formal definition of English.  A question of this sort is very deep, and can be answered only provisionally and for a limited number of grammatical phenomena.  For NLP purposes, the most pressing need I perceive is for an explicit, comprehensive classification scheme to be imposed on a language, without too many worries about whether its details are psychologically or otherwise correct, so that we can all talk about the elements of the language using a common notation and knowing that we mean the same thing by our notational categories and that the set of categories is reasonably exhaustive.

 

What I believe we need in computational linguistics is something like the Linnaean taxonomy for the biological world.  The Linnaean system of plant nomenclature did not always agree with the natural, biologically-valid classification – and Linné knew that.  Its great advantages were that it enabled the researcher to locate a definite name for any specimen (and to know that any other botanist in the world would use the same name for that specimen), and that it gave him something approaching an exhaustive conspectus of the “data elements” which a more theoretical approach would need to be able to cope with.  The confusions inherent in traditional vernacular species names were bypassed by adopting an artificial system of Latin binomials; likewise a standardized grammatical taxonomy could avoid misunderstandings about e.g. “predicate” or “complement” by using neutral code letters or the like rather than descriptive words for its categories. (For further discussion of the analogy between linguistic and biological taxonomy, cf. Sampson (1992).)

 

I wrote above that the tabulation of structural phenomena should be “reasonably exhaustive”.  To aim at perfect comprehensiveness would be to follow a mirage, since at grammatical and semantic levels any natural language is an open-ended system.  Language is an intellectual product, and the diversity of constructions that a speaker/writer can use is not constrained by any physical limits.  I have published a mathematical investigation (Sampson 1987) which tends to suggest that there is no finite bound to the range of distinct constructions in English, so that as ever larger corpora are examined additional constructions will always continue to be found.  Richard Sharman of the IBM UK Scientific Centre likens the grammar of a natural language to a fractal object such as a coastline, in which new detail continues to emerge indefinitely as one looks closer and closer.  But, while an ultimately exact statement of the shape of France is impossible, one can do much better than say “France is hexagonal”.  One important desideratum for an IT-oriented grammatical taxonomy for a language is informed judgment about what levels of detail it is appropriate to specify for different areas of the language in our current state of knowledge.

 

 

5          Logical structure

 

While theoretical and computational linguists have not striven for comprehensiveness of coverage, they have put considerable effort into identifying divergences between the “surface structure” of natural-language utterances and their “underlying structure” or “logical form”.  (To illustrate this distinction via the classic example:  John is eager to please and John is easy to please share the same surface structure, but logically their grammar is quite different:  in one case John is the logical subject and in the other case the logical object of please.  Likewise it might be said that active/passive pairs such as John ate the toast v. The toast was eaten by John are distinct in surface grammar but logically equivalent.)  For many SALT applications, identifying the underlying logic of an input is a necessary stage of analysis; only for a few special cases such as text-to-speech systems is surface parsing alone arguably sufficient.  However, there is much more divergence between various theorists’ conceptions of logical form than of surface structure.  It is probably safe to say that everyone agrees on representing the surface grammar of sentences by means of labelled tree structures (or some notation clearly equivalent to labelled trees), though the alphabet of node-labels would differ considerably from research group to research group, and to a lesser extent the shapes of the trees drawn for particular sentences would also differ.  In the area of logical form, however, although some researchers would again use labelled trees to represent the facts others would use quite different methods of representation (for a survey see Winograd 1983).  Sometimes it is not clear whether these differences are notational or substantive.  Thus, Winograd (op. cit.) represents the logical forms output by the ATN parsers with which he is centrally concerned by means of diagrams that look superficially quite unlike labelled trees, and which are never brought into relationship with the trees that Winograd displays in connexion with other systems of analysis he discusses; yet these diagrams can be mechanically converted into labelled tree structures that are unorthodox in only one or two minor respects.

 

By contrast, in other areas differences between notations for logical form are entirely real and hard to resolve.  An example would be the question of how to distinguish the various arguments of the predicate element (usually the verb) of a clause.  The arguments include the items that appear as subject, direct object, etc. in surface grammar, but some researchers regard these categories as unhelpful for specifying the logic of a clause (note that the grammatical subject of a verb is by no means always the “doer” of the action).  Numerous alternative proposals are available in the literature; many are couched in terms of Fillmorean “case theory” (Fillmore 1968), but they diverge widely with respect to the sets of cases recognized, and other schemes again use concepts other than case.

 

The importance of logical analysis for NLP applications, already alluded to, might suggest that any project oriented towards such an application would be forced to develop a well-defined analytic approach in this area.  Surprisingly, this is not always true of even the largest projects.  The European Communities’ EUROTRA project for machine translation between all the official languages of the member-states was probably the largest and most expensive NLP project anywhere in the world (after a pilot phase lasting several years it was fully established in 1982 and subsequently employed on the order of one hundred full-time researchers at any time, spread over all EC member states).  According to Bente Maegaard (1989: 44), even at that late date the EUROTRA representations of source- and target-language logical structures resolved the issue discussed above about identifying the various arguments of a verb simply by labelling them “arg1”, “arg2”, etc. – i.e. they said nothing substantive at all about this important aspect of logical structure and merely tried to rely on the accident that Western European languages usually order corresponding arguments of corresponding verbs in the same sequence.  This is a striking illustration of the way in which the level of sophistication of NL analysis targets is currently lagging behind the sophistication of the software being created to execute NL analysis.

 

 

6          The SUSANNE taxonomy

 

My Project SUSANNE has now produced a first attempt at a comprehensive grammatical taxonomy for English, covering word-classification, surface grammatical structure, logical grammar, and limited aspects of word meaning.

 

SUSANNE (a project sponsored by the UK Economic and Social Research Council[2] from 1988 to 1992) had the initial aim of creating a parsed sample of English to serve as a source of statistics for use in probabilistic NLP techniques.  This aim was achieved; but, as the work proceeded, it became increasingly clear that the chief value of the project lay in the rigorous taxonomic scheme that was developed to ensure that an analysis was available for any phenomenon occurring in the language and that analyses of different texts were always consistent with one another.  By the time the project was complete, there were a number of parsed English corpora in existence, the largest of which (Mitchell Marcus’s Pennsylvania Treebank, described elsewhere in this volume) dwarfs the SUSANNE Corpus in size; but the explicit SUSANNE taxonomic scheme is so far as I am aware the only extant large-scale attempt at a rigorous, comprehensive grammatical taxonomy for any natural language which is reproducible, in the sense that two analysts both armed with the scheme and faced with the same sample of realistically “messy” text but working independently must annotate the grammar of the text identically.  Only because the corpus is limited in size was it feasible to examine its contents with the degree of intensity needed in order to produce and document an annotation scheme which approaches the ideal of being sensitive to the full range of grammatical subtleties found in the language it represents.  The SUSANNE Corpus may still have a value as a statistical database; but, as I now see it, the main raison d’être of the corpus is as a guarantee of the fact that the parsing scheme is developed from and applicable to realistic data, rather than being a mere aprioristic invention.

 

The first release of the electronic SUSANNE Corpus, including a documentation file giving a brief outline of the analytic scheme, was circulated by the Oxford Text Archive via anonymous ftp (file transfer protocol) beginning in October 1992; correspondence received showed that within six months it was already in use in academic and commercial research environments in many countries on at least four continents.[3]  The book defining the parsing scheme in detail was completed in 1993, and is to be published by Oxford University Press under the title English for the Computer.  I stress that this scheme cannot claim to be more than a first attempt; even if it achieves a degree of recognition in the research community, it unquestionably leaves a great deal of room for extension and improvement in time to come.

 

Project SUSANNE grew out of work I did from 1983 onwards to develop a surface parsing scheme for the language of the LOB Corpus of written British English (cf. Garside et al. 1987: ch. 7), in connexion with the Lancaster automatic parsing project.  The SUSANNE scheme retained the body of analytic precedents developed in that work, and built on them by adding new types of annotation to represent logical or “deep” grammar, by refining existing areas of the annotation scheme (e.g. the classification of words) in order to relate them to aspects of grammar not previously considered, and by defining all the boundaries between adjacent analytic categories as precisely as possible.  The aim of the SUSANNE scheme is to provide a predictable way of recording every aspect of English grammar that is sufficiently precise to be susceptible of being reduced to a formal notation.

 

Project SUSANNE proceeded by taking a 128,000-word subset of the Brown Corpus of written American English which had been manually analysed, though in a somewhat obscure and inconsistent fashion, by Alvar Ellegård’s group at Gothenburg (Ellegård 1978), and reformatting, modifying, correcting, and extending the notations used in that resource.  Thus the SUSANNE taxonomy as it now exists is based on samples of both British and American varieties of English.  It focuses mainly on written English; however, a project on automatic parsing of speech[4] which I was directing at the same period, sponsored by the Ministry of Defence, was generating annotation standards for the distinctive grammatical phenomena of spoken English which were designed to be consistent with the SUSANNE standards for annotating written English, and the book English for the Computer includes this material.

 

Like all complex research enterprises, SUSANNE was a team effort.  I should like to take this opportunity to record my debt to the colleagues who helped to shape the SUSANNE scheme:  Hélène Knight, Tim Willis, Nancy Glaister, David Tugwell, and above all Robin Haigh.

 

 

7          Scientific and commercial benefits

 

In some linguistic traditions, terms such as “taxonomy” and “botanizing” have strongly negative connotations (see e.g. Katz 1971: 31ff.).  But that is because different linguists have different goals.  If one’s aim is to use natural language as a window onto the human cognitive faculty, I quite agree that one should go for theoretically-insightful description of selected areas of language.  But in this paper I am seeking to address readers who are involved with construction of automatic systems for processing natural language and speech to achieve economically-useful tasks, which is a different goal; for this purpose I believe taxonomy is desirable and a high current priority.

 

The advantages that will flow from availability of a comprehensive linguistic taxonomy are of two kinds:  more adequate natural-language analysis systems, and greater sophistication among the user community about the systems on offer.

 

A standard taxonomic scheme will encourage the development of more adequate natural-language analysis systems, by displaying to the system-builder a relatively complete check-list, tabulated in the relatively formal fashion that is appropriate for use in a computing context, of the constructions found in the language and the structural categories relevant for description of those constructions at different analytic levels.  At present, someone designing language-analysis software who aims to make his system comprehensive has no straightforward way to monitor how much of the total task he has covered.  Linguistics textbooks list and discuss only a special subset of the total range of linguistic phenomena. “Descriptive” or “pedagogical” grammars are far more catholic in their coverage, but their discursive style makes them difficult to use for SALT purposes:  what is needed is explicit lists of possibilities classified in terms of explicit criteria, but descriptive grammars often leave such lists to be inferred by the reader while including much material that is redundant and confusing for computational purposes.  Descriptive grammars offer something like a zoo-visitor’s guide rather than a Linnaean taxonomy.  The wide availability of a systematic listing of linguistic phenomena will inevitably act as a spur encouraging system developers to ensure that they can handle all or at least more than at present of the constructions of the written and/or spoken language.

 

Furthermore, the availability of a conspectus of the analytic categories used by other groups will enable system designers to make more judicious choices about how their own systems should represent the phenomena they cover.  For any given analytic area it will become easy either to make a reasoned decision that a particular existing scheme embodies the best available practice and should be followed, or else to construct a new scheme which offers a clear improvement over all extant alternatives.  At present, it is all too easy to waste much time reinventing wheels, in the shape of language-analysis schemes that are less adequate than schemes which have already been worked out, but which are documented only sketchily and in obscure publications or unpublished research reports.

 

Turning to the customer for a natural-language analysis system, or for an application system that embodies natural language analysis:  he wants to know what he is getting for his money.  At present that is not really possible, since natural language is so massively complex and there are no generally-understood benchmarks in terms of which a system’s performance can be stated.  A standard taxonomy makes it easy for system-designers to state in concise but precise terms what areas of English their systems are and what areas they are not claimed to cope with, and to state the variant analytic concepts used (where the choice between alternative analysis schemes might be relevant to a customer’s decisions).

 

This situation should have two beneficial effects on the market.  By facilitating informed judgments about which systems are more and which less successful, it will speed up the process by which good work is accepted and extended while poorer work is winnowed away.  And the fact that customers for the technology will no longer feel themselves to be buying a pig in a poke should make the market more receptive to SALT overall, thus promoting industrial uptake.

 

 

8          Taxonomies for different languages

 

Taxonomic work of the sort described must be carried out separately for separate languages.  At one time it was a truism within the discipline of linguistics that each language must be analysed in its own terms, and that (as Leonard Bloomfield put it (1933: 20)) “Features which we think ought to be universal may be absent from the very next language that becomes accessible”.  More recently, with the emphasis among theoretical linguists on possible biological bases of language structure, this point of view has tended to be lost sight of.  But even from a theoretical viewpoint it is acknowledged that not all linguistic properties are biologically determined and therefore universal; and a practical, NLP-oriented taxonomic scheme must be heavily concerned with aspects of language which are governed by culture rather than by biology, and which therefore are very unlikely to be constant as between languages of separate societies.  It may possibly be that there are universal constraints on the relative ordering, say, of subject, verb, and object within a clause and that these are a consequence of genetically-inherited mechanisms, but it is hardly likely that genetics will have implications for the way a nation chooses to set out its postal addresses.[5]  Clearly one would be quite unlikely to produce the most practically-adequate classification scheme for, say, the words of Chinese, or even of Italian, if one began with the presumption that this is likely to be closely similar to a scheme that has already been worked out for English.

 

However, although the details of a linguistic taxonomy will be specific to an individual language, the principles to which such taxonomies need to conform will be common ones.  Compare the way in which the detailed classifications of different plant genera are very different one from another, but all equally obey the complex laws of nomenclature promulgated by the International Committee on Botanical Nomenclature.  Thus, taxonomic work on any one language should have much to gain from being carried out with a lively awareness of taxonomic standards worked out for other languages.

 

 

Notes

 

[1] Johansson et al. (1991) have compiled a survey of existing conventions for speech transcription, Stig Johansson being convener of a three-man group (with Doug Biber and myself) charged with formulating recommendations in this domain for the US/EC-sponsored Text Encoding Initiative.

 

[2] ESRC reference R00023 1142, “Construction of an Analysed Corpus of English”.  The name “SUSANNE” stands for “Surface and Underlying Structural Analyses of Naturalistic English”.

 

[3] The SUSANNE Corpus is available freely and without formality to anyone who wishes to use it, from www.grsampson.net dowloadable research resources.

 

[4] Ministry of Defence reference D/ER/1/9/4/2062/151(RSRE), “A Speech-Oriented Stochastic Parser”.

 

[5] The above remarks are not intended to imply agreement with the thesis that “core” aspects of language structure are genetically determined.   As it happens, I believe that that thesis is quite wrong (cf. Sampson 1989).  But for present purposes it is unnecessary to enter into this debate; even if there were some truth in the nativist account of certain aspects of linguistic structure, it could scarcely help and might well distort the development of practical, IT-oriented grammatical taxonomies of diverse languages to treat that truth as central to the work of developing them.

 

 

 

REFERENCES

 

J. Aarts & T. van den Heuvel  1985  “Computational tools for the syntactic analysis of corpora”.  Linguistics 23.303-35.

J. Allwood et al.  1990  “Speech management – on the non-written life of speech”.  Nordic Journal of Linguistics 13.3-48.

B. Altenberg  1990  “Spoken English and the dictionary”.  In Svartvik 1990.

L. Bloomfield  1933  Language.  Holt.

A.N. Chomsky  1965  Aspects of the Theory of Syntax.  MIT Press.

DARPA  1989  Speech and Natural Language.  Morgan Kaufmann.

A. Ellegård  1978  The Syntactic Structure of English Texts, Gothenburg Studies in English, 43.

C.J. Fillmore  1968  “The case for case”.  In E. Bach & R.T. Harms, eds., Universals in Linguistic Theory.  Holt, Rinehart and Winston.

C.J. Fillmore et al.  1988  “Regularity and idiomaticity in grammatical constructions”.  Language 64.501-538.

R.G. Garside et al., eds.  1987  The Computational Analysis of English.  Longman.

D. Grune & Ceriel J.H. Jacobs  1990  Parsing Techniques: A Practical Guide.  Ellis Horwood.

P. Howell & K. Young  1990  “Speech repairs: report of work conducted October 1st 1989 – March 31st 1990”.  Report, Department of Psychology, University College London.

S. Johansson  1986  The Tagged LOB Corpus: Users’ Manual.  Norwegian Computing Centre for the Humanities (Bergen).

S. Johansson et al.  1991  “Working paper on spoken texts”.  Report to Text Encoding Initiative working meeting, Myrdal, Norway, November 1991.

J.J. Katz  1971  The Underlying Reality of Language and Its Philosophical Import.  Harper & Row.

W. Labov  1975  “Empirical foundations of linguistic theory”.  In R. Austerlitz, ed., The Scope of American Linguistics.  Pieter de Ridder Press.

W.J.M. Levelt  1983  “Monitoring and self-repair in speech”.  Cognition 14.41-104.

Bente Maegaard  1989  “EUROTRA:  the machine translation project of the European Communities”.  In J.A. Campbell & J. Cuena, eds., Perspectives in Artificial Intelligence, vol. ii.  Ellis Horwood.

K.K. Obermeier  1989  Natural Language Processing Technologies in Artificial Intelligence: The Science and Industry Perspective.  Ellis Horwood.

R. Quirk et al.  1985  A Comprehensive Grammar of the English Language.  Longman.

U. Reyle & C. Rohrer  1988  Natural Language Parsing and Linguistic Theories. Kluwer.

G.R. Sampson  1987  “Evidence against the ‘grammatical’/‘ungrammatical’ distinction”.  In W. Meijs, ed., Corpus Linguistics and Beyond.  Rodopi.

G.R. Sampson  1989  “Language acquisition:  growth or learning?”  Philosophical Papers 18.203-40.

G.R. Sampson  1992  Probabilistic parsing”.  In J. Svartvik, ed., Directions in Corpus Linguistics:  Proceedings of Nobel Symposium 82.  Mouton de Gruyter.

Anna-Brita Stenström  1990  “Lexical items peculiar to spoken discourse”.  In Svartvik 1990.

J. Svartvik, ed.  1990  The London-Lund Corpus of Spoken English.  Lund University Press.

T. Winograd  1983  Language as a Cognitive Process, vol. 1: Syntax.  Addison-Wesley.

 

 

 

BIOGRAPHICAL NOTE

 

Born in 1944, Geoffrey Sampson studied Oriental languages, linguistics, and computer science at Cambridge and Yale during the 1960s.  After some years at Oxford and the LSE he worked for more than a decade at the University of Lancaster, where he was a founder member with Geoffrey Leech and Roger Garside of the Unit for Computer Research on English Language.  He left Lancaster for the University of Leeds in 1985, and later moved into self-employment as a language engineering consultant.  In 1991 he returned to the academic profession at the University of Sussex, where he is Director of the Centre for Advanced Software Applications and Chairman of Computer Science & Artificial Intelligence.