Sampson: Reflections of a Dendrographer

Reflections of a dendrographer

Geoffrey Sampson

1 Introduction

If you live in the English countryside, now and then a plane flies overhead and photographs your house, and a day or two later someone calls offering to sell you the picture. On my study wall I have such a photograph, taken in the summer of 1983 when I was a junior colleague of Geoffrey Leech at the University of Lancaster and lived up the valley in the Yorkshire Dales. If you examine the picture closely, at one side of the garden you can see a white disc, with a small pink disc intersecting its edge. The white disc was a garden table, and the pink disc the top of my bald head, and I remember the day well: I had just begun to contribute to one of Geoffrey’s research projects. Geoffrey Leech and Roger Garside had got sponsorship to develop a statistics-based automatic parser, and I had undertaken to produce the database of manually-annotated language samples needed to train it; on a sunny day I had taken my work outdoors.

When I see that picture I often wonder whether I would have been quite so enthusiastic, if I had realized what I was getting into. On and off, but much more on than off, I and researchers working alongside me have been drawing parse-trees for samples of various genres of language ever since. When our current “LUCY” project concludes, in 2003, we shall have almost exactly reached the twentieth anniversary of that photograph.

In this paper, I shall survey some of the diverse findings about the English language which have emerged from this work, before going on to make some comments about methodological lessons I draw from it for our discipline.

2 Then and now

To set that work of twenty years ago in context and explain why the Lancaster project needed someone to draw some trees for it, let me recall how things were in computational linguistics back then. It happened that that same year, 1983, saw the inaugural meeting, in Pisa, of the European Chapter of the Association for Computational Linguistics. Attending that meeting, I collected a few typical instances of the kinds of language sample which the software systems under discussion at Pisa were designed to cope with, as follows (I make no apology for the fact that this and the immediately following example-sets are repeated from earlier publications in which I have made similar points):

(1) a. Whatever is linguistic is interesting.

b. A ticket was bought by every man.

c. The man with the telescope and the umbrella kicked the ball.

d. Hans bekommt von dieser Frau ein Buch.

e. John and Bill went to Pisa. They delivered a paper.

f. Maria é andata a Roma con Anna.

g. Are you going to travel this summer? Yes, to Sicily.

There is certainly nothing wrong with these dapper little sentences as far as they go, but I think many readers will recognize that they are not representative of the complexities of real-life usage; they are manifestly artificial. Here, by contrast, are a few utterances drawn from different points in our CHRISTINE annotated corpus of spontaneous speech in the UK in the past decade (based on a subset of the demographically-sampled speech section of the British National Corpus):[1]

(2) a. well you want to nip over there and see what they come on on the roll

b. can we put erm New Kids # no not New Kids Wall Of # you know

c. well it was Gillian and # and # erm {pause} and Ronald’s sister erm {pause} and then er {pause} a week ago last night erm {pause} Jean and I went to the Lyceum together to see Arsenic and Old Lace

d. lathered up, started to shave {unclear} {pause} when I come to clean it there weren’t a bloody blade in, the bastards had pinched it

e. but er {pause} I don’t know how we got onto it {pause} er sh- # and I think she said something about oh she knew her tables and erm {pause} you know she’d come from Hampshire apparently and she # {pause} an- # an- yo- # you know er we got talking about ma- and she’s taken her child away from {pause} the local school {pause} and sen- # is now going to a little private school up {pause} the Teign valley near Teigngrace apparently fra-

The difference in texture is very obvious – and note that this is not a matter of well-behaved written language versus the anarchy of speech. Insofar as the Pisa examples are attributable to one rather than the other language mode, the question-and-answer form of (1g) suggests that they too are intended to represent spoken language.

The real difference is that the Pisa examples were invented, while the CHRISTINE examples are authentic. Authentic language, even polished published writing, standardly contains complications and inconsequentialities which are unconsciously avoided by computational linguists who invent their examples. Here is a random assortment of sentences from the LOB Corpus[2] of published British English:

(3) a. Sing slightly flat.

b. Mr. Baring, who whispered and wore pince-nez, was seventy if he was a day.

c. Advice – Concentrate on the present.

d. Say the power-drill makers, 75 per cent of major breakdowns can be traced to neglect of the carbon-brush gear.

e. But he remained a stranger in a strange land.

Note, for instance, that (3a) contains the word flat functioning as an adverb but looking like an adjective. In (3b), the phrase Mr. Baring contains a full stop followed after a space by a capital which, exceptionally for this configuration, does not mark a sentence boundary. In (3c), the grammatical relationship between the material before and after the dash is inexplicit and unclear. Sentence (3d) has peculiar word-order; only (3e) might be regarded as perfectly representing English grammatical “competence”, by those who find that concept helpful. The LOB examples are by no means as messy as the CHRISTINE specimens (if I had looked for messy examples in LOB rather than choosing sentences at random I could have found examples far more anarchic than these), but they are less bland and straightforward than the Pisa material. Authentic, real-life language is rarely as well-behaved as invented language, which twenty years ago was the only kind that almost any computational linguists took into account.

So, at that time, software developments such as Geoffrey’s and Roger’s which were intended to cope with authentic language badly needed information on what sort of structural phenomena actually occur in real-life language use; the kind of resource needed was what Geoffrey baptized a “treebank”, a sample of sentences annotated with labelled trees identifying their grammatical structure. This is what I was beginning to produce on that summer day in 1983, adding structural annotations to a subset of sentences from the LOB Corpus. Since then, our team at Sussex have extended this work to diverse genres of English, including the spontaneous conversational speech of the CHRISTINE Corpus, and the written output of schoolchildren at Key Stage 2 within our current LUCY project.

To make the discussion more concrete, Figure 1 shows an extract from the CHRISTINE Corpus displaying the annotation of (2a) above, which was uttered by a speaker code-named Gemma006 (female, age 28, housewife, SW England dialect). It would not be appropriate here to describe the annotation symbols in detail (for that, see Sampson 1995), but for instance the fourth column classifies the words uttered, using a vocabulary of about 370 wordtags, some of which are highly specific (the role of well as a topic-initiating marker in speech is unique to that word, and is accordingly assigned a unique tag UW); the sixth column uses labelled bracketing to show the phrase and clause structures into which the words fit, marking logical as well as surface structure where these differ. (Thus the element s101 in the node dominating you shows that this word is the subject not only of the verb want but also, logically speaking, of the infinitival subordinate clause beginnng to nip over.)

FIGURE 1 ABOUT HERE

Since 1983, of course, the value of treebanks has become internationally recognized (and the term which Geoffrey coined has become standard throughout the discipline). Examples such as the Penn Treebank[3] have been developed, which dwarf in size any treebanks I have been responsible for. If there remains anything distinctive about the approach of our team to the task of compiling treebanks, it lies in the emphasis on explicitness and minute detail in the scheme of structural analysis. Our treebanking work is heavily informed by a principle which was neatly summarized by Jane Edwards of the University of California, Berkeley (Edwards 1992: 139): “The single most important property of any data base for purposes of computer-assisted research is that similar instances be encoded in predictably similar ways”. The over-riding goal of our group is not to produce the largest possible annotated samples of the language, but to specify the most precise and comprehensive guidelines possible for annotating real-life English, so that, ideally, no matter what turn of phrase crops up in speech or writing, our annotation scheme should always have a single predictable way of marking its structure.

I have described the work elsewhere as an attempt to do for the English language something like what Linnaeus did for the botanical world in the 18th century: to provide a well-defined classification scheme that allows anyone, anywhere to assign any given specimen unambiguously to a particular classificatory slot. Focusing on exactness of classification inevitably means that the annotated corpora associated with our work are small, compared to some of those now available, but getting the underlying scheme of classification right is a necessary precondition for developing large annotated collections that allows statistics to be extracted with some confidence that apples have always been counted with apples and oranges with oranges. From our point of view, the explicit annotation scheme is the central output of our research effort, and the corpora that we develop in the process of debugging the annotation scheme should be seen as secondary by-products (though in practice it seems that this scale of priorities is not one which others can easily be persuaded to share).

In this paper I shall first give some examples of the things we are finding out about English language and human linguistic behaviour through our work on samples of different genres, before going on to discuss some of the lessons I draw from this work for the methodology of our discipline.

3 Some discoveries

3.1 Basic sentence types

One of the earliest discoveries to emerge from our treebanking work on written English from the LOB Corpus (Sampson 1987a: 90) related to basic clause structures in English. Linguistic textbooks quite often suggest that the most basic patterns of all include the three-part pattern “subject – transitive verb – object”, and the two-part pattern “ subject – intransitive verb”. For instance, Fromkin & Rodman (1983: 209) quote as their first two examples of English sentence types the sentences the child found the puppy, and the lazy child slept. But treebank data suggested that these patterns are widely different in status. Subject – transitive verb – object is indeed a very frequent pattern; but subject – intransitive verb, as an entire clause, is strikingly infrequent. If the verb of a main clause does not take an object, there is almost always some other constituent, such as an adverbial element, following it – this is so usual that it seems to be a systematic feature of the language. Sentences of the pattern the lazy child slept do occur, but they do not seem to be a basic type in any obvious sense of “basic”.

3.2 Left branching

A more substantial finding related to a classic statement by Victor Yngve (e.g. 1961) about constraints on left-branching structures in English. It is unquestionable that English parse-trees tend to grow much more freely in the northwest-to-southeast direction than the northeast-to-southwest direction. Consider, for instance, the structure shown in Figure 2 for one of Yngve’s example phrases, as good a young man for the job as you will ever find: the maximum number of left branches between a new word and the root is three (for the words a and young), and most words have only one or two left branches in their lineages, whereas the lineages of ever and find contain respectively four and five right branches. In the structure of a longer utterance, this difference in the incidence of left and right branching will often be far greater. Yngve’s suggestion was that there may actually be a fixed limit on the maximum number of left-branches permissible in the lineage of any word, whereas right-branching can be multiplied without limit; he suggested reasons having to do with psychological processing mechanisms which could explain the hypothetical existence of a left-branching limit, and he argued that many of the grammatical devices of English have been evolved in order to allow us to say the various things we want to say without violating the limit.

FIGURE 2 ABOUT HERE

When Yngve was writing, well before linguists routinely had access to computers, it was not practical to go beyond isolated examples and monitor the incidence of left-branching in a systematic way. More recently, I have explored the issue (Sampson 1997) with respect to our “SUSANNE” annotated corpus of published American English (based on a 13 per cent subset of the Brown Corpus). It turned out that Yngve’s hypothesis about the asymmetry was mistaken (in a way that he could not have discovered from the data available to him). There is not a fixed maximum for the number of left branches between sentence roots and individual words – higher degrees of left-branching occur with a smaller proportion of words, in a smooth, probabilistic fashion. What is fixed, rather, is the probability of expanding a non-rightmost node-daughter into a multi-word tagma – this can be identified as a figure which is remarkably constant for sentences that are widely different in length (and it has later turned out to vary very little even as between the casual spoken utterances of CHRISTINE and the carefully-edited writing of SUSANNE).

This is interesting as a pure scientific finding – a correction to a long-standing belief about English structure which could not have been made without electronic tree banks of real-life language data. But it is significant also for more practical language-engineering purposes. The constraint as Yngve envisaged it was a global property of entire parse-trees rather than a local fact about individual “productions” (pairings of mother and daughter-sequence within a tree). That made it difficult to incorporate within natural language processing software. The true constraint is local rather than global, and in consequence less problematic from the natural language processing viewpoint.

3.3 Rare constructions are common

A further finding from our written-English treebanks related to the completeness of grammars. (This material was first published as Sampson (1987b), and led to some controversy which is summarized, with references, in the revised version in Sampson (2001: chap. 10).) In our first, small, treebank I looked at the range of frequencies of alternative grammatical constructions (all the different productions in which the category of the mother-node was “noun phrase”, the commonest category in the data). A few productions (e.g. noun phrase expanded as determiner followed by singular noun) occur again and again; others occur infrequently or as “one-offs”. But there are many infrequent constructions: constructions which are individually rare are collectively common. Furthermore, there turned out to be a regular, predictable relationship between construction frequencies, and the number of different constructions in the data which shared those frequencies. The relationship in our data could be stated as follows:

if m is the frequency of the commonest construction (say, 28 cases per thousand words)

and if f is the relative frequency of some construction (that is, f is a fraction between 0 and 1, and fm is the absolute frequency of the construction)

then the proportion of all construction-tokens which represent construction-types of relative frequency f or less is about f ^0.4

Since our data-set was small, we cannot know that this relationship would continue to hold for values of f lower than the frequency represented by one-offs in that data-set. But within the limits of our material, the relationship was very consistent; and consider what it would imply, if it continued to hold for frequencies lower than our data-set allowed us to monitor. It would imply that:

one in every ten constructions recurs at most once per 10,000 words

one in every hundred constructions recurs at most once per 3.5 million words

one in every thousand constructions recurs at most once per billion words

One per cent, or even one-tenth of one per cent, of all the construction-tokens occurring in edited writing seem far too many for language-processing software systems to ignore as unimportant “performance deviations” or the like. Yet what hope would there be of constructing accurate and comprehensive generative grammars, if establishing the boundaries of the range of structures to be generated requires us to monitor numerous phenomena which each occur only once in millions or even billions of words? Surely, systems which succeed in processing language as it is used in real life cannot hope to operate on the basis of rigid definitions which generate “all and only” the well-formed sentences of a language; they will have to embody some more fluid, probabilistic concept of grammaticality.

It is true that the specific relationship identified above depends on our definition of sameness or difference among constructions; if the counting of productions had recognized fewer or more distinctions among grammatical categories, the exponent of f might have been some other figure than 0.4. But the classification of constructions used in this research was notably crude, with only basic distinctions recognized; an adequate generative grammar would surely need a more refined classification, in which case the exponent of f would be lower and the problem just discussed would be worse.

Linguists often respond to information about the relationship discussed above by treating it as a special case of “Zipf’s Law”. I do not myself find it helpful to invokes Zipf in this connection; apart from the fact that he discussed the frequencies of concrete items, such as words, rather than construction-types, his “law” is at best an oversimplification (see e.g. Mandelbrot 1957: 24) and, in so far as it approximates to the truth, arguably a logical necessity (Miller 1957). What the relationship discussed here seems to say about the distribution of grammatical constructions is not at all a logical truism, since it runs counter to the pattern that generative linguistics appears to predict.

3.4 Who uses complex grammar?

Turning to findings emerging from our more recent treebanking of spontaneous spoken language, one of these (Sampson 2001: chap. 4) has to do with the incidence of grammatical complexity in the speech of different groups of the population. By “grammatical complexity” I mean what this term means in traditional schoolroom grammar lessons: the extent to which utterances involve subordinate clauses. Any word of an utterance is assigned an integer corresponding to the number of nodes labelled with clause categories which occur in the path from that word to the root of its parse-tree. So, for instance, in an utterance Was this the one you wanted, pet?, the first four words would each score 1, as elements of the (interrogative) main clause; you and wanted would each score 2, as components of a relative clause within that main clause; the vocative pet would score 0, being outside any clause structure. Each identified speaker in the CHRISTINE Corpus (Stage I) was assigned a complexity index representing the mean figure for all words uttered by that speaker.

Complexity differences in this sense were an important part of the alleged contrast made famous a generation ago by Basil Bernstein (1971) between a working-class “restricted linguistic code” and a middle-class “elaborated code”. Again, though, in the 1970s Bernstein did not have access (and nor did any other linguist) to the kind of machine-readable data on a social cross-section of spontaneous speech that was used in compiling the CHRISTINE Corpus. Bernstein constructed his own data-sets on working-class and middle-class speech via an experiment with children in two schools which was both limited and, arguably, too artificial to show anything about natural usage (though it may well have been the best that could be done at the period).

The CHRISTINE data do not confirm Bernstein’s claim of a consistent correlation between linguistic complexity and social class. They do, however, show a correlation with a different demographic variable: age. As Figure 3 suggests,[4] British people apparently tend to speak in ways that become structurally more complex as they get older, and this growth in complexity continues throughout life, not as steeply in middle age as in childhood but not flattening out into a “steady state” after the end of an alleged “critical period” either. Sampson (2001: 70) applies a statistical significance test to show that the upward slope during the adult part of life is apparently real.

FIGURE 3 ABOUT HERE

Apart from its possible relevance for psycholinguistic theories about a time-limited innate “language acquisition device”, this finding has wider interest for our picture of human intellectual life generally. With respect to one observable aspect of language behaviour which is closely tied to the logical structure of thought, it seems that not only the minority who pursue intellectual careers, but the population as a whole, tend to go on growing subtler throughout their lives: that is surely something that society needs to know and reckon with in many policy areas. (Writing as one who is closer to retirement than to the beginning of his career, I find it heartening.) I have written “apparently” and “seems” because, although the data examined so far do achieve statistical significance, the figures are not yet as robust as one would like; it is a high priority to enlarge the body of treebanked material so that findings of this sort can be put on a solider basis.

3.5 A dying tense

A very different finding from the CHRISTINE/I speech material which is already statistically as solid as anyone could ask relates to regional differences in the English tense/aspect system (Sampson forthcoming a). Anglicists are familiar with the idea that verbs in English can be modified to mark various combinations of the semantic categories Past (ate), Perfect (has eaten), Progressive (is eating), and Passive (is eaten). A number of authors (e.g. Harris 1991) have mentioned that Irish English differs from British English in lacking the Perfect term in this system; the historical explanation for that difference is contested.

However, Christine data show that in the conversational speech of the past decade, the Perfect is almost as marginal for speakers in southern England as it is for Irish speakers, though the Perfect remains alive and well in the Midlands and Northern England, and in Wales and Scotland. Apart from the human interest of this finding, it has practical implications for matters such as literacy education and development of dialogue-driven interactive automatic systems; so far as I know it was entirely unsuspected before the onset of treebanking initiatives, indeed it is hard to imagine how it could have been detected earlier.

3.6 Learning to use written grammar

In our current research, we are looking at written English in modern Britain; we are “treebanking” samples of the output not only of skilled adult writers, as represented in published material, but also of teenagers and children. At age five children arrive at school as (usually) fluent speakers of English; if all goes well, they leave school, a dozen or so years later, skilled in the rather different strategies and conventions of the written language. One of the issues on which we hope our current “LUCY” project[5] will shed light is the trajectory taken by children in moving from one range of linguistic competences to the other.

To date, surprisingly little solid information is available about this (Perera 1984 is a rare and admirable study). Often, myths or guesswork pass for science. For instance, Cheshire & Milroy (1993: 8), in one of the few published discussions of grammatical differences between colloquial spoken English and the standard written language, claim that what they call “dislocated” structures – their example is these cats, they’re hungry – are “unacceptab[le]” in written English. In fact, dislocated sentences are normal in high-status published writing. As it happens, three of the six SUSANNE examples used to illustrate the structure in Sampson (1995: §4.525) are drawn from the genre “biography and belles lettres”. Conversely, a major strand in generative linguists’ discussion of “poverty of the stimulus” depends on the assumption that the type of English question which involves inverting the order of subject and verb is as well-established in speech as it is in the written language; according to Chomsky (1976: 31), children “unerringly” form questions like Is the man who is tall in the room? from the corresponding declarative sentences. But CHRISTINE data suggest (cf. Sampson forthcoming b) that the use of this kind of question structure is very limited in spontaneous speech – and furthermore, that when even adults embark on uttering such structures, they often break off and substitute alternative, simpler forms of words before the structure is complete.

Looking at the spontaneous written output of children aged 9 to 12, i.e. in roughly the Key Stage 2 age-range (cf. Sampson forthcoming c), we find that in one basic respect its structural properties are approximately intermediate between those of conversational speech and those of adults’ published prose. As one would surely expect, the grammatical tree structures of published prose are “bushier” than those of speech: the average nonterminal node in published writing dominates twice as many words as the average nonterminal node in conversational speech, and the average nonterminal node in the child-writing tree structures accounts for a little more than the halfway figure between these two. If one looks more closely at the structural differences, though, the relationships between speech, child writing, and adult writing are not always what one might predict.

There are two ways that tree structures can be “bushy”. Individual constructions can have many immediate constituents; or the trees can be highly recursive, so that paths downwards from the root nodes fork again and again before reaching the words at the bottom of the structure, even though each individual “fork” may have few “tines”. We might say that tree structures can be “wide”, or they can be “deep” – either property yields a high average wordage per nonterminal node; and of course a tree may be both wide and deep.

It turns out that width is completely invariant as between speech, child writing, and published writing; all the extra bushiness in the latter genres derives from greater average depth.

This presumably implies that learning to become a skilled user of the written language involves learning to use some kinds of construction more frequently than is usual in conversational speech. In order to begin to get a lead on this, I compared frequencies of various constructions (as a proportion of all constructions) in the three genres. Some constructions (e.g. noun phrases, comparative clauses) are more frequent in published writing than in speech, and (necessarily, therefore) some constructions (e.g. determiner phrases, adverbial clauses) are less common in published writing than in speech. I looked particularly at how the rates in children’s writing compared with the rates in speech and in published prose. In the case of almost all phrase categories, the child-writing figure was closer – often much closer – to published writing than to speech. Not so with clause categories, where in some cases the same children’s writing still reflected an incidence more characteristic of speech.

Clauses tend to be more complicated things than phrases, so it is not surprising to find that children take longer to adapt to written norms in the case of clause constructions than phrase constructions. Strikingly, though, there is one kind of subordinate clause which is more frequent in published writing than in speech and where the child-writing figure is closer to the published-writing figure than to the speech figure: namely, relative clauses, whether complete (e.g. one pup who looked just like his mother) or reduced (e.g. a girl named Jennifer, “Amazon adventure” also by Willard Price). For some reason which is as yet quite mysterious to me, it seems that children find it easier or have a greater propensity to adapt to the norms of adult writing in the case of relative clauses than in the case of other clause constructions which are more characteristic of written than spoken English, such as present-participle clauses (e.g. by adding some more to it).

In terms of formal complexity, relative clauses are among the most challenging constructions in English: so this finding is not at all predictable. Yet it is not even the case that the relative clauses which children use frequently in writing are limited to the simplest kinds of relative clause: in our data the children’s writing exploited the complex possibilities allowed by the relative clause construction to, if anything, an even greater extent that the published adult writing. I find this remarkable, whatever the explanation may be; it is the kind of finding which could only possibly emerge from the kinds of treebank resource that I have been describing .

4 The lessons of software engineering

I hope the foregoing has shown that there is no shortage of new things to be discovered about language and languages, once we study them through the empirical medium of electronic corpora. I turn now to an issue which has not yet been properly confronted, even by linguists and natural language processing researchers who accept the importance of empirical, corpus-based methods.

If we hark back to the neat examples in (1) above from the 1983 Pisa meeting, the central problem which they represented was that researchers developing computational approaches to natural language were far more enthusiastic about planning and writing program code, than about cataloguing the complexities of the material which their programs would need to deal with. The research projects from which the example sentences in (1) emerged were by no means as trivial and unsophisticated as the texture of those examples might suggest, but the sophistication lay almost wholly in the complex logic of the language-processing software developed by various projects, rather than in the analysis of the problems which the software needed to deal with.

This is a syndrome which is very familiar within the wider discipline of information technology. Any textbook on software engineering contains a historical section near the beginning which describes how software development began, in the 1950s and 1960s, as a craft process in which individual programmers wrote code to implement specifications that were largely “in their head” rather than explicitly formalized; and the books go on to describe how this approach broke down when more challenging projects were attempted.

In the middle to late 1960s, truly large software systems were attempted … It was discovered that the problems in building large software systems were not a matter of putting computer instructions together. Rather, the problems being solved were not well understood, at least not by everyone involved in the project or by any single individual … Replacing an individual required an extensive amount of training about the “folklore” of the project requirements and the system design. (Ghezzi et al. 1991: 4)

The response to this discovery by information technologists was the creation of software engineering as a discipline governing the process by which software projects are planned and implemented. The essence of software engineering is to put investigation and detailed documentation of the requirements at the heart of the process. Writing the actual computer code to implement the specifications, instead of being central, becomes a minor aspect of the work, carried out at the end and involving little creativity.

Learning to heed the maxims of software engineering does not come naturally to people recruited into the information technology profession. Programming is fun: by instinct, people want to get stuck straight in to the imaginative work of devising a program logic which corresponds to their intuitive understanding of a problem, writing code to implement that logic, and watching it run – deferring as long as possible the chores of documentation, and modification to cope with unforeseen difficulties. But the industry has learned that people have to resist these instincts, if complex projects are ever to be successful. To quote one famous estimate of the cost of postponing thorough requirements analysis, Barry Boehm (1981: 39-41) suggests that the expense of eliminating errors in software logic can grow by a factor of up to 100, depending on how late in the development process the errors are detected – assuming that the consequences of the errors are not so crucial that the project has to be abandoned, which was often what happened in practice.

Natural languages are among the most complicated systems that human beings deal with. The problems which caused the collapse of so many business software projects in the 1960s and 1970s can be expected to apply in full to our field. Indeed an area of information technology such as natural language processing, where a high proportion of activity takes place within academe, is specially vulnerable to these problems. Compared to industrial employees, member of the academic world enjoy a high degree of autonomy in their working lives; that makes it particularly hard to persuade people to minimize the creative aspects of their work. Also, cost overruns of a factor of two or three, never mind Boehm’s 100, are quite enough to force abandonment of an academic project. It might seem naïve of computational linguists in the 1980s to have worked almost exclusively with language data that were invented and therefore artificially well-behaved, but given the realities of academic life it was understandable.

By now, many computational linguists have come to grasp the importance of working with real-life corpus data rather than invented examples. This is a good start, but it is only part of the lesson that software engineering has to teach. Automatic parsing projects, for instance, might nowadays typically aim to parse real-life rather than artificial inputs; but still one reads and hears far more discussion of the nature and performance of the parsing systems, than of the detailed parsing schemes which define the “target” analyses that an automatic parser either hits or misses. There is still a feeling in the air that, yes, we need corpora as a source of quantitative, statistical data, and as realistically complicated test-beds for our automatic systems; but we are speakers of our language, and as linguists we are heirs to a millennia-old tradition of language analysis, so we do not really need corpora in order to learn about the range of features that automatic language-analysis systems need to deal with. We know what the problems are, though we may need corpora to help us solve them.

Yet the people who first designed, say, accountancy software were undoubtedly people who kept track of their own budgets, as most of us have to do, and were surely in many cases also professional accountants, well-versed in the knowledge base of that profession: but these things turned out not to be enough. A successful software system requires a detailed prior taxonomy of issues which even the professionals would not otherwise ever need to make explicit. And if that is true of a field like accountancy, where the range of variables is limited, how much more must it be true of natural languages, each of which contains tens of thousands of words, each with its own idiosyncratic behaviour.

Remember my quotation from Jane Edwards, earlier, about the centrality of the principle that “similar instances be encoded in predictably similar ways”. If our work depends on treebanks, we cannot implement that principle without a minutely-detailed taxonomy of grammatical structures. If an adverb of place is followed by a more specific prepositional phrase, as in down on the field, are the elements down and on the field sisters? Should down be treated as a subordinate element pre-modifying the prepositional phrase? Or, conversely, should on the field be seen as an appositional postmodifier of down? Does it make a difference if the adverb specifies the extent to which the prepositional relationship holds, as in shortly before sunrise? And such questions must be asked and answered not just for “core” linguistic constructions such as prepositional phrases, but for all the other bits and pieces that crop up in real-life language use and are not often mentioned in textbooks of linguistic theory. In a postal address, what is the structural relationship between the street name and the town name? Is the relationship between the house number and the street name the same – even if they are not separated by a comma? Does it make a difference if the house has a name rather than a number (in which case there must be a comma)? And so on, and on, and on.

These questions sound trivial. In a sense they surely are trivial; yet, unless people are willing to put in the effort of specifying explicit answers to them, there is no hope of getting reliable statistics about structural usage. Without explicit standards, treebanks would represent similar constructions now one way, now another, and figures extracted from them would be largely meaningless.

Theoretical linguists tend to feel that many of these questions are intellectually unreal, because they have no “right” answer. I certainly do not believe that an English-speaker’s psychological language-processing machinery implies that street names are grammatically subordinate to house numbers, or vice versa. But that is not the point. Very often, in drawing tree structures for language examples, a priori it would be as reasonable to draw the tree in one way as another way, but we have to make a choice and stick to it, if our treebank is to be consistent. The decision is imposed on the data rather than discovered from the data, but that does not mean that it is unnecessary to make the decision and document it.

One indicator of the low priority given at present to this sort of detailed taxonomic work is the emphasis that is laid on size of corpora and treebanks. Of course we all want larger rather than smaller research resources, other things being equal: for statistical work “there’s no data like more data”. Several of the findings quoted in §3 earlier were less solid than they might be, because of the small size of the treebanks on which they were based. But the experience of our team is that, if the work of developing a treebank focuses on uncovering and resolving all the detailed open issues about structure that are thrown up by real-life examples, it is not possible to produce really large treebanks within a reasonable time.

Our SUSANNE treebank, on which the 500-page analytic scheme of Sampson (1995) was based, contains about 130,000 words. Our current LUCY project aims to produce a 200,000-word treebank. Relative to some of the treebanks now available, these figures are insignificant. However, projects whose priority is to produce the largest possible quantity of analysed material sometimes explicitly adopt the strategy of asking analysts to annotate whatever they feel confident about and skipping the awkward bits. For some purposes, where there is an urgent current need for a large amount of partially-analysed data, that may be a valid approach. In the longer term, the discipline will need quantities of fully and reliably annotated material, and processing systems which deal appropriately with all the diverse structural details found in real-life language. The only plausible route to that outcome, I believe, is taxonomic activity that proceeds painstakingly enough to uncover and document the awkward cases as well as the obvious things. Unfortunately, this approach cannot be hurried.

5 New taxonomic challenges

5.1 A field without maps

For those who are willing to join in this enterprise of structural taxonomy, there is no shortage of areas to be investigated. To give an impression of how quickly, in developing treebanks, one reaches analytic issues on which standard linguistics gives no guidance at all, let me cite a few of the areas for which we have had to develop novel structural “case law” on our CHRISTINE project.

5.2 Speech repairs

Probably the largest single problem in drawing parse-trees for spontaneous spoken utterances is deciding how to analyse speech repairs, such as:

(4) and that any bonus he # anything he gets over that is a bonus

– where the speaker breaks off after the first he, “rewinds” to an earlier point in the speech stream, and tries again, so that any bonus he is replaced by anything he. We chose a minimalist approach to annotating the milestones in speech repairs, indicating the “point of interruption” (shown in (4) as a hash symbol) but not attempting to mark any as the word rewound to, or gets as the first new word after the edit. Our annotation scheme has to be able to deal with whatever kinds of speech repair crop up in real life, and we doubt that these will always be “well-formed” in the sense of Levelt (1989: 486), permitting these additional points to be identified unambiguously. Even on a minimalist approach, though, many rules must be laid down about how the repair fits into the surrounding tree-structure. Should the first he be identified as the beginning of a relative clause, for instance, even though no wording which would identify the constituent as a clause is actually produced? Should the successive sequences any bonus he … and anything he … be treated as two separate noun phrases, or as sequential wording below a single noun-phrase node? The CHRISTINE research required decisions to be imposed on these and many comparable issues.

5.3 Direct versus reported speech

Sometimes, structural analysis of speech turns out to blur category distinctions which are fundamental in written language. Written English goes to considerable lengths, via punctuation and choice of pronouns, to maintain a clear distinction between direct and reported speech. The distinction seems socially important: we need to know whether a writer is undertaking to convey the exact words, or only the general sense, of a third-party utterance. Sometimes, the distinction is clear in speech: in the CHRISTINE utterance:

(5) he said he hates drama because the teacher takes no notice, he said one week Stuart was hitting me with a stick and the teacher just said calm down you boys

what follows the first said is unambiguously reported speech (he hates, not I hate), while the material after the second and third instances of said is unambiguously direct speech (hitting me rather than hitting him, imperative calm down and vocative you boys). But in many other cases the signals are contradictory. Consider (6a-b):

(6) a. I said well that’s his hard luck

b. well Billy, Billy says well take that and then he’ll come back

The discourse item well after said/says usually signals a direct quotation. The imperative take that points in the same direction. But his and he (rather than your and I respectively) would in written English clearly imply reported speech. Are we to say that the direct/reported-speech distinction does not apply to the spoken medium? If so, which structural categories can safely be extended from written to spoken English?

5.4 Markovian syntax

Even the basic concept of analysing grammar in terms of tree structures is called into question by the frequent occurrence in speech of what we call “Markovian” sequences, such as (7):

(7) that’s your project for the rest of the week is to try and area-select without picking the broad outline up

As one scans (7) through a window of limited size, at each point the structure seems normal; but the whole does not cohere together as a statement or pair of statements, because the phrase your project for the rest of the week is serving simultaneously as complement of the preceding (i)s and subject of the following is. This sort of pattern is too frequent to dismiss as a sporadic “performance error”. Are we to treat it as evidence against the assumption that grammar organizes words into hierarchical tree structures? Surely not – that assumption has deeper roots than almost anything we think we know about grammar. But if we do not abandon it, how, concretely, is a treebanker to annotate an example like (7)?

5.5 Inaudible wording

One possible explanation for the intractability of issues like these (cf. Rahman and Sampson 2000: 309) may be that the grammatical tradition is biased even more heavily than is usually realized towards the structure of written rather than spoken language. Until the recent invention of recording technology, the spoken word was evanescent, and even now it takes great effort to turn quantities of speech into transcribed form so that it can be studied at leisure. If speech had always been as accessible to study as written language, it may be that the grammatical tradition would have developed along different lines, so that, for instance, direct v. reported speech was a less clear cut contrast, and “Markovian” structures did not conflict with the usual notation.

Other problems in annotating speech, though, stem not from the established traditions of grammatical analysis but from the nature of the medium. Consider the problem of inaudible wording. Any annotation standards applicable to spontaneous speech must reckon with the fact that some passages will be untranscribable, because of accidents of recording, noise in the environment, or the like. Annotation standards for speech which could only be applied predictably to wording that is 100 per cent clear would be of limited value. But if a passage to be annotated contains inaudible material, this may create large problems about the analysis of the surrounding, clear wording. Consider a CHRISTINE example:

(8) oh we didn’t {unclear} to drink yourselves

The original utterance might have been something like, say, we didn’t give you anything to drink yourselves – in which case, yourselves is a daughter of the same clause node as we. But it could equally well have been something like we didn’t expect you to want anything to drink yourselves – in which case yourselves is an immediate constituent of a clause subordinate to the one containing we didn’t. An adequate scheme for annotating speech structure must prescribe some analysis which explicitly fails to commit itself on whether yourselves is a sister node or a “niece” node to we (and that is what the CHRISTINE scheme does).

5.6 Dialect versus error

A further type of problem in prescribing guidelines for treebanking speech, which usually does not impinge on those analysing written language, has to do with non-standard usage, and how to draw the line between usage which is odd because of a simple error and usage which seems odd, because it does not fit the pattern of the standard language, but may be quite regular for the speaker involved. In annotating passages from the spoken section of BNC for the CHRISTINE Corpus, we quite often encountered cases when necessary words were omitted (whether by a slip of the tongue on the speaker’s part, or a transcriber’s error, we had no way of knowing). For instance, in (9):

(9) There’s one thing I don’t like {pause} and that’s having my photo taken. And it will be hard when we have to photos.

there seems little doubt that the closing words, have to photos, are an error for an intended phrase like have to show photos or have to have photos. We mark such cases as containing omissions. When I encountered the example:

(10) oh she was shouting at him at dinner time Steven oh god dinner time she was shouting him

initially I assumed that shouting him was another case of essentially the same thing: the speaker intended to say shouting at him (as she did earlier in the utterance), but somehow the at got lost. Later, though, we encountered further examples (the three utterances (10-11) were by different speakers in three different BNC files):

(11) a. go in the sitting room until I shout you for tea

b. the spelling mistakes only occurred when {pause} I was shouted

This perhaps shifts the balance of probability towards the hypothesis that shout is a transitive verb taking the person shouted at as object, for some speakers – though personally I had never encountered such a usage before. If that is right, then it would be quite misleading to mark (10-11) as containing omissions.

Ideally, treebankers would have such complete knowledge of regional and other dialects that they would never be in doubt between cases of these kinds. In reality, no such paragons exist. So how should speech treebanking proceed? A decision to treat any nonstandard wording as erroneous would lead to an absurdly “prescriptive” picture of the language; yet, if conversely we said that every garbled passage is to be annotated on the assumption that it might be grammatical in someone’s idiolect, we would often find it quite impossible to construct analyses for what are in reality accidental misprints or the like.

5.7 Annotating unskilled writing

Perhaps it will be no surprise to readers to hear that spontaneous spoken language involves annotation problems that are not easy to solve in terms of the standard grammatical tradition. But speech is not the only area where the work of establishing a usable structural taxonomy has barely begun. In our current LUCY project we find that children and teenagers often produce tangled wording which deviates from the standards of both colloquial spoken and standard written English, in something like the way that a foreigner’s English might deviate. In some cases, one can see what they are trying to write, in other cases one cannot. We want to annotate these passages in a predictable and insightful way – from the perspective of educational theory they are one of the most significant aspects of this material. But (so far as we are aware) there are no precedents at all for identifying the structural features of this type of prose, and we are having to develop guidelines completely from scratch.

6 In conclusion

I hope I have said enough to show the reader that there is a world of work to be undertaken in order to realize the potential inherent in natural-language treebanks. Anything that we and others have already done in this area only scratches the surface of an enormous topic.

Already, as I tried to show in the earlier part of this paper, we have made interesting discoveries about English that fall outside anything that could emerge from pre-corpus, intuition-based linguistics. But it seems certain that these are no more than a hint at the wealth of discoveries in store for the time when we have larger and more diverse treebanks, and a more numerous community of treebank researchers are exploring a wider range of issues. As a corpus linguist at the beginning of the 21st century, I genuinely feel the way that Isaac Newton claimed, surely mock-modestly, to feel:

like a boy playing on the sea-shore and diverting myself in now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me. (Spence 1966: §1259)

One thing is sure, though. Even those few pretty shells and pebbles would never have come into my ken if I had stayed inland, working exclusively with sentences like A ticket was bought by every man. So it is appropriate in this context to acknowledge the debt which I and many others owe to the man who showed us the path down to the beach. I should like to express my warmest gratitude to Geoffrey Leech.

References

Bernstein, B. 1971. Class, Codes and Control, vol. 1: Theoretical Studies Towards a Sociology of Language. Routledge & Kegan Paul.

Boehm, B. 1981. Software Engineering Economics. Prentice-Hall (Englewood Cliffs, New Jersey).

Cheshire, Jenny and J. Milroy 1993 “Syntactic variation in non-standard dialects”. In J. & Lesley Milroy, eds., Real English: The Grammar of English Dialects in the British Isles. Longman.

Chomsky, A.N. 1976. Reflections on Language. Temple Smith (London).

Edwards, Jane 1992. “Design principles in the transcription of spoken discourse”. In J. Svartvik, ed., Directions in Corpus Linguistics. Mouton de Gruyter.

Fromkin, Victoria and R. Rodman 1983. An Introduction to Language, 3rd edition. Holt, Rinehart & Winston.

Ghezzi, C., M. Jazayeri, and D. Mandrioli 1991. Fundamentals of Software Engineering. Prentice-Hall Internatonal.

Harris, J. 1991. “Conservatism versus substratal transfer in Irish English” (revised version). In P. Trudgill & J.K. Chambers (eds.), Dialects of English: studies in grammatical variation, pp. 191-212. Longman.

Levelt, W.J.M. 1989. Speaking: From Intention to Articulation. MIT Press.

Mandelbrot, B. 1957. “Linguistique statistique macroscopique”. In L. Apostel, B. Mandelbrot, and A. Morf, Logique, langage et théorie de l’information, pp. 1-78. Presses Universitaires de France (Paris).

Miller, G.A. 1957. “Some effects of intermittent silence”. American Journal of Psychology 70.311-14.

Perera, Katharine 1984. Children’s Writing and Reading: Analysing Classroom Language. Basil Blackwell (Oxford) in association with André Deutsch.

Rahman, Anna and G.R. Sampson 2000. “Extending grammar annotation standards to spontaneous speech”. In J.M. Kirk, ed., Corpora Galore, pp. 295-311. Rodopi (Amsterdam).

Sampson, G.R. 1987a. “The grammatical database and parsing scheme”. In R.G. Garside, G.N. Leech, and G.R. Sampson, eds., The Computational Analysis of English, pp. 82-96. Longman.

Sampson, G.R. 1987b. “Evidence against the ‘grammatical’/‘ungrammatical’ distinction” pp. 219-26. In W. Meijs, ed., Corpus Linguistics and Beyond. Rodopi (Amsterdam). [A version reprinted as ch. 10 of Sampson (2001).]

Sampson, G.R. 1995. English for the Computer. Clarendon Press (Oxford).

Sampson, G.R. 1997. “Depth in English grammar”. Journal of Linguistics 33.131-51; reprinted as ch. 4 of Sampson (2001).

Sampson, G.R. 2001. Empirical Linguistics. Continuum.

Sampson, G.R. forthcoming a. “Regional variation in the English verb qualifier system”. To be in English Language and Linguistics.

Sampson, G.R. forthcoming b. “Exploring the richness of the stimulus”. To be in a Special Issue of The Linguistic Review on “The Poverty of the Stimulus”.

Sampson, G.R. forthcoming c. “The structure of children’s writing: moving from spoken to adult written norms”. To be in the Proceedings of ICAME 2001.

Spence, J. 1966. Observations, Anecdotes, and Characters of Books and Men, ed. by J.M. Osborn, 2 vols. Clarendon Press (Oxford).

Yngve, V.H. 1961. “The depth hypothesis”. In R. Jakobson (ed.), Structure of Language and its Mathematical Aspects, American Mathematical Society (Providence, Rhode Island); reprinted in F.W. Householder (ed.), Syntactic Theory I: Structuralist, Penguin.

T02_0120	-----	Gemma006
T02_0123	.....	00328
T02_0126	0051229	*	UW	well	.
T02_0129	0051241	|	PPY	you	[S[Ny:s101.Ny:s101]
T02_0132	0051252	|	VV0v	want	[V.V]
T02_0135	0000000	y	YG	-	[Ti:o[s101.s101]
T02_0138	0051264	|	TO	to	[Vi.
T02_0141	0051274	|	VV0v	nip	.Vi]
T02_0144	0051285	|	RP	over	[R:q.
T02_0147	0051285	|	RLh	there	.R:q]
T02_0150	0051303	|	CC	and	[Ti+.
T02_0153	0000000	y	YG	-	[s101.s101]
T02_0156	0051315	|	VV0v	see	[V.V]
T02_0159	0051326	|	DDQ	what	[Fn?:o[Dq:G103.Dq:G103]
T02_0162	0051338	|	PPHS2	they	[Nap:s.Nap:s]
T02_0165	0051350	|	VV0i	come	[V.V]
T02_0168	0051362	|	II	on	[P:p.
T02_0171	0000000	y	YG	-	[103.103]P:p]
T02_0174	0051372	|	II	on	[P:p.
T02_0177	0051382	|	AT	the	[Ns.
T02_0180	0051394	.	NN1c	roll	.Ns]P:p]Fn?:o]Ti+]Ti:o]S]

Figure 1

[1] Stage I of the CHRISTINE Corpus has been available for downloading over the internet, with no restrictions on use, since 2000; from my home page www.grsampson.net follow the link to “downloadable research resources”. For the BNC see info.ox.ac.uk/bnc/.

[4] The two data points for the 9-12 age range in Figure 3 correspond to the fact that one child is an outlier with an exceptionally low complexity index; independent considerations suggest that he may have been a non-native speaker. The cross symbol shows the mean including this speaker, the circle symbol shows the mean with the outlier excluded.

[5] For information about the LUCY project, follow the appropriate link from my www.grsampson.net home page.