Sampson on Greenbaum, ed., Comparing English Worldwide

Sidney Greenbaum, ed. Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press. 1996. ISBN 0-19-823582-8. xvi + 286 pages.

Most of the work that has gone into compiling English-language corpora since the pioneering Brown Corpus in the 1960s has been devoted to American or British English, which are the standard languages of the two societies with the largest communities of English-speaking computational linguists, and the English varieties with most significance in the context of commercial language engineering. But Britain and the USA are not the only English-speaking countries. The International Corpus of English (ICE), an enterprise initiated by Sidney Greenbaum in 1989 whose participants are oriented more towards Arts-based language studies than to industrial applications, aims to develop a co-ordinated collection of “World Englishes”. When complete, ICE will contain a subcorpus for each of 18 countries or regions, in some of which (e.g. Australia) English is the native language, while in others (e.g. Nigeria) it is the common language of public life although few inhabitants’ mother tongue. This book, containing 19 chapters by different participants in the ICE effort, is the first easily accessible survey of what is planned and what has been achieved.

Each ICE subcorpus is intended to resemble Brown and LOB in comprising 500 samples each of 2000 words, but half the samples will be speech, and the written samples will include manuscript as well as printed material. The period represented will be the 1990s, though some teams have had to ignore the original planned cutoff of 1994. Samples are intended to represent the “educated” or “standard” English of the respective country, but this is a highly problematic concept – the book contains many references to debates about whether countries such as Nigeria can be described as having a separate local standard variety of English, and/or whether it is desirable for them to develop one. Like Brown and LOB, the ICE subcorpora aim to allot subsets of their samples to different genres in a systematic way, but mechanical equivalence is not possible because of the varying roles of English in different societies; thus Philip Bolt and Kingsley Bolton of the ICE-Hong Kong team point out that newsreaders on English-language television and radio stations in Hong Kong are usually expatriates. ICE itself does not include the English of people who have learned English as a foreign language in countries such as France, but a chapter by Sylviane Granger describes a sister project, the International Corpus of Learner English, which aims to fill this gap.

ICE lays stress on securing copyright permissions for the material it uses, so that the output can eventually be freely published. Because the participating teams are independent groups responsible for finding their own support, their progress is inevitably unequal. Greenbaum states that each subcorpus is envisaged as being annotated with wordtags and parsing information, but at the time of writing this was implemented only for the British subcorpus; and “implemented” here seems to mean that the work is under way, not that it has been completed. No subcorpora are yet available for public distribution, in annotated or unannotated form, though it sounds as though some of the unannotated versions are close to completion.

Even without grammatical annotation, the texts require considerable markup for matters such as typographic details and overlaps between speakers’ utterances. The markup used is influenced by SGML; some contributors state that it is SGML, but this seems to refer to “cosmetic” aspects, for instance start and end tags in the form <…>, </…>, and use of ISO 8879 &…; codes for non-ASCII characters. Edward Blachman, Charles Meyer, and Robert Morris of the USA ICE team make it clear that the ICE markup is not SGML in the sense that it is not controlled by an explicit Document Type Definition, though they believe that it should be possible to infer an implicit DTD from the compilers’ usage. (In some passages, markup seems not to conform to SGML even “cosmetically”. For instance, <indig=Urdu> to identify a word of the indigenous language Urdu in an Indian English sentence is surely not a possible SGML tag, since it does not begin with a generic identifier?)

Space does not allow individual discussion of each of the diverse contributions to the book; but two aspects of the ICE enterprise raise general issues which may interest readers of this journal.

One has to do with grammatical annotation schemes. For the British subcorpus, at least, the work of adding wordtags and parse structure is evidently well advanced, and there is plenty of information given about the semiautomatic techniques being used to carry this out, and about the coding schemes used to represent the grammatical facts. These coding schemes are unrelated to the schemes used by other well-known English-language corpus projects, and in this respect ICE strategy seems representative of the discipline. Whenever a new corpus-linguistics effort is launched, it usually devises from scratch its own ranges of grammatical categories and codes.

I find this regrettable. With respect to wordtagging, ten years ago I published a tabulation of the main systems of classifying English words known to me then (Sampson 1987), in the hope of encouraging future researchers to build on earlier work rather than always beginning anew at ground level; I have continued to try to promote this idea since (e.g. Sampson 1995). To date I detect very little enthusiasm for such pooling of effort.

In one sense, researchers are quite right to resist standardization. It would be disastrous if annotation conventions were imposed on research groups, perhaps by funding agencies keen to enforce uniformity for political reasons. There have been hints of this in connexion with EU language engineering research; but natural language processing is far too young a discipline to know what categories and data structures are most appropriate – researchers must be free to experiment.

Much of the time, however, it is clear that one group’s scheme is different from another’s not because either group has consciously decided that an innovation might be desirable, but just because there is no tradition of re-using this sort of material. Lack of standardization in this sense militates against scientifically valuable advance. The real difficulty in devising a grammatical annotation scheme for a natural language lies not in listing a set of categories, but in defining the boundaries between categories with sufficient precision that they can be applied in a predictable way to the endlessly diverse turns of phrase that occur in real-life usage. A research group that sets itself the task of doing this from scratch for all its categories will have its work cut out to sharpen everything up to the point where particular boundaries can be seen as unsatisfactory, making experimentation worthwhile.

Thus, Jan Tent and France Mugler of the ICE-Fiji team note, among other distinctive properties of Fiji English, “The use of verbal particles as verbs: ‘I been come down and off the light … ’; ‘You want me to on the alarm?’”. This is one way (and there will obviously be very many others) in which any standard system of grammatical annotation based on British and American English would need to be modified to deal with the task confronting the ICE enterprise. One can imagine alternative principles that might be chosen to reconcile an existing scheme with such data. For instance, one might specify that words will always receive tags appropriate to their classification in a dictionary of the “metropolitan” variety of English, so that on and off remain particles, and Fiji English is represented as having infinitives headed by non-verbs; alternatively, one might prefer to reclassify words in terms of the wider structures they enter into, so that on and off become verbs (perhaps special uninflectable verbs) in Fiji English. Choice between these approaches is likely to have real consequences – one approach might prove harder to apply consistently than the other, or one might lend itself better than the other to automatic language processing; but it will be difficult to put the effort which they merit into making choices like this, if one is committed to defining all one’s grammatical categories even in the many cases where taking over existing definitions would be unproblematic.

Another way in which this book crystallizes a puzzle I have about the development of the discipline relates to software tools for working with corpora. Considerable ICE resources have evidently been used to create software systems for helping users of the completed subcorpora execute tasks such as producing concordances and locating specified grammatical configurations. Again, the ICE enterprise in this respect is operating in a way that is quite normal in computational linguistics. I know from personal experience that someone who supplies a language corpus without also supplying purpose-built software for working with it is widely regarded as having left a job half-done.

It is hard to see this as a wise policy for allocating scarce research resources. In practice there are usually two possibilities when one wants to exploit corpus data. Often, one wants to put very obvious and simple questions to the corpus; in that case, it is usually possible to get answers via general-purpose Unix commands like grep and wc, avoiding the overhead of learning special-purpose software. Sometimes, the questions one wants to put are original and un-obvious; in those cases, the developer of a corpus utility is unlikely to have anticipated that anyone might want to ask them, so one has to write one’s own program to extract the information. No doubt there are intermediate cases where a corpus utility will do the job and grep will not. I am not convinced that these cases are common enough to justify learning to use such software, let alone writing it.

Since the ICE materials are not yet published, most of the book is concerned with the compilation process rather than with scientific findings that have emerged from the corpus. One exception is the final chapter by Mark Huckvale and Alex Fang, which uses speech sections from the British subcorpus that are both grammatically annotated and recorded to high acoustic standards in order to explore the interaction between grammar and prosodic phonology. Real discoveries have been made, and many more can be expected from this very promising research avenue.

ICE is not really a single project: it is a loose federation of disparate and widely-scattered research groups who have been guided, co-ordinated, and inspired by one man’s leadership. Unhappily, Professor Sidney Greenbaum died in the year this book was published. It is too early to guess how the ICE enterprise will develop without him.

REFERENCES

Sampson, G.R. (1987) Alternative grammatical coding systems. In The Computational Analysis of English, R.G. Garside et al. (eds.), London: Longman.

Sampson, G.R. (1995) English for the Computer. Oxford: Clarendon.

Geoffrey Sampson

University of Sussex