The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version. Published in Natural Language Engineering 4.363–5, 1998. |
Sidney Greenbaum, ed. Comparing English Worldwide: The International Corpus of
English. Oxford: Clarendon Press. 1996. ISBN 0-19-823582-8. xvi + 286 pages.
Most of the work that has gone into compiling
English-language corpora since the pioneering Brown Corpus in the 1960s has
been devoted to American or British English, which are the standard languages
of the two societies with the largest communities of English-speaking
computational linguists, and the English varieties with most significance in
the context of commercial language engineering. But Britain and the USA are not the only English-speaking
countries. The International
Corpus of English (ICE), an enterprise initiated by Sidney Greenbaum in 1989
whose participants are oriented more towards Arts-based language studies than
to industrial applications, aims to develop a co-ordinated collection of “World
Englishes”. When complete, ICE
will contain a subcorpus for each of 18 countries or regions, in some of which
(e.g. Australia) English is the native language, while in others (e.g. Nigeria)
it is the common language of public life although few inhabitants’ mother
tongue. This book, containing 19
chapters by different participants in the ICE effort, is the first easily
accessible survey of what is planned and what has been achieved.
Each ICE subcorpus is intended to resemble Brown and LOB
in comprising 500 samples each of 2000 words, but half the samples will be
speech, and the written samples will include manuscript as well as printed
material. The period represented
will be the 1990s, though some teams have had to ignore the original planned
cutoff of 1994. Samples are
intended to represent the “educated” or “standard” English of the respective
country, but this is a highly problematic concept – the book contains many
references to debates about whether countries such as Nigeria can be described
as having a separate local standard variety of English, and/or whether it is
desirable for them to develop one.
Like Brown and LOB, the ICE subcorpora aim to allot subsets of their
samples to different genres in a systematic way, but mechanical equivalence is
not possible because of the varying roles of English in different societies;
thus Philip Bolt and Kingsley Bolton of the ICE-Hong Kong team point out that newsreaders
on English-language television and radio stations in Hong Kong are usually
expatriates. ICE itself does not
include the English of people who have learned English as a foreign language in
countries such as France, but a chapter by Sylviane Granger describes a sister
project, the International Corpus of Learner English, which aims to fill this
gap.
ICE lays stress on securing copyright permissions for the
material it uses, so that the output can eventually be freely published. Because the participating teams are
independent groups responsible for finding their own support, their progress is
inevitably unequal. Greenbaum
states that each subcorpus is envisaged as being annotated with wordtags and
parsing information, but at the time of writing this was implemented only for
the British subcorpus; and “implemented” here seems to mean that the work is
under way, not that it has been completed. No subcorpora are yet available for public distribution, in
annotated or unannotated form, though it sounds as though some of the unannotated
versions are close to completion.
Even without grammatical annotation, the texts require
considerable markup for matters such as typographic details and overlaps
between speakers’ utterances. The
markup used is influenced by SGML; some contributors state that it is SGML, but
this seems to refer to “cosmetic” aspects, for instance start and end tags in
the form <…>, </…>, and use of ISO 8879 &…; codes for non-ASCII characters. Edward Blachman, Charles Meyer, and
Robert Morris of the USA ICE team make it clear that the ICE markup is not SGML in the sense that it
is not controlled by an explicit Document Type Definition, though they believe
that it should be possible to infer an implicit DTD from the compilers’
usage. (In some passages, markup
seems not to conform to SGML even “cosmetically”. For instance, <indig=Urdu>
to identify a word of the indigenous language Urdu in an Indian English
sentence is surely not a possible SGML tag, since it does not begin with a
generic identifier?)
Space does not allow individual discussion of each of the
diverse contributions to the book; but two aspects of the ICE enterprise raise
general issues which may interest readers of this journal.
One has to do with grammatical annotation schemes. For the British subcorpus, at least,
the work of adding wordtags and parse structure is evidently well advanced, and
there is plenty of information given about the semiautomatic techniques being
used to carry this out, and about the coding schemes used to represent the
grammatical facts. These coding schemes
are unrelated to the schemes used by other well-known English-language corpus
projects, and in this respect ICE strategy seems representative of the
discipline. Whenever a new corpus-linguistics effort is launched, it usually
devises from scratch its own ranges of grammatical categories and codes.
I find this regrettable. With respect to wordtagging, ten years ago I published a
tabulation of the main systems of classifying English words known to me then
(Sampson 1987), in the hope of encouraging future researchers to build on
earlier work rather than always beginning anew at ground level; I have
continued to try to promote this idea since (e.g. Sampson 1995). To date I detect very little enthusiasm
for such pooling of effort.
In one sense, researchers are quite right to resist
standardization. It would be
disastrous if annotation conventions were imposed on research groups, perhaps
by funding agencies keen to enforce uniformity for political reasons. There have been hints of this in
connexion with EU language engineering research; but natural language
processing is far too young a discipline to know what categories and data
structures are most appropriate – researchers must be free to experiment.
Much of the time, however, it is clear that one group’s scheme
is different from another’s not because either group has consciously decided
that an innovation might be desirable, but just because there is no tradition
of re-using this sort of material.
Lack of standardization in this sense militates against scientifically
valuable advance. The real
difficulty in devising a grammatical annotation scheme for a natural language
lies not in listing a set of categories, but in defining the boundaries between
categories with sufficient precision that they can be applied in a predictable
way to the endlessly diverse turns of phrase that occur in real-life
usage. A research group that sets
itself the task of doing this from scratch for all its categories will have its
work cut out to sharpen everything up to the point where particular boundaries
can be seen as unsatisfactory, making experimentation worthwhile.
Thus, Jan Tent and France Mugler of the ICE-Fiji team
note, among other distinctive properties of Fiji English, “The use of verbal
particles as verbs: ‘I been come
down and off
the light … ’; ‘You want me to on the alarm?’”. This is one way (and there will obviously
be very many others) in which any standard system of grammatical annotation
based on British and American English would need to be modified to deal with
the task confronting the ICE enterprise.
One can imagine alternative principles that might be chosen to reconcile
an existing scheme with such data.
For instance, one might specify that words will always receive tags
appropriate to their classification in a dictionary of the “metropolitan” variety
of English, so that on
and off
remain particles, and Fiji English is represented as having infinitives headed
by non-verbs; alternatively, one might prefer to reclassify words in terms of
the wider structures they enter into, so that on and off become verbs (perhaps
special uninflectable verbs) in Fiji English. Choice between these approaches is likely to have real
consequences – one approach might prove harder to apply consistently than the
other, or one might lend itself better than the other to automatic language
processing; but it will be difficult to put the effort which they merit into
making choices like this, if one is committed to defining all one’s grammatical
categories even in the many cases where taking over existing definitions would
be unproblematic.
Another way in which this book crystallizes a puzzle I
have about the development of the discipline relates to software tools for
working with corpora. Considerable
ICE resources have evidently been used to create software systems for helping
users of the completed subcorpora execute tasks such as producing concordances
and locating specified grammatical configurations. Again, the ICE enterprise in this respect is operating in a
way that is quite normal in computational linguistics. I know from personal experience that
someone who supplies a language corpus without also supplying purpose-built
software for working with it is widely regarded as having left a job half-done.
It is hard to see this as a wise policy for allocating scarce
research resources. In practice
there are usually two possibilities when one wants to exploit corpus data. Often, one wants to put very obvious
and simple questions to the corpus; in that case, it is usually possible to get
answers via general-purpose Unix commands like grep and wc, avoiding the
overhead of learning special-purpose software. Sometimes, the questions one wants to put are original and
un-obvious; in those cases, the developer of a corpus utility is unlikely to
have anticipated that anyone might want to ask them, so one has to write one’s
own program to extract the information.
No doubt there are intermediate cases where a corpus utility will do the
job and grep will not. I am not
convinced that these cases are common enough to justify learning to use such
software, let alone writing it.
Since the ICE materials are not yet published, most of
the book is concerned with the compilation process rather than with scientific
findings that have emerged from the corpus. One exception is the final chapter by Mark Huckvale and Alex
Fang, which uses speech sections from the British subcorpus that are both
grammatically annotated and recorded to high acoustic standards in order to
explore the interaction between grammar and prosodic phonology. Real discoveries have been made, and
many more can be expected from this very promising research avenue.
ICE is not really a single project: it is a loose federation of disparate
and widely-scattered research groups who have been guided, co-ordinated, and
inspired by one man’s leadership.
Unhappily, Professor Sidney Greenbaum died in the year this book was
published. It is too early to
guess how the ICE enterprise will develop without him.
REFERENCES
Sampson, G.R. (1987) Alternative grammatical coding systems. In The Computational Analysis of
English,
R.G. Garside et al. (eds.), London: Longman.
Sampson, G.R. (1995) English for the Computer. Oxford: Clarendon.
Geoffrey Sampson
University of Sussex