readings in a widening discipline
edited by Geoffrey Sampson and
Corpus linguistics means research based on large machine-readable
— “corpora” — of real-life written or spoken language usage.
is a branch of linguistics and of computational natural-language
processing that was very much a minority hobby fifteen or twenty
It has now widened out to become a major focus, both for advancing
our intellectual understanding of human language,
and for developing economically-valuable language engineering systems.
This anthology reprints a collection of key articles in the field.
Many people are finding themselves newly drawn into
corpus linguistics activities without much background knowledge
of where the subject has come from, or what its overall shape is.
In particular, people with a humanities background are often
uncomfortable with more technical aspects of the subject,
while researchers who are essentially computer scientists may know
very little of the traditional linguistic ideas which underlie
their work and which, often, justify the research projects that
Because the field was a minority interest until recently,
classic papers which would help newcomers to “read themselves in”
have tended to appear in obscure, hard-to-get-hold-of sources.
By reprinting 42 key articles
(with dates of first appearance ranging
from 1952 to 2002),
our book addresses this problem. In particular, we aim
to give readers from both arts-based and technical backgrounds an
introduction to the other side of the subject. As well as a
general introductory chapter, we have provided each article
with an editorial introduction, putting it into context for
the benefit of newcomers to the field.
We also include a leavening of papers that are beguiling as
well as instructive. We hope our book may, among other things,
help academics to “sell” corpus linguistics to their students.
The text of the original articles has been completely re-set for this collection, and tables and
graphics professionally re-drawn (from sometimes crudely-reproduced originals) to a
common and high visual standard. This may often be the most convenient location
to consult the papers included, even for those who have access to earlier editions.
By now, Corpus Linguistics is a recognized textbook on university courses
in places as distant as California and France.
Some critical comment:
easily accessible and thoroughly rewarding to read … This
excellent book should be required reading for students and teachers
involved in corpus-based research … an impressive volume
— Jonathan Clenton, on
an ideal source ... Beyond the selection of papers, the
“value added” material in this collection is uniformly helpful
and well done ... a wonderful addition to the currently
available textbooks on corpus linguistics
- — Robert Malouf, in
an extremely valuable
resource to own, not only for corpus linguists as reference, but also for
those newly interested in the area to understand the wider field
- — Ute Knoch, on
Your book is a source of inspiration for my students
- — Geoffrey Williams, Université de Bretagne-Sud
a volume to be highly recommended
- — Milica Gačić, in Corpus Linguistics
and Linguistic Theory
- From The Structure of English (1952) Charles Carpenter Fries
- A standard corpus of edited present-day American English (1965) W. Nelson Francis
- On the distribution of noun-phrase types in English clause-structure (1971) F.G.A.M. Aarts
- Predicting text segmentation into tone units (1986) Bengt Altenberg
- Typicality and meaning potentials (1986) Patrick Hanks
- Historical drift in three English genres (1987) Douglas Biber and Edward Finegan
- Corpus creation (1987) John Sinclair
- Cleft and pseudo-cleft constructions in English spoken and written discourse (1987)
Peter C. Collins
- What is wrong with adding one? (1989) William Gale and Kenneth Church
- A statistical approach to machine translation (1990) Peter F. Brown et al.
- A point of verb syntax in south-western British English: an analysis of a dialect
continuum (1991) Ossi Ihalainen
- Using corpus data in the Swedish Academy grammar (1991) Staffan Hellberg
- On the history of that/zero as object clause links in English (1991)
- Encoding the British National Corpus (1992) Gavin Burnage and Dominic Dunlop
- Computer corpora — what do they tell us about culture? (1992) Geoffrey
Leech and Roger Fallon
- Representativeness in corpus design (1992) Douglas Biber
- A corpus-driven approach to grammar: principles, methods, and examples (1993)
- Structural ambiguity and lexical relations (1993) Donald Hindle and Mats Rooth
- Irony in the text or insincerity in the writer? The diagnostic potential of semantic
prosodies (1993) William Louw
- Building a large annotated corpus of English: the Penn Treebank (1993) Mitchell
P. Marcus et al.
- Automatically extracting collocations from corpora for language learning (1994)
Kenji Kita et al.
- Developing and evaluating a probabilistic LR parser of part-of-speech and punctuation
labels (1995) E.J. Briscoe and J.A. Carroll
- Why a Fiji corpus? (1996) Jan Tent and France Mugler
- Treebank grammars (1996) Eugene Charniak
- English corpus linguistics and the foreign-language teaching syllabus (1996)
- Data-oriented language processing: an overview (1996) L.W.M. Bod and R.J.H. Scha
- Conflict talk: a comparison of the verbal disputes between adolescent females in two
corpora (1996) Ingrid Kristine Hasund and Anna-Brita Stenström
- Assessing agreement on classification tasks: the kappa statistic (1996)
- Linguistic and interactional features of Internet Relay Chat (1996) Christopher
- Distinguishing systems and distinguishing senses: new evaluation methods for
word-sense disambiguation (1997) Philip Resnik and David Yarowsky
- Qualification and certainty in L1 and L2 students’ writing (1997) Kenneth
Hyland and John Milton
- Analysing and predicting patterns of DAMSL utterance tags (1998) Mark G. Core
- Assessing claims about language use with corpus data — swearing and abuse (1998)
Anthony McEnery et al.
- The syntax of disfluency in spontaneous spoken language (1998) David McKelvie
- The use of large text corpora for evaluating text-to-speech systems (1998)
Louis C.W. Pols et al.
- The Prague Dependency Treebank: how much of the underlying syntactic structure can be
tagged automatically? (1999) Alena Böhmová and Eva Hajičová
- Reflections of a dendrographer (1999) Geoffrey Sampson
- A generic approach to software support for linguistic annotation using XML (2000)
Jean Carletta et al.
- Europe’s ignored languages (2001) Anthony McEnery
- Semi-automatic tagging of intonation in French spoken corpora (2001) Estelle
Campione and Jean Véronis
- Web as corpus (2001) Adam Kilgarriff
- Intonational variation in the British Isles (2002) Esther Grabe and Brechtje Post
Corpus Linguistics: readings in a widening discipline
is published by
of London and New York. New or used copies available via relevant
xv + 524 pp., 2004. ISBN (hardback) 0-8264-6013-5; (paperback) 0-8264-8803-X.
last changed 29 Nov 2010