Geoffrey Sampson


Corpus Linguistics:
readings in a widening discipline

edited by Geoffrey Sampson and Diana McCarthy

Corpus linguistics means research based on large machine-readable samples — “corpora” — of real-life written or spoken language usage. This is a branch of linguistics and of computational natural-language processing that was very much a minority hobby fifteen or twenty years ago. It has now widened out to become a major focus, both for advancing our intellectual understanding of human language, and for developing economically-valuable language engineering systems. This anthology reprints a collection of key articles in the field.

Many people are finding themselves newly drawn into corpus linguistics activities without much background knowledge of where the subject has come from, or what its overall shape is. In particular, people with a humanities background are often uncomfortable with more technical aspects of the subject, while researchers who are essentially computer scientists may know very little of the traditional linguistic ideas which underlie their work and which, often, justify the research projects that employ them.

Because the field was a minority interest until recently, classic papers which would help newcomers to “read themselves in” have tended to appear in obscure, hard-to-get-hold-of sources. By reprinting 42 key articles (with dates of first appearance ranging from 1952 to 2002), our book addresses this problem. In particular, we aim to give readers from both arts-based and technical backgrounds an accessible introduction to the other side of the subject. As well as a general introductory chapter, we have provided each article with an editorial introduction, putting it into context for the benefit of newcomers to the field.

We also include a leavening of papers that are beguiling as well as instructive. We hope our book may, among other things, help academics to “sell” corpus linguistics to their students.

The text of the original articles has been completely re-set for this collection, and tables and graphics professionally re-drawn (from sometimes crudely-reproduced originals) to a common and high visual standard. This may often be the most convenient location to consult the papers included, even for those who have access to earlier editions.

By now, Corpus Linguistics is a recognized textbook on university courses in places as distant as California and France.

Some critical comment:

easily accessible and thoroughly rewarding to read … This excellent book should be required reading for students and teachers involved in corpus-based research … an impressive volume
— Jonathan Clenton, on The LINGUIST List

an ideal source ... Beyond the selection of papers, the “value added” material in this collection is uniformly helpful and well done ... a wonderful addition to the currently available textbooks on corpus linguistics
— Robert Malouf, in Computational Linguistics

an extremely valuable resource to own, not only for corpus linguists as reference, but also for those newly interested in the area to understand the wider field
— Ute Knoch, on The LINGUIST List

Your book is a source of inspiration for my students
— Geoffrey Williams, Université de Bretagne-Sud

a volume to be highly recommended
— Milica Gačić, in Corpus Linguistics and Linguistic Theory


  1. Introduction
  2. From The Structure of English (1952) Charles Carpenter Fries
  3. A standard corpus of edited present-day American English (1965) W. Nelson Francis
  4. On the distribution of noun-phrase types in English clause-structure (1971) F.G.A.M. Aarts
  5. Predicting text segmentation into tone units (1986) Bengt Altenberg
  6. Typicality and meaning potentials (1986) Patrick Hanks
  7. Historical drift in three English genres (1987) Douglas Biber and Edward Finegan
  8. Corpus creation (1987) John Sinclair
  9. Cleft and pseudo-cleft constructions in English spoken and written discourse (1987) Peter C. Collins
  10. What is wrong with adding one? (1989) William Gale and Kenneth Church
  11. A statistical approach to machine translation (1990) Peter F. Brown et al.
  12. A point of verb syntax in south-western British English: an analysis of a dialect continuum (1991) Ossi Ihalainen
  13. Using corpus data in the Swedish Academy grammar (1991) Staffan Hellberg
  14. On the history of that/zero as object clause links in English (1991) Matti Rissanen
  15. Encoding the British National Corpus (1992) Gavin Burnage and Dominic Dunlop
  16. Computer corpora — what do they tell us about culture? (1992) Geoffrey Leech and Roger Fallon
  17. Representativeness in corpus design (1992) Douglas Biber
  18. A corpus-driven approach to grammar: principles, methods, and examples (1993) Gill Francis
  19. Structural ambiguity and lexical relations (1993) Donald Hindle and Mats Rooth
  20. Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies (1993) William Louw
  21. Building a large annotated corpus of English: the Penn Treebank (1993) Mitchell P. Marcus et al.
  22. Automatically extracting collocations from corpora for language learning (1994) Kenji Kita et al.
  23. Developing and evaluating a probabilistic LR parser of part-of-speech and punctuation labels (1995) E.J. Briscoe and J.A. Carroll
  24. Why a Fiji corpus? (1996) Jan Tent and France Mugler
  25. Treebank grammars (1996) Eugene Charniak
  26. English corpus linguistics and the foreign-language teaching syllabus (1996) Dieter Mindt
  27. Data-oriented language processing: an overview (1996) L.W.M. Bod and R.J.H. Scha
  28. Conflict talk: a comparison of the verbal disputes between adolescent females in two corpora (1996) Ingrid Kristine Hasund and Anna-Brita Stenström
  29. Assessing agreement on classification tasks: the kappa statistic (1996) Jean Carletta
  30. Linguistic and interactional features of Internet Relay Chat (1996) Christopher C. Werry
  31. Distinguishing systems and distinguishing senses: new evaluation methods for word-sense disambiguation (1997) Philip Resnik and David Yarowsky
  32. Qualification and certainty in L1 and L2 students’ writing (1997) Kenneth Hyland and John Milton
  33. Analysing and predicting patterns of DAMSL utterance tags (1998) Mark G. Core
  34. Assessing claims about language use with corpus data — swearing and abuse (1998) Anthony McEnery et al.
  35. The syntax of disfluency in spontaneous spoken language (1998) David McKelvie
  36. The use of large text corpora for evaluating text-to-speech systems (1998) Louis C.W. Pols et al.
  37. The Prague Dependency Treebank: how much of the underlying syntactic structure can be tagged automatically? (1999) Alena Böhmová and Eva Hajičová
  38. Reflections of a dendrographer (1999) Geoffrey Sampson
  39. A generic approach to software support for linguistic annotation using XML (2000) Jean Carletta et al.
  40. Europe’s ignored languages (2001) Anthony McEnery
  41. Semi-automatic tagging of intonation in French spoken corpora (2001) Estelle Campione and Jean Véronis
  42. Web as corpus (2001) Adam Kilgarriff
  43. Intonational variation in the British Isles (2002) Esther Grabe and Brechtje Post

Corpus Linguistics: readings in a widening discipline is published by Continuum International of London and New York. New or used copies available via relevant British or American Amazon pages.

xv + 524 pp., 2004. ISBN (hardback) 0-8264-6013-5; (paperback) 0-8264-8803-X.

Geoffrey Sampson

last changed 29 Nov 2010