Geoffrey Sampson

Corpus Linguistics:
readings in a widening discipline

edited by Geoffrey Sampson and Diana McCarthy

Corpus linguistics means research based on large machine-readable samples — “corpora” — of real-life written or spoken language usage. This is a branch of linguistics and of computational natural-language processing that was very much a minority hobby fifteen or twenty years ago. It has now widened out to become a major focus, both for advancing our intellectual understanding of human language, and for developing economically-valuable language engineering systems. This anthology reprints a collection of key articles in the field.

Many people are finding themselves newly drawn into corpus linguistics activities without much background knowledge of where the subject has come from, or what its overall shape is. In particular, people with a humanities background are often uncomfortable with more technical aspects of the subject, while researchers who are essentially computer scientists may know very little of the traditional linguistic ideas which underlie their work and which, often, justify the research projects that employ them.

Because the field was a minority interest until recently, classic papers which would help newcomers to “read themselves in” have tended to appear in obscure, hard-to-get-hold-of sources. By reprinting 42 key articles (with dates of first appearance ranging from 1952 to 2002), our book addresses this problem. In particular, we aim to give readers from both arts-based and technical backgrounds an accessible introduction to the other side of the subject. As well as a general introductory chapter, we have provided each article with an editorial introduction, putting it into context for the benefit of newcomers to the field.

We also include a leavening of papers that are beguiling as well as instructive. We hope our book may, among other things, help academics to “sell” corpus linguistics to their students.

The text of the original articles has been completely re-set for this collection, and tables and graphics professionally re-drawn (from sometimes crudely-reproduced originals) to a common and high visual standard. This may often be the most convenient location to consult the papers included, even for those who have access to earlier editions.

By now, Corpus Linguistics is a recognized textbook on university courses in places as distant as California and France.

Some critical comment:

easily accessible and thoroughly rewarding to read … This excellent book should be required reading for students and teachers involved in corpus-based research … an impressive volume

— Jonathan Clenton (Osaka University), on The LINGUIST List

an ideal source ... Beyond the selection of papers, the “value added” material in this collection is uniformly helpful and well done ... a wonderful addition to the currently available textbooks on corpus linguistics

— Robert Malouf (San Diego State University), in Computational Linguistics

an extremely valuable resource to own, not only for corpus linguists as reference, but also for those newly interested in the area to understand the wider field

— Ute Knoch (University of Auckland), on The LINGUIST List

Your book is a source of inspiration for my students

— Geoffrey Williams, Université de Bretagne-Sud

a volume to be highly recommended

— Milica Gačić (University of Zagreb), in Corpus Linguistics and Linguistic Theory

a diverse yet accessible collection

— Jack Grieve (Northern Arizona University), in Corpora

Introduction
From The Structure of English (1952) Charles Carpenter Fries
A standard corpus of edited present-day American English (1965) W. Nelson Francis
On the distribution of noun-phrase types in English clause-structure (1971) F.G.A.M. Aarts
Predicting text segmentation into tone units (1986) Bengt Altenberg
Typicality and meaning potentials (1986) Patrick Hanks
Historical drift in three English genres (1987) Douglas Biber and Edward Finegan
Corpus creation (1987) John Sinclair
Cleft and pseudo-cleft constructions in English spoken and written discourse (1987) Peter C. Collins
What is wrong with adding one? (1989) William Gale and Kenneth Church
A statistical approach to machine translation (1990) Peter F. Brown et al.
A point of verb syntax in south-western British English: an analysis of a dialect continuum (1991) Ossi Ihalainen
Using corpus data in the Swedish Academy grammar (1991) Staffan Hellberg
On the history of that/zero as object clause links in English (1991) Matti Rissanen
Encoding the British National Corpus (1992) Gavin Burnage and Dominic Dunlop
Computer corpora — what do they tell us about culture? (1992) Geoffrey Leech and Roger Fallon
Representativeness in corpus design (1992) Douglas Biber
A corpus-driven approach to grammar: principles, methods, and examples (1993) Gill Francis
Structural ambiguity and lexical relations (1993) Donald Hindle and Mats Rooth
Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies (1993) William Louw
Building a large annotated corpus of English: the Penn Treebank (1993) Mitchell P. Marcus et al.
Automatically extracting collocations from corpora for language learning (1994) Kenji Kita et al.
Developing and evaluating a probabilistic LR parser of part-of-speech and punctuation labels (1995) E.J. Briscoe and J.A. Carroll
Why a Fiji corpus? (1996) Jan Tent and France Mugler
Treebank grammars (1996) Eugene Charniak
English corpus linguistics and the foreign-language teaching syllabus (1996) Dieter Mindt
Data-oriented language processing: an overview (1996) L.W.M. Bod and R.J.H. Scha
Conflict talk: a comparison of the verbal disputes between adolescent females in two corpora (1996) Ingrid Kristine Hasund and Anna-Brita Stenström
Assessing agreement on classification tasks: the kappa statistic (1996) Jean Carletta
Linguistic and interactional features of Internet Relay Chat (1996) Christopher C. Werry
Distinguishing systems and distinguishing senses: new evaluation methods for word-sense disambiguation (1997) Philip Resnik and David Yarowsky
Qualification and certainty in L1 and L2 students’ writing (1997) Kenneth Hyland and John Milton
Analysing and predicting patterns of DAMSL utterance tags (1998) Mark G. Core
Assessing claims about language use with corpus data — swearing and abuse (1998) Anthony McEnery et al.
The syntax of disfluency in spontaneous spoken language (1998) David McKelvie
The use of large text corpora for evaluating text-to-speech systems (1998) Louis C.W. Pols et al.
The Prague Dependency Treebank: how much of the underlying syntactic structure can be tagged automatically? (1999) Alena Böhmová and Eva Hajičová
Reflections of a dendrographer (1999) Geoffrey Sampson
A generic approach to software support for linguistic annotation using XML (2000) Jean Carletta et al.
Europe’s ignored languages (2001) Anthony McEnery
Semi-automatic tagging of intonation in French spoken corpora (2001) Estelle Campione and Jean Véronis
Web as corpus (2001) Adam Kilgarriff
Intonational variation in the British Isles (2002) Esther Grabe and Brechtje Post

Corpus Linguistics: readings in a widening discipline is published by Continuum, now an imprint of Bloomsbury Publishing, of London, Sydney, New York, and New Delhi. New or used copies available via relevant British or American Amazon pages.

xv + 524 pp., 2004. ISBN (hardback) 0-8264-6013-5; (paperback) 0-8264-8803-X; also available as PDF e-book.

Geoffrey Sampson

last changed 7 Dec 2020

Geoffrey Sampson

Corpus Linguistics: readings in a widening discipline

edited by Geoffrey Sampson and Diana McCarthy

Contents

Geoffrey Sampson

Corpus Linguistics:
readings in a widening discipline