Geoffrey Sampson


The LUCY Corpus

Structure in Written English in the UK

The LUCY Corpus is now freely available for downloading via the Resources page on this site. The LUCY Corpus is a structurally-annotated sample (“treebank”) of present-day British written English, representing not only the polished writing of published documents, but also the less-skilled or unskilled writing of young adults at the end of secondary and beginning of tertiary education, and of children aged nine to twelve in various types of school and parts of the country. For a detailed statement of the contents of the LUCY Corpus, see its documentation file.

The LUCY Corpus is named after St Lucy, patron saint of authors.

The LUCY Corpus was compiled at Sussex University over the years 2000 to 2003 with the sponsorship of the Economic and Social Research Council (UK). It represents a sister resource to the SUSANNE Corpus (of published American English) and the CHRISTINE Corpus (of spontaneous spoken British English). Like these earlier treebanks, LUCY uses the same highly detailed and comprehensive scheme of structural annotation (the “SUSANNE scheme”), which is widely recognized as the most precise system of its kind available. Some comments by outside observers on this scheme include:

The LUCY annotations go beyond the SUSANNE scheme in one respect. A chief aim in compiling it was to shed light on the processes through which children acquire writing skills. For this purpose it was necessary to develop the annotation scheme so that it can identify cases where an unskilled writer fails to put words together in a meaningful way. The LUCY annotations include a number of devices for showing what has gone wrong in such cases: see section 6 of the documentation file.

Apart from its potential as a source of information on the realities of skilled written usage in modern Britain, it is hoped that the LUCY Corpus will enable researchers to deepen our currently quite limited understanding of the trajectories that children take as they move from being competent speakers of English towards mastery of the rather separate norms of the written language.

Already, early research comparing the child and adult writing in LUCY with the information on grammatical structures in spontaneous speech in CHRISTINE has revealed unexpected regularities in the sequence in which children master written usage. Certain relatively complex grammatical structures are deployed in children's writing earlier than other structures which look as though they ought to be simpler. (See G.R. Sampson, “The structure of children's writing: moving from spoken to adult written norms”, in S. Granger and S. Petch-Tyson, eds., Extending the Scope of Corpus-Based Research, Rodopi, pp. 177-93, 2003, and online.)

We hope that, through findings like this, the availability of the LUCY Corpus may permit new and useful contributions to the improvement of literacy teaching methods.

The researchers who worked on the LUCY Corpus under my direction were Anna Babarczy, Alan Morris, and (briefly) Anna Rahman. I should like to record my warmest thanks for their efforts.

Geoffrey Sampson

last changed 7 Nov 2005