The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version.

The structure of children’s writing:  moving from spoken to adult written norms[1]



Geoffrey Sampson


School of Cognitive and Computing Sciences, University of Sussex


Published in S. Granger and S. Petch-Tyson, eds, Extending the Scope of Corpus-Based Research, Rodopi (Amsterdam), 2003.




1          Introduction


Most children arrive at school speaking English fluently. If all goes well they complete compulsory schooling, a decade or so later, as skilled users of the written language. Written and spoken structural norms differ in a number of ways (see e.g. Miller & Weinert 1998). The compilation of structurally annotated electronic corpora of spoken and written English is starting to open up new possibilities of studying the trajectory taken in moving from one stage to the other.


Our recent CHRISTINE and current LUCY projects have been compiling annotated corpora of everyday conversational English, and of various genres of written English, including published writing of many kinds and children’s writing, all annotated according to the same very detailed scheme of structural annotation (defined in Sampson 1995).[2]  This paper represents a first attempt to extract findings shedding light on the process of writing-skills acquisition from the partly complete CHRISTINE and LUCY corpora.


2          The language samples


For the purposes of this analysis, the annotated language samples to hand at the time when the research was done are divided into three groups, which I shall refer to as “speech”, “published writing”, and “child writing”.


The “speech” samples consist of 39 extracts[3] taken from random points in the “demographically-sampled speech” section of the British National Corpus (Burnard 1995); it thus represents the usage of a cross section of the UK population in the 1990s, balanced in terms of age, sex, region, and social class, in the speech events of their everyday lives – overwhelmingly, informal conversation with family, colleagues, and acquaintances.


The 31 “published writing” samples are drawn from random points in BNC written-language files taken from sources which are published, and which contain a low incidence of linguistic errors or features that would be regarded as solecisms by professional editors.[4]  It includes extracts from sources as diverse as novels, industry house organs, social science textbooks, computer magazines, etc. The “published writing” can thus be seen collectively as in some sense representative of the target towards which writing-skills education is directed.


The “child writing” is taken from material published by a research project sponsored by the Nuffield Foundation in the 1960s. Researchers visited a number of schools in London, Kent, Sussex, and Yorkshire collecting various kinds of data on pupils’ use of oral and written language. (The schools included state primary and grammar schools, one secondary modern, and one then-novel comprehensive school; their locations appear to have been suburban and semi-rural rather than either “inner-city” or fully rural.) For the exercise relevant here, the researchers invited children aged from nine to twelve in 1964-5 to write essays on a choice of open-ended topics (e.g. “My hobby”, “Our last holiday”) which were transcribed into typescript and published as Handscombe (1967a, 1967b). The present study uses 67 of these, comprising roughly equal quantities of wording from ages 9, 10, 11, and 12 and from the two sexes.


A very rough indication of the wordage in the three samples is:




child writing


published writing



Exact figures are difficult to specify because (particularly in the case of spontaneous speech) there are real questions about how words should be counted. These figures, and the data used for the following statistical analyses, exclude written material identified as “headings” rather than continuous text. The statistical analyses of grammatical tree structures also ignore root nodes (representing written paragraphs or oral speaker-turns) and all words immediately dominated by them, which means for instance that spoken “discourse items” such as well or yes are ignored when they are not included within a larger grammatical construction.


The speech sample includes speech by both adults and children; the age range is from one to 84 years.  For present research purposes, all this material is treated as a single sample. (Sampson 2001: ch. 5 examines variation of linguistic structure with age within the speech sample.)  The analyses below also treat the Nuffield material as a single sample, although in future work I plan to investigate development over the nine-to-twelve-year-old age range which they represent.


3          The suitability of the child-writing sample


Being almost forty years old, our sample of child writing might be seen as rather dated, but in one way its age is a positive advantage for comparison with the BNC published-writing data:  children aged from nine to twelve in the mid-1960s are centrally placed within the generation span likely to have been composing published writing in the last decade. The fact that the child writing sample is somewhat biased towards children who overcame the eleven-plus hurdle also presumably makes them a better-than-random match to the class of adults whose writing gets into print.


We know that incidence of different grammatical constructions in children’s writing is heavily influenced by the overall nature of a particular writing task (Perera 1984: 239) – and the same is obviously true of adult prose, for instance one will not find interrogatives in a weather forecast.  So the fit between the Nuffield material and the BNC published writing might be criticized because the prose genres are different.  That is unavoidable:  children do not write social-science textbooks, and the middle-aged do not normally write little narratives about their last holiday.  But the two written data-sets are a good match in that they both represent what people are capable of writing spontaneously at the respective stages in life.  A common problem with corpora of children’s or young people’s writing is that the writing tasks are unnatural and the prose style derivative.  We have found when trying to use first-year undergraduates’ coursework to sample young adults’ writing, for instance, that its texture commonly seems to have more to do with the nature of the set books prescribed by lecturers than with the students’ spontaneous usage – not in the sense that the students are plagiarizing the books word for word, but they very often seem to interpret the coursework task as requiring a sort of stylistic pastiche which one can guess to be quite different from anything they would write spontaneously.  The Nuffield project was very successful in getting away from this kind of problem, by providing a wide choice of general topics known to be attractive to children at the respective ages, and by making it obvious to the children that this was a voluntary activity separate from the school curriculum, in which their performance would not be graded.  The resulting corpus (which is far larger than the sample annotated to date by our LUCY project) gives every appearance of being the spontaneous, unaided output of children of different levels of skill.  (I have often been surprised that greater use has not been made of it since its original publication.)


4          Writing “wordier” than speech


An initial expectation is that words in published writing are likely to be organized into more elaborate constructions, on average, than words in spontaneous speech.  Converting this intuitive idea into a precise, quantifiable concept is not as straightforward as it may sound.  Linguists oriented to written language research might think in terms of “greater average sentence length”, but that concept is not usable in connexion with spontaneous speech: spoken utterances are not divided unambiguously into “sentences” (Miller & Weinert: ch. 2), and in general the area nearest the roots of parse-trees is where the annotation tends to be most debatable.


What I have done as an attempt to get round that problem is to look at the mean length in words of all grammatical constructions, averaging over not just the most inclusive tagmas, such as sentences, but their constituents, and the constituents of their constituents, and so on down to include each nonterminal node that dominates more than one word. The figures are:




child writing


published writing



If one accepts this as a reasonable way to measure “wordiness”, then we can say that the average construction in published writing is indeed about twice as long as the average spoken construction; and the average construction in child writing is intermediate in length – in terms of these figures one might say that the children have moved a little more than halfway from spoken to written norms.


(There is undoubtedly something odd about averaging over a set of elements, some of which are parts of other elements in the set.  I have used this method of measuring “wordiness” because no better method occurs to me, but nothing in later sections of this paper depends on it.)


One incidental point about the above figures is that the speech data contain a structural feature that is completely absent from writing: “speech repairs”, where the speaker embarks on a construction, breaks off, and starts again using the same or different words, e.g. Has he # has he gone Rovers? In our annotation scheme all the words in a speech repair like this are ultimately dominated by a single nonterminal node (see Sampson 1995: 448ff. for details on the method of annotating speech repairs), so those nodes will dominate relatively long sequences of words; yet, despite the fact that repair structures are frequent in spontaneous speech and are edited out of writing, we see that speech still comes out on average with relatively short tagmas.


5          Width v. depth in parse-trees


If we forget about the details of particular grammatical constructions and think just about the abstract geometry of parse-trees, there are fundamentally two ways in which average construction length can differ. Branching can be wide, or it can be deep.


By wide branching I mean that individual tagmas can have many immediate constituents (daughter nodes): for instance a man is a noun phrase with two daughters (the two words), whereas a funny little man who made us laugh is in terms of our scheme a noun phrase with five daughters (the first four words, and the relative clause who made us laugh) – Miller & Weinert (1998: 135ff.) claim that “wide” noun phrases such as the latter example are common in writing and rare or absent in spontaneous speech.


By deep branching I mean the extent to which structures exploit the recursive properties of grammar, to produce long chains of branches between words and root nodes.[5]  A tree in which branching is “deep” in this sense will dominate many words even if each nonterminal node has just two daughters.


Width and depth are not mutually exclusive, and one might expect both to contribute to the difference in “wordiness” between the genres. In our data that turns out not to be so. Average number of ICs per construction are as follows:




child writing


published writing



It must surely be just a coincidence that the first two numbers agree to four significant figures, but nevertheless (contrary to what is suggested by the Miller & Weinert passage cited above) our three genres are evidently all very similar in terms of node “width”.


(I should mention that these averages cover only nodes which do have at least two daughters. Our annotation scheme recognizes some “unary branching” nodes – for instance, in she was feeling depressed, we count she as a noun phrase consisting of just a pronoun, and depressed as a past participle clause consisting just of a verb group consisting just of a past participle. Here and in other statistics quoted below, single-daughter tagmas are omitted from the averages, because their status seems much more theory-dependent and debatable than that of multi-daughter tagmas. If unary branching nodes are included in the calculations of average width, then there are differences, but quite small differences, between the genres:




child writing


published writing



This might correspond for instance to speech using more pronouns while writing uses more explicit referring expressions.)


When we look at depth, the picture changes. What I have done here is to examine average depth of words not in terms of raw numbers of branches between word and root node, but specifically in terms of how many of the nodes dominating a word are labelled with clause categories, that is how deeply embedded the words are in terms of subordinate clauses. Again, this makes the numbers less theory-dependent. For instance, some linguists believe in “VP” units, which mean that objects have more branches above them than subjects, whereas our annotation scheme treats subjects and objects as sister nodes; but linguists are in much more agreement about where to recognize subordinate clauses. (For further technical details about the computation of depth figures, compare Sampson 2001: ch. 5.) Mean word depths are:




child writing


published writing



These means may not look very different, but that is a consequence of the way depth is calculated. You can never have a subordinate clause without words in a matrix clause to introduce it, so increasing the complexity of a sentence in terms of extra layers of subordination does not increase the average depth of all words proportionately. Differences in average depth figures always seem small relative to the corresponding structural differences in sentence complexity. But the differences between these means are statistically extremely robust. Even in the case of child writing v. speech, a one-tailed t-test gave a significance statistic far larger than the critical value for the p < 0.0001 threshold, which was the largest critical value I found in the literature (and for the two other pairwise comparisons the significance statistics are massively larger still).


This is not to say that the whole of the large difference in mean depth between speech and published writing is necessarily attributable to the difference in linguistic mode.  The speech sample represents the full age range in society, while published writers are likely to be older than the average of the population as a whole; in earlier work (Sampson 2001: ch. 5) I found that mean depth of grammatical structures increases with age throughout life, so the authors of the published prose may be people whose speech structures are more complex than average. But mean depth in middle-aged speakers’ oral output is only about five per cent greater than the mean for all utterances in the speech sample, whereas the published-writing v. speech differential shown above is about seven times that.[6]  (The smallness of these age-related differences in speech structure made it seem reasonable to treat our full set of annotated speech samples collectively as representing the spoken language in which children are already fluent when they begin schooling; if I had used only speech by children as the basis for comparison with child writing, the quantity of available material might not have been enough to yield statistically-reliable findings.)


6          Interim summary


Summing up what we have found so far, then: it seems that constructions are on average wordier in writing than in speech, that this difference relates entirely to depth of branching, i.e. grammatical recursion, rather than to node width, and that the child writing is in this respect intermediate between spontaneous speech and published writing.


It might sound as though I am taking wordiness to be desirable in its own right. Of course I agree that simple written style is good style, and it is a pity for written prose to abandon the pithy, punchy structure of speech where it does not need to. But often it does need to. Subordinate clauses have work to do, and if there are more of them in published writing than in spontaneous speech that is, surely, mainly because writing is used to express ideas that are logically subtler than those expressed in social chat. There may have been periods in the past when writers deliberately and pompously made their sentence structures more ramified and verbose than was necessary to get their messages across; I do not believe this has much to do with the structural differences between writing and speech in the BNC. But if I am pressed on this issue, my fall-back position would be that published writing in some sense represents the end point of the process which children acquiring literacy skills are in fact embarked on, whether we think it is the ideal end point or not; so it is surely interesting to look at the trajectory which children who begin as illiterate but fluent speakers take to reach that end point.


7          Phrase and clause categories


Let us look at the grammatical differences between the genres in more detail. Presumably differences in depth of recursion are likely to correlate with differential usage of particular grammatical constructions that allow recursion. Our annotation scheme categorizes tagmas other than main clauses, at the coarsest level, into eight types of phrase and fifteen types of subordinate clause. (There are many refined subcategories, but we shall not look at those here; and we shall not look at the frequency of main clauses, which are necessarily a higher proportion of all tagmas in a genre which has less recursion. Also: in the speech data a small proportion of parse tree nodes are explicitly labelled as unclassifiable, usually because of inaudible wording;[7] these nodes are ignored in the statistics below.)


You can get recursive structure in English without subordinate clauses, for instance through nesting of noun phrases and prepositional phrases in a structure like:


[the key [in [the top drawer [of [the cabinet [by [the fridge]]]]]]]


– but the measure of recursion used in the figures earlier in this paper counted only recursion involving clause subordination, and my guess is that it is differential use of clause subordination which is mainly or exclusively responsible for the greater “wordiness” of published writing than spontaneous speech.[8]


Before giving quantitative breakdowns of the use of particular grammatical categories, I should explain one special point about the figures for child writing. This material naturally sometimes contains grammatical errors; our annotation scheme has developed a system for recording such deviations together with the target structures apparently being aimed at, which goes beyond the notation of Sampson (1995) (which was developed for edited prose in which grammatical errors are less frequent).  The statistics quoted in the present paper count “target” constructions; when these are not correctly realized, the statistics take no account of that failure.  There are many ways in which one could analyse the child-writing material statistically, but this is both the simplest and arguably the most suitable initial approach; we know that learners must sometimes make mistakes when they try to do difficult things, and surely it is more interesting to monitor what they attempt, than to monitor the essentially accidental shapes of their failures. 


(I would add that in any case for the 9-12-year-olds’ writing examined in this paper, this choice probably has little impact on the statistics presented.  The area in which our project is finding really tangled deviant structures is the undergraduate coursework mentioned earlier, where the nature of the task seems to push the writers towards prose that is logically more complex than they would produce spontaneously.  That material has not been examined in the present study.)


8          Use of phrase categories


Taking phrase categories first: for each of the eight grammatical categories and for each pair of the three genres I applied the chi-squared test (with Yates’s correction for continuity) to a two by two contingency table with the columns representing the two genres, and rows representing instances of that category versus instances of all other grammatical categories. In the table below, figures represent percentages of all tagmas counted in the relevant sample which are instances of the category shown. Asterisks mark significant differences between the two figures on either side,[9] using the following code:



p < 0.05


p < 0.01


p < 0.001

not significant


In all eight cases, the differences between speech and published-writing figures are significant at the p < 0.001 level.[10]





child writing


pub. writing

noun phrase





verb group






prepositional phrase






adjective phrase





adverb phrase




number phrase






determiner phrase





genitive phrase






In each case but that of number phrases, the figure for child writing is closer, often much closer, to the published writing than the speech figure. This is true both for categories which are commoner in published writing, such as noun phrase, prepositional phrase, genitive phrase, and in the converse cases, e.g. determiner phrase. Broadly speaking it seems that in the area of phrase grammar these children have proceeded quite a long way along the path of adaptation to the norms of “model” written prose.


It is hardly necessary to add that I fully realize how very broad-brush such an analysis is. Even if “model” writing is characterized by a much higher incidence of noun phrases than speech (which often uses one-word pronouns rather than explicit phrases as subject, object, etc.), obviously the aim of writing-skills education is not anything so crass as “getting the percentage of noun phrases up into the thirties”. Writing skills are about using an appropriate construction to express a particular idea in a particular context, not about percentages. But it seems that, systematically, the right construction in written contexts is much more often a noun phrase than is the case in speech. From the simple figures alone we cannot tell whether the children’s choices of grammatical construction are appropriate even when their frequency matches the published writing frequency; conceivably they could be using as many noun phrases as published writing, but in all the wrong grammatical contexts. That does not seem very likely a priori, though (and if it were true, our team’s annotation task would surely be far harder than it is). With data sources as rich in detail as these, we have to start somewhere in analysing them, and counting percentages of main grammatical categories seems a reasonable first way into the material.


9          Use of subordinate clause categories


For subordinate clauses, the picture is rather different. Of the fifteen types recognized by our scheme, I ignore the rare cases of infinitival relative clauses and for-to clauses, neither of which show a frequency as high as 0.1 per cent in any of the three genres.  Figures for the other thirteen categories are:[11]





child writing


pub. writing

infinitival clause




adverbial clause





nominal clause






verbless clause






present participle clause





relative clause





antecedentless relative





bare non-finite clause





past participle clause






comparative clause




with clause




special as clause





whiz-deleted relative







The most frequent of all subordinate-clause types, the infinitival clause, occurs at essentially the same rate in all three genres. For the other twelve categories, the frequency differences between speech and published writing are significant at the p < 0.001 level (except with clauses, where the significance level is only p < 0.05).


These twelve categories fall into four groups, depending whether their frequency in speech is greater or less than in published writing (S > P or S < P), and on whether their frequency in child writing is closer to that in speech or to that in published writing (C ≈ S or C ≈ P).


S > P, C ≈ P


Nominal clauses, verbless clauses, antecedentless relative clauses, and bare non-finite clauses are less frequent in published writing than speech, and the child-writing figure is closer to published writing; as we might put it, children have successfully learned to ration their use of them.


In the case of verbless clauses, the child-writing figure is actually much lower even than the published-writing figure, which perhaps reflects teachers’ injunctions to “write in complete sentences”.  A similar relationship obtains between the three figures for nominal clauses, which at this point I do not understand – I would not have guessed beforehand that this category was commoner in speech than writing. (Possibly one explanation might be the frequency, in speech, of introductory hedging phrases like I think … or you know ?, where the material following think or know will be analysed as a nominal clause object of the respective verb – I have not yet looked into this.)


Antecedentless relative clauses, and bare non-finite clauses, do feel like relatively “intimate” constructions – the latter because their use is restricted mainly to the verb make meaning “force” and to verbs of perception, and the former because formal prose tends to favour explicit antecedents (think of the way that stuffy writing uses that which in contexts where what would be far more idiomatic); so the differences between the three genres are unsurprising.


S > P, C ≈ S


Adverbial clauses are commoner in speech than in published writing, and the child-writing figure is about the same as the speech figure.


S < P, C ≈ S


Present participle clauses, comparative clauses, with clauses, and special as clauses are more frequent in published writing than in speech, and the child-writing figure remains closer to the speech figure.


S < P, C ≈ P


Finally, there are relative clauses (with explicit antecedents), whiz-deleted relative clauses, and past participle clauses. These are constructions used more frequently in published writing than in speech; and the frequencies in the child writing are closer to the former than to the latter. (In the case of past participle clauses, the child writing frequency is admittedly not far from the mid-point between the other two genres.)


These three categories are also, logically speaking, varieties of the same construction, in which a nominal element is postmodified by a clause in which the nominal plays a grammatical role. A whiz-deleted relative is a relative clause in which the main verb is a form of BE and in which that verb, and the relative pronoun, are “understood” rather than made explicit. A past participle clause is, or at least can be, a whiz-deleted relative clause based on a passive construction, where what is left after the relative pronoun and BE are suppressed begins with a past participle. In our scheme, the category “past participle clause” also covers tagmas which are similar in their internal structure but occur in functions other than noun postmodifiers, e.g. (to hear the winner’s name) called out; but most past participle clauses in the child-writing sample are cases functioning as reduced relatives. (The great majority of these are clauses based on the participles called or named, e.g. (a road) called the Ring, (a girl) named Jennifer.)


Summing up, then: if we think of children’s acquisition of writing skills as, in part, the replacement of the grammatical habits of conversational speech with the norms of adult writing, it seems that, at the stage represented by our child writing data, the children have already achieved much of this adaptation with respect to phrasal constructions (whether this means using more of one type or fewer of another type); but less adaptation has occurred with respect to clause constructions. For a number of types of clause, the children’s written usage remains closer to spoken norms (C ≈ S). For various clause-types which are used less in published writing than in speech (S > P), the children have learned to reduce their usage. But the only clause categories used more in published writing than in speech and where the child writing has risen close to the published norms (S < P, C ≈ P) are various kinds of (full or reduced) relative clauses.[12]


10        The complexity of the relative construction


It seems easily understandable that children will take longer to adapt to adult norms (where these involve increased rather than decreased use) in the case of subordinate clauses, which are complex structures, that in the case of phrases. I find it more surprising that adaptation occurs sooner with relative clauses than other kinds of subordinate clause. This surely cannot be because relative clauses are structurally simpler; considered as abstract formal structures, relative clauses seem strikingly more complicated than some other subordinate clause types.


Assuming that declarative main clauses can be seen as basic, producing a relative clause involves modifying that basic structure by deleting some element which may be related only remotely to the main verb of the declarative structure (for instance it may be a subordinate constituent of an immediate constituent of the structure), yielding a word-sequence that would be bizarre in isolation.  In some cases an appropriate relative pronoun is used, and if the relativized item is object of a preposition then the preposition may be shifted (“Pied-Piped”) to precede the relative pronoun.  Cases with zero relative pronoun are formally simpler, but are arguably no simpler to master, since the logical relationship between clause and antecedent is inexplicit and may be very diverse.


Adverbial clauses or nominal clauses, by contrast, are constructed simply by prefixing a subordinating conjunction to a declarative structure; in the case of nominal clauses, although they may be signalled by the conjunction that, not even this is necessary.  Admittedly, these two categories are S > P cases (their formal simplicity may perhaps be relevant to their high speech frequency), so there is no issue about how children develop the skill of using them in writing.  But, among S < P categories, present participle clauses, for instance, are surely no more formally complex than relative clauses – one might well see them as less complex – yet their child-writing frequency is little different from their speech frequency.  It is not easy to understand why relative clauses should lead present participle clauses so strikingly in the degree to which child writers increase their use of them towards adult written norms.


11        Simple v. complex relatives


It is true that some relative clauses are simpler than others.  A subject relative (a relative clause in which the relativized item is clause subject) has the same shape as a declarative clause, with a wh- pronoun in place of the subject, and since the logical and surface position of the relative pronoun is immediately adjacent to the antecedent it is straightforward to interpret.  Likewise a relative in which the relativized item is the whole of an adjunct of the clause (e.g. (every time) we hit a wave, where the relativized item is a Time adjunct of hit) has the same shape as a declarative (there is no obvious gap, because adjuncts are optional extras), and the logical relationship between relative clause and antecedent is usually clear because the antecedent is a general noun like time or place.  If relative clauses are more profuse in child writing than the complexity of the construction would lead one to expect, one might guess that this is because children confine themselves to the simplest types of relative, so that for them the construction is not a complex one.  Children’s written English might occupy an earlier point on the Keenan-Comrie relativization hierarchy (Keenan & Comrie 1977) than adult written English.


From a limited sampling it appears that this is not so.  It would be tedious to check all the relative clauses in our data manually, so I checked forty “full” relative clauses (i.e. not whiz-deleted relatives or past participle clauses) from each of the speech, child writing, and published writing samples.[13]  I classified the 120 relative clauses according to whether the relativized element is:


A, subject of the relative clause: (the Christmas story,) which took place many years ago


B, an entire adjunct of the relative clause:  (every time) we hit a wave


C, object or complement of the relative clause:  (a small animal) they catch


D, a constituent of a phrase constituent of the relative clause: (the person) to whom it points


E, a constituent of a phrase constituent of a phrase constituent of the relative clause:  (some flowers […]) that I do not know the name of


F, a constituent of a subordinate clause constituent of the relative clause:  (I am in K. [House],) which I naturally think is the best


G, a constituent of a phrase constituent of a subordinate clause constituent of the relative clause:  (a village dance) which the headmistress has forbidden any of the girls to go to


(Examples in italics are quoted from the child-writing sample in each case.)


Intuitively, the sequence A to G roughly corresponds to complexity of relative clause types, so if it were true that relative clauses in children’s writing were simpler than in adult writing, one would expect the breakdown to show higher figures for child writing than for published writing in rows A and B, with the child-writing figures declining to zero in lower rows.[14] 


It is true that the simpler relative structures are more frequent than the more complex structures in all three genres, but in other respects the figures by no means conform to that prediction:[15]




child writing

published writing



























The proportion of A- and B-type relatives is actually considerably higher in the published writing than either the child-writing or speech sample.  These samples are admittedly small and possibly unrepresentative, but if the frequency of relative clauses in child writing were explainable in terms of children using simple versions of the construction, one might expect this to be visible even in small samples. 


The figures seem to imply that the relative clause construction used in the child-writing sample is the full adult relative-clause construction; and, hence, that complexity of different constructions is not a reliable predictor of the extent to which the constructions will be deployed in child writing.


12        Unanswered questions


If children make heavy written use of relative clauses earlier than some simpler constructions, one would like to know what it is about relative clauses that permits or encourages this.  Are relative clauses for some reason more useful, in the kinds of written communication represented in the Nuffield material, than some other subordinate-clause types which are needed for adults’ more diverse communicative goals?  Is it that relative clauses, though formally complex, represent a more straightforward development from simpler and earlier written usage than some other constructions?  At this stage I cannot even guess at the answers.


In other respects, too, it is clear that the foregoing has only begun to scratch the surface of what can potentially be learned from resources like the LUCY and CHRISTINE Corpora.  Once consistently annotated samples are available in machine-readable form, the questions one can ask about the acquisition of writing skills are limited only by the researcher’s ingenuity.





Burnard, L.  1995.  Users Reference Guide for the British National Corpus Version 1.0.  Oxford University Computing Services.

Handscombe, R.J., ed.  1967a.  The Written Language of Nine and Ten-Year Old Children.  (Nuffield Foreign Languages Teaching Materials Project, Reports and Occasional Papers, no. 24.)  Leeds University.

Handscombe, R.J., ed.  1967b.  The Written Language of Eleven and Twelve-Year Old Children.  (Nuffield Foreign Languages Teaching Materials Project, Reports and Occasional Papers, no. 25.)  Leeds University.

Keenan, E.L. and B. Comrie  1977.  “Noun phrase accessibility and Universal Grammar”.  Linguistic Inquiry 8.63-99.

Miller, J. and Regina Weinert  1998.  Spontaneous Spoken Language: Syntax and Discourse.  Clarendon Press (Oxford).

Perera, Katharine  1984.  Children’s Writing and Reading: Analysing Classroom Language.  Basil Blackwell  (Oxford) in association with André Deutsch.

Sampson, G.R.  1995.  English for the Computer.  Clarendon Press (Oxford).

Sampson, G.R.  1997.  Depth in English grammar”.  Journal of Linguistics 33.131-51; reprinted as ch. 4 of Sampson (2001).

Sampson, G.R.  1999.  “CHRISTINE Corpus, Stage I: Documentation”.

Sampson, G.R.  2001.  Empirical Linguistics.  Continuum.

Yngve, V.H.  1961.  “The depth hypothesis”.  In R. Jakobson (ed.), Structure of Language and its Mathematical Aspects, American Mathematical Society (Providence, Rhode Island); reprinted in F.W. Householder (ed.), Syntactic Theory I: Structuralist, Penguin.



[1] I am grateful to Anna Babarczy and Alan Morris for their contributions to the research resources used in this study, and to Gerald Gazdar, Adam Kilgarriff, and Anna Babarczy for comments on versions of the paper.  Responsibility for its shortcomings is mine alone.


[2] The CHRISTINE and LUCY projects were/are sponsored by the Economic and Social Research Council (UK) under contracts R 000 23 6443 and R 000 23 8146.  Stage I of the CHRISTINE Corpus is available for downloading from (follow link to “downloadable resources”); when complete the LUCY Corpus will be made as accessible as copyright restrictions permit.


[3] Of the forty text files in CHRISTINE Stage I, Release 2, file T40 was omitted from this study because of a format error which interfered with the statistics extraction software.


[4] It is intended that, when complete, the LUCY Corpus will also contain a section of adult writing which is ephemeral or which, even if published, has a relatively high incidence of deviations from standard usage; for that reason, items of the latter kind were excluded from the “published writing” sample which has already been annotated.


[5] In an earlier study (Sampson 1997) I used the term “depth” in a different sense, inspired by the work of Victor Yngve (e.g. 1961), to refer to the extent to which parse-trees contain left-branching structures. (The left-branchingness measure of that study, applied to the present data, gives mean figures which are similar for all three genres.) “Depth” in the present paper refers to distance between leaf and root nodes, and not to a measure of asymmetricality between left and right branching.


[6] Unfortunately the depth figures in the study quoted above are not directly comparable with those shown here; in that study I averaged over all words, including discourse items not contained in clauses (which were assigned depth 0) – this was a reasonable approach in research which compared the oral output of different speakers, but becomes less appropriate when speech is compared with writing.


[7] See Sampson (1999: §9) on the rules by which our scheme annotates such cases.


[8] I have not checked whether the wordiness differential might be partly attributable to phrase within phrase recursion of the kind just illustrated – it is not entirely clear how, formally, one should tease apart the contributions of different types of recursion; but, impressionistically, noun phrase within prepositional phrase within noun phrase structures seem very common in the speech data.


[9] More strictly, between the raw figures from which those percentages are calculated.  The chi-squared test does not apply to percentages.


[10] For detailed definitions of these categories and the subordinate-clause categories discussed below, see Sampson (1995).  It is not possible in a brief space to illustrate the full range of constructions covered, but I give one example from the child-writing sample for each category:


noun phrase                 all the first formers

verb group                   had been

prepositional phrase    in the world

adjective phrase          very small

adverb phrase              as soon as she’s used to her toys

number phrase            the other two

determiner phrase       any of the girls

genitive phrase            Mary Todd’s



[11] Again I give a single example from the child writing for each category (wording in brackets is included to show the context, and is not part of the example tagma):


infinitival clause                     to keep her out of trouble

adverbial clause                      if he had him

nominal clause                        that it is a Four of Diamonds

verbless clause                        (they go in to dinner,) then the second bell

present participle clause         (by) adding some more to it

relative clause                         (one pup) who looked just like his mother

antecedentless relative            What I like doing

bare non-finite clause             (make us) do the right thing

past participle clause              (a girl) named Jennifer

comparative clause                 (as black) as Alan’s is fair

with clause                              (a yellow door …) with the name wrote on it

special as clause                     (field archers do not use sights) as target archers do

whiz-deleted relative              (“Amazon Adventure”) also by Willard Price



[12] It is interesting to compare these findings with those of Perera (1984), an excellent book which is the only previous substantial study of the grammar of child writing known to me, though written slightly too early to exploit the possibilities now opened up by computer manipulation of machine-readable annotated corpora. Perera’s table 19, p. 232, does not match our finding of child writing assimilating to adult norms earlier with respect to use of relative clauses than other subordinate clause types (though she does note that relative clauses increase in frequency more rapidly during the school years than other clause types, p. 234). Exact comparisons between Perera’s and our findings are difficult, for one thing because her statistics relate to children’s speech and children’s writing but do not give comparative figures for adult writing.


[13] For speech I  took all the full relative clauses (omitting one case whose type could not be determined because it was broken off before completion) in CHRISTINE files T07, T14, T21, T28, T35, and the first five from T20.  For child writing I took the first twenty relative clauses in both the 9-10-year-old and the 11-12-year-old files, which were not in a systematic sequence – the forty cases were produced by nine 9-year-olds, seven 11-year-olds, and one 12-year-old.  For published writing I took the first twenty cases from a passage of Independent sports reporting (part of BNC file A4B) and from an extract from a book on provision of legal services in Britain (part of BNC file GVH).


[14] At one point (1998: 109), Miller & Weinert claim in effect that the standard English relative clause construction occurs in spontaneous speech only in patterns A to C.  They note the existence of an alternative construction which occurs only in speech, and is more transparent because the relativized item is represented by a pronoun in its logical position, as in the book that I found these words on its pages; Miller & Weinert say that the relativized item can play far more diverse roles in this latter construction.  However, Miller & Weinert’s claim about the spoken use of the standard construction seems to be contradicted by examples they quote at other points (e.g. the shop I bought it in, their p. 106 – and see our data in the table below).  The alternative construction involving a “shadow pronoun” does occur in our CHRISTINE speech data, though impressionistically it is far rarer than standard relative clauses, and so far as I have noticed it does not occur at all in the child writing data.


[15] The bracketed figure in row D corresponds to the fact that the one 9-year-old type D relative clause in the sample is deviant:  (there are many others [scil. birds]) in which I often read about.  The bracketed figure in row F relates to the spoken example (all) she’s supposed to do now, which by the rules of our scheme is analysed as having the relativized item as object of an infinitival clause subject of supposed.  One might well prefer to see BE supposed to as a quasi-modal construction, in which case the F figure under “speech” would reduce from 2 to 1 and the C figure would increase from 13 to 14.