The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version. Pubished in Journal of Linguistics 33.131–51, 1997. |
Depth in English grammar
Geoffrey Sampson
University of Sussex
ABSTRACT
Corpus data are used to investigate Yngve’s claim
that English usage avoids grammatical structures in which the number of left
branches between any word and the root node of a sentence exceeds some fixed
limit. The data do display a
marked bias against left-branching, but the pattern of word-depths does not
conform to Yngve’s concept of a sharp limit. The bias could alternatively reflect a statistical
invariance in the incidence of left-branching, but whether this is so depends
on how left-branching is counted.
Six nonequivalent measures are proposed; it turns out that one (and only
one) of these yields strikingly constant figures for left-branching in
real-life sentences over a wide range of lengths. This is not the measure suggested by Yngve’s formulation; it
is the measure whose invariance is arguably the most favourable for
computational tractability.
1. History and implications of the Yngve Hypothesis
Victor Yngve drew attention (1960, 1961) to an
asymmetry in English grammar favouring right-branching over left-branching
structures: although various
individual grammatical constructions create left-branching, Yngve believed that
the use of these constructions is in practice constrained in such a way that
the ‘depth’ of any word in a sentence never exceeds some fixed limit, perhaps
seven. He offered an explanation
for this in terms of psychological processing mechanisms. Lees (1961) and Fodor et al. (1974:
408ff.) argued that the relevant psychological considerations are more complex
than Yngve supposed, and that the depth constraints in languages such as
Japanese and Turkish are quite different from that in English.
Note
that the term ‘depth’, in Yngve’s usage, refers purely to the quantity of
left-branching contained in the path linking a terminal node to the root node
of a grammatical tree structure (we shall become more precise shortly about how
this is counted). It is necessary
to stress this to avoid misunderstanding, because the term ‘depth’ is used
quite differently in connexion with tree structures by computer scientists, for
whom the depth of a terminal node is the total number of branches (of any kind)
between itself and the root. Thus,
for a computer scientist, the rightmost terminal node of a tree may have a
large depth, but for Yngve the depth of the last word of a sentence is
necessarily zero. Yngve’s papers
on this topic have attained such classic status in linguistics that I have
chosen to follow his usage here.
The computer scientists’ ‘depth’ is a quantity which plays no part in
the present discussion, so I have not needed to adopt any particular term for
it.[1]
The
question of left-branching became linked for linguists with that of multiple
central embedding (Miller & Chomsky 1963, Reich 1969), which is claimed to
be constrained very tightly indeed.
Occasionally it is suggested (e.g. Lyons 1991: 116) that Yngve’s
hypothesis might have resulted from taking what is in reality a constraint on
central embedding to be a more general constraint on left-branching. But these issues should be kept
distinct. There unquestionably is
a strikingly low incidence in English of left-branching in general ‹ that is,
of multi-word constituents occurring anywhere other than as rightmost daughters
of their containing constructions.
One of the most immediately noticeable features of any
grammatically-analysed English corpus which uses brackets to delimit
constituents is the frequent occurrence of long sequences of right brackets at
the same point in a text, while sequences of adjacent left brackets are few and
short. On the other hand the
empirical facts about central embedding are less clear. De Roeck et al. (1982) and Sampson
(1996) give evidence that the kinds of construction claimed by writers such as
Church (1982: 24 n. 32) and Stabler (1994: 315-316) to be ‘unacceptable’ do
quite often occur in ordinary usage in English and other languages. There may be a special avoidance of
central embedding; alternatively, the alleged rarity of multiple central
embedding might just reflect the familiar principle that, when one counts
examples of anything, the more detailed the criterion which examples are
required to meet the fewer cases one will find. This paper will study the general English tendency for
multi-word constituents to occur at the end of their containing construction,
ignoring the separate issue whether constituents which violate this tendency
are significantly less frequent in the middle than at the beginning of the
higher unit.
Writing
before the availability of computers and grammatically-analysed corpora, Yngve
noted (1960: 461) that ‘It is difficult to determine what the actual [depth]
limit is’; his figure of seven seems to have been a surmise based on psychological
findings about memory limitations in other domains, rather than on an empirical
survey of linguistic usage (which would scarcely have been feasible at that
period). Fodor et al. (1974: 414)
echoed Yngve’s point about the difficulty of checking empirically just what the
depth patterns are in real-life usage.
But it is fairly clear that Yngve’s conception involves a sharp
cut-off: up to the depth limit
(whether this is seven or another number) many words are found, beyond the
limit none. He illustrates his
concept with a diagram (reproduced here by permission of the American
Mathematical Society as (1), after Yngve 1961: 134, Fig. 5) of the kind of
structure that would be expected with a depth limit of three; of the fifteen
terminal nodes in (1), apart from the last (which necessarily has depth 0)
there are three at depth 1, six at depth 2, and five at depth 3. Yngve’s caption to the diagram reads
‘If the temporary memory can contain only three symbols, the structures it can
produce are limited to a depth of three and can never penetrate the dotted
line.’
FIG. 1 ABOUT HERE
Yngve’s
depth hypothesis is significant for computational linguistics, because ‹
leaving aside the question whether sentences violating the depth limit should
be regarded as ‘ungrammatical’ or as ‘grammatical but unacceptable’, a
distinction that we shall not discuss ‹ it seems to imply that English
grammatical usage is determined in part by a nonlocal constraint. Since the phrase-structure rules of
English grammar allow some left-branching and are recursive, it appears that
the class of structures they generate should include structures with excessive
left-branching, which would have to be filtered out by a mechanism that
responds to the overall shape of a tree rather than to the relationship between
a mother node and its immediate daughter nodes. Statistical optimizing techniques for automatic parsing,
such as those of Sampson et al. (1989), Black et al. (1993), which select the
analysis for an input string which maximizes a continuous measure of
grammatical plausibility, might need to build depth of left-branching into
their evaluation metrics as a consideration to be traded off against local
mother/daughter-node relationships.
While
there is undoubtedly something right about Yngve’s depth hypothesis, to an
empirically-minded corpus linguist the postulation of a fixed limit to depth of
left-branching has a suspicious air.
Corpus linguists tend rather to think of high- and low-frequency
grammatical configurations, with an ‘impossible’ structure being one that
departs so far from the norm that its probability is in practice
indistinguishable from zero, but without sharp cut-offs between the ‘possible’
and the ‘impossible’. The aim of
this paper is to bring corpus evidence to bear on the task of discovering
precisely what principle lies behind the tendency to asymmetry observed by
Yngve in English. We shall find
that the answer is clear-cut; that it does not imply a sharp cut-off between
acceptable and unacceptable depths of left-branching; and that it has positive
consequences for the computational-tractability issues canvassed above.
2.
Evidence used to assess the hypothesis
The corpus used for this purpose is the SUSANNE
Corpus (Sampson 1995).[2]
This is an approximately 130,000-word subset of the Brown Corpus of
edited American English, equipped with annotations identifying its surface and
logical grammatical structure. The
SUSANNE Corpus was developed in conjunction with the SUSANNE analytic scheme
(op. cit.). This is a set of
annotation symbols and detailed rules for applying them to difficult cases,
which is intended to come as close as possible to the ideal of defining
grammatical analyses for written and spoken English that are predictable (in the sense that different
analysts independently applying the scheme to the same sample of English must
produce identical annotations), comprehensive
(in the sense that everything found in real-life usage receives an analysis,
and all aspects of English surface and logical grammar which are definite
enough to be susceptible of explicit annotation are indicated), and consensual (in that the scheme avoids taking
sides on analytic issues which are contested between rival linguistic theories,
choosing instead a ‘middle-of-the-road’ analysis into which alternative
theorists’ analyses can be translated).
Its 130,000 words make the SUSANNE Corpus far from the largest analysed
corpus of English now available, but limited size is the penalty paid to
achieve high reliability of the analysis of each individual sentence ‹ for
present purposes that is important.[3]
The research reported below used Release 3 of the SUSANNE Corpus,
completed in March 1994; the many
proofreading techniques to which this version was subjected before release
included scanning the entire text formatted by software which uses indentation
to reflect the constituency structure implied by the SUSANNE annotations, so
that most errors which would affect the conclusions of the present research
should have been detected and eliminated.
Although
the SUSANNE analytic scheme aims to be ‘consensual’ as just defined, obviously
many individual linguistic theorists would prefer different structural analyses
for particular constructions.
However, although this might lead to some changes in the individual
figures reported below, the overall conclusions are sufficiently clear-cut to
make it reasonable to hope that they would be unaffected by such modifications,
provided these were carried out consistently.[4]
Some
readers may think it unfortunate that the present investigation is based on
written rather than spoken English; if constraints on left-branching derive
from psychological processing considerations (as Yngve believed) it is likely
that these considerations impact more directly on spontaneous speech than on
writing. At present there is to my
knowledge no analysed corpus of spontaneous spoken English which would have
been suitable for the purpose. But
in any case, transcriptions of spontaneous speech tend not to contain long chains
even of right-branching structure, and they contain many editing phenomena
which make it difficult to analyse an utterance in terms of a single coherent
tree-structure; so that it is questionable whether an analysed corpus of
spontaneous speech could be used for this research, even if we had one. The highly-ramified structures
discussed by Yngve (1960) are in fact much more characteristic of written than
of spoken English, and I believe that an analysed corpus of written English may
offer the best opportunity to take his work further.
3. Preparation of the test data
In order to study
left-branching, it was necessary to modify the structures of the SUSANNE Corpus
in a number of respects:
(i) The SUSANNE analytic scheme treats
punctuation marks as ‘words’ with their own place in parse trees; and it
recognizes ‘ghost’ elements (or ‘empty nodes’) ‹ terminal nodes marking the
logical position of elements which appear elsewhere in surface structure, and
which have no concrete realization of their own. Punctuation marks are not likely to be relevant to our
present concerns (with respect to human syntactic processing they are written
markers of structure rather than elements forming part of a syntactic
structure); and ghost elements are too theory-dependent to be appropriately
included in an empirical investigation such as ours (Yngve discussed only the
structuring of concrete words).
Therefore all terminal nodes of these two types, and any nonterminals
dominating only such nodes, were pruned out of the SUSANNE structures.
(ii) Any tree whose root node is labelled
Oh, ‘heading’, was eliminated:
this covers items such as numbered chapter titles, and other forms whose
internal structure often has little to do with the grammar of running English
text.
(iii) Apart from ‘headings’, the SUSANNE
texts are divided by the analysis into units whose root nodes are labelled O,
‘paragraph’. A paragraph normally
consists of an unstructured chain of sentences (interspersed with
sentence-final punctuation marks which were eliminated at step (i)). Yngve’s thesis relates to structure
within individual sentences; therefore O nodes were eliminated, and the units
within which left-branching was examined were the subtrees whose roots are
daughters of O nodes in the unmodified Corpus. Not all of these units are grammatically ‘complete
sentences’; occasionally, for instance, a noun phrase functions as an immediate
constituent of a SUSANNE paragraph.
The present investigation paid no attention to whether root nodes of
trees in the modified Corpus had the label S or some other label.
(iv) Some SUSANNE tree structures contain
nodes below the root, representing categories such as ‘direct quotation’, which
with respect to their internal constituency are equivalent to root nodes. For the present investigation, the
links between such ‘rootrank nodes’ (Sampson 1995: §4.40) and their daughters
were severed: thus left-branching
was measured within the sentence(s) of a direct quotation without reference to
the sentence within which the quotation was embedded, and when left-branching
was measured in that quoting sentence the quotation was treated as a single
terminal node.
(v) The SUSANNE analytic scheme treats
certain sequences of typographic words, e.g. up to date used as an
adjective, as grammatically equivalent to single words. Any node labelled with an ‘idiomtag’
(Sampson 1995: §3.55) was treated as terminal, and the structure below it in
the unmodified SUSANNE Corpus was ignored.
(vi) The SUSANNE analytic scheme makes
limited use of singulary-branching structure. For instance, a gerundive clause consisting of a present
participle and nothing more will be assigned a node labelled with a clausetag
dominating only a node labelled with a verb-group tag dominating only a node
labelled with a present-participle wordtag. Numerical measures of left-branching might take singulary
branching into account in different ways, depending on exactly how the measures
were defined, but intuitively it seems unlikely that singulary branching is
significant in this connexion; and again singulary-branching nodes seem to be
entities that are too theory-laden to be considered in the present
context. (What would it mean to
assert that the grammatical configuration just cited is a case of three
separate units that happen to be coterminous, rather
than a case of one word unit that happens to play three roles? ‹ many would
see these as different ways of talking about the same facts.) Therefore singulary branching was
eliminated by collapsing pairs of mother and only-daughter nodes into single
nodes.
4. Counts of word depths
The first question put to the resulting set of
sentence structures was whether Yngve’s concept of a sharp limit to the
permissible degree of ‘depth’ is borne out in the data. Let us say that the lineage of a word is the class of nodes
including the leaf node (terminal node) associated with that word, the root
node of its tree, and all the intermediate nodes on the unique path between
leaf and root nodes; and let us say that a node e is a younger sister of a node d if d
and e are immediately dominated by the same ‘mother’ node and e
is further right than d.
Then Yngve’s concept of the ‘depth’ of a word corresponds to:
(2) The
total number of younger sisters of all the nodes in the word’s lineage.
The number of words in the modified SUSANNE Corpus
having various depths in this sense is shown in Table 1.
TABLE 1 ABOUT HERE
Table
1 gives us not a picture of a phenomenon that occurs freely up to a cut-off
point and thereafter not at all, but of one which, above a low depth, becomes
steadily less frequent with increasing depth until, within the finite quantity
of available data, its probability becomes indistinguishable from zero.
However,
although (2) is the definition of ‘depth’ that corresponds most directly to
Yngve’s exposition, there are two aspects of it which might be called into
question. In the first place,
‘depth’ in this sense can arise as much through a single node having many younger
sisters as through a long lineage of nodes each having one younger sister. This is illustrated by the one word in
SUSANNE having depth 13, which is the first word[5] of the sentence Constitutional
government, popular vote, trial by jury, public education, labor unions,
cooperatives, communes, socialized ownership, world courts, and the veto power
in world councils are but a few examples (Brown Corpus and SUSANNE Corpus
location code G11:0310). The
SUSANNE analysis of this sentence is shown in (3); nodes contributing to the
depth count of the first word are underlined.[6]
FIG. 3 ABOUT HERE
Although
in principle the existence of individual nodes with large numbers of daughters
and the existence of long lineages of nodes each having one younger sister are
two quite different aspects of tree-shape, for Yngve the distinction was
unimportant because he believed that branching in English grammatical
structures is always or almost always binary (Yngve 1960: 455). But this seems to have been less an
empirical observation about English grammar than an analytical principle Yngve
chose to impose on English grammar.
In the case of multi-item co-ordinations such as the one in (3), for
instance, where semantics implies no internal grouping of the conjuncts I know
of no empirical reason to assume that the co-ordination should be analysed as a
hierarchy of binary co-ordinations; in SUSANNE analyses, which avoid positing
structure except where there are positive reasons to do so, many nodes have
more than two daughters. Where
SUSANNE has a single node with three or more daughters, it seems that Yngve
regularly assumed a right-branching hierarchy of binary nodes. This implies that ‘depth’ measured on
SUSANNE trees will best approximate to Yngve’s concept if each node having
younger sister(s) contributes exactly one to the depth of the words it
dominates, rather than nodes having many younger sisters making a greater
contribution. In that way, depth
figures for words dominated by nodes with many daughters will be the same as
they would be in the corresponding Yngvean trees containing only binary
nodes. (To make the point quite
explicit: although I do not myself
believe that grammatical branching is always binary, I am proposing that we
count word depth in a way that gives the same results whether that is so or not.)
Secondly,
even the most right-branching tree must have an irreducible minimum of left
branches. A tree in which all
nonterminal nodes other than the root are rightmost daughters ought surely to
be described as containing no left-branching at all; yet by Yngve’s definition
each word other than the last will have a depth of one, rather than zero (and
the average word depth will consequently depend on how many words there
are). This inconsistency could be
cured by ignoring the leaf node when counting left-branching in a lineage.
Accordingly,
I suggest that a more appropriate definition than (2) of the depth of a word
would be:
(4) The
total number of those nonterminal nodes in the word’s lineage which have at
least one younger sister.
Thus, consider terminal node e in tree (5):
Counted according to (2), the depth of e is
four, the relevant younger sister nodes being F, j, k, L. Counted according to (4), the depth of e
is two, the contributing nonterminals being B and C. If the distribution of depths among
SUSANNE words is recomputed using definition (4), the results are given in
Table 2.
TABLE 2 ABOUT HERE
The
decline is now much steeper, but again we seem to be looking at a continuously
decreasing probability which eventually becomes indistinguishable from zero in
a finite data-set, rather than at a sharp cut-off. The four words at depth 5 are the words New York, United
States occurring in the respective sentences Two errors by New York
Yankee shortstop Tony Kubek in the eleventh inning donated four unearned runs
and a 5-to-2 victory to the Chicago White Sox today (A11:1840), and Vital
secrets of Britain’s first atomic submarine, the Dreadnought, and, by
implication, of the entire United States navy’s still-building nuclear sub
fleet, were stolen by a London-based soviet spy ring, secret service agents
testified today (A20:0010).
These examples seem intuitively to relate more closely than the (3)
example to the depth phenomenon with which Yngve was concerned; their SUSANNE
analyses are (6) and (7).
FIG.
6 ABOUT HERE
FIG.
7 ABOUT HERE
It
is true that, if depth is counted in terms of definition (4) rather than
Yngve’s original definition (2), then Table 2 shows that the SUSANNE data are
logically compatible with a fixed maximum depth of seven. But to explain the figures of Table 2
in terms of a fixed depth limit is scientifically unsatisfactory, because it is
too weak a hypothesis to account for the patterning in the data. To give an analogy: a table of the numbers of
twentieth-century Europeans who attain various ages at death would, in the
upper age ranges, show declining figures for increasing age until zero was
reached at some age in the vicinity of 120. Logically this would be compatible with a theory that human
life is controlled by a biological clock which brings about death at age 125
unless the person happens to die earlier; but such a theory would be
unconvincing. In itself it fails
to explain why we do not meet numerous 124-year-olds ‹ to explain that we need
some theory such as cumulative genetic transcription errors as cells repeatedly
divide leading to increased probability of fatal maladies; and, if we adopt a
theory of this latter kind, it is redundant also to posit a specific fixed
maximum which is rarely or never attained.
What
we would like to do is to find some numerical property obeyed by the SUSANNE
trees which is more specific than ‘no depth greater than seven’, which is
invariant as between short and long sentences, and which predicts that the
number of words at a given depth will decline as depth increases.
In
the following sections I address this issue in the abstract, prescinding from
psychological questions about how human beings might produce or understand
grammatical structures, and instead treating the set of observed SUSANNE
parsetrees purely as a collection of shapes in which some invariant property is
sought. The ratio of psychological
theorizing to empirical description in this area has been rather high to date,
and the balance deserves to be redressed.
Having found an empirical result I shall not wholly refrain from
speculation about possible processing implications, but these will be very
tentative; the central aim of the work reported here is to establish the
empirical facts rather than to draw psychological conclusions.
5. Alternative measures of
left-branching
One possible invariant might be mean depth (in the
(4) sense) of the various words in a sentence. If there were no tendency to avoid left-branching, then mean
word depth would be higher in long sentences than in short sentences, because
more words imply longer lineages between terminal nodes and root, and the
lineages would contain left-branching as frequently as right-branching. Yngve’s picture of a depth boundary
that remains fixed however long a sentence grows suggests that mean word depth
might be constant over different sentence lengths; this could be true despite
the occasional incidence of words with unusually large depth figures.
However,
if we choose to compute the asymmetry of sentence structures by an averaging
procedure over all parts of the tree, rather than by taking a single maximum
figure, then averaging word depth is not the only way to do this. Two other possibilities present themselves. One could take the mean, over the nonterminal
nodes, of the proportion of each node’s daughters which are left-branching
nodes ‹ that is, which are
themselves nonterminal and are not the rightmost daughter. Or one could take the mean, again over
the nonterminal nodes, of the proportion of all words ultimately dominated by a
node which are not dominated by the rightmost daughter of the node and are not
immediately dominated by the node.
Let us call these three statistical properties of a tree structure the depth-based measure, the production-based measure, and the realization-based
measure respectively.
A
low figure for any of these three measures implies that a tree has relatively
little left-branching. But the
measures are not equivalent.
Consider for instance the three six-leaf tree structures (8), (9), and
(10):
By the depth-based measure, the most left-branching
of the three structures is (8); by the production-based measure, the most
left-branching is (9); by the realization-based measure, the most
left-branching is (10).[7]
So far as I am aware, other methods of calculating degree of
left-branching will assign a ranking to the various trees having a given number
of leaf nodes that will be identical or near-identical to the ranking assigned
by one of these three measures.
None
of the three measures give figures for different trees which are directly
comparable when the trees have different numbers of leaf nodes (i.e. dominate
sentences of different lengths).
An entirely right-branching tree, in which nonterminal nodes are always
rightmost daughters of their mothers, will score zero by each of the three
measures. But, for each of the
measures, the score for an entirely left-branching tree will depend on sentence
length. Writing w for the
number of leaf nodes (words) dominated by a tree, the maximum score will be:
for
the depth-based measure
for
the production-based measure
for
the realization-based measure
[Consult print publication for formulae missing in this online version]
We might therefore normalize the measures to a common
scale by dividing the raw figures by the appropriate one of these three
quantities. The resulting
normalized measures give us a meaningful way of comparing the positions
occupied by sentences of any lengths on a scale from 1, for ‘completely
left-branching’, to 0, for ‘completely right-branching’ (with respect to any
one of the three definitions of asymmetry).
I
shall refer to the six resulting statistical measures of left-branching as RD,
RP, RR, ND, NP, NR, for raw v. normalized depth-, production-, and realization-based
measures. The question now is
which, if any, of these six measures yields figures for structural asymmetry in
English that show little variance with different lengths of sentence.
6. Incidence of left-branching computed by
alternative measures
In order to answer this question I grouped the
sentences of the modified SUSANNE Corpus into sets by length; for each set up
to length w = 47 I computed the six asymmetry measures for the sentences
in the set, and took their means.
(The maximum length of sentences examined was fixed at 47 because, above
this length, not all lengths are represented in the data by at least ten
instances. Up to w = 47 the
fewest instances of a sentence-length is nineteen for w = 45.) For very short sentences the means
display some patternless fluctuations, which is not too surprising: with few words and even fewer
nonterminals to average over, one should perhaps not expect statistical
measures of a tree’s topological properties to be very informative.[8]
But the runs of figures from w = 7 up to w = 47
(covering a total of 5963 sentences) display very clear trends, summarized in
Table 3, which for each of the six measures gives the overall mean and standard
deviation of the 41 individual means for different sentence lengths, together
with the linear correlation coefficient r between sentence length and
individual mean asymmetry figure.
TABLE 3 ABOUT HERE
The
measure closest to Yngve’s concept, RD, shows a very strong positive
correlation (r = 0.96) between length and depth: individual mean RD figures range from
0.38 for 8-word sentences up to 0.98 for 47-word sentences. Normalizing the depth measure merely
reverses the sign of the correlation (r = –0.93): individual mean ND figures range
between 0.136 for length 7 and 0.040 for length 41.
By
far the most consistent measure of left-branching is RP, which shows
essentially no correlation with sentence length (r = 0.093). Mean RP figures for different sentence
lengths cluster tightly (low standard deviation) round the overall mean of
0.094; the lowest individual mean is 0.084 for length 45, the highest is 0.102
for length 44. It is evidently RP
which gives rise to the limited left-branching which Yngve took for an absolute
bar on lineages containing more than a fixed maximum number of left branches.
The
normalized production-based measure of left-branching, and the
realization-based measures, are not as precisely correlated with sentence
length as the depth-based measures, but absolute correlation coefficients over
0.6 make it clear that these measures are not candidates for the invariant
quantity adumbrated by Yngve.
Individual means range from 0.22 (NP), 0.123 (RR), 0.189 (NR), for
length 7, down to 0.17 (NP), 0.085 (RR), 0.094 (NR), for length 45.
I
do not suggest that the incidence of words at different Yngvean depths can be
predicted purely from statistics on the average incidence of nonterminal and
terminal daughters in individual productions. If that were possible, the figures of Table 2 would display
a regularity that we do not find.
Assuming that not only the proportion L of left-branching
daughters but also the mean number b of daughter nodes per mother node,
and the proportion R of rightmost daughters which are nonterminal, are
constant for different sentence-lengths, then each figure in Table 2 ought to
differ by a constant factor bL/(1 – R) from its
predecessor. Even if the figures
of Table 2 were not to hand, we would know that things are not that simple. The great majority of root nodes in the
modified SUSANNE Corpus have the same label S, ‘main clause’, and the class of
those productions which share some particular mother label will not in general
contain the same proportion of left-branching daughters as found in all
productions (the fact, recorded in Table 2, that there are more depth-1 than
depth-0 words in the Corpus shows that productions having S to the left of the
arrow have a relatively high proportion of left-branching daughters). Likewise the mean proportion of
left-branching daughters for category labels which themselves occur on
left-branching daughter nodes is very likely to deviate from the overall mean
in one direction or the other.
Considerations like these imply that we cannot predict an expected
pattern of word depths against which Table 2 can be tested. But, once we know that the overall
incidence of left-branching productions is a low constant frequency for
sentences of different lengths, there is no need of further explanation for the
fact that the figures in Table 2 dwindle to zero after the first few rows, and
hence for Yngve’s impression that depths above about seven never occur in
practice.[9]
7. Implications of the findings
From a computational perspective, the significance of
the fact that RP is the invariant measure is that this is the one measure of
asymmetry which depends purely on local grammatical facts. A context-free grammar with
probabilities associated with alternative productions gives an invariant mean
RP figure for sentences of different lengths; if any of the other five measures
had proved to be invariant with sentence length, that would have implied some
mechanism controlling global tree shape, separate from the class of allowable
productions. Thus the finding may
represent good news for computational tractability.
Admittedly,
even the invariance of RP might require an explanation in nonlocal terms, if
the grammatical structures to be explained were to incorporate the singulary
branching which was eliminated from the modified SUSANNE Corpus (§3, (vi)
above). For instance, if pronouns
are introduced into clauses via rules which rewrite clause categories as
sequences including the category ‘noun phrase’ at different points, and
separate rules which rewrite ‘noun phrase’ alternatively as a pronoun or a
multi-word sequence, then a probabilistic context-free grammar could not ensure
that subjects are commonly pronouns and that multi-word noun phrases occur much
more often clause-finally. But the
grammar of English could be defined without singulary branching, by using rules
in which e.g. pronouns occur directly in the expansions of clause categories.
It
is interesting that the invariant measure is RP rather than NP. One interpretation of this finding
might perhaps be that sentences are not in practice constructed by choosing the
words they are to contain and then organizing those words into a suitable
grammatical structure; rather, the grammatical structures are chosen
independently of sentence-length considerations, and the expansion process
terminates simply because productions having no nonterminals to the right of
the arrow have a certain probability and hence will sooner or later be chosen.[10]
It
is hard to accept that the consistent mean left-branching figure for English
productions could be caused by a fixed limit to the number of items held in the
speaker’s/writer’s short-term memory, as Yngve argued: that mechanism would give invariant RD
rather than invariant RP figures.
If the language used low frequency of left-branching productions (that
is, productions which add one to the Yngvean depth of the words ultimately
dominated by their left-branching daughter node) as a strategy to avoid
generating trees containing words deeper than some fixed limit such as seven,
it would be a very inefficient strategy:
most words would be at a depth much less than the limit, ‘wasting’
available memory, and even so there would occasionally be a violation of the
limit. I suggest that fixed
numerical limits may play little role in the psychological processing of
language.
It
would be interesting to discover whether the different incidence of Yngvean
depth found in languages such as Japanese and Turkish can equally be accounted
for by left-branching production frequencies fixed at different
language-specific values.
8. Summary
We have seen that Yngve was right in saying that
English grammatical usage embodies a systematic bias against left-branching
constructions. But corpus evidence
of a kind that has become available only since Yngve published his hypothesis
suggests that the nature of that bias is rather different from what Yngve seems
to have supposed. It is not that
English enforces a left-branching depth maximum which is frequently reached but
never exceeded. Rather, there is a
specific probability of including a left-branching nonterminal category among
the immediate constituents of a construction; this probability is independent
of the wider sentence structure within which the construction is embedded, but
because the probability is small the incidence of words at different depths
becomes lower, and eventually vanishingly low, at greater depths.
REFERENCES
Aho, A. V., Hopcroft, J. E., & Ullman,
J. D. (1974). The design and
analysis of computer algorithms.
Reading, Mass.:
Addison-Wesley.
Black, E., Garside, R. G., & Leech, G.
N. (eds.) (1993). Statistically-driven
computer grammars of English: the IBM/Lancaster approach. Language and Computers: Studies in
Practical Linguistics 8.
Amsterdam: Rodopi.
Booth, T. L. & Thompson, R. A.
(1973). Applying probability
measures to abstract languages. IEEE
Transactions on Computers C-22.
442-450.
Church, K. W. (1982). On memory limitations in natural
language processing.
Bloomington, Indiana:
Indiana University Linguistics Club.
De Roeck, Anne, Johnson, R., King,
Margaret, Rosner, M., Sampson, G. R., & Varile, N. (1982). A myth about centre-embedding. Lingua 58. 327-340.
Ellegård, A. (1978). The syntactic structure of English
texts: a computer-based study of four kinds of text in the Brown University
Corpus. Gothenburg Studies in
English 43. Gothenburg: Acta Universitatis Gothoburgensis.
Fodor, J. A., Bever, T. G., & Garrett,
M. F. (eds.) (1974). The psychology of language: an introduction
to psycholinguistics and generative grammar. London:
McGraw-Hill.
Hofland, K. & Johansson, S.
(1982). Word frequencies in
British and American English.
Bergen: Norwegian Computing Centre for the Humanities.
Householder, F.W. (ed.) (1972). Syntactic theory 1: structuralist. Harmondsworth, Mddx: Penguin.
Jakobson, R. (ed.) (1961). Structure of language and its
mathematical aspects.
Proceedings of Symposia in Applied Mathematics 12. Providence, Rhode Island: American Mathematical Society.
Knuth, D. E. (1973). The art of computer programming:
vol. 3, sorting and searching.
Reading, Mass.:
Addison-Wesley.
Lees, R. B. (1961). Comments on Hockett’s paper. In Jakobson, ed. (1961). 266-267.
Lyons, J. (1991). Chomsky. (3rd edition.) London: Fontana.
Miller, G. A. & Chomsky, A. N.
(1963). Finitary models of
language users. In Luce, R. D.,
Bush, R. R., & Galanter, E. (eds.), Handbook of mathematical psychology,
vol. 2. London: Wiley.
Reich, P. A. (1969). The finiteness of natural
language. Language 45. 831-843. (Reprinted in Householder, ed. (1972). 258-272.)
Sampson, G. R. (1995). English
for the computer: the
SUSANNE Corpus and analytic scheme.
Oxford: Clarendon Press.
Sampson, G. R. (1996). From
central embedding to corpus
linguistics.
Sampson, G. R., Haigh, R., & Atwell,
E. S. (1989). Natural language
analysis by stochastic optimization: a progress report on Project APRIL. Journal of Experimental and
Theoretical Artificial Intelligence 1. 271-287.
Stabler, E. P., Jr (1991). Avoid the pedestrian’s paradox. In Berwick, R. C., Abney, S. P., &
Tenny, Carol (eds.) Principle-based parsing: computation and
psycholinguistics.
London: Kluwer. 199-237.
Stabler, E.P. (1994). The finite connectivity of linguistic
structure. In Clifton, C., Jr,
Frazier, Lyn, & Rayner, K. (eds.) Perspectives on sentence processing. Hillsdale, N.J.: Lawrence Erlbaum. 303-336.
Steedman, M. J. (1989). Grammar, interpretation, and processing
from the lexicon. In
Marslen-Wilson, W. (ed.) Lexical representation and process. London: MIT Press.
463-504.
Tesnière, L. (1965). Eléments de syntaxe structurale
(2nd ed.). Paris: Klincksieck.
Yngve, V. H. (1960). A model and an hypothesis for language
structure. Proceedings of the
American Philosophical Society 104.
444-466.
Yngve, V. H. (1961). The depth hypothesis. In Jakobson, ed. (1961). 130-138. (Reprinted in Householder, ed. (1972). 115-123.)
Depth Words
0 7851
1 30798
2 34352
3 26459
4 16753
5 9463
6 4803
7 2125
8 863
9 313
10 119
11 32
12 4
13 1
14+ 0
Table
1
Depth counts by Yngve’s definition
Depth Words
0 55866
1 64552
2 12164
3 1274
4 76
5 4
6+ 0
Table
2
Depth counts by revised definition
RD ND RP NP RR NR
mean 0.73 0.067 0.094 0.20 0.10 0.12
s.d. 0.19 0.023 0.0038 0.0091 0.0075 0.020
r 0.96 –0.93 0.093 –0.61 –0.83 –0.88
Table
3
Distribution
of mean structural asymmetry at different sentence lengths, for six measures of
asymmetry
Raw Depth-Based Measure (RD): word depth (by definition (3)),
averaged over the words of a sentence
Raw Production-Based Measure (RP): proportion of the daughters of a
nonterminal node which are themselves nonterminal and nonrightmost, averaged
over the nonterminals of a sentence
Raw Realization-Based Measure (RR): proportion of the words dominated by a
nonterminal which are also dominated by a lower nonterminal that is not the
rightmost daughter, averaged over the nonterminals of a sentence
Normalized Measures (ND, NP, NR): for each sentence the corresponding raw
measure is converted to a number between 0 and 1 by dividing by the largest
possible raw figure for a sentence of the same length
Address:
School of Cognitive and Computing Sciences
University of Sussex
Falmer, Brighton BN1 9QH
[1]Computer scientists’
quantitative measures of tree structure (e.g. Knuth 1973: 451ff., Aho et al.
1974: 167 Ex. 4.33) specify the extent to which a tree departs from perfect
‘balance’ where the paths between terminal nodes and root are all the same
length: this affects the
efficiency of algorithms which access data held in tree structures. These measures ignore the extent to
which departures from balance occur in one direction rather than the other,
which is the topic of the present paper but is not normally significant in a
computing context.
[2]The SUSANNE Corpus was
produced by a project sponsored by the Economic and Social Research Council
(UK), reference no. R000 23 1142, using a resource developed earlier by Alvar
Ellegård of the University of Gothenburg (Ellegård 1978). The Corpus is distributed free of
charge by anonymous ftp [via Sampson’s
Resources page
– link replacing out-of-date information in
original text].
[3]Although SUSANNE contains
only a fraction of the Brown Corpus material, if the latter is accepted as a
‘fair cross-section’ of the language there is some reason to see SUSANNE as
comparably representative: it
contains equal quantities of prose from each of the four broad genre categories
established by Hofland & Johansson (1982: 22-27) from objective evidence.
[4]Likewise, provided one
agrees that grammatical structure can be represented in terms of labelled
trees, I believe it is not important for what follows whether one takes the
trees to be defined by unitary phrase-structure rules, by separate
immediate-dominance and linear-precedence constraints, or otherwise. The conclusions certainly would be
affected if one held a view of English grammar such as that attributed by
Stabler (1991: 200) to ‘Steedman and others’ (see e.g. Steedman 1989: 466),
according to which English syntax is based on left-branching structures. But I am not aware of any support
whatever for such an analysis from literature whose primary focus is on the
empirical facts of English, rather than on theories of psychological
processing. If this view were
accepted, it would be difficult to explain why most linguists take it as a
truism that English and Japanese, for instance, have widely-different overall
syntactic architectures.
[5]Misprinted as Consitutional in the source text from which
the Brown Corpus was compiled.
[6]Note that the YC nodes
dominating commas, being punctuation nodes, were eliminated from the modified
Corpus used in this study.
[7]The respective scores are
as follows:
(8) (9) (10)
depth-based 1.50 1.00 0.67
production-based 0.20 0.25 0.17
realization-based 0.327 0.325 0.333
I illustrate the
calculations for the case of tree (9).
For the depth-based measure, the nonterminals having younger sisters are
the two lowest, hence the depth (by definition (4)) of the leaf nodes in
left-to-right sequence is 0, 2, 2, 1, 1, 0 ‹ total 6, averaged over six leaves
gives 1.00. For the
production-based measure, the left-branching nodes are again the two lowest
nonterminals, hence the proportion of left-branching daughters for the
nonterminals in sequence from the root downwards is 0, 0.5, 0.5, 0: average 0.25. For the realization-based measure, the relevant proportions
of words for the nonterminals in sequence from the root downwards are 0/6, 4/5,
2/4, 0/2: average 0.325.
[8]Some of the short
‘sentences’ in the SUSANNE Corpus consist of material such as prices shown
numerically which, like ‘headings’ (see §3 above), can scarcely be seen as
representing natural language structure in the ordinary sense.
[9]My discussion (like
Yngve’s) has assumed a phrase-structure representation of sentence grammar, in
which all the words of a sentence are associated with terminal nodes of a tree
structure, and nonterminal nodes are labelled with grammatical categories. It would be interesting to consider
whether generalizations about depth in English would be affected if one chose a
dependency representation of grammatical structure (Tesnière 1965), in which
nonterminal as well as terminal nodes are associated with words, and the
mother/daughter relationship between nodes represents the head/modifier rather
than the whole/part relationship.
A dependency tree is notationally equivalent to a phrase-structure tree
in which one daughter of each nonterminal node is marked as head, so facts
about depth in phrase-structure trees should be mechanically translatable into
facts about dependency trees. But
the respective statements would not necessarily be equally straightforward ‹ it
might be that the facts about depth in English are more naturally stated in
terms of one notation rather than the other; and conceivably the availability
of headship information in dependency trees could permit generalizations to be
stated in a stronger form lacking a translation into phrase-structure
notation. I have not pursued these
issues.
[10]Booth & Thompson
(1973) have shown that it is possible to construct pathological probabilistic
context-free grammars which give a more than infinitesimal probability for
derivations to expand endlessly without terminating, but this is surely more a
mathematical curiosity than a finding relevant to natural language: if English can be described by a
probabilistic grammar, it will presumably be a non-pathological grammar.