The following online article has been derived mechanically
from an MS produced on the way towards conventional print
publication.
Many details are likely to deviate from the print version;
figures and footnotes may even be missing altogether, and
where negotiation with journal editors has led to improvements
in the published wording, these will not be reflected in this
online version.
Shortage of time makes it impossible for me to offer a
more careful rendering.
I hope that placing this imperfect version online may be useful
to some readers, but they should note that
Published in the |

STATISTICAL LINGUISTICS

The longest-established important application of statistical techniques to linguistic problems is stylometry, a method of resolving disputed authorship (usually in a literary context, occasionally for forensic purposes) by finding statistical properties of text that are characteristic of individual writers, such as mean word or sentence length, or frequencies of particular words (see e.g. Morton 1978).

Such a technique was proposed by Augustus
De Morgan in 1851 in connexion with the authorship of the Pauline Epistles;
concrete investigations, using methods of greater or lesser statistical
sophistication, have been carried out by many scholars, beginning with T.C.
Mendenhall in 1887. The
mathematical foundations of the topic were studied by Gustav Herdan (e.g.
Herdan 1966), who showed for instance that the right way to measure vocabulary
richness (type/token ratios), avoiding dependence on size of sample, is to
divide log(types) by log(tokens).
A.Q. Morton’s stylometric demonstration that no more than five Epistles
can be attributed to Paul attracted considerable attention in 1963, from a
public intrigued by the idea that a computer (seen at that time as an obscure
scientific instrument) might yield findings of religious significance. However, this work remains
controversial. Other leading
investigations were carried out by George Udny Yule on *De Imitatione Christi*, by Alvar Ellegård on the *Letters of
Junius*, and by Frederick Mosteller and
David Wallace on the *Federalist Papers*.

More recently, Douglas Biber (e.g. Biber 1995) has moved stylometry away from concern with individual authorship puzzles to shed light on broader considerations of historical evolution of style, and on genre differences in English and other languages, by means of factor analysis applied to grammatical features of texts.

A very different area is the use of
numerical measures of resemblance between pairs of languages in order to
establish the shape of the Stammbaum (family tree) of relationships among
languages of a family. Early work
(e.g. by Ernst Förstemann in 1852 and Jan Czekanowski in 1927) used
phonological and grammatical properties to measure resemblance; but writers
such as Morris Swadesh in the 1950s considered mainly the proportions of words
for “core concepts” which remained cognate in pairs of descendant
languages. The terms *lexicostatistics* and *glottochronology* are both used to describe such research (on which
see e.g. Embleton 1986), although “glottochronology” is often reserved for work
which assumes a constant rate of vocabulary replacement (an assumption regarded
by many as unwarranted), and which thereby yields absolute dates for
language-separation events. This
type of research has recently been placed on a sounder theoretical footing by
drawing on the axioms of biological cladistics. Don Ringe and others have used cladistic techniques to
investigate relationships between the main branches of the Indo-European
language family; in many respects their work confirms traditional views, but it
gives unexpected results for the Germanic branch (which includes English). Johanna Nichols (1992) has used a
different biologically-inspired approach, applying cluster-analysis techniques
adapted from population genetics to statistics concerning the geographical
incidence of grammatical features, in order to investigate early, large-scale
human migration patterns. She
claims that language statistics demonstrate that the Americas must have been colonized
much earlier than was previously believed.

A third statistical approach to language was based on a generalization about word frequencies, stated by J.B. Estoup as early as 1916, but later publicized by George Kingsley Zipf, with whose name it is nowadays usually associated. Zipf’s Law, , says that, in a long text, the rank of a word (commonest word = 1, second commonest = 2, etc.), multiplied by the word’s frequency, will give roughly the same figure for all words. (Zipf also stated a second law, which asserted a log-linear relationship between word frequencies, and numbers of words sharing a particular frequency; he believed this to be a corollary of his first law, but that is incorrect – mathematically the two generalizations are independent of one another.) Zipf explained his “law” in terms of a Principle of Least Effort, on which he based a wide-ranging (and in some respects controversial) theory of human behaviour.

Although called a “law”, Zipf’s finding is an approximation only; commonly it is quite inaccurate for the highest and lowest ranks, though reasonably correct in the middle ranges. In the 1950s, Benoît Mandelbrot modified Zipf’s Law to make it more empirically adequate, setting this work in the context of a proposed new discipline of “macrolinguistics”, which was intended to bear the same relation to grammar (or “microlinguistics”) as thermodynamics bears to the mechanics of individual gas molecules (Apostel et al. 1957). It is debatable whether Zipf’s Law tells us anything surprising or deep about the nature of language or linguistic behaviour; George Miller argued in 1957 that the “law” was virtually a statistical necessity. But, as Christopher Manning and Hinrich Schütze have put it, Zipf’s Law does remain a good way to summarize the insight that “what makes frequency-based approaches to language hard is that almost all words are rare”.

In the 1960s and 1970s, numerical techniques fell out of favour in linguistics. (William Labov’s pioneering work on the statistical rules governing social variation in language was carried out in conscious opposition to prevailing orthodoxies.) The subsequent revival of numerical approaches was due partly to a swing of intellectual fashion, but important roles were also played by two concrete developments: the creation of large samples or “corpora” of language in computer-readable form, the first of which was published as early as 1964, and the greatly increased accessibility of computers from the early 1980s. These factors made it easy for linguists to examine statistical properties of language which it would previously have been impractical to study.

Some corpus-based statistical work is an
extension of traditional descriptive linguistics. Here are three examples (among many which could have been
quoted). Our understanding of
lexical differences between British and American English has been enlarged by
Hofland & Johansson’s publication of comprehensive tables of statistically
significant differences between the frequencies of words in the two dialects
(Hofland & Johansson 1982):
these show, for instance, that masculine words such as *he*, *boy*, *man*, words referring to military or violent concepts,
but also words from the religious domain, are significantly more frequent in
American English, while feminine words, hedging words such as *but* and *possible*, words referring to family relationships, and the word *disarmament*, are significantly commoner in British English. (At least, this was true forty years
ago, when the corpora analysed by Hofland & Johansson were compiled.) Harald Baayen has demonstrated that
objective statistical methods for measuring the relative productivity of
various word-derivation processes in English give results which sometimes
contradict both linguists’ intuitions and non-statistical linguistic
theory. Sampson (2001, ch. 5)
finds that the grammatical complexity of speech seems to increase with age of
the speaker, not just in childhood but through middle and old age, contrary to
what might be predicted from “critical period” theories of language learning.

Alongside investigations with purely scholarly goals, corpus-based statistical techniques have also come into play as an alternative to rule-based techniques in language engineering. (For a survey, see Manning & Schütze (1999).) A central function in many automatic language processing systems is parsing, i.e. grammatical analysis. The classic approach to automatic parsing of natural language treated the task as essentially similar to that of “compiling” computer programs written in a formal language, with a parser being derived from a generative grammar defining the class of legal inputs. However, this approach encounters difficulties if natural languages fail in practice to obey rigorous grammatical rules, or if (as is often the case) the rules allow very large numbers of alternative parses for sentences of only moderate length. Accordingly, the 1990s saw an upsurge of interest in parsing techniques based on language models which include statistical information distilled from corpora. For instance, probabilities may be assigned in various ways to the alternative rewrites allowed by a phrase-structure grammar; or the language model may altogether eschew the concept “ill-formed structure” (and “ill-formed string”) and assign numerical scores to all possible ways of drawing a labelled tree over any word-string, with the correct parse scoring higher than alternatives in the case of those strings which are good sentences. Various (deterministic or stochastic) optimizing techniques are then used to locate the best-scoring analysis available for an input; and statistical optimizing techniques may also be used to induce the probabilistic grammar or language model automatically from data, which may take the form of a “treebank” – a language sample in which sentences have been equipped with their correct parse trees.

These approaches have brought
computational linguistics into a closer relationship with speech research,
where it has been recognized since the 1970s that the messiness and
unpredictability of speech signals make probabilistic optimizing techniques
centrally important for automatic speech recognition. In speech recognition (see e.g. Jelinek 1997), one is given
a physical speech signal and aims to locate the sequence of words which is most
probable, given that signal. In
other words, one wishes to maximize a conditional probability *p*(*w*|*s*); and empirical data give one estimates for the converse conditional probabilities *p*(*s*|*w*), that is the probabilities of particular signals,
given various word-sequences.
Various useful language-engineering applications can be cast in similar
terms. By Bayes’s Theorem, ; so, in order to calculate the quantity we are interested
in, we need to be able to estimate the *p*(*w*)’s, the prior
probabilities of various word sequences in the language. Speech researchers commonly estimate *p*(*w*)’s
using crude “*n*-gram” (usually
bigram or trigram) language models, which estimate the probability of a long
word-sequence simply from the frequencies of the overlapping pairs or triples
of words comprised in the sequence, ignoring its linguistic structure. Linguists’ hopes that grammar could be
deployed to improve on the performance of *n*-gram models have not so far borne fruit. Conversely, simple collocation-based
techniques have begun to be used for tasks that were previously assumed to
depend crucially on subtle linguistic analysis: for instance, there have been experiments in automatic
development of machine translation systems by extracting parallel collocations
from large bodies of bilingual text, such as the Canadian *Hansard*. In
general, one consequence of the introduction of statistical techniques in
natural language processing has been a shift towards simpler grammar formalisms
or language models than those which were popular in linguistics previously.

A separate trend in computational research on language, also crucially dependent on statistical or probabilistic concepts, is the development of psychological models of linguistic competence which contradict the longstanding assumption that knowing a language involves mastery of a set of language rules. One such approach was put forward in the 1980s under the alternative names “connexionism”, “Parallel Distributed Processing”, or “neural network” models (a similar but less sophisticated model had flourished briefly in the 1960s under the name “perceptrons”). Fundamental to the PDP approach is the idea that, where behaviour seems to be governed by intelligible rules, e.g. rules of a grammar, these are in reality only a crude way of envisaging the net consequences of the actions of numerous simple processors, each of which operates in a probabilistic manner on inputs and outputs which are meaningless in isolation from the network. PDP was put forward as a new general theory of human cognition, but its most impressive early successes (see e.g. Rumelhart, McClelland, et al. 1986) were relevant chiefly to language and speech. By the end of the century, some commentators (e.g. Pinker 1999) felt that PDP had not fully lived up to its initial promise in this domain; but in the mean time, Rens Bod (1999) had developed a quite different statistically-based theory of human grammar processing, “data-oriented parsing”, which equally implied that languages are not encoded in users’ minds as rules.

Although several contributors to
statistical linguistics have been distinguished mathematicians, the subject has
usually exploited standard mathematical concepts without making novel
contributions to statistical theory.
But sometimes linguistic phenomena have played a part in the advance of
statistical thought. Notably,
Andrei Andreevich Markov, the founder of the theory of stochastic chain
processes (Cox & Miller 1965), used the emission of successive linguistic
units in discourse as his example of such a process; his classic 1913 article
was based on counts of letter-sequence frequencies in Pushkin’s *Eugene
Onegin*. One offshoot of Markov’s work which might have been expected
significantly to influence the subsequent development of linguistics, Claude
Shannon’s “information theory” (Shannon & Weaver 1949), in fact made little
impact for several decades on mainstream linguistic theory. But Markovian concepts acquired a new
importance in speech and language processing research from the 1980s onwards,
for instance through the use of “hidden Markov models” in speech
recognition. (The *n*-gram models discussed earlier treat languages as if
they were governed by simple Markov processes.)

Again, there is a fundamental statistical problem (first analysed by Alan Turing) to which language is specially relevant, concerning inferences about the distribution of species in a population from a finite sample in which many species are unrepresented (Sampson 2001, ch. 7). In 1976, Bradley Efron and Ronald Thisted used these ideas to estimate the number of words that Shakespeare knew but happened not to use in the extant writings (and when, in 1985, a new poem attributed to Shakespeare came to light, Thisted and Efron applied the theory in order to conclude that the attribution was very likely correct). In the early 1970s, H.S. Sichel introduced a novel probability law, the generalized inverse Gaussian-Poisson distribution, in connexion with related issues.

BIBLIOGRAPHY

Apostel, Léo, Benoît
Mandelbrot, & Albert Morf
(1957) *Logique, langage
et théorie de l’information*. (Études d’épistémologie génétique, 3.) Paris: Presses Universitaires de France.

Biber, Douglas (1995) *Dimensions of Register Variation: a Cross-Linguistic
Comparison*. Cambridge:
Cambridge University Press.

Bod, Rens (1999) *Beyond Grammar: an Experience-Based Theory of Language*. Cambridge: Cambridge University Press.

Cox, D.R. & H.D.
Miller (1965). *The Theory of Stochastic Processes*.
London: Methuen.

Embleton, Sheila (1986) *Statistics in Historical Linguistics*.
Bochum: Brockmeyer.

Herdan, Gustav (1966) *The Advanced Theory of Language as Choice and Chance*.
Berlin: Springer.

Hofland, Knut &
Stig Johansson (1982) *Word Frequencies in British and
American English*. Bergen: Norwegian Computing Centre for the Humanities.

Jelinek,
Frederick (1997) *Statistical Methods for Speech
Recognition*. Cambridge, Mass.:
MIT Press.

Manning, Christopher
& Hinrich Schütze (1999) *Foundations of Statistical Natural
Language Processing*. Cambridge, Mass.: MIT Press.

Morton, Andrew
Q. (1978) *Literary Detection: How to Prove
Authorship and Fraud in Literature and Documents*. Epping, Essex: Bowker.

Nichols, Johanna (1992) *Linguistic Diversity in Space and Time*.
Chicago: University of
Chicago Press.

Pinker, Steven (1999) *Words and Rules: The Ingredients of Language*.
London: Weidenfeld & Nicolson.

Rumelhart, David E.,
James L. McClelland, et al.
(1986) *Parallel
Distributed Processing: Explorations in the Microstructure of Cognition* (2 vols).
Cambridge, Mass.: MIT
Press.

Sampson,
Geoffrey (2001)
*Empirical Linguistics*.
London: Continuum.

Shannon, Claude E.,
& Warren Weaver (1949) *The Mathematical Theory of
Communication*. Urbana, Illinois:
University of Illinois Press.

Zipf, George
Kingsley (1949) *Human Behavior and the Principle of
Least Effort*. Cambridge, Mass.:
Addison-Wesley.