The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version. Published in R.C. Carrasco & J. Oncina, eds., Grammatical Inference and Applications, Springer, 1994. |
Stochastic Optimization of a Probabilistic Language
Model
M.D. Dennis, A.M. Wallington, G.R. Sampson
University of Sussex
Abstract
The system described generates a language model for
use in the APRIL natural-language parser.
The APRIL language model takes the form of a set of typical productions
(pairings of mother label with sequence of daughter labels) for each
non-terminal category; the system analyses an input by seeking a labelled tree
over the words of the input which offers the best fit between each production
in the tree and some production of the language model. Previous versions of APRIL used
hand-designed language models, but this became a serious research bottleneck. This paper presents a system which uses
stochastic optimization to reduce a large set of observed productions to a set
of “prototype productions” which are few in number but which nevertheless
successfully typify the large observed set.
1 The
APRIL Parser
This paper discusses a system which automatically
derives a language model, for use in a robust statistically-based
natural-language parser, from a grammatically-analysed sample of the relevant
natural language.
The parser which the system is intended to serve is
the APRIL parser, an early version of which was described in Sampson et al.
(1989). APRIL treats the problem
of finding a satisfactory grammatical analysis of an input text as a
statistical optimization problem.
Given an input string, it looks for the labelled tree over the units of
the string which maximizes a continuous measure of grammatical plausibility,
where plausibility is assessed by comparing the various local configurations
jointly making up a tree with the incidence of identical or similar
configurations in a database of correctly-parsed natural-language
material. The stochastic
optimizing technique of simulated annealing is used to locate the best solution
within the solution-space for an input.
This space contains all possible tree structures having the appropriate
number of terminal nodes, with all possible assignments of labels to the
nonterminal nodes of the structures; no labelled tree is ruled out as
grammatically “illegal”, instead an analysis that a linguist would view as
absurd is rejected because it is quantitatively less plausible than other
analyses, by reference to the statistics of the parsed database. (Simulated annealing – see e.g. Aarts
& Korst (1989) – involves executing a random walk through a solution space,
subjecting the individual steps of the walk to an initially weak but steadily
increasing bias against steps from better to worse solutions: in a wide range of problem domains this
allows a system to converge on the global optimum solution while avoiding being
trapped by local optima.)
In this way, the APRIL parser addresses the problem
that rule-based parsers are often fragile when confronted with messy real-life
language data that do not always conform perfectly to well-defined rules. By the nature of the simulated
annealing algorithm, APRIL must yield some solution for any input, even a
highly ungrammatical input: APRIL
seeks the most plausible available analysis, without caring whether the
absolute degree of plausibility of that analysis is high or (as is likely in
the case of an ungrammatical input) relatively low.
For early versions of the APRIL system, inputs
consisted of individual English sentences which an analyst had segmented into
words and tagged (that is, wordclass codes had been assigned to the
words). The current phase of
Project APRIL[1] is producing a system whose inputs are
paragraphs of English in the form of raw character strings lacking any analytic
information. Using an electronic
dictionary, the system begins by forming a word-hypothesis lattice whose arcs
span segments of the input which may constitute words (not always obvious, e.g.
in the case of punctuation or idioms) and which are labelled with candidate
wordtags and tag-frequency information.
The parse annealer then seeks the optimum labelled tree whose terminal
nodes constitute some path through this lattice. (The word-hypothesis lattice structure is formally similar
to data structures typically generated by speech-recognition systems; because
of research-resource availability, Project APRIL is currently working with
written English, but the system is designed for ease of transfer in due course
to the speech domain, where problems of grammatically-messy usage are even more
salient.)
The APRIL approach requires a language model capable
of assigning a numerical measure of plausibility to any “production” – that is,
any pairing of a mother label, drawn from an agreed vocabulary of nonterminal
categories, with a finite sequence of daughter labels each drawn from the union
of that vocabulary with the agreed vocabulary of wordtags. The current phase of Project APRIL uses
as its categories those of the SUSANNE annotation scheme (Sampson 1994), a set
of symbols and rules for applying them to difficult cases which has been
designed as an explicit, comprehensive taxonomic scheme for representing
written and spoken, surface and logical English grammar. The statistics of the APRIL language
model are derived from the 130,000-word SUSANNE analysed Corpus, omitting a
subset reserved for testing.[2]
2 The
Use of Prototype Productions
Because of the complexity and unpredictability of
real-life usage, the range of grammatical configurations found in the SUSANNE
Corpus is immensely diverse. Even
using simplified category labels which omit many of the details of the full
SUSANNE annotation scheme, the parse-trees of the SUSANNE Corpus embody on the
order of 14,000 distinct productions, some of which recur many times but many
of which are hapax legomena.
Furthermore it is sure that other samples of English would throw up
other productions again. It would
be unreasonable to treat the production-set of the SUSANNE Corpus as a
definitive grammar of English for parsing purposes. Rather, it must be treated as a source from which a smaller,
more tractable language model is to be distilled. Furthermore, this tractable language-model needs not only to
be able to assign some measure of plausibility to productions that actually
occur in English: the nature of
the APRIL algorithm means that the system will repeatedly have to evaluate
crazy, randomly-labelled parse-trees, so the language model needs to be able to
assign some (often low) plausibility figure to any production whatsoever,
provided only that mother and daughter labels are drawn from the agreed
vocabularies.
APRIL solves this problem via the concept of
“prototype” daughter sequences.
For any given mother label, say “noun phrase”,[3] there will be very many distinct daughter
sequences in the data, but these will be groupable into families of sequences,
such that each member of a particular family can be seen as deriving from a
single prototype sequence for the family by insertion of extra labels into the
sequence and/or omission of some of the elements from the sequence. Thus, for “noun phrase”, one prototype
sequence might be the single element “personal-pronoun”, another might be
“article + adjective-phrase + common-noun”. Once a set of prototype sequences is chosen for each mother
category, the APRIL language model is constructed automatically as a set of
transition networks of the form shown in Figure 1. A prototype sequence supplies the labels for successive arcs
on the “spinal route” through a network, and these arcs are supplemented in a
standardized manner with skip arcs allowing any arc on a spinal route to be
omitted, and loop arcs allowing extra elements to be inserted at any
point. Thus, Figure 1 shows the
network that would be constructed for a prototype sequence A B C. Arcs labelled e
are jump arcs transited without consuming input; labels such as “V – BC”
mean “the full vocabulary of labels, omitting B and C”:
Figure 1 Skip-and-Loop transition network
Probabilities are assigned to the arcs, and to
alternative symbols on the loop arcs, by running the parsed data over the
network system; and the resulting model is used to evaluate productions during
the parsing process by driving the daughter sequence of a production over the
networks for the different prototype sequences for the relevant mother label,
multiplying the probabilities of the various arcs transited and of the
particular prototype, and taking the highest of the resulting products for the
various prototypes. The
arrangement of skip and loop arcs guarantees that any label sequence whatever
will be accepted deterministically on any prototype network.[4]
The adequacy of the language model, and hence the
performance of the APRIL parser, is therefore crucially dependent on the
quality of the prototype sequences for various nonterminal categories. In earlier phases of Project APRIL,
prototype sequences were derived from lists of observed productions by
hand. Although the parsed corpus
then used was much smaller than the SUSANNE Corpus and used less detailed
grammatical categories (so that the number of observed productions was
smaller), this proved to be a very serious bottleneck in system
development. It seemed highly
desirable to automate the process of deriving prototypes from observed
productions.
3 Optimization
of Prototypes
In the current phase of the project, the process of
deriving prototypes from data is itself treated as an optimization problem to
be solved by simulated annealing.
For any given nonterminal category, a list is formed of the observed
daughter sequences, with one entry for each sequence type, together with a
record of the number of observed instances (tokens) for each type. A data structure is created which has a
slot for each element in each daughter sequence, as in Figure 2 (the labels and
figures in which are illustrative only).
An annealing run involves a random walk through the space of alternative
combinations of assignments of the values On and Off to the slots in such a
structure. At each step of the
run, a randomly-chosen slot is changed from On to Off or vice versa. In the situation illustrated, the
current solution is assigning to the mother label M the prototype set {X Y Z, X
B Y, P Y, C Y Z, Q R S U}.
Figure 2 Data structure for observed/prototype
sequences
Since we are concerned with generating a prototype set
of daughters for a specific mother, no On/Off slots are needed for the mother
label.
For a prototype set to be useful as part of a language
model for the APRIL parser, it needs to contain few sequences, but there should
be a good degree of similarity between each observed sequence and some
prototype sequence. These
requirements tend to conflict. The
evaluation metric used in annealing a prototype set balances the requirements
in a manner now to be described.
We first define a cost for an individual prototype
sequence relative to an individual observed sequence. Consider the prototype sequence A B C (which gives rise to
the skip-and-loop network of Figure 1), relative to the observed sequence X B Y
C. The observed sequence is driven
over the network, and a counter is incremented by 2 for each traversal of a
loop arc or a skip arc (i.e. a jump arc between two odd-numbered states), by 1
for each traversal of an arc labelled with an element of the prototype
sequence, and by 0 for each traversal of a jump arc from an even- to an odd-numbered
state. In the case quoted, the
result is 8, the transitions being as shown in the following table:
Input |
Move |
Cumulative Sum |
start |
go to
state 0 |
0 |
X |
loop V-ABC |
2 |
B |
edge e from state 0 to 1 |
2 |
B |
edge e from state 1 to 3 |
4 |
B |
edge B
from state 3 to 4 |
5 |
Y |
loop V-C |
7 |
C |
edge e from state 4 to 5 |
7 |
C |
edge C
from state 5 to 6 |
8 |
end |
edge e to final state 7 |
8 |
A prototype has a higher cost, the more dissimilar it
is from the observed sequence, but it has a positive cost even if identical to
the observed sequence (since we want prototype sequences to be few). The resulting figure is “normalized” by
being divided by the summed lengths of the prototype and observed sequences,
and the normalized figure is squared.
Normalization is intended to prevent figures for long observed or
prototype sequences from swamping figures for short sequences in evaluating a
set of prototypes; the squaring step was introduced as an empirical response to
the need to make prototype-sequence cost increase more than linearly with
increase in dissimilarity between the respective symbol-strings. Thus the cost of prototype A B C
relative to observed X B Y C is (8 ÷ (3 + 4))2 = 1.306.
Having defined the cost of an individual prototype
sequence relative to an individual observed sequence, we now define the cost of
a set of prototype
sequences relative to a set of observed sequences.
For each observed sequence, that prototype sequence is found whose cost
relative to the observed sequence is lowest, and this cost is multiplied by the
frequency of instances of the observed sequence within the data. The resulting figures for all observed
sequences are summed, and the sum is multiplied by the number of those
prototype sequences within the prototype set which are lowest-cost sequences relative
to at least one observed sequence.
It is this product which the language-model annealer
seeks to minimize during an annealing run; and, on termination of the run, the
set of prototypes output consists of just those prototype sequences whose cost
contributed to the value of the final solution, omitting “useless” prototypes
whose slots are currently On but which are not the cheapest prototype relative
to any observed sequence.
Evidently, this evaluation metric implies a great deal
of processing at each step of a language-model annealing run. The code has been carefully designed to
minimize the amount of recalculation required; this is essential for the
generation of a language model in a reasonable time. Currently the time taken to generate a prototype set for the
mother label “sentence” (the label for which our data give the largest solution
space) is about two days on a Sun Sparc 10. (For this label there are just over 10,000 On/Off slots,
hence the solution space has over 210000 elements to be
searched.)
4 Performance
of the Language-Model Annealer
The results discussed here were generated after the
language-model annealer had been implemented for only about a fortnight, and
(as normal with simulated annealing applications) much of that time was devoted
to empirical experimentation with different annealing schedules and different
weightings attached to various aspects of the evaluation metric. Nevertheless, the system already
appears likely to be able to generate a suitable language model for APRIL
purposes. Almost all the
prototype-sequences generated at current settings of the language-model
annealer look sensible, and, where a particular sequence seems clearly
unsuitable for inclusion in the language model, it normally belongs to a class
(for instance, one-element sequences consisting just of a punctuation mark)
which could be mechanically identified.
(In practice, there would be no reason to rule out a limited use of
manual post-editing in compiling the APRIL language model; but initial impressions
are that this will possibly not be needed.)
The nature of the prototype sets generated is not
wholly similar to what was initially expected. We saw above that the last step in measuring the cost of an
individual prototype sequence relative to an individual observed sequence
consists of squaring the normalized figure derived from the transition
network. One can modify the
evaluation function by taking the normalized figure to a power lower or higher
than two. However, this tends
simultaneously to affect both the length and the number of prototype sequences
generated. Using a low power such
as one, individual sequences become too short to be adequate (it is unlikely
that a language model would function satisfactorily if the great majority of
prototype sequences contained only a single daughter symbol). A power of two is
large enough to give reasonably long daughter sequences typically containing
two, three, or four symbols and give not unreasonable numbers of prototypes.
For example, 190 observed “verb group” sequences were reduced to 17 prototype
sequences, 165 observed “adjective phrase” sequences were reduced to nine, 242
“prepositional phrases” were reduced to nine, and 2844 “noun phrases” were
reduced to 88 (the latter figure is rather higher than was envisaged). Higher
powers still give much larger numbers of prototypes.
Another obvious way of adjusting the evaluation
function is, when deriving the cost of prototype set relative to
observed-sequence set, to multiply the summed individual costs not by the
number of “useful prototypes” but by some power of that number. From preliminary experiments it seems
likely that we might achieve a reasonable reduction in the number of
prototypes, without reducing their typical length, by taking the number of
useful prototypes to a fractional power somewhere between one and two before
multiplying into the summed individual costs – though this would be at the
expense of a considerable increase in processing time. (Using a power as high as two gives far
too few prototypes.)
However, although the numbers of prototypes generated
by the current evaluation criterion are, on occasions (as in the case of the
“noun phrase” category quoted above), rather larger than initially envisaged,
on examining the sequences generated we are at this early stage inclined to think
that we should perhaps not give priority to increasing the subtlety of the
prototype-set evaluation function, and that the system may be telling us that
our original intuitive guesses underestimated the number of sequences needed
for an adequate language model in these cases.
It is particularly noteworthy that a number of the
prototype sequences generated would be quite unlikely to occur to a human
analyst executing the task using his linguistic intuition and scanning the many
observed sequences – some of them seem at first sight peculiar or implausible,
and yet on reflection they may well be empirically satisfactory. One example for the mother category
“noun phrase” is the daughter sequence “article + noun + comma + comma”. Two commas will never occur without
something inserted between them, and a human analyst would be far more likely
to include the intervening element in a prototype than to include the commas
and omit what separates them. Yet
it is true that English noun postmodifiers include many diverse grammatical
categories, but that these are often surrounded by commas; it may be that in
practice the commas are a better cue to the nature of the construction than is
the identity of the intervening element.
Another example for the “noun phrase” mother category is the one-element
daughter sequence “attributive-only adjective”, such as former.
At first sight, it seems quite undesirable to give good marks to a
parse-tree in which an adjective is categorized as a noun phrase – if a single
adjective is analysed as forming a phrase at all, under the SUSANNE scheme it
would normally be an adjective phrase.
However, the rules of the SUSANNE scheme insert adjective-phrase nodes
above single adjectives only in predicative position, which is precisely where
a word like former
cannot occur. And, although former alone is unlikely to occur as a noun
phrase, a phrase such as the former will indeed be labelled as a noun phrase with
understood noun. So again the
prototype may serve its purpose well although to a human analyst it initially
looks misleading.
5 A
Sample Category
In order to give a clearer impression of the nature of
the observed data used by the system and of the relationship between input and
output, we now give details of input and output for one non-terminal category,
adjective phrase, in the experiments discussed in the preceding section. Table 1 shows the 165 daughter-label
sequences observed below the mother-label J, “adjective phrase”, in our data;
the leftmost column gives the number of instances of each production in the
data. In the experiment discussed,
the system derived from these the following set of nine prototype sequences:
JJ
Tn
JA
JJ P
YH
Tg
JB
RR JJ
P
Space precludes giving a complete set of translations
of the simplified SUSANNE symbols here, but the following list gives brief
definitions of a number of them, including all those appearing in the above
prototype sequences:
AT article
D determiner
phrase, e.g. so many
DA determiner
capable of following article, e.g. more
FB prefix
Fc comparative
clause, e.g. than we have
JB attributive
adjective, e.g. former
JJ general
adjective
N noun
phrase
NN1 singular
common noun
P prepositional
phrase
R adverb
phrase
RG qualifier,
e.g. very
RR general
adverb
Tg present-participle
clause, often one-word, e.g. depressing
Tn past-participle
clause, e.g. beaten
XX not
YH hyphen
6 Closing
Remarks
It must be pointed out that, at the time of drafting
this paper, it is not yet possible to say how successful an
automatically-generated language model will turn out to be at the task of
preferring correct analyses to incorrect ones. It will be some time yet before a complete
automatically-generated language model can be included in a functioning
parse-annealer. We expect to have
reached this stage, and to be in a position to discuss at least some early
results from this version of the APRIL parser, by the time of the Alicante
meeting.
If the technique does prove successful in practice, it
offers an attractively re-usable method for generating grammatical models for
natural languages. We do not
assert that every natural language is necessarily as amenable as English to the
APRIL system of parsing by reference to prototype sequences. (Languages with very free word order,
or with very large ranges of inflexional forms, might lend themselves less well
than English to this approach.)
But many modern European languages would prima facie appear as suitable
as English for this approach. If a
satisfactory language model for English can be generated in the manner
described here, it should equally be possible to generate a model for another
such language, given only a grammatically-analysed corpus – and the latter is
surely a sine qua non for most serious NLP-oriented research.
REFERENCES
E. Aarts and J. Korst 1989 Simulated
Annealing and Boltzmann Machines. Wiley.
G.R. Sampson
1994
English for the
Computer: The SUSANNE Corpus and Analytic Scheme.
Forthcoming from Oxford University Press.
G.R. Sampson et al. 1989 “Natural
language analysis by stochastic optimization”. Journal of Experimental and Theoretical Artificial
Intelligence 1.271-87.
[1]Project APRIL is currently sponsored jointly by the UK Science and Engineering Research Council and Ministry of Defence, under grant GR/J06108, “A full natural-language annealing parser”.
[2]The SUSANNE Corpus is distributed free of charge via anonymous ftp. To obtain a copy, take the link ‘downloadable research resources’ from www.grsampson.net and follow the instructions there. [New footnote replacing obsolete information in original publication.]
[3]For the sake of clarity, the technical codes of the SUSANNE annotation scheme are replaced in the bulk of this paper by self-explanatory descriptive terms.
[4]The skip-and-loop network concept as used in the APRIL parser is due to Robin Haigh, now of the University of Leeds Computing Service.