Stochastic Optimization of a Probabilistic Language Model

Stochastic Optimization of a Probabilistic Language Model

M.D. Dennis, A.M. Wallington, G.R. Sampson

University of Sussex

Abstract

The system described generates a language model for use in the APRIL natural-language parser. The APRIL language model takes the form of a set of typical productions (pairings of mother label with sequence of daughter labels) for each non-terminal category; the system analyses an input by seeking a labelled tree over the words of the input which offers the best fit between each production in the tree and some production of the language model. Previous versions of APRIL used hand-designed language models, but this became a serious research bottleneck. This paper presents a system which uses stochastic optimization to reduce a large set of observed productions to a set of “prototype productions” which are few in number but which nevertheless successfully typify the large observed set.

1 The APRIL Parser

This paper discusses a system which automatically derives a language model, for use in a robust statistically-based natural-language parser, from a grammatically-analysed sample of the relevant natural language.

The parser which the system is intended to serve is the APRIL parser, an early version of which was described in Sampson et al. (1989). APRIL treats the problem of finding a satisfactory grammatical analysis of an input text as a statistical optimization problem. Given an input string, it looks for the labelled tree over the units of the string which maximizes a continuous measure of grammatical plausibility, where plausibility is assessed by comparing the various local configurations jointly making up a tree with the incidence of identical or similar configurations in a database of correctly-parsed natural-language material. The stochastic optimizing technique of simulated annealing is used to locate the best solution within the solution-space for an input. This space contains all possible tree structures having the appropriate number of terminal nodes, with all possible assignments of labels to the nonterminal nodes of the structures; no labelled tree is ruled out as grammatically “illegal”, instead an analysis that a linguist would view as absurd is rejected because it is quantitatively less plausible than other analyses, by reference to the statistics of the parsed database. (Simulated annealing – see e.g. Aarts & Korst (1989) – involves executing a random walk through a solution space, subjecting the individual steps of the walk to an initially weak but steadily increasing bias against steps from better to worse solutions: in a wide range of problem domains this allows a system to converge on the global optimum solution while avoiding being trapped by local optima.)

In this way, the APRIL parser addresses the problem that rule-based parsers are often fragile when confronted with messy real-life language data that do not always conform perfectly to well-defined rules. By the nature of the simulated annealing algorithm, APRIL must yield some solution for any input, even a highly ungrammatical input: APRIL seeks the most plausible available analysis, without caring whether the absolute degree of plausibility of that analysis is high or (as is likely in the case of an ungrammatical input) relatively low.

For early versions of the APRIL system, inputs consisted of individual English sentences which an analyst had segmented into words and tagged (that is, wordclass codes had been assigned to the words). The current phase of Project APRIL[1] is producing a system whose inputs are paragraphs of English in the form of raw character strings lacking any analytic information. Using an electronic dictionary, the system begins by forming a word-hypothesis lattice whose arcs span segments of the input which may constitute words (not always obvious, e.g. in the case of punctuation or idioms) and which are labelled with candidate wordtags and tag-frequency information. The parse annealer then seeks the optimum labelled tree whose terminal nodes constitute some path through this lattice. (The word-hypothesis lattice structure is formally similar to data structures typically generated by speech-recognition systems; because of research-resource availability, Project APRIL is currently working with written English, but the system is designed for ease of transfer in due course to the speech domain, where problems of grammatically-messy usage are even more salient.)

The APRIL approach requires a language model capable of assigning a numerical measure of plausibility to any “production” – that is, any pairing of a mother label, drawn from an agreed vocabulary of nonterminal categories, with a finite sequence of daughter labels each drawn from the union of that vocabulary with the agreed vocabulary of wordtags. The current phase of Project APRIL uses as its categories those of the SUSANNE annotation scheme (Sampson 1994), a set of symbols and rules for applying them to difficult cases which has been designed as an explicit, comprehensive taxonomic scheme for representing written and spoken, surface and logical English grammar. The statistics of the APRIL language model are derived from the 130,000-word SUSANNE analysed Corpus, omitting a subset reserved for testing.[2]

2 The Use of Prototype Productions

Because of the complexity and unpredictability of real-life usage, the range of grammatical configurations found in the SUSANNE Corpus is immensely diverse. Even using simplified category labels which omit many of the details of the full SUSANNE annotation scheme, the parse-trees of the SUSANNE Corpus embody on the order of 14,000 distinct productions, some of which recur many times but many of which are hapax legomena. Furthermore it is sure that other samples of English would throw up other productions again. It would be unreasonable to treat the production-set of the SUSANNE Corpus as a definitive grammar of English for parsing purposes. Rather, it must be treated as a source from which a smaller, more tractable language model is to be distilled. Furthermore, this tractable language-model needs not only to be able to assign some measure of plausibility to productions that actually occur in English: the nature of the APRIL algorithm means that the system will repeatedly have to evaluate crazy, randomly-labelled parse-trees, so the language model needs to be able to assign some (often low) plausibility figure to any production whatsoever, provided only that mother and daughter labels are drawn from the agreed vocabularies.

APRIL solves this problem via the concept of “prototype” daughter sequences. For any given mother label, say “noun phrase”,[3] there will be very many distinct daughter sequences in the data, but these will be groupable into families of sequences, such that each member of a particular family can be seen as deriving from a single prototype sequence for the family by insertion of extra labels into the sequence and/or omission of some of the elements from the sequence. Thus, for “noun phrase”, one prototype sequence might be the single element “personal-pronoun”, another might be “article + adjective-phrase + common-noun”. Once a set of prototype sequences is chosen for each mother category, the APRIL language model is constructed automatically as a set of transition networks of the form shown in Figure 1. A prototype sequence supplies the labels for successive arcs on the “spinal route” through a network, and these arcs are supplemented in a standardized manner with skip arcs allowing any arc on a spinal route to be omitted, and loop arcs allowing extra elements to be inserted at any point. Thus, Figure 1 shows the network that would be constructed for a prototype sequence A B C. Arcs labelled e are jump arcs transited without consuming input; labels such as “V – BC” mean “the full vocabulary of labels, omitting B and C”:

Figure 1 Skip-and-Loop transition network

Probabilities are assigned to the arcs, and to alternative symbols on the loop arcs, by running the parsed data over the network system; and the resulting model is used to evaluate productions during the parsing process by driving the daughter sequence of a production over the networks for the different prototype sequences for the relevant mother label, multiplying the probabilities of the various arcs transited and of the particular prototype, and taking the highest of the resulting products for the various prototypes. The arrangement of skip and loop arcs guarantees that any label sequence whatever will be accepted deterministically on any prototype network.[4]

The adequacy of the language model, and hence the performance of the APRIL parser, is therefore crucially dependent on the quality of the prototype sequences for various nonterminal categories. In earlier phases of Project APRIL, prototype sequences were derived from lists of observed productions by hand. Although the parsed corpus then used was much smaller than the SUSANNE Corpus and used less detailed grammatical categories (so that the number of observed productions was smaller), this proved to be a very serious bottleneck in system development. It seemed highly desirable to automate the process of deriving prototypes from observed productions.

3 Optimization of Prototypes

In the current phase of the project, the process of deriving prototypes from data is itself treated as an optimization problem to be solved by simulated annealing. For any given nonterminal category, a list is formed of the observed daughter sequences, with one entry for each sequence type, together with a record of the number of observed instances (tokens) for each type. A data structure is created which has a slot for each element in each daughter sequence, as in Figure 2 (the labels and figures in which are illustrative only). An annealing run involves a random walk through the space of alternative combinations of assignments of the values On and Off to the slots in such a structure. At each step of the run, a randomly-chosen slot is changed from On to Off or vice versa. In the situation illustrated, the current solution is assigning to the mother label M the prototype set {X Y Z, X B Y, P Y, C Y Z, Q R S U}.

Figure 2 Data structure for observed/prototype sequences

Since we are concerned with generating a prototype set of daughters for a specific mother, no On/Off slots are needed for the mother label.

For a prototype set to be useful as part of a language model for the APRIL parser, it needs to contain few sequences, but there should be a good degree of similarity between each observed sequence and some prototype sequence. These requirements tend to conflict. The evaluation metric used in annealing a prototype set balances the requirements in a manner now to be described.

We first define a cost for an individual prototype sequence relative to an individual observed sequence. Consider the prototype sequence A B C (which gives rise to the skip-and-loop network of Figure 1), relative to the observed sequence X B Y C. The observed sequence is driven over the network, and a counter is incremented by 2 for each traversal of a loop arc or a skip arc (i.e. a jump arc between two odd-numbered states), by 1 for each traversal of an arc labelled with an element of the prototype sequence, and by 0 for each traversal of a jump arc from an even- to an odd-numbered state. In the case quoted, the result is 8, the transitions being as shown in the following table:

Input	Move	Cumulative Sum
start	go to state 0	0
X	loop V-ABC	2
B	edge e from state 0 to 1	2
B	edge e from state 1 to 3	4
B	edge B from state 3 to 4	5
Y	loop V-C	7
C	edge e from state 4 to 5	7
C	edge C from state 5 to 6	8
end	edge e to final state 7	8

A prototype has a higher cost, the more dissimilar it is from the observed sequence, but it has a positive cost even if identical to the observed sequence (since we want prototype sequences to be few). The resulting figure is “normalized” by being divided by the summed lengths of the prototype and observed sequences, and the normalized figure is squared. Normalization is intended to prevent figures for long observed or prototype sequences from swamping figures for short sequences in evaluating a set of prototypes; the squaring step was introduced as an empirical response to the need to make prototype-sequence cost increase more than linearly with increase in dissimilarity between the respective symbol-strings. Thus the cost of prototype A B C relative to observed X B Y C is (8 ÷ (3 + 4))2 = 1.306.

Having defined the cost of an individual prototype sequence relative to an individual observed sequence, we now define the cost of a set of prototype sequences relative to a set of observed sequences. For each observed sequence, that prototype sequence is found whose cost relative to the observed sequence is lowest, and this cost is multiplied by the frequency of instances of the observed sequence within the data. The resulting figures for all observed sequences are summed, and the sum is multiplied by the number of those prototype sequences within the prototype set which are lowest-cost sequences relative to at least one observed sequence.

It is this product which the language-model annealer seeks to minimize during an annealing run; and, on termination of the run, the set of prototypes output consists of just those prototype sequences whose cost contributed to the value of the final solution, omitting “useless” prototypes whose slots are currently On but which are not the cheapest prototype relative to any observed sequence.

Evidently, this evaluation metric implies a great deal of processing at each step of a language-model annealing run. The code has been carefully designed to minimize the amount of recalculation required; this is essential for the generation of a language model in a reasonable time. Currently the time taken to generate a prototype set for the mother label “sentence” (the label for which our data give the largest solution space) is about two days on a Sun Sparc 10. (For this label there are just over 10,000 On/Off slots, hence the solution space has over 210000 elements to be searched.)

4 Performance of the Language-Model Annealer

The results discussed here were generated after the language-model annealer had been implemented for only about a fortnight, and (as normal with simulated annealing applications) much of that time was devoted to empirical experimentation with different annealing schedules and different weightings attached to various aspects of the evaluation metric. Nevertheless, the system already appears likely to be able to generate a suitable language model for APRIL purposes. Almost all the prototype-sequences generated at current settings of the language-model annealer look sensible, and, where a particular sequence seems clearly unsuitable for inclusion in the language model, it normally belongs to a class (for instance, one-element sequences consisting just of a punctuation mark) which could be mechanically identified. (In practice, there would be no reason to rule out a limited use of manual post-editing in compiling the APRIL language model; but initial impressions are that this will possibly not be needed.)

The nature of the prototype sets generated is not wholly similar to what was initially expected. We saw above that the last step in measuring the cost of an individual prototype sequence relative to an individual observed sequence consists of squaring the normalized figure derived from the transition network. One can modify the evaluation function by taking the normalized figure to a power lower or higher than two. However, this tends simultaneously to affect both the length and the number of prototype sequences generated. Using a low power such as one, individual sequences become too short to be adequate (it is unlikely that a language model would function satisfactorily if the great majority of prototype sequences contained only a single daughter symbol). A power of two is large enough to give reasonably long daughter sequences typically containing two, three, or four symbols and give not unreasonable numbers of prototypes. For example, 190 observed “verb group” sequences were reduced to 17 prototype sequences, 165 observed “adjective phrase” sequences were reduced to nine, 242 “prepositional phrases” were reduced to nine, and 2844 “noun phrases” were reduced to 88 (the latter figure is rather higher than was envisaged). Higher powers still give much larger numbers of prototypes.

Another obvious way of adjusting the evaluation function is, when deriving the cost of prototype set relative to observed-sequence set, to multiply the summed individual costs not by the number of “useful prototypes” but by some power of that number. From preliminary experiments it seems likely that we might achieve a reasonable reduction in the number of prototypes, without reducing their typical length, by taking the number of useful prototypes to a fractional power somewhere between one and two before multiplying into the summed individual costs – though this would be at the expense of a considerable increase in processing time. (Using a power as high as two gives far too few prototypes.)

However, although the numbers of prototypes generated by the current evaluation criterion are, on occasions (as in the case of the “noun phrase” category quoted above), rather larger than initially envisaged, on examining the sequences generated we are at this early stage inclined to think that we should perhaps not give priority to increasing the subtlety of the prototype-set evaluation function, and that the system may be telling us that our original intuitive guesses underestimated the number of sequences needed for an adequate language model in these cases.

It is particularly noteworthy that a number of the prototype sequences generated would be quite unlikely to occur to a human analyst executing the task using his linguistic intuition and scanning the many observed sequences – some of them seem at first sight peculiar or implausible, and yet on reflection they may well be empirically satisfactory. One example for the mother category “noun phrase” is the daughter sequence “article + noun + comma + comma”. Two commas will never occur without something inserted between them, and a human analyst would be far more likely to include the intervening element in a prototype than to include the commas and omit what separates them. Yet it is true that English noun postmodifiers include many diverse grammatical categories, but that these are often surrounded by commas; it may be that in practice the commas are a better cue to the nature of the construction than is the identity of the intervening element. Another example for the “noun phrase” mother category is the one-element daughter sequence “attributive-only adjective”, such as former. At first sight, it seems quite undesirable to give good marks to a parse-tree in which an adjective is categorized as a noun phrase – if a single adjective is analysed as forming a phrase at all, under the SUSANNE scheme it would normally be an adjective phrase. However, the rules of the SUSANNE scheme insert adjective-phrase nodes above single adjectives only in predicative position, which is precisely where a word like former cannot occur. And, although former alone is unlikely to occur as a noun phrase, a phrase such as the former will indeed be labelled as a noun phrase with understood noun. So again the prototype may serve its purpose well although to a human analyst it initially looks misleading.

5 A Sample Category

In order to give a clearer impression of the nature of the observed data used by the system and of the relationship between input and output, we now give details of input and output for one non-terminal category, adjective phrase, in the experiments discussed in the preceding section. Table 1 shows the 165 daughter-label sequences observed below the mother-label J, “adjective phrase”, in our data; the leftmost column gives the number of instances of each production in the data. In the experiment discussed, the system derived from these the following set of nine prototype sequences:

JJ P

RR JJ

Space precludes giving a complete set of translations of the simplified SUSANNE symbols here, but the following list gives brief definitions of a number of them, including all those appearing in the above prototype sequences:

AT article

D determiner phrase, e.g. so many

DA determiner capable of following article, e.g. more

FB prefix

Fc comparative clause, e.g. than we have

JB attributive adjective, e.g. former

JJ general adjective

N noun phrase

NN1 singular common noun

P prepositional phrase

R adverb phrase

RG qualifier, e.g. very

RR general adverb

Tg present-participle clause, often one-word, e.g. depressing

Tn past-participle clause, e.g. beaten

XX not

YH hyphen

6 Closing Remarks

It must be pointed out that, at the time of drafting this paper, it is not yet possible to say how successful an automatically-generated language model will turn out to be at the task of preferring correct analyses to incorrect ones. It will be some time yet before a complete automatically-generated language model can be included in a functioning parse-annealer. We expect to have reached this stage, and to be in a position to discuss at least some early results from this version of the APRIL parser, by the time of the Alicante meeting.

If the technique does prove successful in practice, it offers an attractively re-usable method for generating grammatical models for natural languages. We do not assert that every natural language is necessarily as amenable as English to the APRIL system of parsing by reference to prototype sequences. (Languages with very free word order, or with very large ranges of inflexional forms, might lend themselves less well than English to this approach.) But many modern European languages would prima facie appear as suitable as English for this approach. If a satisfactory language model for English can be generated in the manner described here, it should equally be possible to generate a model for another such language, given only a grammatically-analysed corpus – and the latter is surely a sine qua non for most serious NLP-oriented research.

REFERENCES

E. Aarts and J. Korst 1989 Simulated Annealing and Boltzmann Machines. Wiley.

G.R. Sampson 1994 English for the Computer: The SUSANNE Corpus and Analytic Scheme. Forthcoming from Oxford University Press.

G.R. Sampson et al. 1989 “Natural language analysis by stochastic optimization”. Journal of Experimental and Theoretical Artificial Intelligence 1.271-87.

[1]Project APRIL is currently sponsored jointly by the UK Science and Engineering Research Council and Ministry of Defence, under grant GR/J06108, “A full natural-language annealing parser”.

[2]The SUSANNE Corpus is distributed free of charge via anonymous ftp. To obtain a copy, take the link ‘downloadable research resources’ from www.grsampson.net and follow the instructions there. [New footnote replacing obsolete information in original publication.]

[3]For the sake of clarity, the technical codes of the SUSANNE annotation scheme are replaced in the bulk of this paper by self-explanatory descriptive terms.

[4]The skip-and-loop network concept as used in the APRIL parser is due to Robin Haigh, now of the University of Leeds Computing Service.