The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version.

Published in International J. of Corpus Linguistics 5.53–68, 2000.


 

 

 

 

 

 

 

 

 

A proposal for improving the measurement of parse accuracy

 

 

Geoffrey Sampson

 

School of Cognitive and Computing Sciences

University of Sussex

Falmer, Brighton BN1 9QH, England

 

[email protected]

 

 

Abstract

 

Widespread dissatisfaction has been expressed with the measure of parse accuracy used in the Parseval programme, based on the location of constituent boundaries.  Scores on the Parseval metric are perceived as poorly correlated with intuitive judgments of goodness of parse; the metric applies only to a restricted range of grammar formalisms; and it is seen as divorced from applications of NLP technology.  The present paper defines an alternative metric, which measures the accuracy with which successive words are fitted into parsetrees.  (The original statement of this metric is believed to have been the earliest published proposal about quantifying parse accuracy.)  The metric defined here gives overall scores that quantify intuitive concepts of good and bad parsing relatively directly, and it gives scores for individual words which enable the location of parsing errors to be pinpointed.  It applies to a wider range of grammar formalisms, and is tunable for specific parsing applications.

 

 


 

1.  Alternative parse-evaluation metrics

 

There has recently been a growth of interest in techniques for evaluating parser output (see e.g. Sutcliffe et al. (1996); Lin (1998); Carroll, Briscoe, & Sanfilippo (1998), Gaizauskas, Hepple, & Huyck (1998), and other contributions to Rubio et al. (1998); and especially Carroll, Basili, et al. (1998)).  The Grammar Evaluation Interest Group (GEIG) metric, based on comparing bracketings between a candidate parse and a “gold standard” parse for a given language sample (Black, Abney, et al. 1991; Grishman, Macleod, & Sterling 1992), has become a standard through its use in the Parseval parser evaluation programme; but many contributors to Carroll, Basili, et al. (1998) expressed dissatisfaction with it.

 

The earliest (to my knowledge) published parse-evaluation metric, which was invented by the present author, was based on a fundamentally different principle from the GEIG approach (see Sampson, Haigh, & Atwell 1989: 278; Sampson 1996: 66-8).  This leaf-ancestor assessment technique was in practice eclipsed by the success of the Parseval programme, though Briscoe & Carroll (1996: 144-5), who were aware of both metrics, pointed out that GEIG parse assessment is unsatisfactory in a way that leaf-ancestor assessment is not.  (Briscoe & Carroll also discussed a third metric, related to the GEIG system, used by Black, Garside, et al. (1993: 10), but they described this as the least satisfactory of all.)  If the GEIG metric is now coming to be perceived as inadequate, it may be worth restating the leaf-ancestor alternative, more fully than was done in the original publication, and explaining its relative strengths.

 

Both approaches to parse evaluation assume that the task of parsing is to assign a labelled tree structure, often represented as a labelled bracketing, to a string of input elements, commonly words.  (If the language sample being parsed is a multi-sentence text, the analysis may of course consist of a sequence of bracketings or trees; purely for simplicity of exposition I shall write as if input strings always consist of a single sentence.)  They assume that the assessor has access to a particular structure which is deemed to be correct (the “gold standard”), and they aim to measure the distance between the gold standard and the analysis being assessed (the “candidate parse”), which will also be a labelled tree over the same input string.  The GEIG scheme compares the sets of word-sequences bracketed together as units in gold-standard and candidate parses respectively, and counts:

 

           the number of matched bracketings (sequences found in both sets) as a proportion of the candidate bracketings – the “precision”

 

           the number of matched bracketings as a proportion of the gold-standard bracketings – the “recall”

 

           the number of “crossing” brackets, where a bracketing in the gold-standard set overlaps a bracketing in the candidate set but each also contains material not included in the other.

 

In the original version of the GEIG scheme, only the location of brackets, not their labels, was taken into account; more recent versions of the scheme (e.g. Magerman 1995, Collins 1996) have adapted it to take into account identity or difference between bracket-labels.  Some complications have also been introduced to deal with specially problematic grammatical phenomena.

 

By contrast, leaf-ancestor assessment scores the analysis of each terminal element (commonly, word) in a candidate parse-tree by comparing the sequences of node-labels on the paths linking the terminal element to the root node in gold-standard and candidate analyses respectively, crediting the terminal element with the number of matching labels as a proportion of all the labels in the two paths.  The score assigned to an entire sentence is the mean of the scores for its individual words.

 

To give a simple illustration of one reason why the leaf-ancestor metric might be preferred to the GEIG assessment system, consider a sentence such as The weather forecast today was poor.  Suppose one’s gold-standard scheme makes today a postmodifier of forecast, so that the first constituent of a correct parse will be the four-word noun phrase the weather forecast today, as a paraphrase of “today’s weather forecast”; and suppose a candidate parse instead treats today as an adverbial sentence adjunct, sister of a three-word noun phrase the weather forecast.  Under the GEIG approach, the candidate parse will receive no credit at all for categorizing the words the weather forecast as belonging to a noun phrase; the three-word bracketing fails to match any bracketing in the gold-standard parse.  Under leaf-ancestor assessment, these three words will be treated as correctly parsed, and only the score for today will pull the average for the sentence down.  This might seem intuitively a more reasonable reflection of the extent to which the candidate parse is wrong.

 

Note that this contrast between GEIG and leaf-ancestor approaches is independent of choice of particular gold-standard analyses.  If the reader holds a view of English grammar according to which today really is a sister rather than a daughter of the weather forecast constituent in the sentence quoted, then that is his gold standard, but the question would then be what credit should be given to a candidate parse which returns the weather forecast today as a single noun phrase constituent; again, GEIG will give no credit, leaf-ancestor assessment will give high marks to the first three words and a lower score only to today.  Likewise, the contrast between assessment techniques is in principle independent of the particular alphabet of node-labels used, which may be limited to a few basic categories such as noun phrase, adjective phrase, etc., or may include fuller details, for instance “singular noun phrase functioning as subject”.  (However, if node-labels are detailed, we shall see below that the leaf-ancestor assessment technique can be refined by incorporating concepts of partial matching between node labels.)

 

Participants in the Carroll, Basili, et al. (1998) workshop repeatedly found it necessary to draw a distinction between extrinsic and intrinsic parse-assessment methods.  Extrinsic methods are those intended to compare the performance of separate parsing systems, designed by different groups with their own ideas about linguistic structure.  Intrinsic methods (of which leaf-ancestor assessment is one) are methods intended to be applied in situations where a particular gold standard of “correct” parsing is accepted, and the aim of assessment is to quantify how closely the output of a parser approximates to the agreed target.  Extrinsic assessment imposes heavy constraints on the evaluation metric; only those aspects can meaningfully be evaluated which are common to different groups’ gold standards, and in the current state of linguistic theorizing this will hardly extend to more than the location of (some) constituent boundaries – different groups’ alphabets of node labels are very diverse.  Consequently the GEIG assessment scheme, which lays exclusive or at least heavy emphasis on locating constituent boundaries, may well be the only sort of scheme that can appropriately be used for the Parseval programme.  But all scientific linguists agree that there is much more to grammatical structure than is represented by an unlabelled bracketing, even though they disagree about the detailed nature of that extra information.  Linguists who work with rich annotation schemes that express fuller accounts of linguistic structure need delicate metrics to measure the accuracy of such annotation; these metrics ought not to be distorted by the special constraints associated with competition between groups using diverse annotation schemes.

 

 

2.  Leaf-ancestor assessment defined

 

Before entering into a detailed comparison between leaf-ancestor and GEIG schemes, I shall define leaf-ancestor assessment more precisely than I have done above.

 

First, some measure of similarity between node-labels is chosen.  In the simple case, this measure will assign the value 1 to identical labels and 0 to nonidentical labels; more refined versions of leaf-ancestor assessment can use other mappings from the class of label-pairs into the interval [0, 1].

 

For each element in a string whose parse is to be assessed, the procedure establishes the lineages of that element with respect to candidate and gold-standard trees, that is, the sequences of node-labels found on the path between leaf and root nodes in the respective trees.  (The examples displayed below use the convention of placing leaf and root ends of a lineage to left and right respectively across the page.)

 

Precisely which labels are included at the two ends of a lineage will depend on what structural information has to be discovered by a parser rather than being given in advance; for purposes of leaf-ancestor assessment, lineages should include all of the former information and none of the latter.  If a parsing scheme assigns the same initial symbol S to the root of every well-formed tree, it would be appropriate for root labels to be omitted from lineages.  If the parser has to discover the wordtags (part of speech classes) of the input words, these labels will be included at the leaf ends of lineages, but if the parser is given wordtags as input data, then lineages will include only higher-level labels.

 

Similarly, although in the most usual case the string elements whose lineages are considered will be words, that need not always be so.  If one of the analytic tasks to be executed by the parser is defining the segmentation of a character-string into words (what is often referred to by computer scientists as “tokenization”, though this term seems to embody a misunderstanding of the type/token distinction), then the leaves of candidate and gold-standard parsetrees will correspond to individual characters, and words will be associated with the nodes at the first level above the leaves.

 

To ensure that a set of lineages fully defines the structure of a tree, extra symbols must be inserted into some lineages.  Whenever a leaf is the first leaf dominated by a nonterminal that dominates more than one leaf, a left-boundary symbol is inserted, as a separate element of the lineage for that leaf, immediately before the label of the highest node in which that leaf is initial.  Likewise, a right-boundary symbol is inserted in the lineage of a leaf immediately after the label of the highest node for which that leaf is the last of multiple dominated leaves, when there is such a node.  (Without this proviso, a set of lineages for the elements of a string would fail to distinguish between structures in which, say, a nonterminal labelled A was realized as a single daughter labelled B which in turn was realized as four leaves p q r s,  and a similar structure in which the A node was realized as two B nodes respectively realized as p q and as r s.)

 

If a candidate parsetree is partly correct but has errors, then commonly the candidate and gold-standard lineages of a string element will share some labels, which will appear in the same order in both lineages, but which will be interspersed in one or both lineages by nonmatching labels.  The assessment procedure measures the similarity between candidate and gold-standard lineages for a given string element by considering the class of possible mappings of some subset of the candidate lineage elements onto some subset of the gold-standard lineage elements which preserve ordering.  (If A1 B2 C3 D4 E5 F6 is a candidate lineage and A1 C2 E3 F4 D5 is the corresponding gold-standard lineage, the mapping (1,1), (3,2), (5,3), (6,4),  which links some of the identically-labelled node pairs, preserves ordering, but the mapping (1,1), (3,2), (4,5), (5,3), (6,4), which links all of them, does not preserve ordering:  the link joining the D’s crosses the links between the E’s and between the F’s.)  Any such mapping is assigned the sum of the label-similarity scores for the linked label-pairs; in the case of the order-preserving mapping just quoted, assuming the simple identical/nonidentical label-similarity metric, the sum is 4.  The similarity of two lineages over a leaf is defined as twice the value of the highest-scoring order-preserving mapping, divided by the total number of elements in the two lineages; for the case just discussed, this will be (2 x 4) ÷ (6 + 5), or 0.73 to two significant figures.  The value of an entire candidate parsetree is defined as the mean similarity of the lineage-pairs for its respective leaves.  Any candidate parse must then receive a score between 0 and 1, with 1 reserved for totally-accurate parses.

 

 

3.  A worked example

 

As an example, consider a misparsing of the sentence The closest thing to a home was a string hammock and, when it rained, some palm fronds draped over sticks, in which the word draped has been misinterpreted as an active past tense having some palm fronds as its subject, and the wording from and onwards is treated as a clause co-ordinated with the noun phrase a string hammock.  (The example is adapted from Voutilainen 1998.)  In the SUSANNE grammatical annotation scheme (Sampson 1995) the gold-standard and candidate parses will be as in Figures 1 and 2 respectively.  (To simplify discussion of the example, I omit functiontags identifying constituents as subject, time adjunct, etc., and I omit all wordtags, though the alternative interpretations of draped as past tense or past participle would correspond to different wordtaggings in the SUSANNE scheme and in most other schemes known to me.) 

 

FIGURES 1 AND 2 ABOUT HERE

 

The lineages of the successive words are shown in Figure 3.  Until hammock, lineages are identical in candidate and gold standard; from that point on, candidate lineages are shown to the right of gold-standard lineages, and unmatched symbols in the lineage-pairs are printed in italics.  The last column gives the score for successive leaf nodes.  For instance, in the lineages for the last word, sticks, there are four matched pairs of symbols among eleven symbols all told, giving a score of 0.73 for this word; the mean score for the parsetree is 0.82.

 

FIGURE 3 ABOUT HERE

 

Note that this scoring system not only yields an assessment for a candidate parsetree as a whole, but also provides indications of where the parser is failing:  the scores imply that the worst-parsed word in this example is draped, which seems intuitively correct.  The GEIG system offers nothing comparable.

 

An obvious refinement would be to use a node-label similarity measure which gave partial credit for partly-matching labels.  In the SUSANNE annotation scheme, the second and subsequent characters of a node-label normally represent subcategorizations of the basic category represented by the first character; hence, for this scheme, a crude but reasonable way to measure node-label similarity might be to score label-pairs by assigning the number of characters in maximal identical initial substrings of the labels divided by total number of characters in the labels.  The words some palm fronds are analysed as part of a conjoined plural noun phrase (Np+) in the gold-standard, and as a non-conjoined plural noun phrase (Np) in the candidate; by the label-similarity measure just quoted, the score for this label-pair would rise from zero to 4/5, and hence the scores for some and fronds, and for palm, would rise from 0.5 and 0.57 to 0.6 and 0.69 respectively.

 

 

4.  Advantages of leaf-ancestor assessment

 

One reason for preferring leaf-ancestor assessment to the GEIG approach is that the former quantifies success at what seems intuitively to be the essence of parsing, namely determining what sort of large objects are constituted by the small objects which can be directly observed.  By contrast, under the GEIG approach “it is unclear how much of a success (or failure) it is to achieve high (or low) scores” (Carroll, Briscoe, & Sanfilippo 1998: 449); “it is unclear as to how the score on this metric relates to success in parsing” (Srinivasan et al. 1998).  Lin (1996: 13-14) argues that “phrase boundaries do not have much to do with the meaning of a sentence … parse trees should be evaluated according to more semantically relevant features”; he gives an example where evaluation based on locating phrase boundaries gives a good score for what is in reality a very poor analysis. 

 

Conversely, Carroll & Briscoe (1996: 97) and Lin (1998: 99-100) point out that a poor GEIG score may be produced by a relatively trivial parsing error:  for instance, attaching one constituent many levels in a parsetree above or below the point at which it is attached in the gold-standard parse will give many crossing brackets, though the error may seem minor in the case, say, of a brief aside inserted between dashes.  Under leaf-ancestor assessment, misattaching one constituent will affect the scores only of the words in that constituent, so it can affect the overall average for the sentence to only a limited extent.  The longer the misattached constituent, the more the average will be dragged down, and this is as it should be:  a larger proportion of the input has been misanalysed.

 

Leaf-ancestor assessment has further advantages.  The GEIG approach is tied to the phrase-structure grammar formalism which has traditionally been used in the English-speaking world; Carroll & Briscoe (1998) point out that it cannot easily be adapted to parsing schemes using the dependency formalism which has been described as “the ‘indigenous’ syntactic theory of Europe” (Hudson 1990: 107) and which seems to be used increasingly by British and North American computational linguists also.  (Conversely, the evaluation metric recently proposed by Lin Dekang (Lin 1998) is tied to dependency grammar, and cannot be applied to phrase-structure notation.)  Leaf-ancestor assessment, on the other hand, can be applied equally well either to dependency parsing or to phrase-structure parsing.  Figures 4 and 5 respectively display gold-standard and candidate parses, expressed in the Functional Dependency Grammar notation of Timo Järvinen & Pasi Tapanainen (http://www.ling.helsinki.fi/~tapanain/dg/), for the example sentence used above; the candidate parse has errors equivalent to those of the phrase-structure analysis in Figure 2.  Figure 6 compares gold-standard and candidate lineages, in a manner parallel to Figure 3 for the phrase-structure case; in Figure 6, the lineage for each leaf begins with a function label identifying the semantic relationship between the leaf and its head, and continues with a series of integers identifying the chain of dependent/head relationships between the leaf and the head word of the sentence.  (Thus, word 1, the, is dependent on word 3, thing, which is dependent on word 7, was, which is the sentence head.)  The mean score for the candidate parse is 0.89.

 

FIGURES 4, 5, AND 6 ABOUT HERE

 

Figure 6 expresses information redundantly; the succession of integers after the first in each lineage is determined by the pairings of word-number with the number of the word’s head.  (For instance, because line 3 shows that thing is dependent on word 7, it is guaranteed that any lineage containing 3 will contain 7 in the next position.)  Consequently it might be better, in applying leaf-ancestor assessment to dependency structures, to omit all but the first integer from lineages.  (In this version, leaf-ancestor assessment becomes similar to Lin’s technique mentioned above; but that technique is restricted to dependency notation, whereas we have seen that leaf-ancestor assessment applies equally to phrase-structure parsetrees, at present by far the more widely-used formalism.)  The comma on line 12 would then have alternative lineages “tmp 20” v. “tmp 19” for a score of 0.5; fronds on line 19 would have “subj 20” v. “cc 7” for a score of zero; and so forth.  The principle remains the same:  a candidate analysis is scored by averaging, over successive words, the extent to which the words are correctly fitted into the overall structure.

 

Another advantage of leaf-ancestor assessment relates to applications of NLP technology.  Several authors note that the GEIG system is not well-geared to evaluating parsers in terms of usefulness for particular NLP applications.  According to Srinivasan et al. (1998), “Applications that use a parser may be interested in specific structure, Noun Phrases, Appositives, Predicative Nominatives, Subject-Verb-Object relations and so on.  The [GEIG] metric is not fine-grained enough to evaluate parses with respect to specific syntactic phenomena.  It divorces the parser from the application the parser is embedded in”.  Basili, Pazienza, & Zanzotto (1998) point out that a particular application in the domain of information extraction might be “crucially dependent on the ability of the parsing system to capture verbal dependencies (e.g. argumental PPs or temporal expressions)”, and not on other aspects of grammar.  When an application depends on getting a limited subset of parse decisions right, leaf-ancestor assessment can readily be tuned to give predominant weight to those aspects of analysis.  For instance, in totalling the label-pair similarity figures to arrive at a value for a lineage-pair, weighting factors could be applied to increase the contribution from pairs in which the gold-standard label is a “relevant” category, and reduce the contribution from other pairs.

 

Not all phrase-structure parsing schemes represent sentence analyses entirely in terms of labelled trees over the words of a sentence.  Often, a parsing scheme expresses differences between surface and logical grammatical structure by inserting “trace” or “ghost” elements, corresponding to no written or spoken material, as extra leaf elements in the string under analysis.  (For instance, in the full version of the SUSANNE parsing scheme which includes indications of logical grammar, the past-participle clause at the right-hand side of Figure 1 would begin with a ghost element labelled to show that fronds is underlyingly the subject of draped.)  This means that candidate and gold-standard analyses will not necessarily cover identical strings of leaves; but leaf-ancestor assessment can be adapted to this situation in a natural manner by treating any trace element occurring in one parse but not the other as a word whose lineage-similarity score is zero.

 

As we saw above, a further advantage of the leaf-ancestor metric is that it gives meaningful scores for individual elements of an input string, as well as an overall score for a parse-tree.

 

 

5.  Possible disadvantages

 

Turning to negative considerations, the one objection raised against leaf-ancestor assessment by Carroll, Briscoe, & Sanfilippo (1998: 449) is that it can be difficult to know how to use a percentage measure of partial parsing accuracy, as opposed to a simple verdict of identity or nonidentity between candidate and gold-standard parse:  “for instance with a ‘80% correct’ tree, how important is the remaining 20% to the correct interpretation of the sentence?”  This is a fair point, but it does not discriminate between leaf-ancestor and GEIG assessment techniques:  both deliver quantitative measures of partial accuracy.  Furthermore, in situations where it is possible to specify which categories of parse error matter and which are relatively trivial, it should be possible under leaf-ancestor assessment to tune the node-label similarity measure, or to weight the contributions to lineage-similarity computations, so that the former errors are punished more severely than the latter in determining overall parse scores.

 

Another objection to leaf-ancestor assessment might be that the need to optimize mappings between subsets of candidate and gold-standard lineage elements makes this technique relatively computationally intensive.  For any pair of lineages to be compared, having respectively m and n elements, the maximum score has to be sought by evaluating  mappings from subsets of one lineage onto subsets of the other.  But evaluating parse accuracy will not normally be a function executed in real time, so this is unlikely to be an issue in practice.  (In particular cases, less computation may be needed; for instance, when node-label pairs are evaluated simply as identical or nonidentical, candidate and gold-standard lineage-pairs can be evaluated very efficiently.)

 

In any case, there is no virtue in a metric being easy to calculate, if it measures the wrong thing.  The main point of this paper is to urge that leaf-ancestor assessment delivers figures which succeed in quantifying human perceptions of relatively accurate or inaccurate parsing, whereas GEIG figures are only remotely related to parsing accuracy.  If that is true, considerations of computational cost are scarcely relevant.

 

The GEIG metric was criticized above as focusing unduly on the location of constituent boundaries.  Leaf-ancestor assessment might be criticized for giving greater weight to structure in the higher than in the lower reaches of parsetrees.  Because higher nodes dominate more words than the nodes below them, a labelling error high in a tree will drag down the average value assigned to the tree further than will an otherwise similar error low in the tree.  Since higher structure is relatively remote from observation, leaf-ancestor assessment might be seen as a conservative assessment technique:  it gives greater weight to what is harder to establish.  It could be argued that this bias is an imperfection. 

 

Nevertheless, I suggest that leaf-ancestor assessment comes closer than other extant parse-evaluation techniques to formalizing our pretheoretic intuitions about degrees of parsing accuracy.

 

 


References

 

Basili, R., Maria Teresa Pazienza, & F.M. Zanzotto (1998)  “Evaluating a robust parser for Italian”, in Carroll, Basili, et al. (1998).

Black, E., S. Abney, et al. (1991)  “A procedure for quantitatively comparing the syntactic coverage of English grammars”, in Proceedings of the Speech and Natural Language Workshop, DARPA, February 1991, Pacific Grove, Calif., Morgan Kaufmann, pp. 306-11.

Black, E., R.G. Garside, et al., eds. (1993)  Statistically-driven Computer Grammars of English: the IBM/Lancaster Approach.  Amsterdam: Rodopi.

Briscoe, E.J. & J.A. Carroll (1996)  “A probabilistic LR parser of part-of-speech and punctuation labels”.  In Jenny Thomas & M.H. Short, eds., Using Corpora for Language Research, London: Longman, pp. 135-50.

Carroll, J.A., R. Basili, et al., eds. (1998)  Proceedings of the Workshop on the Evaluation of Parsing Systems, First International Conference on Language Resources and Evaluation, Granada, Spain, 26 May 1998.

Carroll, J.A. & E.J. Briscoe (1996) “Apportioning development effort in a probabilistic LR parsing system through evaluation”, in E. Brill & K. Church, eds., Proceedings of the Conference on Empirical Methods in Natural Language Processing, University of Pennsylvania, 17-18 May 1996, pp. 92-100.

Carroll, J.A., E.J. Briscoe, & A. Sanfilippo (1998)  “Parser evaluation: a survey and a new proposal”.  In Rubio et al. (1998), pp. 447-54.

Collins, M.J. (1996)  “A new statistical parser based on bigram lexical dependencies”, in Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, 24-27 June 1996, Santa Cruz, California, pp. 184-91.

Gaizauskas, R., M. Hepple, & C. Huyck (1998)  “A scheme for comparative evaluation of diverse parsing systems”.  In Rubio et al. (1998), pp. 143-9.

Grishman, R., C. Macleod, & J. Sterling (1992)  “Evaluating parsing strategies using standardized parse files”, in Proceedings of the Third Conference on Applied Natural Language Processing, 31 March–3 April 1992, Trento, Italy, pp. 156-61.

Hudson, R.A. (1990)  English Word Grammar.  Oxford: Blackwell.

Lin Dekang (1996)  “Dependency-based parser evaluation: a study with a software manual corpus”.  In Sutcliffe et al. (1996), pp. 13-24.

Lin Dekang (1998)  “A dependency-based method for evaluating broad-coverage parsers”.  Natural Language Engineering 4.97-114.

Magerman, D. (1995)  “Statistical decision-tree models for parsing”, in Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Massachusetts Institute of Technology, 26-30 June 1995, pp. 276-83.

Rubio, A., et al., eds. (1998)  Proceedings of the First International Conference on Language Resources and Evaluation, Granada, Spain, 28-30 May 1998.

Sampson, G.R. (1995)  English for the Computer.  Oxford: Clarendon Press.

Sampson, G.R. (1996)  Evolutionary Language Understanding.  London: Cassell.

Sampson, G.R., R. Haigh, & E.S. Atwell (1989)  “Natural language analysis by stochastic optimization: a progress report on Project APRIL”.  Journal of Experimental and Theoretical Artificial Intelligence 1.271-87.

Srinivasan, B. [printed as “Bangalore, S.”], A. Sarkar, Christine Doran, & Beth Ann Hockey (1998)  “Grammar and parser evaluation in the XTAG project”, in Carroll, Basili, et al. (1998).

Sutcliffe, R.F.E., H.-D. Koch, & Annette McElligott, eds. (1996)  Industrial Parsing of Software Manuals.  Amsterdam: Rodopi.

Voutilainen, A. (1998)  “Helsinki taggers and parsers”.  Paper presented at the 19th Annual Meeting of the International Computer Archive of Modern English (ICAME19-98), Newcastle, Co. Down.


the      Ns  [   S                                              1.0

closest  Ns  S                                                  1.0

thing    Ns  S                                                  1.0

to       [   P   Ns  S                                          1.0

a        [   Ns  P   Ns  S                                      1.0

home     Ns  P   Ns  ]   S                                      1.0

was      Vsb S                                                  1.0

a        [   N   S                                              1.0

string   N   S                                                  1.0

hammock  N   S                                                  1.0

and      [   Np+ N   S                [   S+  N   S             0.75

+,       Np+ N   S                    S+  N   S                 0.67

when     Rq  [   Fa  Np+ N   S        Rq  [   Fa  S+  N   S     0.83

it       Ni  Fa  Np+ N   S            Ni  Fa  S+  N   S         0.8

rained   Vd  Fa  ]   Np+ N   S        Vd  Fa  ]   S+  N   S     0.83

+,       Np+ N   S                    S+  N   S                 0.67

some     Np+ N   S                    [   Np  S+  N   S         0.5

palm     Np+ N   S                    Np  S+  N   S             0.57

fronds   Np+ N   S                    Np  ]   S+  N   S         0.5

draped   Vn  [   Tn  Np+ N   S        Vd  S+  N   S             0.4

over     [   P   Tn  Np+ N   S        [   P   S+  N   S         0.73

sticks   P   Tn  Np+ N   S   ]        P   S+  N   S   ]         0.73

 

Figure 3

 

 

1 the       det   3  7  0                                       1.0

2 closest   attr  3  7  0                                       1.0

3 thing     subj  7  0                                          1.0

4 to        mod   3  7  0                                       1.0

5 a         det   6  4  3  7  0                                 1.0

6 home      pcomp 4  3  7  0                                    1.0

7 was       main  0                                             1.0

8 a         det   10  7  0                                      1.0

9 string    attr  10  7  0                                      1.0

10 hammock  comp  7  0                                          1.0

11 and      cc    7  0                                          1.0

12 +,       tmp   20  7  0            tmp   19  7  0            0.75

13 when     tmp   15  20  7  0        tmp   15  19  7  0        0.8

14 it       subj  15  20  7  0        subj  15  19  7  0        0.8

15 rained   tmp   20  7  0            tmp   19  7  0            0.75

16 +,       tmp   20  7  0            tmp   19  7  0            0.75

17 some     det   19  20  7  0        det   19  7  0            0.89

18 palm     attr  19  20  7  0        attr  19  7  0            0.89

19 fronds   subj  20  7  0            cc    7  0                0.57

20 draped   cc    7  0                mod   19  7  0            0.57

21 over     ha    20  7  0            ha    20  19  7  0        0.89

22 sticks   pcomp 21  20  7  0        pcomp 21  20  19  7  0    0.91

 

Figure 6