The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version.

Published in Nelleke Oostdijk & P. de Haan, eds., Corpus-Based Research into Language, Rodopi (Amsterdam), 1994.


 

SUSANNE:  A Domesday Book of English Grammar

 

 

Geoffrey Sampson

 

School of Cognitive and Computing Sciences

University of Sussex

Falmer, Brighton BN1 9QH, England

 

 

 

 

 

INTRODUCTION

 

 

The SUSANNE Corpus has been created, with the sponsorship of the Economic and Social Research Council (UK), as part of the process of developing a comprehensive taxonomy and annotation scheme for the (logical and surface) grammar of English for NLP (natural language processing) purposes.[1]  Copies are now available to the research community freely and without formalities.  Release 1 of the Corpus has been distributed via anonymous ftp over the Internet by the Oxford Text Archive since October 1992; after six months, messages received from users show that it is by now in use in a variety of academic and commercial research environments in many countries on at least four continents.  (The procedure for acquiring a copy is detailed in the Appendix.)

 

The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and the boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis.[2]  The SUSANNE scheme may be likened to a ‘Linnaean taxonomy’ of the grammatical domain:  its aim (comparable to that of Linnaeus’s eighteenth-century taxonomy for the domain of botany) is not to identify categories which are theoretically optimal or which necessarily reflect the psychological organization of speakers’ linguistic competence, but simply to offer a scheme of categories and ways of applying them that make it practical for NLP researchers to register everything that occurs in real-life usage systematically and unambiguously, without misunderstandings over local uses of analytic terminology.

 

Alternatively, one may liken the SUSANNE analytic scheme to the Domesday Book commissioned by William the Bastard after his conquest of England:  the scheme describes English grammar as Domesday describes eleventh-century English geography, not discursively or with attention to human interest, but comprehensively and in a terse, systematic format which specifies just enough information to permit the application of consistent procedures (in the Domesday case, taxation procedures).

 

There are numerous reasons why taxonomic work of this kind is a high priority at the current juncture in the history of natural language processing.  Such work is needed both to facilitate the development of more adequate NLP systems, and to create a greater level of sophistication in the user community about the systems available.

 

By offering a comprehensive check-list of phenomena which a fully-adequate NLP system needs to be able to handle (which include many linguistic structures commonly ignored by theoretical linguistics – consider for instance addresses, weights and measures, the placement of punctuation marks within grammatical structures, all very significant for practical natural language processing but scarcely visible within orthodox linguistic descriptions), a taxonomy enables the system builder to monitor what areas of the total task he has covered and to focus his efforts on major gaps.  And by publicly specifying a ‘default’ analysis for every construction, a taxonomy enables the system builder to put effort into defining alternative analytic norms only where he has positive reasons for diverging from the default analysis – at present, for lack of a public taxonomy, each research group must define its analytic standards independently from the ground up, or else (as often happens) leave them vague in many respects.

 

At the same time, a public taxonomy facilitates the definition of objective benchmarks allowing the achievements of particular NLP systems to be measured and expressed in terms that are generally understood:  thus encouraging the replacement of inferior by superior systems, and enabling potential clients for the technology to assess in advance the scope of systems they are thinking of investing in.  These developments are essential if natural language processing is to complete the transition from the status of an academic pastime into a mature component of the information technology industry.  Cf. Sampson (1992, forthcoming).

 

The SUSANNE analytic scheme is defined in detail in a book by myself, English for the Computer, forthcoming from Oxford University Press.  The Chairman of the Analysis and Interpretation Working Group of the US/EC-sponsored Text Encoding Initiative has proposed its adoption as a recognised TEI standard.  The SUSANNE scheme aims to specify annotation norms for the modern English language; it does not cover other languages, although it is hoped that the general principles of the SUSANNE scheme may prove helpful in developing comparable taxonomies for these.

 

Regrettably, Release 1 of the SUSANNE Corpus is not a ‘TEI-conformant’ resource, though aspects of the annotation scheme have been decided in such a way as to facilitate a move to TEI conformance in later releases.  The working timetable of the Initiative meant that relevant aspects of the TEI Guidelines were not yet complete at the point when the SUSANNE Corpus was ready for initial release; delaying this release would have been unfortunate.

 

The brief description of the SUSANNE Corpus contained in the remainder of this article cannot replace the very detailed statements, illustrated with numerous Brown and LOB Corpus examples, to be found in English for the Computer; any user aiming to do serious work with the Corpus or the SUSANNE annotation scheme would probably need to consult the book.  In a sense, the Corpus is pointless without the book.  Nevertheless, prospective users may find a summary statement helpful, as giving an impression of the scope of the analytic scheme.

 

 

 

BACKGROUND

 

 

The present SUSANNE annotation scheme originated in work carried out by myself in collaboration with Professor Geoffrey Leech, F.B.A., and others in the years 1983‑85 to produce a database of manually analysed sentences from the LOB Corpus of written British English; this database, which has not been (and will not now be) published, is described in Garside et al. (1987: ch. 7).  The annotation scheme of this ‘Lancaster-Leeds Treebank’ represented surface grammar only, without indications of logical form.  It subsequently seemed desirable to extend this scheme to include methods for representing logical grammar, and to refine both surface and logical aspects of the annotation scheme by applying it to a larger body of texts.  The only way that a parsing scheme can in practice be made increasingly adequate is in the way that the English Common Law develops, by collecting and systematizing the body of precedents generated through detailed consideration of more and more individual cases that arise in real life.  Accordingly, Project SUSANNE took a subset of the Brown Corpus of written American English which had been manually analysed by Alvar Ellegård’s group at Gothenburg (Ellegård 1978), and reworked the annotations in this under-used resource in order to turn them into a scheme consistent with that used in the Lancaster-Leeds Treebank but including specifications of logical as well as surface structure:  several categories of information not indicated in either Lancaster-Leeds or Gothenburg schemes were also added.[3]

 

The finished SUSANNE parsing scheme has thus been developed on the basis of samples of both British and American English.  It is oriented chiefly towards written language; however, on another project sponsored by the Royal Signals and Radar Establishment[4] my team produced extensions to the SUSANNE scheme for annotating the distinctive grammatical phenomena of spoken English, and these extensions are specified in English for the Computer (though they are not used in the SUSANNE Corpus and are not discussed further here).  It should be noted also that the scheme has emerged through a process of detailed critical discussion of analytic standards by some ten people over a decade; apart from myself, the leading role in the early years of these discussions was taken by Geoffrey Leech, whose standing as an English grammarian needs no emphasis.

 

The SUSANNE Corpus itself comprises an approximately 128,000-word subset of the Brown Corpus of American English, annotated in accordance with the SUSANNE scheme.  The original motives for producing this database included that of providing better statistics than any then available[5] for probabilistic automatic-parsing techniques, such as those of my APRIL annealing parser project.[6]  Statistically-based automatic language processing needs data analysed in a very consistent fashion, and hence requires a very explicit analytic scheme.  In terms of quantity of language examples analysed, Project SUSANNE was overtaken after its inception by projects (notably Mitchell Marcus’s Pennsylvania Treebank project, cf. chapter 00 of this volume) which have used quasi-industrial methods to generate far larger bodies of grammatically-analysed material.  However, the SUSANNE scheme may be unparalleled in the extent to which its categories have been refined and tested through detailed consideration of the almost endless small quirks of the texts to which they have been applied, and in the degree of precision to which the resulting guidelines for using the categories have been documented – thus defining analytic standards which permit annotation of future material to be extremely self-consistent.  Accordingly the SUSANNE Corpus is offered to the research community primarily as a demonstration of the application of the parsing scheme, evidencing the fact that the scheme has survived the test of experience rather than being merely aprioristic.  The SUSANNE Corpus functions, as it were, like a collection of type specimens appended to a botanical taxonomy.

 

Although Release 1 of the SUSANNE Corpus has undergone considerable proof-checking, it unquestionably still contains many errors.[7]  I aim to issue future releases correcting these; I shall be extremely grateful if users discovering errors will log them and send me details, preferably by post rather than e-mail.

 

 

 

STRUCTURE OF THE CORPUS

 

 

The SUSANNE Corpus consists of 64 data files together with a documentation file.  Each data file contains an annotated version of one 2000+ word text from the Brown Corpus.  Files average about 83 kilobytes in size, thus the entire Corpus totals about 5.3 megabytes.  The data file names are those of the respective Brown texts, e.g. A01, N18; the documention file is named ‘SUSANNE.doc’. Sixteen texts are drawn from each of the following Brown genre categories:

 

            A         press reportage
            G         belles lettres, biography, memoirs
            J          learned (mainly scientific and technical) writing
            N         adventure and Western fiction

 

The Corpus thus samples each of the four broad genre groups established on the basis of word-frequency data by Hofland & Johansson (1982: 27).

 

Each data file has a line (terminating in a newline character) for each word of the original text; but ‘words’ for SUSANNE purposes are often smaller than words in the ordinary orthographic sense, for instance punctuation marks and the apostrophe-s suffix are treated as separate words and assigned lines of their own.  (For details on the rules by which orthographic words are segmented, as well as on all other analytic matters mentioned below, see English for the Computer.)

 

For an example see Figure 1, which displays a short section from file A10 (part of the analysis of a news report from The Oregonian newspaper).

 

Each line of a SUSANNE data file has six fields separated by tabs (that is, there is one tab after each of fields 1 to 5, but a newline after field 6).  Each field on every line contains at least one character.  The six fields on each line are:

 

            1          reference

            2          status

            3          wordtag

            4          word

            5          lemma

            6          parse

 

Apart from the tab and newline characters used to structure fields and records (that is, lines), all bytes in each of the 64 SUSANNE data files are drawn from a subset of the 94 graphic character allocations of the International Reference Version (‘IRV’) of ISO 646:1983 ‘Information Processing – ISO 7-bit coded character set for information interchange’, from hexadecimal 21 (exclamation mark) to hex 7E (tilde).  These codes are assumed for SUSANNE purposes to represent the graphic symbols assigned by the IRV system.  Twelve members of the IRV character set are never used in the Corpus, namely (all codes hexadecimal):

 

            23        gate

            24        generalized currency unit

            27        prime

            2F        solidus

            5C       reverse solidus

            5E        circumflex

            5F        underline

            60        grave

            7B       opening curly bracket

            7C       vertical bar

            7D       closing curly bracket

            7E        tilde

 

The space character, hex 20, which is classified by ISO 646 as a control code also does not occur in SUSANNE data files.

 

Where text characters cannot be adequately represented directly within the resulting 82-member character set, they are represented by entity names within angle brackets.  Where possible these are drawn from Appendix D to ISO 8879:1986, ‘Information Processing – Text & Office Systems – Standard Generalized Markup Language (SGML)’.  For instance, <ldquo> stands for opening double inverted commas, <eacute> for lower-case ‘e’ with acute accent.  Symbols in angle brackets are used also to stand for such things as typographical shifts, which for purposes of grammatical analysis are conveniently represented as items within the word-sequence:  e.g. <bital> means ‘begin italics’.

 

 

 

REFERENCE FIELD

 

 

The reference field contains nine bytes which give each line a reference number that is unique across the SUSANNE Corpus, e.g. ‘A10:0630e’.  The first three bytes (here A10) are the file name; the fourth byte is always a colon; bytes 5 to 8 (here 0630) are the number of the line in the ‘Bergen I’ version of the Brown Corpus on which the relevant word appears (Brown line numbers normally increment in tens, with occasional odd numbers interpolated); and the ninth byte is a lower-case letter differentiating successive words that appear on the same Brown line.  (SUSANNE lines are lettered continuously from ‘a’, omitting ‘l’ and ‘o’.)

 

 

 

STATUS FIELD

 

 

The status field contains one byte.  The letters ‘A’ and ‘S’ show that the word is an ‘abbreviation’ or ‘symbol’, respectively, as defined by Brown Corpus codes (Francis & Kučera 1989: 12).  The letter ‘E’ shows that the word is (or is part of) a misprint or solecism in the original text (details are logged in English for the Computer).  On the great majority of lines, to which none of these three categories apply, the status field contains a hyphen character.

 

 

 

WORDTAG FIELD

 

 

The SUSANNE wordtag set is based on the ‘Lancaster’ tagset listed in Garside et al. (1987: Appendix B); additional grammatical distinctions have been drawn in this set, and these are indicated by suffixing lower-case letters to the Lancaster tags.  For instance, seemed is tagged VVD (past tense of verb) in the Lancaster scheme, but VVDi (past tense of intransitive – including copular – verb) in the SUSANNE scheme.  Apart from the lower-case extensions, the wordtags are normally identical to the Lancaster tags:  punctuation marks are assigned alphabetical tags beginning Y… (e.g. YC for comma), and the dollar sign which appears in some Lancaster tags for genitive words is replaced by G (e.g. GG for the apostrophe-s suffix), so that the modified Lancaster tags always consist wholly of alphanumeric characters, beginning with two capital letters.  (In a few cases, tags from the Lancaster set have been merged or eliminated from the SUSANNE scheme in the light of experience.)

 

The tag YG appears in the wordtag field to represent a ‘trace’ – the logical position of a constituent which has been shifted elsewhere, or deleted, in the surface grammatical structure.

 

The SUSANNE wordtag set comprises 352 distinct wordtags, not counting tags for elements of ‘grammatical idioms’ (see below); a few of these wordtags are never used in the SUSANNE Corpus.  The wordtags are listed, and their application rigorously defined, in English for the Computer – in the case of closed wordclasses, by enumeration of their members, and in the case of open classes by rules for choice between alternative tags.  These rules refer to information in a specified published dictionary (the Oxford Advanced Learner’s Dictionary of Current English, 3rd edition).

 

 

 

WORD FIELD

 

 

The word field contains a segment of the text, often coinciding with a word in the orthographic sense but sometimes, as noted above, including only part of an orthographic word.  In general the word field represents all and only those typographical distinctions in the original documents which were recorded in the Brown Corpus (Francis & Kučera 1989: 10–15), though in certain cases the SUSANNE Corpus has gone behind the Brown Corpus to reconstruct typographical details omitted from Brown.

 

Certain characters have special meanings in the wordfield, as follows:

 

+             (occurs only as first byte of the wordfield) shows that the contents of the field were not separated in the original text from the immediately-preceding text segment by whitespace (e.g. in the case of a punctuation mark, or part of a hyphenated sequence split over successive SUSANNE lines);

 

-             the line corresponds to no text material (it represents the ‘trace’ for a grammatically-moved element);

 

< … >   enclose entity names for special typographical features, as discussed above, either taken from ISO 8879:1986 Appendix D or created for the SUSANNE Corpus – for instance <pand> stands for ‘either plus sign or ampersand’, since the Brown Corpus makes no distinction between these characters.

 

 

 

LEMMA FIELD

 

 

The lemma field shows the dictionary headword of which the text word is a form:  the field shows base forms for words which are inflected in the text, and eliminates typographical variations (such as sentence-initial capitalization) which are not inherent to the word but relate to its use in context.  (In the case of ‘words’ to which the dictionary-form concept is inappropriate, e.g. numerals and punctuation marks, the lemma field contains a hyphen.)  Orthographic forms in the lemma field are those of the Oxford Advanced Learner’s Dictionary of Current English, 3rd edition.

 

Project SUSANNE aimed also to indicate the senses which polysemous words bear in context, via codes relating word-tokens to numbered subsenses in a specified dictionary.  The book English for the Computer provides a detailed coding scheme for representing this information.  Unfortunately, this aspect of the project’s output proved to contain a number of inadequacies, and the information does not appear in Release 1 of the Corpus.  It is hoped to include it in later releases.

 

 

 

PARSE FIELD


The contents of the sixth field represent the central raison d’être of the SUSANNE Corpus.  They code the grammatical structure of texts as a sequence of labelled trees, having a leaf node for each Corpus line.

 

Each text is treated as a sequence of ‘paragraphs’ separated by ‘headings’.  (Figure 1 includes one complete one-sentence paragraph, ending at line A10:0650f, and the first sentence of the following paragraph.)  A ‘paragraph’ normally coincides with an ordinary orthographic paragraph; a ‘heading’ may consist of actual verbal material, or may be merely a typographical paragraph division, symbolized <minbrk> in the word field.  Conceptually, the structure of each paragraph or heading is a labelled tree with root node labelled O (Oh for a heading), and with a leaf node labelled with a wordtag for each SUSANNE word or trace, i.e. each line of the Corpus.  There will commonly be many intermediate labelled nodes.

 

Such a tree is represented as a bracketed string in the ordinary way, with the labels of nonterminal nodes written ‘inside’ both opening and closing brackets (that is, to the right of opening brackets and to the left of closing brackets).  This bracketed string is then adapted as follows for inclusion in successive SUSANNE parse fields.  Wherever an opening bracket immediately follows a closing bracket, the string is segmented, yielding one segment per leaf node; and within each such segment, the sequence opening-bracket + wordtag + closing-bracket, representing the leaf node, is replaced by full stop.  Thus each parse field contains exactly one full stop, corresponding to a terminal node labelled with the contents of the wordtag field, sometimes preceded by labelled opening bracket(s) and sometimes followed by labelled closing bracket(s), corresponding to higher tagmas which begin or end with the word on the line in question.

 

Brackets are square except in the case of nodes immediately dominating the ‘trace’ wordtag YG, which are represented with angle brackets.

 

Nonterminal node labels in the SUSANNE scheme contain up to three types of information:  a formtag, a functiontag, and an index, in that order.  In a label containing a formtag and one or both of the other two elements, a colon separates the formtag from the other elements.  A functiontag is always a single alphabetic character, and an index is a sequence of three digits; restrictions on valid combinations of elements within a node label mean that complex labels can always be unambiguously decomposed into their elements.

 

 

 

RANKS OF CONSTITUENT

 

 

Apart from nodes immediately dominating traces, all nodes have labels including formtags, which identify the internal properties of the word or word-sequence dominated by the node.  The shape of a parse-tree is defined in terms of a hierarchy of formtag ranks:

 

            1          wordlevel formtags (begin with two capital letters; formtags of all other ranks begin with one capital and contain no further capitals)

 

            2          phraselevel formtags (begin with one of:  N V J R P D M G)

 

            3          clauselevel formtags (begin with one of:  S F T Z L A W)

 

            4          rootlevel formtags (begin with one of:  O Q I)

 

Each grammatical clause, whether consisting of one or more words, is given a node labelled with a clauselevel formtag.  Each immediate constituent of a clause, whether there are one or more such constituents and whether the constituent consists of one or more words, is given a node labelled with a phraselevel formtag, unless the constituent belongs to a wordlevel category that has no corresponding phraselevel category (e.g. punctuation marks, existential there, conjunctions), or to a rootlevel category (e.g. a direct quotation, formtagged Q).  Thus a clause consisting just of a verb will be assigned a clauselevel formtag (e.g. Tg for present-participle clause) which singularily dominates a phraselevel formtag (e.g. Vg for ‘verb group beginning with present participle’) which in turn singularily dominates a wordlevel formtag (e.g. VVGi for ‘present participle of intransitive verb’).

 

Other than by these rules, and in certain other limited circumstances specified in English for the Computer, singulary branching does not occur.  An intermediate phraselevel node is inserted between a higher phraselevel node and a sequence of words dominated by it only if two or more of those words form a coherent constituent within the higher phrase.  A clause which fills a slot standardly filled by a phrase (e.g. a nominal clause as subject or object) will not have a phrase node above the clause node unless the clause proper is preceded and/or followed by modifying elements that are not part of the clause.

 

Detailed rules for deciding constituency in various debatable cases, for placing items such as punctuation marks within parse trees, for extending the application of the categories and structuring rules to linguistic phenomena such as addresses or weights and measures which are not commonly taken into account in linguistic theorizing, and so forth, are laid down in English for the Computer.

 

 

 

FUNCTIONTAGS AND INDICES

 

 

Functiontags, identifying roles such as surface subject, logical object, time adjunct, are assigned to all immediate constituents of clauses, except for their verb-group heads and certain other constituents for which function labelling is inappropriate.

 

Indices are assigned to pairs of nodes to show referential identity between items which are in certain defined grammatical relationships to one another.  Thus, in Figure 1, the sequence feeling that evacuation plans … would not work is given the label Ns:S152, in which the formtag Ns identifies the tagma as a singular noun phrase (note that in context feeling occurs in its nominal use – one might have expected the sentence to read a feeling … , but corpus linguistics takes language as it finds it), the capital S shows that the tagma is surface subject of the seemed clause (in an existential clause the subject, which determines verb agreement, standardly follows the verb), and the index 152 shows that this tagma is identifiable with the logical subject (s152) of the to be clause.  The label Ti:s on this latter clause shows that it is an infinitival clause (formtag Ti), which as a whole, including its displaced logical subject, forms the logical subject (s) of seemed.

 

In some cases, movement rules displace a constituent into a tagma within which it has no grammatical role (for instance, an adverb which is logically a clause constituent may interrupt the verb group – sequence of auxiliary verbs and main verb – which heads the clause):  in such cases the functiontag is G (‘guest’).  Constituents which do not logically belong below the node which immediately dominates them in surface structure are always given G functiontags and indices linking them to their logical position.  With that exception (and with one other exception not discussed here relating to co-ordination), functiontagging is used only for immediate constituents of clauses.

 

English for the Computer lists the types of surface/logical-grammar discordance which are represented by the SUSANNE scheme, and the approved methods of representing them.  The SUSANNE analysis is always chosen so as to be as far as possible neutral as between alternative linguistic theories.

 

 

 

THE FORMTAGS

 

 

The SUSANNE formtags are as follows:

 

            Rootlevel Formtags

 

            O          paragraph

            Oh       heading

            Ot       title (e.g. of book)

            Q          quotation

            I          interpolation

            Iq       tag question

            Iu       scientific citation

 

            Clauselevel Formtags

 

            S          main clause

            Ss       quoting clause embedded within quotation

            Fa       adverbial clause

            Fn       nominal clause

            Fr       relative clause

            Ff       ‘fused’ relative

            Fc       comparative clause

            Tg       present participle clause

            Ti       infinitival clause

            Tn       past participle clause

            Tf       for-to clause

            Tb       ‘bare’ nonfinite clause

            Tq       infinitival relative clause

            Z          reduced (‘whiz-deleted’) relative clause

            L          other verbless clause

            A          special as clause

            W          with clause

 

            Phraselevel Formtags

 

            N          noun phrase

            V          verb group

            J          adjective phrase

            R          adverb phrase

            P          prepositional phrase

            D          determiner phrase

            M          numeral phrase

            G          genitive phrase

 

The various phrase categories take lower-case subcategory symbols which can be combined in any meaningful combination (e.g. the verb group would not work is formtagged Vdce).  The phrase subcategories are:

 

            Vo       operator section of verb group, when separated from remainder of verb group e.g. by subject-auxiliary inversion

            Vr       remainder of verb group from which operator has been separated

            Vm       verb group beginning with am

            Va       verb group beginning with are

            Vs       verb group beginning with was

            Vz       verb group beginning with other 3rd-singular verb

            Vw       verb group beginning with were

            Vj       verb group beginning with be

            Vd       verb group beginning with past tense

            Vi       infinitival verb group

            Vg       verb group beginning with present participle

            Vn       verb group beginning with past participle

            Vc       verb group beginning with modal

            Vk       verb group containing emphatic DO

            Ve       negative verb group

            Vf       perfective verb group

            Vu       progressive verb group

            Vp       passive verb group

            Vb       verb group ending with BE

            Vx       verb group lacking main verb

            Vt       catenative verb group

 

            Nq       wh- noun phrase

            Nv       wh…ever noun phrase

            Ne       I/me head

            Ny       you head

            Ni       it head

            Nj       adjective head

            Nn       proper name

            Nu       unit noun head

            Na       marked as subject

            No       marked as nonsubject

            Ns       singular noun phrase

            Np       plural noun phrase

 

            Jq       wh- adjective phrase

            Jv       wh…ever adjective phrase

            Jx       measured absolute adjective phrase

            Jr       measured comparative adjective phrase

            Jh       postmodified adjective phrase

 

            Rq       wh- adverb phrase

            Rv       wh…ever adverb phrase

            Rx       measured absolute adverb phrase

            Rr       measured comparative adverb phrase

            Rs       adverb conducive to asyndeton

            Rw       quasi-nominal adverb

 

            Po       of phrase

            Pb       by phrase

            Pq       wh- prepositional phrase

            Pv       wh…ever prepositional phrase

 

            Dq       wh- determiner phrase

            Dv       wh…ever determiner phrase

            Ds       singular determiner phrase

            Dp       plural determiner phrase

 

     Ms       phrase headed by one

 

Subcategory symbols are not included if implied by more specific subcategories, thus a verb group beginning was will be labelled Vs, not Vsd.

 

 

 

NON-ALPHANUMERIC FORMTAG SUFFIXES

 

 

Formtags may also contain non-alphanumeric symbols, including:

 

            ?          interrogative clause

            *          imperative clause

            %          subjunctive clause

            !          exclamatory clause or other item

            "          vocative item

 

Other non-alphanumeric symbols represent co-ordination structure.  Under the SUSANNE scheme, second and subsequent conjuncts in a co-ordination are analysed as subordinate to the first conjunct; thus a co-ordination of the form:

 

            c, y, and w

 

(where c, y, etc. are word-sequences of any grammatical rank) would be assigned a structure of the form:

 

            [c, [y], [and w]]

 

The formtag of the entire co-ordination is determined by the properties of the first conjunct (except for singular/plural subcategories in the case of phrase categories to which these apply); the later conjuncts (which will often be grammatically reduced) have nodes of their own whose formtags mark them as ‘subordinate conjuncts’.  The following symbols relate to co-ordination (and apposition) structure:

 

            +          subordinate conjunct introduced by conjunction

            -          subordinate conjunct not introduced by conjunction

            @          appositional element

            &          co-ordinate structure acting as first conjunct within a higher co-ordination (marked in certain cases only)

 

Co-ordination is recognised as occurring between words as well as between higher-rank tagmas; Figure 1 contains no example, but for instance in he bought apples and bananas the phrase apples and bananas would be analysed as a simple noun phrase singularily dominating a co-ordination of nouns, rather than as a co-ordination of one-word noun phrases.  Therefore nonterminal nodes may have formtags consisting of wordtags followed by co-ordination symbols, thus (using WT to stand for an arbitrary wordtag):

 

            WT&     co-ordination of words

            WT+     conjunct within wordlevel co-ordination that is introduced by a conjunction

            WT-     conjunct within wordlevel co-ordination not introduced by a conjunction

 

(A wordlevel co-ordination always takes an ampersand on its formtag; phrase or clause co-ordinations do so only in very restricted circumstances.)

 

Also, certain sequences of orthographic words, in certain uses, are regarded as functioning grammatically as single words (‘grammatical idioms’).  For instance, in keeping with is normally treated as a grammatical idiom, equivalent to a single preposition (for which the wordtag is II).  In such cases, the nonterminal node dominating the sequence has a formtag consisting of an equals sign suffixed to the corresponding wordtag; and the individual words composing the idiom are not wordtagged in their own right, but receive tags with numerical suffixes reflecting their membership of an idiom.  (The sequence in keeping with is formtagged II=, and the words in, keeping, with in this context are wordtagged II31 II32 II33.)  English for the Computer includes exhaustive listings of closed-class grammatical idioms.

 

Note that formtags of the forms WT& WT+ WT- WT= rank as wordlevel formtags for the purposes of determining tree structure as discussed above.

 

 

 

THE FUNCTIONTAGS

 

 

Functiontags divide into complement and adjunct tags:  broadly, a given complement tag can occur at most once in any clause, but a clause may contain multiple adjuncts of the same type.

 

It was originally planned to classify complements in terms of some version of Fillmorean Case Grammar.  The most fully worked-out version of case theory, including specimen case frames for numerous English verbs and other predicates, is that of Stockwell et al. (1973), and the SUSANNE team set out to develop this into a scheme capable of specifying an unambiguous case assignment for all complements found in corpus material.  After strenuous and protracted efforts, this attempt failed; the nature of the logical relationships which various predicates in real-life usage contract with their arguments proved too diverse to handle in this fashion, and the team believe that they have ‘tested to destruction’ the hypothesis that core clause structure in English can adequately be described in terms of a limited set of ‘cases’.  Instead, the finished SUSANNE Corpus classifies complements in terms of the semantically less informative, but more predictable, traditional concepts of subject and object.

 

The scheme of adjunct categories has been developed from the classification of Quirk et al. (1985), though some modifications have been introduced in the light of experience in applying the categories to corpus data.

 

            Complement Functiontags

 

            s          logical subject

            o          logical direct object

            S          surface (and not logical) subject

            O          surface (and not logical) direct object

            i          indirect object

            u          prepositional object

            e          predicate complement of subject

            j          predicate complement of object

            a          agent of passive

            n          particle of phrasal verb

            z          complement of catenative

            x          relative clause having higher clause as antecedent

            G          ‘guest’ having no grammatical role within its tagma

 

            Adjunct Functiontags

 

            p          place

            q          direction

            t          time

            h          manner or degree

            m          modality

            c          contingency

            r          respect

            w          comitative

            k          benefactive

            b          absolute

 

 

Detailed guidelines for the application of these functional categories are included in English for the Computer.

 

 


 

APPENDIX:  How to retrieve a copy of the SUSANNE Corpus

 

 

[Since this article was published, I have abandoned the Oxford Text Archive as a distribution centre for up-to-date versions of the SUSANNE Corpus and similar resources, in favour of a server under my own control. To get hold of a copy of the most up-to-date version of the Corpus at any time, follow the link to ‘downloadable research resources’ from my home page at www.grsampson.net and follow the instructions given there.]

 

 

 

 

 


 

REFERENCES

 

 

Ellegård, A.  1978.  The Syntactic Structure of English Texts.  Gothenburg Studies in English 43.  Gothenburg: Acta Universitatis Gothoburgensis.

 

Francis, W.N. and H. Kučera.  1989.  Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for use with Digital Computers (corrected and revised edition).  Providence, Rhode Island: Department of Linguistics, Brown University.

 

Garside, R.G., G.N. Leech, and G.R. Sampson (eds.).  1987.  The Computational Analysis of English.  London: Longman.

 

Hofland, K. and S. Johansson.  1982.  Word Frequencies in British and American English.  London: Longman.

 

Quirk, R., S. Greenbaum, G.N. Leech, and J. Svartvik.  1985.  A Comprehensive Grammar of the English Language.  London: Longman.

 

Sampson, G.R.  1991.  Analysed corpora of English: a consumer guide.  In Computers in Applied Linguistics, ed. by Martha Pennington and V. Stevens [dated 1992 but published in 1991].  181-200.   Clevedon, Avon: Multilingual Matters.

 

Sampson, G.R.  1992.  Probabilistic parsing.  In Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, ed. by J. Svartvik.  425-47.  Berlin: Mouton de Gruyter.

 

Sampson, G.R.  Forthcoming.  The need for grammatical stocktaking.  To be in Proceedings of the 1992 Pisa Symposium on European Textual Corpora, ed. by N. Ostler.

 

Stockwell, R.P., P. Schachter, and Barbara Hall Partee.  1973.  The Major Syntactic Structures of English.  New York: Holt, Rinehart and Winston.

 

Aarts FSch


 



[1]The support of the Economic and Social Research Council (UK) is gratefully acknowledged.  Project SUSANNE, ‘Construction of an Analysed Corpus of English’, was funded by ESRC award no. R00023 1142 from 1988 to 1992.  ‘SUSANNE’ stands for ‘Surface and underlying structural analyses of naturalistic English’.  I should like to express my warmest thanks to the team who worked on Project SUSANNE, namely Robin Haigh, Hélène Knight, Tim Willis, and Nancy Glaister, and to David Tugwell who also contributed to the SUSANNE scheme.

 

[2]Note that a sharp distinction is drawn here between the terms ‘scheme’ and ‘system’.  A ‘parsing scheme’, or ‘analytic scheme’, refers to a range of notations and guidelines for using them which prescribe to a human analyst what the appropriate grammatical annotation for a language example should be.  A parsing ‘system’ on the other hand refers to a software system which automatically produces analyses (according to some parsing scheme) of input language examples.  A parsing scheme defines the target which a parsing system hits (or fails to hit).  The SUSANNE Corpus represents part of the definition of a parsing scheme.  It has been produced largely manually, not as the output of an automatic parsing system.

 

[3]I thank Alvar Ellegård for permission to circulate a research resource derived from the work of his group.

 

[4]APRIL Phase 2, ‘A speech-oriented stochastic parser’:  see footnote 6 below.

 

[5]Analysed corpora available at the outset of Project SUSANNE are surveyed in Sampson (1991).

 

[6]Phases 1 (1986–9) and 2 (1989–91) of Project APRIL were sponsored by the Royal Signals and Radar Establishment (Ministry of Defence), under MoD contracts nos. D/ER/1/9/4/2062/0128 and  D/ER/1/9/4/2062/0151. APRIL Phase 3 (1992–95), ‘A full natural language annealing parser’, which is to produce a self-contained annealing parser system suitable for distribution to and evaluation by the research community, is sponsored jointly by the Science and Engineering Research Council (UK) and the UK Ministry of Defence, under grant no. GR/J06108.

 

[7]For instance there are numerous incorrect attachments of postmodifying phrases in Release 1.