SUSANNE – A Domesday Book of English Grammar

SUSANNE: A Domesday Book of English Grammar

Geoffrey Sampson

School of Cognitive and Computing Sciences

University of Sussex

Falmer, Brighton BN1 9QH, England

INTRODUCTION

The SUSANNE Corpus has been created, with the sponsorship of the Economic and Social Research Council (UK), as part of the process of developing a comprehensive taxonomy and annotation scheme for the (logical and surface) grammar of English for NLP (natural language processing) purposes.[1] Copies are now available to the research community freely and without formalities. Release 1 of the Corpus has been distributed via anonymous ftp over the Internet by the Oxford Text Archive since October 1992; after six months, messages received from users show that it is by now in use in a variety of academic and commercial research environments in many countries on at least four continents. (The procedure for acquiring a copy is detailed in the Appendix.)

The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and the boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis.[2] The SUSANNE scheme may be likened to a ‘Linnaean taxonomy’ of the grammatical domain: its aim (comparable to that of Linnaeus’s eighteenth-century taxonomy for the domain of botany) is not to identify categories which are theoretically optimal or which necessarily reflect the psychological organization of speakers’ linguistic competence, but simply to offer a scheme of categories and ways of applying them that make it practical for NLP researchers to register everything that occurs in real-life usage systematically and unambiguously, without misunderstandings over local uses of analytic terminology.

Alternatively, one may liken the SUSANNE analytic scheme to the Domesday Book commissioned by William the Bastard after his conquest of England: the scheme describes English grammar as Domesday describes eleventh-century English geography, not discursively or with attention to human interest, but comprehensively and in a terse, systematic format which specifies just enough information to permit the application of consistent procedures (in the Domesday case, taxation procedures).

There are numerous reasons why taxonomic work of this kind is a high priority at the current juncture in the history of natural language processing. Such work is needed both to facilitate the development of more adequate NLP systems, and to create a greater level of sophistication in the user community about the systems available.

By offering a comprehensive check-list of phenomena which a fully-adequate NLP system needs to be able to handle (which include many linguistic structures commonly ignored by theoretical linguistics – consider for instance addresses, weights and measures, the placement of punctuation marks within grammatical structures, all very significant for practical natural language processing but scarcely visible within orthodox linguistic descriptions), a taxonomy enables the system builder to monitor what areas of the total task he has covered and to focus his efforts on major gaps. And by publicly specifying a ‘default’ analysis for every construction, a taxonomy enables the system builder to put effort into defining alternative analytic norms only where he has positive reasons for diverging from the default analysis – at present, for lack of a public taxonomy, each research group must define its analytic standards independently from the ground up, or else (as often happens) leave them vague in many respects.

At the same time, a public taxonomy facilitates the definition of objective benchmarks allowing the achievements of particular NLP systems to be measured and expressed in terms that are generally understood: thus encouraging the replacement of inferior by superior systems, and enabling potential clients for the technology to assess in advance the scope of systems they are thinking of investing in. These developments are essential if natural language processing is to complete the transition from the status of an academic pastime into a mature component of the information technology industry. Cf. Sampson (1992, forthcoming).

The SUSANNE analytic scheme is defined in detail in a book by myself, English for the Computer, forthcoming from Oxford University Press. The Chairman of the Analysis and Interpretation Working Group of the US/EC-sponsored Text Encoding Initiative has proposed its adoption as a recognised TEI standard. The SUSANNE scheme aims to specify annotation norms for the modern English language; it does not cover other languages, although it is hoped that the general principles of the SUSANNE scheme may prove helpful in developing comparable taxonomies for these.

Regrettably, Release 1 of the SUSANNE Corpus is not a ‘TEI-conformant’ resource, though aspects of the annotation scheme have been decided in such a way as to facilitate a move to TEI conformance in later releases. The working timetable of the Initiative meant that relevant aspects of the TEI Guidelines were not yet complete at the point when the SUSANNE Corpus was ready for initial release; delaying this release would have been unfortunate.

The brief description of the SUSANNE Corpus contained in the remainder of this article cannot replace the very detailed statements, illustrated with numerous Brown and LOB Corpus examples, to be found in English for the Computer; any user aiming to do serious work with the Corpus or the SUSANNE annotation scheme would probably need to consult the book. In a sense, the Corpus is pointless without the book. Nevertheless, prospective users may find a summary statement helpful, as giving an impression of the scope of the analytic scheme.

BACKGROUND

The present SUSANNE annotation scheme originated in work carried out by myself in collaboration with Professor Geoffrey Leech, F.B.A., and others in the years 1983‑85 to produce a database of manually analysed sentences from the LOB Corpus of written British English; this database, which has not been (and will not now be) published, is described in Garside et al. (1987: ch. 7). The annotation scheme of this ‘Lancaster-Leeds Treebank’ represented surface grammar only, without indications of logical form. It subsequently seemed desirable to extend this scheme to include methods for representing logical grammar, and to refine both surface and logical aspects of the annotation scheme by applying it to a larger body of texts. The only way that a parsing scheme can in practice be made increasingly adequate is in the way that the English Common Law develops, by collecting and systematizing the body of precedents generated through detailed consideration of more and more individual cases that arise in real life. Accordingly, Project SUSANNE took a subset of the Brown Corpus of written American English which had been manually analysed by Alvar Ellegård’s group at Gothenburg (Ellegård 1978), and reworked the annotations in this under-used resource in order to turn them into a scheme consistent with that used in the Lancaster-Leeds Treebank but including specifications of logical as well as surface structure: several categories of information not indicated in either Lancaster-Leeds or Gothenburg schemes were also added.[3]

The finished SUSANNE parsing scheme has thus been developed on the basis of samples of both British and American English. It is oriented chiefly towards written language; however, on another project sponsored by the Royal Signals and Radar Establishment[4] my team produced extensions to the SUSANNE scheme for annotating the distinctive grammatical phenomena of spoken English, and these extensions are specified in English for the Computer (though they are not used in the SUSANNE Corpus and are not discussed further here). It should be noted also that the scheme has emerged through a process of detailed critical discussion of analytic standards by some ten people over a decade; apart from myself, the leading role in the early years of these discussions was taken by Geoffrey Leech, whose standing as an English grammarian needs no emphasis.

The SUSANNE Corpus itself comprises an approximately 128,000-word subset of the Brown Corpus of American English, annotated in accordance with the SUSANNE scheme. The original motives for producing this database included that of providing better statistics than any then available[5] for probabilistic automatic-parsing techniques, such as those of my APRIL annealing parser project.[6] Statistically-based automatic language processing needs data analysed in a very consistent fashion, and hence requires a very explicit analytic scheme. In terms of quantity of language examples analysed, Project SUSANNE was overtaken after its inception by projects (notably Mitchell Marcus’s Pennsylvania Treebank project, cf. chapter 00 of this volume) which have used quasi-industrial methods to generate far larger bodies of grammatically-analysed material. However, the SUSANNE scheme may be unparalleled in the extent to which its categories have been refined and tested through detailed consideration of the almost endless small quirks of the texts to which they have been applied, and in the degree of precision to which the resulting guidelines for using the categories have been documented – thus defining analytic standards which permit annotation of future material to be extremely self-consistent. Accordingly the SUSANNE Corpus is offered to the research community primarily as a demonstration of the application of the parsing scheme, evidencing the fact that the scheme has survived the test of experience rather than being merely aprioristic. The SUSANNE Corpus functions, as it were, like a collection of type specimens appended to a botanical taxonomy.

Although Release 1 of the SUSANNE Corpus has undergone considerable proof-checking, it unquestionably still contains many errors.[7] I aim to issue future releases correcting these; I shall be extremely grateful if users discovering errors will log them and send me details, preferably by post rather than e-mail.

STRUCTURE OF THE CORPUS

The SUSANNE Corpus consists of 64 data files together with a documentation file. Each data file contains an annotated version of one 2000+ word text from the Brown Corpus. Files average about 83 kilobytes in size, thus the entire Corpus totals about 5.3 megabytes. The data file names are those of the respective Brown texts, e.g. A01, N18; the documention file is named ‘SUSANNE.doc’. Sixteen texts are drawn from each of the following Brown genre categories:

            A         press reportage
            G         belles lettres, biography, memoirs
            J          learned (mainly scientific and technical) writing
            N         adventure and Western fiction

The Corpus thus samples each of the four broad genre groups established on the basis of word-frequency data by Hofland & Johansson (1982: 27).

Each data file has a line (terminating in a newline character) for each word of the original text; but ‘words’ for SUSANNE purposes are often smaller than words in the ordinary orthographic sense, for instance punctuation marks and the apostrophe-s suffix are treated as separate words and assigned lines of their own. (For details on the rules by which orthographic words are segmented, as well as on all other analytic matters mentioned below, see English for the Computer.)

For an example see Figure 1, which displays a short section from file A10 (part of the analysis of a news report from The Oregonian newspaper).

Each line of a SUSANNE data file has six fields separated by tabs (that is, there is one tab after each of fields 1 to 5, but a newline after field 6). Each field on every line contains at least one character. The six fields on each line are:

1 reference

2 status

3 wordtag

4 word

5 lemma

6 parse

Apart from the tab and newline characters used to structure fields and records (that is, lines), all bytes in each of the 64 SUSANNE data files are drawn from a subset of the 94 graphic character allocations of the International Reference Version (‘IRV’) of ISO 646:1983 ‘Information Processing – ISO 7-bit coded character set for information interchange’, from hexadecimal 21 (exclamation mark) to hex 7E (tilde). These codes are assumed for SUSANNE purposes to represent the graphic symbols assigned by the IRV system. Twelve members of the IRV character set are never used in the Corpus, namely (all codes hexadecimal):

23 gate

24 generalized currency unit

27 prime

2F solidus

5C reverse solidus

5E circumflex

5F underline

60 grave

7B opening curly bracket

7C vertical bar

7D closing curly bracket

7E tilde

The space character, hex 20, which is classified by ISO 646 as a control code also does not occur in SUSANNE data files.

Where text characters cannot be adequately represented directly within the resulting 82-member character set, they are represented by entity names within angle brackets. Where possible these are drawn from Appendix D to ISO 8879:1986, ‘Information Processing – Text & Office Systems – Standard Generalized Markup Language (SGML)’. For instance, <ldquo> stands for opening double inverted commas, <eacute> for lower-case ‘e’ with acute accent. Symbols in angle brackets are used also to stand for such things as typographical shifts, which for purposes of grammatical analysis are conveniently represented as items within the word-sequence: e.g. <bital> means ‘begin italics’.

REFERENCE FIELD

The reference field contains nine bytes which give each line a reference number that is unique across the SUSANNE Corpus, e.g. ‘A10:0630e’. The first three bytes (here A10) are the file name; the fourth byte is always a colon; bytes 5 to 8 (here 0630) are the number of the line in the ‘Bergen I’ version of the Brown Corpus on which the relevant word appears (Brown line numbers normally increment in tens, with occasional odd numbers interpolated); and the ninth byte is a lower-case letter differentiating successive words that appear on the same Brown line. (SUSANNE lines are lettered continuously from ‘a’, omitting ‘l’ and ‘o’.)

STATUS FIELD

The status field contains one byte. The letters ‘A’ and ‘S’ show that the word is an ‘abbreviation’ or ‘symbol’, respectively, as defined by Brown Corpus codes (Francis & Kučera 1989: 12). The letter ‘E’ shows that the word is (or is part of) a misprint or solecism in the original text (details are logged in English for the Computer). On the great majority of lines, to which none of these three categories apply, the status field contains a hyphen character.

WORDTAG FIELD

The SUSANNE wordtag set is based on the ‘Lancaster’ tagset listed in Garside et al. (1987: Appendix B); additional grammatical distinctions have been drawn in this set, and these are indicated by suffixing lower-case letters to the Lancaster tags. For instance, seemed is tagged VVD (past tense of verb) in the Lancaster scheme, but VVDi (past tense of intransitive – including copular – verb) in the SUSANNE scheme. Apart from the lower-case extensions, the wordtags are normally identical to the Lancaster tags: punctuation marks are assigned alphabetical tags beginning Y… (e.g. YC for comma), and the dollar sign which appears in some Lancaster tags for genitive words is replaced by G (e.g. GG for the apostrophe-s suffix), so that the modified Lancaster tags always consist wholly of alphanumeric characters, beginning with two capital letters. (In a few cases, tags from the Lancaster set have been merged or eliminated from the SUSANNE scheme in the light of experience.)

The tag YG appears in the wordtag field to represent a ‘trace’ – the logical position of a constituent which has been shifted elsewhere, or deleted, in the surface grammatical structure.

The SUSANNE wordtag set comprises 352 distinct wordtags, not counting tags for elements of ‘grammatical idioms’ (see below); a few of these wordtags are never used in the SUSANNE Corpus. The wordtags are listed, and their application rigorously defined, in English for the Computer – in the case of closed wordclasses, by enumeration of their members, and in the case of open classes by rules for choice between alternative tags. These rules refer to information in a specified published dictionary (the Oxford Advanced Learner’s Dictionary of Current English, 3rd edition).

WORD FIELD

The word field contains a segment of the text, often coinciding with a word in the orthographic sense but sometimes, as noted above, including only part of an orthographic word. In general the word field represents all and only those typographical distinctions in the original documents which were recorded in the Brown Corpus (Francis & Kučera 1989: 10–15), though in certain cases the SUSANNE Corpus has gone behind the Brown Corpus to reconstruct typographical details omitted from Brown.

Certain characters have special meanings in the wordfield, as follows:

+ (occurs only as first byte of the wordfield) shows that the contents of the field were not separated in the original text from the immediately-preceding text segment by whitespace (e.g. in the case of a punctuation mark, or part of a hyphenated sequence split over successive SUSANNE lines);

- the line corresponds to no text material (it represents the ‘trace’ for a grammatically-moved element);

< … > enclose entity names for special typographical features, as discussed above, either taken from ISO 8879:1986 Appendix D or created for the SUSANNE Corpus – for instance <pand> stands for ‘either plus sign or ampersand’, since the Brown Corpus makes no distinction between these characters.

LEMMA FIELD

The lemma field shows the dictionary headword of which the text word is a form: the field shows base forms for words which are inflected in the text, and eliminates typographical variations (such as sentence-initial capitalization) which are not inherent to the word but relate to its use in context. (In the case of ‘words’ to which the dictionary-form concept is inappropriate, e.g. numerals and punctuation marks, the lemma field contains a hyphen.) Orthographic forms in the lemma field are those of the Oxford Advanced Learner’s Dictionary of Current English, 3rd edition.

Project SUSANNE aimed also to indicate the senses which polysemous words bear in context, via codes relating word-tokens to numbered subsenses in a specified dictionary. The book English for the Computer provides a detailed coding scheme for representing this information. Unfortunately, this aspect of the project’s output proved to contain a number of inadequacies, and the information does not appear in Release 1 of the Corpus. It is hoped to include it in later releases.

PARSE FIELD

The contents of the sixth field represent the central raison d’Ítre of the SUSANNE Corpus. They code the grammatical structure of texts as a sequence of labelled trees, having a leaf node for each Corpus line.

Each text is treated as a sequence of ‘paragraphs’ separated by ‘headings’. (Figure 1 includes one complete one-sentence paragraph, ending at line A10:0650f, and the first sentence of the following paragraph.) A ‘paragraph’ normally coincides with an ordinary orthographic paragraph; a ‘heading’ may consist of actual verbal material, or may be merely a typographical paragraph division, symbolized <minbrk> in the word field. Conceptually, the structure of each paragraph or heading is a labelled tree with root node labelled O (Oh for a heading), and with a leaf node labelled with a wordtag for each SUSANNE word or trace, i.e. each line of the Corpus. There will commonly be many intermediate labelled nodes.

Such a tree is represented as a bracketed string in the ordinary way, with the labels of nonterminal nodes written ‘inside’ both opening and closing brackets (that is, to the right of opening brackets and to the left of closing brackets). This bracketed string is then adapted as follows for inclusion in successive SUSANNE parse fields. Wherever an opening bracket immediately follows a closing bracket, the string is segmented, yielding one segment per leaf node; and within each such segment, the sequence opening-bracket + wordtag + closing-bracket, representing the leaf node, is replaced by full stop. Thus each parse field contains exactly one full stop, corresponding to a terminal node labelled with the contents of the wordtag field, sometimes preceded by labelled opening bracket(s) and sometimes followed by labelled closing bracket(s), corresponding to higher tagmas which begin or end with the word on the line in question.

Brackets are square except in the case of nodes immediately dominating the ‘trace’ wordtag YG, which are represented with angle brackets.

Nonterminal node labels in the SUSANNE scheme contain up to three types of information: a formtag, a functiontag, and an index, in that order. In a label containing a formtag and one or both of the other two elements, a colon separates the formtag from the other elements. A functiontag is always a single alphabetic character, and an index is a sequence of three digits; restrictions on valid combinations of elements within a node label mean that complex labels can always be unambiguously decomposed into their elements.

RANKS OF CONSTITUENT

Apart from nodes immediately dominating traces, all nodes have labels including formtags, which identify the internal properties of the word or word-sequence dominated by the node. The shape of a parse-tree is defined in terms of a hierarchy of formtag ranks:

1 wordlevel formtags (begin with two capital letters; formtags of all other ranks begin with one capital and contain no further capitals)

2 phraselevel formtags (begin with one of: N V J R P D M G)

3 clauselevel formtags (begin with one of: S F T Z L A W)

4 rootlevel formtags (begin with one of: O Q I)

Each grammatical clause, whether consisting of one or more words, is given a node labelled with a clauselevel formtag. Each immediate constituent of a clause, whether there are one or more such constituents and whether the constituent consists of one or more words, is given a node labelled with a phraselevel formtag, unless the constituent belongs to a wordlevel category that has no corresponding phraselevel category (e.g. punctuation marks, existential there, conjunctions), or to a rootlevel category (e.g. a direct quotation, formtagged Q). Thus a clause consisting just of a verb will be assigned a clauselevel formtag (e.g. Tg for present-participle clause) which singularily dominates a phraselevel formtag (e.g. Vg for ‘verb group beginning with present participle’) which in turn singularily dominates a wordlevel formtag (e.g. VVGi for ‘present participle of intransitive verb’).

Other than by these rules, and in certain other limited circumstances specified in English for the Computer, singulary branching does not occur. An intermediate phraselevel node is inserted between a higher phraselevel node and a sequence of words dominated by it only if two or more of those words form a coherent constituent within the higher phrase. A clause which fills a slot standardly filled by a phrase (e.g. a nominal clause as subject or object) will not have a phrase node above the clause node unless the clause proper is preceded and/or followed by modifying elements that are not part of the clause.

Detailed rules for deciding constituency in various debatable cases, for placing items such as punctuation marks within parse trees, for extending the application of the categories and structuring rules to linguistic phenomena such as addresses or weights and measures which are not commonly taken into account in linguistic theorizing, and so forth, are laid down in English for the Computer.

FUNCTIONTAGS AND INDICES

Functiontags, identifying roles such as surface subject, logical object, time adjunct, are assigned to all immediate constituents of clauses, except for their verb-group heads and certain other constituents for which function labelling is inappropriate.

Indices are assigned to pairs of nodes to show referential identity between items which are in certain defined grammatical relationships to one another. Thus, in Figure 1, the sequence feeling that evacuation plans … would not work is given the label Ns:S152, in which the formtag Ns identifies the tagma as a singular noun phrase (note that in context feeling occurs in its nominal use – one might have expected the sentence to read a feeling … , but corpus linguistics takes language as it finds it), the capital S shows that the tagma is surface subject of the seemed clause (in an existential clause the subject, which determines verb agreement, standardly follows the verb), and the index 152 shows that this tagma is identifiable with the logical subject (s152) of the to be clause. The label Ti:s on this latter clause shows that it is an infinitival clause (formtag Ti), which as a whole, including its displaced logical subject, forms the logical subject (s) of seemed.

In some cases, movement rules displace a constituent into a tagma within which it has no grammatical role (for instance, an adverb which is logically a clause constituent may interrupt the verb group – sequence of auxiliary verbs and main verb – which heads the clause): in such cases the functiontag is G (‘guest’). Constituents which do not logically belong below the node which immediately dominates them in surface structure are always given G functiontags and indices linking them to their logical position. With that exception (and with one other exception not discussed here relating to co-ordination), functiontagging is used only for immediate constituents of clauses.

English for the Computer lists the types of surface/logical-grammar discordance which are represented by the SUSANNE scheme, and the approved methods of representing them. The SUSANNE analysis is always chosen so as to be as far as possible neutral as between alternative linguistic theories.

THE FORMTAGS

The SUSANNE formtags are as follows:

Rootlevel Formtags

O paragraph

Oh heading

Ot title (e.g. of book)

Q quotation

I interpolation

Iq tag question

Iu scientific citation

Clauselevel Formtags

S main clause

Ss quoting clause embedded within quotation

Fa adverbial clause

Fn nominal clause

Fr relative clause

Ff ‘fused’ relative

Fc comparative clause

Tg present participle clause

Ti infinitival clause

Tn past participle clause

Tf for-to clause

Tb ‘bare’ nonfinite clause

Tq infinitival relative clause

Z reduced (‘whiz-deleted’) relative clause

L other verbless clause

A special as clause

W with clause

Phraselevel Formtags

N noun phrase

V verb group

J adjective phrase

R adverb phrase

P prepositional phrase

D determiner phrase

M numeral phrase

G genitive phrase

The various phrase categories take lower-case subcategory symbols which can be combined in any meaningful combination (e.g. the verb group would not work is formtagged Vdce). The phrase subcategories are:

Vo operator section of verb group, when separated from remainder of verb group e.g. by subject-auxiliary inversion

Vr remainder of verb group from which operator has been separated

Vm verb group beginning with am

Va verb group beginning with are

Vs verb group beginning with was

Vz verb group beginning with other 3rd-singular verb

Vw verb group beginning with were

Vj verb group beginning with be

Vd verb group beginning with past tense

Vi infinitival verb group

Vg verb group beginning with present participle

Vn verb group beginning with past participle

Vc verb group beginning with modal

Vk verb group containing emphatic DO

Ve negative verb group

Vf perfective verb group

Vu progressive verb group

Vp passive verb group

Vb verb group ending with BE

Vx verb group lacking main verb

Vt catenative verb group

Nq wh- noun phrase

Nv wh…ever noun phrase

Ne I/me head

Ny you head

Ni it head

Nj adjective head

Nn proper name

Nu unit noun head

Na marked as subject

No marked as nonsubject

Ns singular noun phrase

Np plural noun phrase

Jq wh- adjective phrase

Jv wh…ever adjective phrase

Jx measured absolute adjective phrase

Jr measured comparative adjective phrase

Jh postmodified adjective phrase

Rq wh- adverb phrase

Rv wh…ever adverb phrase

Rx measured absolute adverb phrase

Rr measured comparative adverb phrase

Rs adverb conducive to asyndeton

Rw quasi-nominal adverb

Po of phrase

Pb by phrase

Pq wh- prepositional phrase

Pv wh…ever prepositional phrase

Dq wh- determiner phrase

Dv wh…ever determiner phrase

Ds singular determiner phrase

Dp plural determiner phrase

Ms phrase headed by one

Subcategory symbols are not included if implied by more specific subcategories, thus a verb group beginning was will be labelled Vs, not Vsd.

NON-ALPHANUMERIC FORMTAG SUFFIXES

Formtags may also contain non-alphanumeric symbols, including:

? interrogative clause

* imperative clause

% subjunctive clause

! exclamatory clause or other item

" vocative item

Other non-alphanumeric symbols represent co-ordination structure. Under the SUSANNE scheme, second and subsequent conjuncts in a co-ordination are analysed as subordinate to the first conjunct; thus a co-ordination of the form:

c, y, and w

(where c, y, etc. are word-sequences of any grammatical rank) would be assigned a structure of the form:

[c, [y], [and w]]

The formtag of the entire co-ordination is determined by the properties of the first conjunct (except for singular/plural subcategories in the case of phrase categories to which these apply); the later conjuncts (which will often be grammatically reduced) have nodes of their own whose formtags mark them as ‘subordinate conjuncts’. The following symbols relate to co-ordination (and apposition) structure:

+ subordinate conjunct introduced by conjunction

- subordinate conjunct not introduced by conjunction

@ appositional element

& co-ordinate structure acting as first conjunct within a higher co-ordination (marked in certain cases only)

Co-ordination is recognised as occurring between words as well as between higher-rank tagmas; Figure 1 contains no example, but for instance in he bought apples and bananas the phrase apples and bananas would be analysed as a simple noun phrase singularily dominating a co-ordination of nouns, rather than as a co-ordination of one-word noun phrases. Therefore nonterminal nodes may have formtags consisting of wordtags followed by co-ordination symbols, thus (using WT to stand for an arbitrary wordtag):

WT& co-ordination of words

WT+ conjunct within wordlevel co-ordination that is introduced by a conjunction

WT- conjunct within wordlevel co-ordination not introduced by a conjunction

(A wordlevel co-ordination always takes an ampersand on its formtag; phrase or clause co-ordinations do so only in very restricted circumstances.)

Also, certain sequences of orthographic words, in certain uses, are regarded as functioning grammatically as single words (‘grammatical idioms’). For instance, in keeping with is normally treated as a grammatical idiom, equivalent to a single preposition (for which the wordtag is II). In such cases, the nonterminal node dominating the sequence has a formtag consisting of an equals sign suffixed to the corresponding wordtag; and the individual words composing the idiom are not wordtagged in their own right, but receive tags with numerical suffixes reflecting their membership of an idiom. (The sequence in keeping with is formtagged II=, and the words in, keeping, with in this context are wordtagged II31 II32 II33.) English for the Computer includes exhaustive listings of closed-class grammatical idioms.

Note that formtags of the forms WT& WT+ WT- WT= rank as wordlevel formtags for the purposes of determining tree structure as discussed above.

THE FUNCTIONTAGS

Functiontags divide into complement and adjunct tags: broadly, a given complement tag can occur at most once in any clause, but a clause may contain multiple adjuncts of the same type.

It was originally planned to classify complements in terms of some version of Fillmorean Case Grammar. The most fully worked-out version of case theory, including specimen case frames for numerous English verbs and other predicates, is that of Stockwell et al. (1973), and the SUSANNE team set out to develop this into a scheme capable of specifying an unambiguous case assignment for all complements found in corpus material. After strenuous and protracted efforts, this attempt failed; the nature of the logical relationships which various predicates in real-life usage contract with their arguments proved too diverse to handle in this fashion, and the team believe that they have ‘tested to destruction’ the hypothesis that core clause structure in English can adequately be described in terms of a limited set of ‘cases’. Instead, the finished SUSANNE Corpus classifies complements in terms of the semantically less informative, but more predictable, traditional concepts of subject and object.

The scheme of adjunct categories has been developed from the classification of Quirk et al. (1985), though some modifications have been introduced in the light of experience in applying the categories to corpus data.

Complement Functiontags

s logical subject

o logical direct object

S surface (and not logical) subject

O surface (and not logical) direct object

i indirect object

u prepositional object

e predicate complement of subject

j predicate complement of object

a agent of passive

n particle of phrasal verb

z complement of catenative

x relative clause having higher clause as antecedent

G ‘guest’ having no grammatical role within its tagma

Adjunct Functiontags

p place

q direction

t time

h manner or degree

m modality

c contingency

r respect

w comitative

k benefactive

b absolute

Detailed guidelines for the application of these functional categories are included in English for the Computer.

APPENDIX: How to retrieve a copy of the SUSANNE Corpus

[Since this article was published, I have abandoned the Oxford Text Archive as a distribution centre for up-to-date versions of the SUSANNE Corpus and similar resources, in favour of a server under my own control. To get hold of a copy of the most up-to-date version of the Corpus at any time, follow the link to ‘downloadable research resources’ from my home page at www.grsampson.net and follow the instructions given there.]

REFERENCES

Ellegård, A. 1978. The Syntactic Structure of English Texts. Gothenburg Studies in English 43. Gothenburg: Acta Universitatis Gothoburgensis.

Francis, W.N. and H. Kučera. 1989. Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for use with Digital Computers (corrected and revised edition). Providence, Rhode Island: Department of Linguistics, Brown University.

Garside, R.G., G.N. Leech, and G.R. Sampson (eds.). 1987. The Computational Analysis of English. London: Longman.

Hofland, K. and S. Johansson. 1982. Word Frequencies in British and American English. London: Longman.

Quirk, R., S. Greenbaum, G.N. Leech, and J. Svartvik. 1985. A Comprehensive Grammar of the English Language. London: Longman.

Sampson, G.R. 1991. Analysed corpora of English: a consumer guide. In Computers in Applied Linguistics, ed. by Martha Pennington and V. Stevens [dated 1992 but published in 1991]. 181-200. Clevedon, Avon: Multilingual Matters.

Sampson, G.R. 1992. Probabilistic parsing. In Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, ed. by J. Svartvik. 425-47. Berlin: Mouton de Gruyter.

Sampson, G.R. Forthcoming. The need for grammatical stocktaking. To be in Proceedings of the 1992 Pisa Symposium on European Textual Corpora, ed. by N. Ostler.

Stockwell, R.P., P. Schachter, and Barbara Hall Partee. 1973. The Major Syntactic Structures of English. New York: Holt, Rinehart and Winston.

Aarts FSch