The following online article has been derived mechanically from an MS produced on the way towards conventional print publication. Many details are likely to deviate from the print version; figures and footnotes may even be missing altogether, and where negotiation with journal editors has led to improvements in the published wording, these will not be reflected in this online version. Shortage of time makes it impossible for me to offer a more careful rendering. I hope that placing this imperfect version online may be useful to some readers, but they should note that the print version is definitive. I shall not let myself be held to the precise wording of an online version, where this differs from the print version. Published in Nelleke Oostdijk & P. de Haan, eds., Corpus-Based Research into Language, Rodopi (Amsterdam), 1994. |
SUSANNE: A Domesday Book of
English Grammar
Geoffrey Sampson
School of Cognitive and Computing Sciences
University of Sussex
Falmer, Brighton BN1 9QH, England
INTRODUCTION
The SUSANNE Corpus has been created, with the sponsorship of the Economic
and Social Research Council (UK), as part of the process of developing a
comprehensive taxonomy and annotation scheme for the (logical and surface)
grammar of English for NLP (natural language processing) purposes.[1]
Copies are now available to the research community freely and without
formalities. Release 1 of the
Corpus has been distributed via anonymous ftp over the Internet by the Oxford
Text Archive since October 1992; after six months, messages received from users
show that it is by now in use in a variety of academic and commercial research
environments in many countries on at least four continents. (The procedure for acquiring a copy is
detailed in the Appendix.)
The SUSANNE scheme attempts to provide a method of representing all
aspects of English grammar which are sufficiently definite to be susceptible of
formal annotation, with the categories and the boundaries between categories
specified in sufficient detail that, ideally, two analysts independently
annotating the same text and referring to the same scheme must produce the same
structural analysis.[2]
The SUSANNE scheme may be likened to a ‘Linnaean taxonomy’ of the
grammatical domain: its aim
(comparable to that of Linnaeus’s eighteenth-century taxonomy for the domain of
botany) is not to identify categories which are theoretically optimal or which
necessarily reflect the psychological organization of speakers’ linguistic
competence, but simply to offer a scheme of categories and ways of applying
them that make it practical for NLP researchers to register everything that
occurs in real-life usage systematically and unambiguously, without
misunderstandings over local uses of analytic terminology.
Alternatively, one may liken the SUSANNE analytic scheme to the Domesday
Book commissioned by William the Bastard after his conquest of England: the scheme describes English grammar as
Domesday describes eleventh-century English geography, not discursively or with
attention to human interest, but comprehensively and in a terse, systematic
format which specifies just enough information to permit the application of
consistent procedures (in the Domesday case, taxation procedures).
There are numerous reasons why taxonomic work of this kind is a high
priority at the current juncture in the history of natural language
processing. Such work is needed
both to facilitate the development of more adequate NLP systems, and to create
a greater level of sophistication in the user community about the systems
available.
By offering a comprehensive check-list of phenomena which a
fully-adequate NLP system needs to be able to handle (which include many
linguistic structures commonly ignored by theoretical linguistics – consider
for instance addresses, weights and measures, the placement of punctuation
marks within grammatical structures, all very significant for practical natural
language processing but scarcely visible within orthodox linguistic
descriptions), a taxonomy enables the system builder to monitor what areas of
the total task he has covered and to focus his efforts on major gaps. And by publicly specifying a ‘default’
analysis for every construction, a taxonomy enables the system builder to put
effort into defining alternative analytic norms only where he has positive
reasons for diverging from the default analysis – at present, for lack of a
public taxonomy, each research group must define its analytic standards
independently from the ground up, or else (as often happens) leave them vague
in many respects.
At the same time, a public taxonomy facilitates the definition of
objective benchmarks allowing the achievements of particular NLP systems to be
measured and expressed in terms that are generally understood: thus encouraging the replacement of
inferior by superior systems, and enabling potential clients for the technology
to assess in advance the scope of systems they are thinking of investing
in. These developments are
essential if natural language processing is to complete the transition from the
status of an academic pastime into a mature component of the information
technology industry. Cf. Sampson
(1992, forthcoming).
The SUSANNE analytic scheme is defined in detail in a book by myself, English
for the Computer,
forthcoming from Oxford University Press.
The Chairman of the Analysis and Interpretation Working Group of the
US/EC-sponsored Text Encoding Initiative has proposed its adoption as a
recognised TEI standard. The
SUSANNE scheme aims to specify annotation norms for the modern English
language; it does not cover other languages, although it is hoped that the
general principles of the SUSANNE scheme may prove helpful in developing
comparable taxonomies for these.
Regrettably, Release 1 of the SUSANNE Corpus is not a ‘TEI-conformant’
resource, though aspects of the annotation scheme have been decided in such a
way as to facilitate a move to TEI conformance in later releases. The working timetable of the Initiative
meant that relevant aspects of the TEI Guidelines were not yet complete at the
point when the SUSANNE Corpus was ready for initial release; delaying this
release would have been unfortunate.
The brief description of the SUSANNE Corpus contained in the remainder
of this article cannot replace the very detailed statements, illustrated with
numerous Brown and LOB Corpus examples, to be found in English for the
Computer; any user aiming
to do serious work with the Corpus or the SUSANNE annotation scheme would
probably need to consult the book.
In a sense, the Corpus is pointless without the book. Nevertheless, prospective users may
find a summary statement helpful, as giving an impression of the scope of the
analytic scheme.
BACKGROUND
The present SUSANNE annotation scheme originated in work carried out by
myself in collaboration with Professor Geoffrey Leech, F.B.A., and others in
the years 1983‑85 to produce a database of manually analysed sentences
from the LOB Corpus of written British English; this database, which has not
been (and will not now be) published, is described in Garside et al. (1987: ch.
7). The annotation scheme of this
‘Lancaster-Leeds Treebank’ represented surface grammar only, without
indications of logical form. It
subsequently seemed desirable to extend this scheme to include methods for
representing logical grammar, and to refine both surface and logical aspects of
the annotation scheme by applying it to a larger body of texts. The only way that a parsing scheme can
in practice be made increasingly adequate is in the way that the English Common
Law develops, by collecting and systematizing the body of precedents generated
through detailed consideration of more and more individual cases that arise in
real life. Accordingly, Project
SUSANNE took a subset of the Brown Corpus of written American English which had
been manually analysed by Alvar Ellegård’s group at Gothenburg (Ellegård 1978),
and reworked the annotations in this under-used resource in order to turn them
into a scheme consistent with that used in the Lancaster-Leeds Treebank but
including specifications of logical as well as surface structure: several categories of information not
indicated in either Lancaster-Leeds or Gothenburg schemes were also added.[3]
The finished SUSANNE parsing scheme has thus been developed on the basis
of samples of both British and American English. It is oriented chiefly towards written language; however, on
another project sponsored by the Royal Signals and Radar Establishment[4] my team produced extensions to the
SUSANNE scheme for annotating the distinctive grammatical phenomena of spoken
English, and these extensions are specified in English for the Computer (though they are not used in the SUSANNE
Corpus and are not discussed further here). It should be noted also that the scheme has emerged through
a process of detailed critical discussion of analytic standards by some ten
people over a decade; apart from myself, the leading role in the early years of
these discussions was taken by Geoffrey Leech, whose standing as an English
grammarian needs no emphasis.
The SUSANNE Corpus itself comprises an approximately 128,000-word subset
of the Brown Corpus of American English, annotated in accordance with the
SUSANNE scheme. The original
motives for producing this database included that of providing better
statistics than any then available[5] for probabilistic automatic-parsing
techniques, such as those of my APRIL annealing parser project.[6]
Statistically-based automatic language processing needs data analysed in
a very consistent fashion, and hence requires a very explicit analytic
scheme. In terms of quantity of
language examples analysed, Project SUSANNE was overtaken after its inception
by projects (notably Mitchell Marcus’s Pennsylvania Treebank project, cf.
chapter 00 of this volume) which have used quasi-industrial methods to generate
far larger bodies of grammatically-analysed material. However, the SUSANNE scheme may be unparalleled in the
extent to which its categories have been refined and tested through detailed
consideration of the almost endless small quirks of the texts to which they
have been applied, and in the degree of precision to which the resulting
guidelines for using the categories have been documented – thus defining
analytic standards which permit annotation of future material to be extremely
self-consistent. Accordingly the
SUSANNE Corpus is offered to the research community primarily as a
demonstration of the application of the parsing scheme, evidencing the fact
that the scheme has survived the test of experience rather than being merely
aprioristic. The SUSANNE Corpus
functions, as it were, like a collection of type specimens appended to a
botanical taxonomy.
Although Release 1 of the SUSANNE Corpus has undergone considerable
proof-checking, it unquestionably still contains many errors.[7]
I aim to issue future releases correcting these; I shall be extremely
grateful if users discovering errors will log them and send me details,
preferably by post rather than e-mail.
STRUCTURE OF THE CORPUS
The SUSANNE Corpus consists of 64 data files together with a
documentation file. Each data file
contains an annotated version of one 2000+ word text from the Brown
Corpus. Files average about 83
kilobytes in size, thus the entire Corpus totals about 5.3 megabytes. The data file names are those of the
respective Brown texts, e.g. A01, N18; the documention file is named
‘SUSANNE.doc’. Sixteen texts are drawn from each of the following Brown genre
categories:
A press
reportage
G belles
lettres, biography, memoirs
J learned
(mainly scientific and technical) writing
N adventure
and Western fiction
The Corpus thus samples each of the four broad genre groups established
on the basis of word-frequency data by Hofland & Johansson (1982: 27).
Each data file has a line (terminating in a newline character) for each
word of the original text; but ‘words’ for SUSANNE purposes are often smaller
than words in the ordinary orthographic sense, for instance punctuation marks
and the apostrophe-s suffix are treated as separate words and assigned lines of
their own. (For details on the
rules by which orthographic words are segmented, as well as on all other
analytic matters mentioned below, see English for the Computer.)
For an example see Figure 1, which displays a short section from file
A10 (part of the analysis of a news report from The Oregonian newspaper).
Each line of a SUSANNE data file has six fields separated by tabs (that
is, there is one tab after each of fields 1 to 5, but a newline after field
6). Each field on every line
contains at least one character.
The six fields on each line are:
1 reference
2 status
3 wordtag
4 word
5 lemma
6 parse
Apart from the tab and newline characters used to structure fields and
records (that is, lines), all bytes in each of the 64 SUSANNE data files are
drawn from a subset of the 94 graphic character allocations of the
International Reference Version (‘IRV’) of ISO 646:1983 ‘Information Processing
– ISO 7-bit coded character set for information interchange’, from hexadecimal
21 (exclamation mark) to hex 7E (tilde).
These codes are assumed for SUSANNE purposes to represent the graphic
symbols assigned by the IRV system.
Twelve members of the IRV character set are never used in the Corpus,
namely (all codes hexadecimal):
23 gate
24 generalized
currency unit
27 prime
2F solidus
5C reverse
solidus
5E circumflex
5F underline
60 grave
7B opening
curly bracket
7C vertical
bar
7D closing
curly bracket
7E tilde
The space character, hex 20, which is classified by ISO 646 as a control
code also does not occur in SUSANNE data files.
Where text characters cannot be adequately represented directly within
the resulting 82-member character set, they are represented by entity names
within angle brackets. Where
possible these are drawn from Appendix D to ISO 8879:1986, ‘Information
Processing – Text & Office Systems – Standard Generalized Markup Language
(SGML)’. For instance, <ldquo>
stands for opening double inverted commas, <eacute> for lower-case ‘e’ with acute
accent. Symbols in angle brackets
are used also to stand for such things as typographical shifts, which for purposes
of grammatical analysis are conveniently represented as items within the
word-sequence: e.g. <bital>
means ‘begin italics’.
REFERENCE FIELD
The reference field contains nine bytes which give each line a reference
number that is unique across the SUSANNE Corpus, e.g. ‘A10:0630e’. The first three bytes (here A10) are
the file name; the fourth byte is always a colon; bytes 5 to 8 (here 0630) are
the number of the line in the ‘Bergen I’ version of the Brown Corpus on which
the relevant word appears (Brown line numbers normally increment in tens, with
occasional odd numbers interpolated); and the ninth byte is a lower-case letter
differentiating successive words that appear on the same Brown line. (SUSANNE lines are lettered
continuously from ‘a’, omitting ‘l’ and ‘o’.)
STATUS FIELD
The status field contains one byte. The letters ‘A’ and ‘S’ show that the word is an
‘abbreviation’ or ‘symbol’, respectively, as defined by Brown Corpus codes
(Francis & Kučera 1989: 12). The letter ‘E’ shows that the word is (or is part of) a
misprint or solecism in the original text (details are logged in English for
the Computer). On the great majority of lines, to
which none of these three categories apply, the status field contains a hyphen
character.
WORDTAG FIELD
The SUSANNE wordtag set is based on the ‘Lancaster’ tagset listed in
Garside et al. (1987: Appendix B); additional grammatical distinctions have
been drawn in this set, and these are indicated by suffixing lower-case letters
to the Lancaster tags. For
instance, seemed is
tagged VVD (past tense of verb) in the Lancaster scheme, but VVDi (past tense
of intransitive – including copular – verb) in the SUSANNE scheme. Apart from the lower-case extensions,
the wordtags are normally identical to the Lancaster tags: punctuation marks are assigned
alphabetical tags beginning Y… (e.g. YC for comma), and the dollar sign which
appears in some Lancaster tags for genitive words is replaced by G (e.g. GG for the
apostrophe-s suffix), so that the modified Lancaster tags always consist wholly
of alphanumeric characters, beginning with two capital letters. (In a few cases, tags from the
Lancaster set have been merged or eliminated from the SUSANNE scheme in the
light of experience.)
The tag YG appears in the wordtag field to represent a ‘trace’ –
the logical position of a constituent which has been shifted elsewhere, or
deleted, in the surface grammatical structure.
The SUSANNE wordtag set comprises 352 distinct wordtags, not counting
tags for elements of ‘grammatical idioms’ (see below); a few of these wordtags
are never used in the SUSANNE Corpus.
The wordtags are listed, and their application rigorously defined, in English
for the Computer – in the
case of closed wordclasses, by enumeration of their members, and in the case of
open classes by rules for choice between alternative tags. These rules refer to information in a
specified published dictionary (the Oxford Advanced Learner’s Dictionary of
Current English, 3rd
edition).
WORD FIELD
The word field contains a segment of the text, often coinciding with a
word in the orthographic sense but sometimes, as noted above, including only
part of an orthographic word. In
general the word field represents all and only those typographical distinctions
in the original documents which were recorded in the Brown Corpus (Francis
& Kučera 1989: 10–15), though in certain cases
the SUSANNE Corpus has gone behind the Brown Corpus to reconstruct
typographical details omitted from Brown.
Certain characters have special meanings in the wordfield, as follows:
+ (occurs
only as first byte of the wordfield) shows that the contents of the field were
not separated in the original text from the immediately-preceding text segment
by whitespace (e.g. in the case of a punctuation mark, or part of a hyphenated
sequence split over successive SUSANNE lines);
- the
line corresponds to no text material (it represents the ‘trace’ for a
grammatically-moved element);
< … > enclose entity names for special typographical
features, as discussed above, either taken from ISO 8879:1986 Appendix D or
created for the SUSANNE Corpus – for instance
<pand> stands for
‘either plus sign or ampersand’, since the Brown Corpus makes no distinction
between these characters.
LEMMA FIELD
The lemma field shows the dictionary headword of which the text word is
a form: the field shows base forms
for words which are inflected in the text, and eliminates typographical
variations (such as sentence-initial capitalization) which are not inherent to
the word but relate to its use in context. (In the case of ‘words’ to which the dictionary-form concept
is inappropriate, e.g. numerals and punctuation marks, the lemma field contains
a hyphen.) Orthographic forms in
the lemma field are those of the Oxford Advanced Learner’s Dictionary of
Current English, 3rd
edition.
Project SUSANNE aimed also to indicate the senses which polysemous words
bear in context, via codes relating word-tokens to numbered subsenses in a
specified dictionary. The book English
for the Computer provides
a detailed coding scheme for representing this information. Unfortunately, this aspect of the
project’s output proved to contain a number of inadequacies, and the
information does not appear in Release 1 of the Corpus. It is hoped to include it in later
releases.
PARSE FIELD
The contents of the sixth field represent the central raison d’être of the
SUSANNE Corpus. They code the
grammatical structure of texts as a sequence of labelled trees, having a leaf
node for each Corpus line.
Each text is treated as a sequence of ‘paragraphs’ separated by
‘headings’. (Figure 1 includes one
complete one-sentence paragraph, ending at line A10:0650f, and the first
sentence of the following paragraph.)
A ‘paragraph’ normally coincides with an ordinary orthographic
paragraph; a ‘heading’ may consist of actual verbal material, or may be merely
a typographical paragraph division, symbolized <minbrk> in the word field. Conceptually, the structure of each
paragraph or heading is a labelled tree with root node labelled O (Oh for a heading),
and with a leaf node labelled with a wordtag for each SUSANNE word or trace,
i.e. each line of the Corpus.
There will commonly be many intermediate labelled nodes.
Such a tree is represented as a bracketed string in the ordinary way,
with the labels of nonterminal nodes written ‘inside’ both opening and closing
brackets (that is, to the right of opening brackets and to the left of closing
brackets). This bracketed string
is then adapted as follows for inclusion in successive SUSANNE parse
fields. Wherever an opening
bracket immediately follows a closing bracket, the string is segmented,
yielding one segment per leaf node; and within each such segment, the sequence
opening-bracket + wordtag + closing-bracket, representing the leaf node, is
replaced by full stop. Thus each
parse field contains exactly one full stop, corresponding to a terminal node
labelled with the contents of the wordtag field, sometimes preceded by labelled
opening bracket(s) and sometimes followed by labelled closing bracket(s),
corresponding to higher tagmas which begin or end with the word on the line in
question.
Brackets are square except in the case of nodes immediately dominating
the ‘trace’ wordtag YG, which are represented with angle brackets.
Nonterminal node labels in the SUSANNE scheme contain up to three types
of information: a formtag, a functiontag, and an index, in that order. In a label containing a formtag and one or both of the other
two elements, a colon separates the formtag from the other elements. A functiontag is always a single
alphabetic character, and an index is a sequence of three digits; restrictions
on valid combinations of elements within a node label mean that complex labels
can always be unambiguously decomposed into their elements.
RANKS OF CONSTITUENT
Apart from nodes immediately dominating traces, all
nodes have labels including formtags, which identify the internal properties of
the word or word-sequence dominated by the node. The shape of a parse-tree is defined in terms of a hierarchy
of formtag ranks:
1 wordlevel
formtags (begin with two capital letters; formtags of all other ranks begin
with one capital and contain no further capitals)
2 phraselevel
formtags (begin with one of: N V J R P D M G)
3 clauselevel
formtags (begin with one of: S F T Z L A W)
4 rootlevel
formtags (begin with one of: O Q I)
Each grammatical clause, whether consisting of one or more words, is
given a node labelled with a clauselevel formtag. Each immediate constituent of a clause, whether there are
one or more such constituents and whether the constituent consists of one or
more words, is given a node labelled with a phraselevel formtag, unless the
constituent belongs to a wordlevel category that has no corresponding
phraselevel category (e.g. punctuation marks, existential there, conjunctions), or to a rootlevel
category (e.g. a direct quotation, formtagged Q).
Thus a clause consisting just of a verb will be assigned a clauselevel
formtag (e.g. Tg for present-participle clause) which singularily
dominates a phraselevel formtag (e.g. Vg for ‘verb group beginning with present
participle’) which in turn singularily dominates a wordlevel formtag (e.g. VVGi for ‘present
participle of intransitive verb’).
Other than by these rules, and in certain other limited circumstances
specified in English for the Computer, singulary branching does not occur. An intermediate phraselevel node is
inserted between a higher phraselevel node and a sequence of words dominated by
it only if two or more of those words form a coherent constituent within the
higher phrase. A clause which
fills a slot standardly filled by a phrase (e.g. a nominal clause as subject or
object) will not have a phrase node above the clause node unless the clause
proper is preceded and/or followed by modifying elements that are not part of
the clause.
Detailed rules for deciding constituency in various debatable cases, for
placing items such as punctuation marks within parse trees, for extending the
application of the categories and structuring rules to linguistic phenomena
such as addresses or weights and measures which are not commonly taken into
account in linguistic theorizing, and so forth, are laid down in English for
the Computer.
FUNCTIONTAGS AND INDICES
Functiontags, identifying roles such as surface subject, logical object,
time adjunct, are assigned to all immediate constituents of clauses, except for
their verb-group heads and certain other constituents for which function
labelling is inappropriate.
Indices are assigned to pairs of nodes to show referential identity
between items which are in certain defined grammatical relationships to one
another. Thus, in Figure 1, the
sequence feeling that evacuation plans … would not work is given the label Ns:S152, in which
the formtag Ns identifies the tagma as a singular noun phrase (note
that in context feeling occurs
in its nominal use – one might have expected the sentence to read a feeling
… , but corpus
linguistics takes language as it finds it), the capital
S shows that the tagma is
surface subject of the seemed clause (in an existential clause the subject, which determines verb
agreement, standardly follows the verb), and the index 152 shows that this tagma is identifiable
with the logical subject (s152) of the to be clause.
The label Ti:s on this latter clause shows that it is an infinitival
clause (formtag Ti), which as a whole, including its displaced logical
subject, forms the logical subject (s) of seemed.
In some cases, movement rules displace a constituent into a tagma within
which it has no grammatical role (for instance, an adverb which is logically a
clause constituent may interrupt the verb group – sequence of auxiliary verbs
and main verb – which heads the clause):
in such cases the functiontag is G (‘guest’). Constituents which do not logically belong below the node
which immediately dominates them in surface structure are always given G functiontags
and indices linking them to their logical position. With that exception (and with one other exception not
discussed here relating to co-ordination), functiontagging is used only for
immediate constituents of clauses.
English for the Computer lists the types of surface/logical-grammar discordance which are
represented by the SUSANNE scheme, and the approved methods of representing
them. The SUSANNE analysis is
always chosen so as to be as far as possible neutral as between alternative
linguistic theories.
THE FORMTAGS
The SUSANNE formtags are as follows:
Rootlevel
Formtags
O paragraph
Oh heading
Ot title (e.g.
of book)
Q quotation
I interpolation
Iq tag
question
Iu scientific
citation
Clauselevel
Formtags
S main
clause
Ss quoting
clause embedded within quotation
Fa adverbial
clause
Fn nominal
clause
Fr relative
clause
Ff ‘fused’
relative
Fc comparative
clause
Tg present
participle clause
Ti infinitival
clause
Tn past
participle clause
Tf for-to clause
Tb ‘bare’
nonfinite clause
Tq infinitival
relative clause
Z reduced
(‘whiz-deleted’) relative clause
L other
verbless clause
A special
as clause
W with clause
Phraselevel
Formtags
N noun
phrase
V verb
group
J adjective
phrase
R adverb
phrase
P prepositional
phrase
D determiner
phrase
M numeral
phrase
G genitive
phrase
The various phrase categories take lower-case subcategory symbols which
can be combined in any meaningful combination (e.g. the verb group would not
work is formtagged Vdce). The phrase subcategories are:
Vo operator
section of verb group, when separated from remainder of verb group e.g. by
subject-auxiliary inversion
Vr remainder
of verb group from which operator has been separated
Vm verb group
beginning with am
Va verb group
beginning with are
Vs verb group
beginning with was
Vz verb group
beginning with other 3rd-singular verb
Vw verb group
beginning with were
Vj verb group
beginning with be
Vd verb group
beginning with past tense
Vi infinitival
verb group
Vg verb group
beginning with present participle
Vn verb group
beginning with past participle
Vc verb group
beginning with modal
Vk verb group
containing emphatic DO
Ve negative
verb group
Vf perfective
verb group
Vu progressive
verb group
Vp passive
verb group
Vb verb group
ending with BE
Vx verb group
lacking main verb
Vt catenative
verb group
Nq wh- noun phrase
Nv wh…ever noun phrase
Ne I/me head
Ny you head
Ni it head
Nj adjective
head
Nn proper name
Nu unit noun
head
Na marked as
subject
No marked as
nonsubject
Ns singular
noun phrase
Np plural noun
phrase
Jq wh- adjective phrase
Jv wh…ever adjective phrase
Jx measured
absolute adjective phrase
Jr measured
comparative adjective phrase
Jh postmodified
adjective phrase
Rq wh- adverb phrase
Rv wh…ever adverb phrase
Rx measured
absolute adverb phrase
Rr measured
comparative adverb phrase
Rs adverb
conducive to asyndeton
Rw quasi-nominal
adverb
Po of phrase
Pb by phrase
Pq wh- prepositional phrase
Pv wh…ever prepositional phrase
Dq wh- determiner phrase
Dv wh…ever determiner phrase
Ds singular
determiner phrase
Dp plural
determiner phrase
Ms phrase
headed by one
Subcategory symbols are not included if implied by more specific
subcategories, thus a verb group beginning was will be labelled Vs, not Vsd.
NON-ALPHANUMERIC FORMTAG SUFFIXES
Formtags may also contain non-alphanumeric symbols, including:
? interrogative
clause
* imperative
clause
% subjunctive
clause
! exclamatory
clause or other item
" vocative
item
Other non-alphanumeric symbols represent co-ordination structure. Under the SUSANNE scheme, second and
subsequent conjuncts in a co-ordination are analysed as subordinate to the
first conjunct; thus a co-ordination of the form:
c, y, and w
(where c, y, etc. are word-sequences of any
grammatical rank) would be assigned a structure of the form:
[c, [y], [and w]]
The formtag of the entire co-ordination is determined by the properties
of the first conjunct (except for singular/plural subcategories in the case of
phrase categories to which these apply); the later conjuncts (which will often
be grammatically reduced) have nodes of their own whose formtags mark them as
‘subordinate conjuncts’. The
following symbols relate to co-ordination (and apposition) structure:
+ subordinate
conjunct introduced by conjunction
- subordinate
conjunct not introduced by conjunction
@ appositional
element
& co-ordinate
structure acting as first conjunct within a higher co-ordination (marked in
certain cases only)
Co-ordination is recognised as occurring between words as well as
between higher-rank tagmas; Figure 1 contains no example, but for instance in he
bought apples and bananas
the phrase apples and bananas would be analysed as a simple noun phrase singularily dominating a
co-ordination of nouns, rather than as a co-ordination of one-word noun
phrases. Therefore nonterminal
nodes may have formtags consisting of wordtags followed by co-ordination
symbols, thus (using WT to stand for an arbitrary wordtag):
WT& co-ordination of words
WT+ conjunct within
wordlevel co-ordination that is introduced by a conjunction
WT- conjunct within
wordlevel co-ordination not introduced by a conjunction
(A wordlevel co-ordination always takes an ampersand on its formtag;
phrase or clause co-ordinations do so only in very restricted circumstances.)
Also, certain sequences of orthographic words, in certain uses, are
regarded as functioning grammatically as single words (‘grammatical
idioms’). For instance, in
keeping with is normally
treated as a grammatical idiom, equivalent to a single preposition (for which
the wordtag is II). In
such cases, the nonterminal node dominating the sequence has a formtag
consisting of an equals sign suffixed to the corresponding wordtag; and the
individual words composing the idiom are not wordtagged in their own right, but
receive tags with numerical suffixes reflecting their membership of an
idiom. (The sequence in keeping
with is formtagged II=, and the words
in, keeping, with in this context are wordtagged II31 II32 II33.) English for the Computer includes exhaustive listings of
closed-class grammatical idioms.
Note that formtags of the forms WT& WT+ WT-
WT= rank as wordlevel
formtags for the purposes of determining tree structure as discussed above.
THE FUNCTIONTAGS
Functiontags divide into complement and adjunct tags:
broadly, a given complement tag can occur at most once in any clause,
but a clause may contain multiple adjuncts of the same type.
It was originally planned to classify complements in terms of some
version of Fillmorean Case Grammar.
The most fully worked-out version of case theory, including specimen
case frames for numerous English verbs and other predicates, is that of
Stockwell et al. (1973), and the SUSANNE team set out to develop this into a
scheme capable of specifying an unambiguous case assignment for all complements
found in corpus material. After
strenuous and protracted efforts, this attempt failed; the nature of the
logical relationships which various predicates in real-life usage contract with
their arguments proved too diverse to handle in this fashion, and the team
believe that they have ‘tested to destruction’ the hypothesis that core clause
structure in English can adequately be described in terms of a limited set of
‘cases’. Instead, the finished
SUSANNE Corpus classifies complements in terms of the semantically less
informative, but more predictable, traditional concepts of subject and object.
The scheme of adjunct categories has been developed from the
classification of Quirk et al. (1985), though some modifications have been
introduced in the light of experience in applying the categories to corpus
data.
Complement
Functiontags
s logical
subject
o logical
direct object
S surface
(and not logical) subject
O surface
(and not logical) direct object
i indirect
object
u prepositional
object
e predicate
complement of subject
j predicate
complement of object
a agent
of passive
n particle
of phrasal verb
z complement
of catenative
x relative
clause having higher clause as antecedent
G ‘guest’
having no grammatical role within its tagma
Adjunct
Functiontags
p place
q direction
t time
h manner
or degree
m modality
c contingency
r respect
w comitative
k benefactive
b absolute
Detailed guidelines for the application of these functional categories
are included in English for the Computer.
APPENDIX: How to retrieve a
copy of the SUSANNE Corpus
[Since this article was published, I have abandoned the Oxford Text Archive as a distribution centre for up-to-date versions of the SUSANNE Corpus and similar resources, in favour of a server under my own control. To get hold of a copy of the most up-to-date version of the Corpus at any time, follow the link to ‘downloadable research resources’ from my home page at www.grsampson.net and follow the instructions given there.]
REFERENCES
Ellegård, A. 1978. The
Syntactic Structure of English Texts.
Gothenburg Studies in English 43.
Gothenburg: Acta Universitatis Gothoburgensis.
Francis, W.N. and H. Kučera.
1989. Manual of
Information to Accompany a Standard Corpus of Present-Day Edited American
English, for use with Digital Computers (corrected and revised edition). Providence, Rhode Island: Department of
Linguistics, Brown University.
Garside, R.G., G.N. Leech, and G.R.
Sampson (eds.). 1987. The Computational Analysis of
English. London: Longman.
Hofland, K. and S. Johansson. 1982. Word Frequencies in British and American English.
London: Longman.
Quirk, R., S. Greenbaum, G.N. Leech, and
J. Svartvik. 1985. A Comprehensive Grammar of the
English Language. London: Longman.
Sampson, G.R. 1991. Analysed
corpora of English: a consumer guide.
In Computers in Applied Linguistics, ed. by Martha Pennington and V. Stevens [dated 1992
but published in 1991].
181-200. Clevedon,
Avon: Multilingual Matters.
Sampson, G.R. 1992.
Probabilistic parsing. In Directions
in Corpus Linguistics: Proceedings of Nobel Symposium 82, ed. by J. Svartvik. 425-47. Berlin: Mouton de Gruyter.
Sampson, G.R. Forthcoming.
The need for grammatical stocktaking. To be in Proceedings of the 1992 Pisa Symposium on
European Textual Corpora,
ed. by N. Ostler.
Stockwell, R.P., P. Schachter, and Barbara
Hall Partee. 1973. The Major Syntactic Structures of
English. New York: Holt, Rinehart and Winston.
Aarts FSch
[1]The support of the Economic and Social
Research Council (UK) is gratefully acknowledged. Project SUSANNE, ‘Construction of an Analysed Corpus of
English’, was funded by ESRC award no. R00023 1142 from 1988 to 1992. ‘SUSANNE’ stands for ‘Surface and
underlying structural analyses of naturalistic English’. I should like to express my warmest
thanks to the team who worked on Project SUSANNE, namely Robin Haigh, Hélène
Knight, Tim Willis, and Nancy Glaister, and to David Tugwell who also
contributed to the SUSANNE scheme.
[2]Note that a sharp distinction is drawn
here between the terms ‘scheme’ and ‘system’. A ‘parsing scheme’, or ‘analytic scheme’, refers to a range
of notations and guidelines for using them which prescribe to a human analyst
what the appropriate grammatical annotation for a language example should
be. A parsing ‘system’ on the
other hand refers to a software system which automatically produces analyses
(according to some parsing scheme) of input language examples. A parsing scheme defines the target
which a parsing system hits (or fails to hit). The SUSANNE Corpus represents part of the definition of a
parsing scheme. It has been
produced largely manually, not as the output of an automatic parsing system.
[3]I thank Alvar Ellegård for permission to
circulate a research resource derived from the work of his group.
[4]APRIL Phase 2, ‘A speech-oriented
stochastic parser’: see footnote 6
below.
[5]Analysed corpora available at the outset
of Project SUSANNE are surveyed in Sampson (1991).
[6]Phases 1 (1986–9) and 2 (1989–91) of
Project APRIL were sponsored by the Royal Signals and Radar Establishment (Ministry
of Defence), under MoD contracts nos. D/ER/1/9/4/2062/0128 and D/ER/1/9/4/2062/0151. APRIL Phase 3
(1992–95), ‘A full natural language annealing parser’, which is to produce a
self-contained annealing parser system suitable for distribution to and
evaluation by the research community, is sponsored jointly by the Science and
Engineering Research Council (UK) and the UK Ministry of Defence, under grant
no. GR/J06108.
[7]For instance there are numerous incorrect
attachments of postmodifying phrases in Release 1.