The SUSANNE Corpus: Documentation

Release 5, 2000.08.11

Geoffrey Sampson
School of Cognitive & Computing Sciences
University of Sussex
Falmer, Brighton BN1 9QH, England


Change of Address

In the past, the SUSANNE Corpus and other language-engineering resources produced by my research team have been scattered at different internet locations not all under my control, and they have more than once been shifted to new addresses without notification to me. I apologize to users for the frustrations this has sometimes caused. To avoid such problems in future, I have acquired my own internet domain, which I intend to maintain indefinitely. My home page has now moved to:

http://www.grsampson.net/

From now on this will always include a pointer to a list of the current locations of the SUSANNE Corpus and other downloadable research resources produced under my direction. In due course, those resources may themselves be shifted into the grsampson.net domain.

Contents

  • Version Information
  • 1. Introduction
  • 2. Structure of the Corpus
  • 3. The Lexicon File
  • 4. The Text Files
  • 4.1 Field Structure
  • 4.2 Reference Field
  • 4.3 Status Field
  • 4.4 Wordtag Field
  • 4.5 Word Field
  • 4.6 Lemma Field
  • 4.7 Parse Field
  • 4.8 Ranks of Constituent
  • 4.9 Functiontags and Indices
  • 4.10 The Formtags
  • 4.11 Non-alphanumeric Formtag Suffixes
  • 4.12 The Functiontags
  • 5. Innovations in Release 5
  • 5.1 Correction of Analytic Errors
  • 5.2 Additional File
  • 5.3 Reference Field Format
  • 5.4 Subdivision of Texts
  • 5.5 Revisions to the Annotation Scheme
  • 6. Errors in the Source Texts
  • 7. Sources
  • Notes
  • References
  • URL List
  • Version Information

    Release 5 is the first new release of the SUSANNE Corpus for almost six years, and incorporates larger changes than did previous releases. Rather than detail these here, they are discussed in a separate section of this document, "Innovations in Release 5", §5 below.

    Release 4 of 1994.11.07 corrected a handful of errors discovered in checking the proofs of English for the Computer and otherwise.

    Release 3 of 1994.04.04 corrected errors which came to light during the process of finalizing the MS of the book English for the Computer. One proofreading technique applied in the creation of Release 3 was to read through the entire Corpus text printed in a format which used indentation to display the parse-field bracketing structure, in order to catch structural errors such as inappropriate placement of postmodifier constituents within parse trees. Also, this documentation file was provided with a detailed listing of misprints and similar errors in the Corpus texts, showing which of them stem from the original publications (and are therefore preserved in the SUSANNE Corpus), and which were introduced in the work of creating the Brown Corpus (and have accordingly been eliminated from SUSANNE).

    Release 2, dated 1993.06.02, corrected a number of errors found in Release 1; I am grateful to all those users who helped to find them. It also contained one minor change in annotation conventions: in the parse field, from Release 2 onwards all node labels are written within square brackets (Release 1 included a redundant distinction between square brackets for ordinary nodes and angle brackets for "ghost" (or "trace") nodes, which are distinguished in several other ways). This documentation file now includes a listing of the text sources on which the Corpus is based, and incorporates some minor changes in wording.

    Release 1 of the SUSANNE Corpus was completed on 1992.09.06.

    1. Introduction

    The SUSANNE Corpus was created, with the sponsorship of the Economic and Social Research Council (UK), as part of the process of developing a comprehensive language-engineering-oriented taxonomy and annotation scheme for the (logical and surface) grammar of English.¹ The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis. The SUSANNE scheme may be likened to a "Linnaean taxonomy" of the grammatical domain: its aim (comparable to that of Linnaeus's eighteenth-century taxonomy for the domain of botany) is not to identify categories which are theoretically optimal or which necessarily reflect the psychological organization of speakers' linguistic competence, but simply to offer a scheme of categories and ways of applying them that make it practical for language-engineering researchers to register everything that occurs in real-life usage systematically and unambiguously, and for researchers at different sites to exchange empirical grammatical data without misunderstandings over local uses of analytic terminology. On reasons why such a scheme is needed at the present juncture in language-engineering research, see e.g. Sampson (1992, 2001 ch. 6).

    Note that a sharp distinction is drawn here between the terms scheme and system. A "parsing scheme", or "analytic scheme", refers to a range of notations and guidelines for using them which prescribe to a human analyst what the appropriate grammatical annotation for a language example should be. A parsing "system" on the other hand refers to a software system which automatically produces analyses (according to some parsing scheme) of input language examples. A parsing scheme defines the target which a parsing system hits (or misses). The SUSANNE Corpus represents part of the definition of a parsing scheme. It has been produced largely manually, not as the output of an automatic parsing system.

    The SUSANNE analytic scheme is defined in detail in a book by myself, English for the Computer, published under the Clarendon imprint of Oxford University Press in 1995 (Sampson 1995 - abbreviated below as EFC). The SUSANNE scheme aims to specify annotation norms for the modern English language; it does not cover other languages, although it is hoped that the general principles of the SUSANNE scheme may prove helpful in developing comparable taxonomies for these.

    Although I collaborated during the later stages of developing the SUSANNE Corpus with the US/EU-sponsored Text Encoding Initiative (URL 1), the SUSANNE Corpus is not a "TEI-conformant" resource. Various aspects of the annotation scheme were decided in such a way as to facilitate a possible move to TEI conformance in later releases, but the working timetable of the Initiative meant that relevant aspects of the TEI Guidelines were not yet complete at the point when the SUSANNE Corpus was ready for release. The TEI Guidelines are in any case very general, but at the time of writing it seems possible that the Ide/Véronis "Corpus Encoding Standard" (URL 2) may become a special case of the TEI system which achieves recognition as a standard way of encoding this type of information. If that happens, it is my intention to circulate a CES-conformant version of SUSANNE alongside the current version. (However, I believe many users will always prefer to work with the Corpus in its current fixed-field format.)

    The brief description of the SUSANNE Corpus which follows cannot replace the very detailed statements to be found in EFC, and any user aiming to do serious work with the Corpus or its annotation scheme would need to consult the book. Nevertheless, it may be useful to have a summary statement included with the electronic Corpus.

    The present SUSANNE annotation scheme originated in work carried out by myself in collaboration with Prof. Geoffrey Leech FBA and others in the years 1983-85 to produce a database of manually analysed sentences from the LOB Corpus of written British English, as a source of statistics for probabilistic automatic-parsing techniques; this database, which has not been (and will not now be) published, is described in Garside et al. (1987: ch. 7). The annotation scheme of this "Lancaster-Leeds Treebank" represented surface grammar only, without indications of logical form. It subsequently seemed desirable to extend this scheme to include methods for representing logical grammar, and to refine both surface and logical aspects of the annotation scheme by applying it to a larger body of texts. The only way that a parsing scheme can in practice be made increasingly adequate is in the way that the English Common Law develops, by collecting and systematizing the body of precedents generated through detailed consideration of more and more individual cases that arise in real life. Accordingly, the SUSANNE Project took a subset of the Brown Corpus of written American English which had been manually analysed by Alvar Ellegård's group at Gothenburg (Ellegård 1978), and reworked the annotations in this under-used resource in order to turn them into a scheme consistent with that used in the Lancaster-Leeds Treebank but including specifications of logical as well as surface structure: several categories of information not indicated in either Lancaster-Leeds or Gothenburg schemes were also added.² (For the Brown and LOB Corpora, see e.g. Garside et al. (1987: 4-5), URL 3.)

    The SUSANNE analytic scheme has thus been developed on the basis of samples of both British and American English. It was initially oriented towards written language only, and the SUSANNE Corpus contains exclusively written-language samples. However, in later work sponsored first by the Royal Signals and Radar Establishment, and more recently by the Economic and Social Research Council, my team has produced extensions to the scheme for annotating the distinctive structural phenomena of the spoken language, and has applied these to samples of recent spontaneous spoken English (the CHRISTINE Corpus, URL 4). The first stage of the CHRISTINE Corpus, comprising analyses of a demographically balanced cross-section of English spoken in all parts of the UK within the last decade, was released in August 1999 (URL 5), and is one of the first analysed speech corpora to become available anywhere in the English-speaking world (or, so far as I know, outside it). The speech-related aspects of the analytic scheme are outlined in ch. 6 of EFC, and discussed in greater detail in the CHRISTINE documentation file (which is available as a Web page at URL 6).

    It should be noted that the SUSANNE analytic scheme has emerged through a process of detailed critical discussion of analytic standards by about a dozen people over almost twenty years. Apart from myself, the leading role in the early years of these discussions was taken by Geoffrey Leech, whose standing as an English grammarian needs no emphasis.

    The SUSANNE Corpus itself comprises an approximately 130,000-word subset of the Brown Corpus of American English, annotated in accordance with the SUSANNE scheme. The original motives for producing this database included that of providing better statistics for probabilistic parsing; but in this respect the SUSANNE Project was overtaken after its inception by projects (notably Mitchell Marcus's Pennsylvania Treebank project, cf. Marcus et al. (1993), URL 7) which have used quasi-industrial methods to generate far larger bodies of grammatically-analysed material. However, the SUSANNE scheme may be unparallelled in the extent to which its categories have been refined and tested through detailed consideration of the almost endless small quirks of the texts to which they have been applied, and in the degree of precision to which the resulting guidelines for using the categories have been documented - thus defining analytic standards which permit annotation of future material to be extremely self-consistent. The SUSANNE scheme has been winning a measure of international recognition in this respect; for instance, a 1995 report of the European Union EAGLES language-engineering standards initiative described it as "a unique achievement", and D. Terence Langendoen, then President of the Linguistic Society of America, wrote in 1997 that its "detail ... is unrivalled" (Langendoen 1997: 600).³

    Accordingly the SUSANNE Corpus is offered to the research community primarily as a demonstration of the application of the parsing scheme, evidencing the fact that the scheme has survived the test of experience rather than being a merely aprioristic system. The SUSANNE Corpus functions, as it were, like a collection of type specimens appended to a botanical taxonomy.

    Although the Corpus itself was created as ancillary to the parsing scheme, it has been pleasing, during the years since its initial release, to find that it has been widely used as a research resource in its own right, often by researchers and groups very distant from the site where it was created. There are no legal restrictions on copying or using the SUSANNE Corpus, though it would not be a friendly act for an individual or agency other than the Oxford Text Archive to set up as an alternative distributor, or (without permission) to exploit the information in the Corpus in order to promote a rival annotation scheme. Any person or group publishing work based on the SUSANNE Corpus is requested to acknowledge the roles of the Economic and Social Research Council (UK) as sponsor and the University of Sussex as grantholder.

    Each successive release of the SUSANNE Corpus has eliminated errors discovered in earlier releases. The number of errors found and corrected between releases has fallen very considerably as the years have gone by, but there will undoubtedly still be some left. I shall be extremely grateful if users discovering errors will log them and send me details.

    2. Structure of the Corpus

    Release 5 of the SUSANNE Corpus comprises 66 files:

    Sixteen texts are drawn from each of the following Brown genre categories:

    A press reportage
    G belles lettres, biography, memoirs
    J learned (mainly scientific and technical) writing
    N adventure and Western fiction
    The Corpus thus samples each of the four broad genre groups established on the basis of word-frequency data by Hofland & Johansson (1982: 27). For publication details of the original texts, see §7, "Sources", below.

    The text files average about 83 kilobytes in size; the entire Corpus totals about six megabytes. The names of the text files are those of the respective Brown texts, e.g. A01, N18.

    The Corpus comprises data files only, structured in a way that makes the task of extracting information as straightforward as possible. We do not see it as part of our task to produce special-purpose software for data extraction. We could not do that, since we have no way of knowing what sorts of questions future researchers will want to pose to our data. (SUSANNE has been used for various kinds of research that I had no thought of when I first put it into circulation.)

    This last point seems worth making, because since the first release of SUSANNE I have more than once encountered comments suggesting that, in failing to supply accompanying utility software, we left a job half done. In response, let me quote remarks I made in a recent book review (Sampson 1998: 365) about the approach which sees utility software as an essential accompaniment to corpus data:

    It is hard to see this as a wise policy for allocating scarce research resources. In practice there are usually two possibilities when one wants to exploit corpus data. Often, one wants to put very obvious and simple questions to the corpus; in that case, it is usually possible to get answers via general-purpose Unix commands like grep and wc, avoiding the overhead of learning special-purpose software. Sometimes, the questions one wants to put are original and un-obvious; in those cases, the developer of a corpus utility is unlikely to have anticipated that anyone might want to ask them, so one has to write one's own program to extract the information. No doubt there are intermediate cases where a corpus utility will do the job and grep will not. I am not convinced that these cases are common enough to justify learning to use such software, let alone writing it.

    3. The Lexicon File

    The lexicon file contains an alphabetized list of all pairs of wordform and wordtag that occur at least once in the Corpus. This is an innovation in Release 5, prompted by a recommendation relating to speech corpora made by the EAGLES Spoken Language Working Group (Gibbon et al. 1997: 170, Recommendation 6). (For the EAGLES initiative, see URL 9.) Including such a list is potentially no less valuable for written-language than for speech corpora, and separate listing of grammatically-distinct uses of single wordforms is an obvious way of increasing its value.

    Each line of the lexicon file contains a wordform followed by a wordtag, separated by a tab character, and terminated by a newline. Wordforms differing only in the case of one or more letters are separately listed (e.g. OSLO and Oslo have separate entries). The alphabetization uses the sequence A, a, B, b, ... (rather than A, B, ..., Z, a, b, ...).

    4. The Text Files

    4.1 Field Structure

    Each file has a line (terminating in a newline character) for each "word" of the original text; but "words" for SUSANNE purposes are often smaller than words in the ordinary orthographic sense, for instance punctuation marks and the apostrophe-s suffix are treated as separate words and assigned lines of their own. (For details on the rules by which orthographic words are segmented, as well as on all other analytic matters mentioned below, see EFC.)

    Each line of a SUSANNE file has six fields separated by tabs (that is, there is one tab after each of fields 1 to 5, but a newline after field 6). Each field on every line contains at least one character. A typical short sequence of lines is:

    N06:0180.12	-	NN1u	Baldness 	baldness	[S[Ns:s.Ns:s]
    N06:0180.15	-	VBDZ	was      	be	      	[Vsu.
    N06:0180.18	-	VVGt	attacking	attack  	.Vsu]
    N06:0180.21	-	APPGm	his      	his     	[Ns:o.
    N06:0180.24	-	NN1c	pate     	pate    	.Ns:o]S]
    

    The six fields on each line are:

    1. reference
    2. status
    3. wordtag
    4. word
    5. lemma
    6. parse

    Apart from the tab and newline characters used to structure fields and records, all bytes in each of the 64 SUSANNE files are drawn from a subset of the 94 graphic character allocations of the International Reference Version ("IRV") of ISO 646:1983 Information Processing - ISO 7-bit coded character set for information interchange, from hexadecimal 21 (exclamation mark) to hex 7E (tilde). These codes are assumed for SUSANNE purposes to represent the graphic symbols assigned by the IRV system. Twelve members of the IRV character set are not used in the Corpus, namely (all codes hexadecimal):

    23 gate
    24 generalized currency unit
    27 prime
    2F solidus
    5C reverse solidus
    5E circumflex
    5F underline
    60 grave
    7B opening curly bracket
    7C vertical bar
    7D closing curly bracket
    7E tilde
    The space character, hex 20, which is classified by ISO 646 as a control code also does not occur in the SUSANNE Corpus.

    Where text characters cannot be adequately represented directly within the resulting 82-member character set, they are represented by entity names within angle brackets. Where possible these are drawn from Appendix D to ISO 8879:1986, Information Processing - Text & Office Systems - Standard Generalized Markup Language (SGML). For instance, <eacute> stands for é. Symbols in angle brackets are used also to represent such things as typographical shifts, which for purposes of grammatical analysis are conveniently represented as items within the word-sequence: e.g. <bital> stands for "begin italics". The complete set of such entity names used in SUSANNE Release 5 is listed in EFC, §2.32, except for <docbrk>, which is new in Release 5 and is defined in §5.4 of this document, below.

    4.2 Reference Field

    The reference field contains eleven bytes which give each line a reference number that is unique across the SUSANNE Corpus, e.g. N06:0180.15. The first three bytes (here N06) are the file name; the fourth byte is always a colon; bytes 5 to 8 (here 0180) are the number of the line in the "Bergen I" version of the Brown Corpus on which the relevant word appears (Brown line numbers normally increment in tens, with occasional odd numbers interpolated); the ninth byte is always a full stop; and bytes 10 and 11 (here 15) are a two-digit number identifying the individual SUSANNE line, i.e. the individual word or punctuation mark (word numbers normally increment in threes, again with occasional intermediate numbers). The word-numbering system is one of the respects in which Release 5 differs from previous SUSANNE releases.

    4.3 Status Field

    The status field contains one byte. The letters A and S show that the word is an "abbreviation" or "symbol", respectively, as defined by Brown Corpus codes (Francis & Kučera 1989: 12). The letter E shows that the word is (or is part of) a misprint or solecism in the original text (for details see "Errors in the Source Texts", §6 below). On the great majority of lines, to which none of these three categories apply, the status field contains a hyphen character (this applies to each line in the short SUSANNE extract displayed in §4.1 above).

    4.4 Wordtag Field

    The SUSANNE wordtag set is based on the "Lancaster" tagset listed in Garside et al. (1987: Appendix B), with some additional distinctions and modifications. In line N06:0180.15 (see §4.1 above), the wordtag VBDZ applies to the word was (only). The SUSANNE tagset comprises 353 distinct wordtags, not counting tags for elements of "grammatical idioms" (see below); a few of these wordtags never occur in the SUSANNE Corpus. The wordtags are listed, and their application rigorously defined, in EFC - in the case of closed wordclasses, by enumeration of their members, and in the case of open classes by rules for choice between alternative tags. These rules refer to information in a specified published dictionary (the Oxford Advanced Learner's Dictionary of Current English, 3rd edition).

    Note particularly that the tag YG appears in the wordtag field to represent a "ghost" - the logical position of a constituent which has been shifted elsewhere, or deleted, in the surface grammatical structure.

    4.5 Word Field

    The word field contains a segment of the text, often coinciding with a word in the orthographic sense but sometimes, as noted above, including only part of an orthographic word. (In line N06:0180.15 the word field contains was.) In general the word field represents all and only those typographical distinctions in the original documents which were recorded in the Brown Corpus (Francis & Kučera 1989: 10-15), though in certain cases the SUSANNE Corpus has gone behind the Brown Corpus to reconstruct typographical details omitted from Brown.

    Certain characters have special meanings in the word field, as follows:

    + (occurs only as first byte of the word field) shows that the contents of the field were not separated in the original text from the immediately-preceding text segment by whitespace (e.g. in the case of a punctuation mark, or part of a hyphenated sequence split over successive SUSANNE lines);
    - the line corresponds to no text material (it represents the "ghost" or "trace" for a grammatically-moved element);
    <...> enclose entity names for special typographical features, as listed in EFC, §2.32.

    4.6 Lemma Field

    The lemma field shows the dictionary headword of which the text word is a form: the field shows base forms for words which are inflected in the text, and eliminates typographical variations (such as sentence-initial capitalization) which are not inherent to the word but relate to its use in context. (In line N06:0180.15, the lemma field contains be as the base form of was.) In the case of "words" to which the dictionary-form concept is inappropriate, e.g. numerals and punctuation marks, the lemma field contains a hyphen. Orthographic forms in the lemma field are those of a specified dictionary (the Oxford Advanced Learner's Dictionary of Current English, 3rd edition).

    The SUSANNE project aimed also to indicate the senses which polysemous words bear in context, via codes relating word-tokens to numbered subsenses in a specified dictionary. The book English for the Computer provides a detailed coding scheme for representing this information. Unfortunately, this aspect of the project's output proved to contain a number of inadequacies, and the information was not included in the finished Corpus. In the years since the initial release of SUSANNE, wordsense-coding has developed norms of its own, independent of our early work on the SUSANNE project (see e.g. Kilgarriff and Palmer 2000, URL 10), so it no longer seems appropriate to revise and incorporate that material.

    4.7 Parse Field

    The contents of the sixth field represent the central raison d'être of the SUSANNE Corpus. They code the grammatical structure of texts as a sequence of labelled trees, having a leaf node for each Corpus line.

    Each text is treated as a sequence of "paragraphs" separated by "headings". A "paragraph" normally coincides with an ordinary orthographic paragraph; a "heading" may consist of actual verbal material, or may be merely a typographical paragraph division, symbolized <minbrk> in the word field. (See §5.4, "Subdivision of Texts", for more detail on paragraphs and headings.) Conceptually, the internal structure of each paragraph or heading is a labelled tree with root node labelled O (Oh for a heading), and with a leaf node labelled with a wordtag for each SUSANNE word or trace, i.e. each line of the Corpus. There will commonly be many intermediate labelled nodes.

    Such a tree is represented as a bracketed string in the ordinary way, with the labels of nonterminal nodes written "inside" both opening and closing brackets (that is, to the right of opening brackets and to the left of closing brackets). This bracketed string is then adapted as follows for inclusion in successive SUSANNE parse fields. Wherever an opening bracket immediate follows a closing bracket, the string is segmented, yielding one segment per leaf node; and within each such segment, the sequence opening-bracket + wordtag + closing-bracket, representing the leaf node, is replaced by full stop. Thus each parse field contains exactly one full stop, corresponding to a terminal node labelled with the contents of the wordtag field, sometimes preceded by labelled opening bracket(s) and sometimes followed by labelled closing bracket(s), corresponding to higher tagmas which begin or end with the word on the line in question. In line N06:0180.15, the parse field entry [Vsu. shows that was is the first word of a tagma was attacking, which as a whole is a Vsu, i.e. a progressive verb group beginning with the word was.

    Nonterminal node labels in the SUSANNE scheme contain up to three types of information: a formtag, a functiontag, and an index, in that order. In a label containing a formtag and one or both of the other two elements, a colon separates the formtag from the other elements. A functiontag is always a single alphabetic character, and an index is a sequence of three digits; restrictions on valid combinations of elements within a node label mean that complex labels can always be unambiguously decomposed into their elements.

    In total the parse-trees of Release 3 of SUSANNE comprised 267,046 nodes, of which 4383 were roots and 156,584 were leaves. In Release 5, the number of leaf nodes is only marginally different at 156,622; other parse nodes have not been re-counted, but their numbers are likely also to be very close to the figures for Release 3.

    4.8 Ranks of Constituent

    Apart from nodes immediately dominating "ghost" elements, all nodes have labels including formtags, which identify the internal properties of the word or word-sequence dominated by the node. The shape of a parse-tree is defined in terms of a hierarchy of formtag ranks:

    1. wordrank formtags (begin with two capital letters; formtags of all other ranks begin with one capital and contain no further capitals)
    2. phrasetags (begin with one of: N V J R P D M G)
    3. clausetags (begin with one of: S F T Z L A W)
    4. rootrank formtags (begin with one of: O Q I)

    Each grammatical clause, whether consisting of one or more words, is given a node labelled with a clausetag. Each immediate constituent of a clause, whether there are one or more such constituents and whether the constituent consists of one or more words, is given a node labelled with a phrasetag, unless the constituent belongs to a wordrank category that has no corresponding phraserank category (e.g. punctuation marks, conjunctions), or to a rootrank category (e.g. a direct quotation, formtagged Q). Thus a clause consisting of one verb will be assigned a clausetag (e.g. Tg for present- participle clause) which singularily dominates a phrasetag (e.g. Vg for "verb group beginning with present participle") which in turn singularily dominates a wordrank formtag (e.g. VVGi for "present participle of intransitive verb").

    Other than by these rules, and in certain other limited circumstances specified in EFC, singulary branching does not occur. An intermediate phrase node is inserted between a higher phrase node and a sequence of words dominated by it only if two or more of those words form a coherent constituent within the higher phrase. A clause which fills a slot standardly filled by a phrase (e.g. a nominal clause as subject or object) will not have a phrase node above the clause node unless the clause proper is preceded and/or followed by modifying elements that are not part of the clause.

    Detailed rules for deciding constituency in various debatable cases, for placing items such as punctuation marks within parse trees, etc. are laid down in EFC.

    4.9 Functiontags and Indices

    Functiontags, identifying roles such as surface subject, logical object, time adjunct, are assigned to all immediate constituents of clauses, except for their verb-group heads and certain other constituents for which function labelling is inappropriate.

    Indices are assigned to pairs of nodes to show referential identity between items which are in certain defined grammatical relationships to one another. For instance, a phrase raised out of a lower clause to act as object in a higher clause, as in John expected Mary to admit it, will be assigned an index identical to that assigned to the ghost element which marks the logical position of the item in the lower clause. The (artificial) example quoted would be represented as:

    [Nns:s John] expected [Nns:O999 Mary] [Ti:o [s999 GHOST] to admit [Ni:o it]]

    - where the index 999 shows that the ghost acting as logical subject (symbolized s) of the admit clause is coreferential with Mary which acts as surface object (O) of the expected clause; the logical object (o) of the expected clause being the entire infinitival subordinate clause (Ti). In some cases, movement rules displace a constituent into a tagma within which it has no grammatical role (for instance, an adverb which is logically a clause constituent may interrupt the verb group - sequence of auxiliary verbs and main verb - of the clause): in such cases the functiontag is G ("guest"). Constituents which do not logically belong below the node which immediately dominates them in surface structure are always given G functiontags and indices linking them to their logical position. With that exception (and with one other exception not discussed here relating to co-ordination), functiontagging is used only for immediate constituents of clauses. EFC lists the categories of surface/logical-grammar discordance which are represented by the SUSANNE scheme, and the approved methods of representing them. The SUSANNE analysis is always chosen so as to be as far as possible neutral as between alternative linguistic theories.

    4.10 The Formtags

    The SUSANNE formtags are as follows:

    Rootrank Formtags

    O paragraph
    Oh heading
    Ot title (e.g. of book)
    Q quotation
    I interpolation
    Iq tag question
    Iu technical reference

    Clausetags

    S main clause
    Ss embedded quoting clause
    Fa adverbial clause
    Fn nominal clause
    Fr relative clause
    Ff fused relative
    Fc comparative clause
    Tg present participle clause
    Tn past participle clause
    Ti infinitival clause
    Tf for-to clause
    Tb bare nonfinite clause
    Tq infinitival relative clause
    W with clause
    A special as clause
    Z reduced ("whiz-deleted") relative
    L miscellaneous verbless clause

    Phrasetags

    V verb group
    N noun phrase
    J adjective phrase
    R adverb phrase
    P prepositional phrase
    D determiner phrase
    M numeral phrase
    G genitive phrase

    The various phrase categories take lower-case subcategory symbols which can be combined in any meaningful combination (e.g. the verb group must have been noticed would be formtagged Vcfp). The phrase subcategories are:

    Vo operator section of verb group, when separated from remainder of V e.g. by subject-auxiliary inversion
    Vr remainder of V from which Vo has been separated
    Vm V beginning with am
    Va V beginning with are
    Vs V beginning with was
    Vz V beginning with other 3rd-singular verb
    Vw V beginning with were
    Vj V beginning with be
    Vd V beginning with past tense
    Vi infinitival V
    Vg V beginning with present participle
    Vn V beginning with past participle
    Vc V beginning with modal
    Vk V containing emphatic DO
    Ve negative V
    Vf perfective V
    Vu progressive V
    Vp passive V
    Vb V ending with BE
    Vx V lacking main verb
    Vt catenative V
    Nq wh- N
    Nv wh...ever N
    Ne I/me as whole or head
    Ny you as whole or head
    Ni it as whole or head
    Nj adjectival head
    Nn proper name
    Nu unit of measurement as head
    Na marked as subject
    No marked as nonsubject
    Ns marked as singular
    Np marked as plural
    Jq wh- J
    Jv wh...ever J
    Jx measured absolute J
    Jr measured comparative J
    Jh "heavy" (postmodified) J
    Rq wh- R
    Rv wh...ever R
    Rx measured absolute R
    Rr measured comparative R
    Rs adverb conducive to asyndeton
    Rw quasi-nominal adverb
    Po of phrase
    Pb by phrase
    Pq wh- P
    Pv wh...ever P
    Dq wh- D
    Dv wh...ever D
    Ds marked as singular
    Dp marked as plural
    Ms M headed by one
    Gq wh- G
    Gv wh...ever G

    In later work (see §13.3 of the CHRISTINE/I Documentation file), two further M subcategories were introduced which had been omitted by oversight from the original SUSANNE scheme:

    - to which should logically be added Mv for wh...ever numeral phrase, though no case of that has been encountered. It has not so far been practical to include these retrospectively in the SUSANNE Corpus.

    4.11 Non-alphanumeric Formtag Suffixes

    Formtags may also contain non-alphanumeric symbols, including:

    ? interrogative clause
    * imperative clause
    % subjunctive clause
    ! exclamatory clause or other item
    " vocative item

    Other non-alphanumeric symbols represent co-ordination structure. Under the SUSANNE scheme, second and subsequent conjuncts in a co-ordination are analysed as subordinate to the first conjunct; thus a co-ordination of the form:

    chi, psi, and omega
    (whatever the grammatical rank of the word-sequences chi, psi, etc.) would be assigned a structure of the form:
    [chi, [psi], [and omega]]

    The formtag of the entire co-ordination is determined by the properties of the first conjunct (except for singular/plural subcategories in the case of phrase categories to which these apply); the later conjuncts (which will often be transformationally reduced) have nodes of their own whose formtags mark them as "subordinate conjuncts". The following symbols relate to co-ordination (and apposition) structure:

    + subordinate conjunct introduced by conjunction
    - subordinate conjunct not introduced by conjunction
    @ appositional element
    & co-ordinate structure acting as first conjunct within a higher co-ordination (marked in certain cases only)

    Co-ordination is recognised as occurring between words as well as between higher-rank tagmas. Therefore nonterminal nodes may have formtags consisting of wordtags followed by co-ordination symbols, thus (using WT to stand for an arbitrary wordtag):

    WT& co-ordination of words
    WT+ conjunct within wordlevel co-ordination that is introduced by a conjunction
    WT- conjunct within wordlevel co-ordination not introduced by a conjunction
    (A wordlevel co-ordination always takes an ampersand on its formtag; phrase or clause co-ordinations do so only in very restricted circumstances.)

    Also, certain sequences of orthographic words, in certain uses, are regarded as functioning grammatically as single words ("grammatical idioms"). For instance, none the less would normally be treated as a grammatical idiom, equivalent to an adverb (for which the wordtag is RR). In such cases, the nonterminal node dominating the sequence has a formtag consisting of an equals sign suffixed to the corresponding wordtag; and the individual words composing the grammatical idiom are not wordtagged in their own right, but receive tags with numerical suffixes reflecting their membership of an idiom. (The sequence none the less would be formtagged RR=, and the words none, the, and less in this context would be wordtagged RR31 RR32 RR33.) EFC includes exhaustive listings of closed-class grammatical idioms.

    Note that formtags of the forms WT& WT+ WT- WT= count as wordrank formtags for the purposes of determining tree structure as discussed above.

    4.12 The Functiontags

    Functiontags divide into complement and adjunct tags: broadly, a given complement tag can occur at most once in any clause, but a clause may contain multiple adjuncts of the same type. The scheme of adjunct categories was developed from the classification of Quirk et al. (1985).

    Complement Functiontags

    s logical subject
    o logical direct object
    i indirect object
    u prepositional object
    e predicate complement of subject
    j predicate complement of object
    a agent of passive
    S surface (and not logical) subject
    O surface (and not logical) direct object
    G "guest" having no grammatical role within its tagma

    Adjunct Functiontags

    p place
    q direction
    t time
    h manner or degree
    m modality
    c contingency
    r respect
    w comitative
    k benefactive
    b absolute

    Other Functiontags

    n participle of phrasal verb
    x relative clause having higher clause as antecedent
    z complement of catenative

    Detailed guidelines for the application of these functional categories are included in EFC.

    5. Innovations in Release 5

    5.1 Correction of Analytic Errors

    As with other releases after the first, Release 5 includes a number of corrections of analytic errors in particular passages. Relative to Release 4 there are twenty or so of these. Many of this batch of errors were brought to my attention by Vladimir V. Gojol of the National Institute of Informatics, Bucharest, to whom I offer thanks.

    There are also, however, a number of larger-scale differences between Release 5 and previous SUSANNE releases.

    5.2 Additional File

    The lexicon file is new in Release 5; this has already been discussed in §3 above.

    5.3 Reference Field Format

    A "structural" difference between Release 5 and its predecessors relates to the format of the first field (the "reference field") in each line. The reference field gives each line of the SUSANNE Corpus (hence, each word or punctuation mark of the texts) a unique identifying code. Up to Release 4, this had a format exemplified by N06:0180e, where N06 is a Brown file name, 0180 is the number of a line in the Bergen I version of the Brown Corpus (which is a horizontal, i.e. many words per line, version), and e identifies the individual word within the Bergen I line - successive words were given successive lower-case letters. In Release 5, the last byte is replaced by a full stop followed by a pair of digits: N06:0180e becomes N06:0180.15.

    The motive for this change is that, whenever the Corpus was corrected in a way that changed the division of the text into word-tokens (for instance, perhaps the analytic guidelines required a hyphenated word to be split across multiple lines), under the old system not only the lines immediately affected but many following lines had to be painstakingly edited, because the lettering was continuous. By moving to two-digit line identifiers it becomes possible to leave gaps (the numbers usually go up in threes), so corrections to the Corpus are less arduous. This is arguably a case of shutting the stable door, since I hope that far fewer corrections will be needed in future than the large number that have been made since the initial release. Nevertheless, the change seemed worth making.

    5.4 Subdivision of Texts

    The system by which texts are subdivided into paragraphs and larger units was not described adequately in previous versions of the Documentation file, and has been slightly changed for Release 5.

    Text structure above the paragraph level was not central to the concerns either of the SUSANNE project, or of the Brown Corpus which provided our source material. Brown (see Francis & Kučera 1989: 10) recognized just two levels of text subdivision. For any particular 2000-word text extract, a "major" subdivision was defined as the highest-order division that happened to fall within the extract (for instance a chapter boundary, or a section boundary if the whole extract fell within one chapter); a "minor" subdivision was defined as any lower-order subdivision within that extract, down to and including paragraph boundaries. This, at least, is how the Brown Manual describes their system. Taken literally, though, it implies that paragraphs should be "major" subdivisions for extracts that contain no higher-order divisions, and every extract should include at least one major boundary. In reality, only a minority of the 64 Brown texts used in SUSANNE include anything labelled as a "major" subdivision. Paragraph boundaries are always coded as minor subdivisions even in the many extracts that contain no higher-order divisions.

    When a text subdivision is introduced by a heading, Brown codes it in a way that is translated in SUSANNE by surrounding the heading with the "words" <bmajhd>...<emajhd> (for a major heading) or <bminhd>...<eminhd> (for a minor heading). The wording of the heading is shown in Brown (and therefore also in SUSANNE) in all capitals, irrespective of the original typography. When a subdivision has no heading, it is preceded by the SUSANNE "word" <majbrk> or <minbrk> - thus paragraphs are separated by <minbrk> symbols.

    The text following a heading is not treated as a separate paragraph from the heading (the <minbrk> symbol never follows immediately after <emajhd> or <eminhd> - nor does <minbrk> occur immediately before <bmajhd> or <bminhd>). There is no concept of balancing a "beginning of subdivision" marker with an "end of subdivision" marker, akin to SGML <p>...</p>. The Brown Corpus was compiled long before the creation of SGML and its descendant systems, and SUSANNE simply followed the logic of the Brown annotation in this area.

    In one respect, however, Release 5 has introduced a change. Many of the 2000-word Brown "texts" are made up of two or more short pieces from separate sources, grouped together to achieve a Corpus in which all "texts" are of roughly equal length for statistical convenience. (In SUSANNE this applies mainly, but not exclusively, to the newspaper texts of Category A.) A single "text" may include the whole or parts of separate items from one issue of a newspaper, items from different issues, or even items from different newspapers. (Details are given in §7, "Sources", below.) In the Brown Corpus, junctions between separate items within one "text" were given no special marker. If the following item began with a heading, it would start with (the Brown equivalent of) a <bmajhd> or <bminhd> element; but often it was marked merely as a paragraph break. In consequence, there are cases where, illogically, a subdivision internal to a single item was marked as a more important break than the junction between what were originally entirely separate documents.

    SUSANNE Release 5 addresses this problem by introducing a new symbol, <docbrk>, to mark junctions within SUSANNE files between separate documents (including separate news stories from the same issue of a newspaper).

    If the material following a <docbrk> happens to be a heading, it will begin with a <bmajhd> or <bminhd> symbol as normal. This is the only situation in the SUSANNE Corpus where <...brk> and <...hd> lines are adjacent.

    A separate point worth mentioning about the SUSANNE heading and paragraph system is that it is arguably rather illogical for the beginning- and end-heading marks to be included as first and last daughters of Oh (heading) tagmas, and for break symbols to be given Oh nodes above them. Elsewhere in the grammatical annotation scheme, boundary markers are treated as sisters rather than daughters of the units they bound. For instance, full stop and other sentence-final punctuation marks are right-sisters, not last daughters, of S tagmas. However, it has not seemed worth revising the Corpus in this respect. The linguistically-interesting top-level tagmas are those labelled O (paragraph) rather than Oh, and they would be unaffected by such a revision.

    5.5 Revisions to the Annotation Scheme

    Work on the CHRISTINE speech corpus over the years up to 1999 led to many additions and refinements to the annotation scheme of English for the Computer. The bulk of these related to specifically spoken-language phenomena; but the work also uncovered various errors and inconsistencies in the guidelines on analysis of features that occur in written language.

    New decisions about how to deal with these features were specified in §15 of the CHRISTINE documentation file (URL 6), and Release 5 of SUSANNE has been brought into line with most of them. Many of the points are rather trivial and are not listed separately here. Certain points are discussed, however, because they proved problematic to implement for SUSANNE.

    as well as

    The CHRISTINE documentation file describes it as a mistake in EFC that this idiom is classified as CC rather than II. That was a response to instances of as well as in CHRISTINE/I where the idiom functioned unmistakably as a preposition. But the truth is that there are other cases where it is unmistakably a conjunction. As well as should be tagged as either II= or CC= according to context (and there are many contexts where it is hard to make a choice one way or the other).

    Vx

    The CHRISTINE documentation file, §15, re p. 194 of EFC says that the definition of Vx should explicitly exclude verb groups in which DO replaces a more specific main verb, e.g. you must do (CHRISTINE T12.03913) for you must remember. But this statement is itself insufficiently specific. In infiltrating the state as they did in the Republican administration (SUSANNE A06:1580), the word did is tagged Vdx because they did could be seen as standing for they did infiltrate; in you must do (which I believe is a specifically British rather than American usage), must do is Vc rather than Vcx because it can only be seen as standing for must infiltrate, not *must do infiltrate. Likewise in he will visit...as he never fails to do (SUSANNE G06:0980), to do is Vi rather than Vix because to do can only be taken to stand for to visit, not *to do visit.

    appositional elements

    EFC, §4.501, noted that SUSANNE had not marked one-word appositional elements within phrases with ...@ tags, and noted that this was illogical and regrettable since one-word conjuncts within phrases are regularly given ...+ or ...- nodes. In CHRISTINE (§15 of the documentation file, re §4.507 of EFC), quoted words or phrases which occur in apposition are regularly tagged Q@, and this was advocated as desirable for annotating written language also: thus the example quoted in EFC, the word mug meant the object which ..., would take the annotation [Ns:s the word [Q@ mug_NN1c]] .... Unfortunately it has not been practical to locate and change the various relevant passages in SUSANNE.

    Mp, Mq, Mv

    I noted in §4.10 above that SUSANNE Release 5 has not applied these new numeral-phrase subcategories.

    left-grouped coordination

    EFC, §4.475, pointed out that the ...& notation for left-grouped coordination had been applied in SUSANNE only to coordinations where the main conjunct is a clause rather than a phrase, though in subsequent work we have applied it across the board. SUSANNE Release 5 has not corrected earlier SUSANNE versions in this respect.

    indices in reduced relative clauses

    Our team has also made one new decision about analytic practice which was too recent to include in the CHRISTINE documentation file. This relates to the use of the ghost and index system within postmodifying tagmas which logically are reduced relative clauses, tagged Tg, Ti, Tn, or Z. According to the EFC guidelines, these should resemble unreduced Fr clauses in containing ghosts indexed to their antecedents.

    This proved to be an aspect of the scheme to which analysts often found it difficult to conform in practice; and the indexing system adds no real information in the case of Tg and Tn postmodifiers, because the understood element of the clause, if represented as a ghost, will always be functiontagged :s in a Tg and :S in a Tn. Therefore, in our current practice, ghosts are not included in postmodifying Tg or Tn tagmas. They are still included in Ti or Z tagmas, where their functiontag is not straightforwardly predictable from the surrounding structure (and, in the case of postmodifying Ti's, where there is a difference between instances which are reduced relative clauses, containing an understood element coreferential with the antecedent, and instances where no element is understood and no possibility of inserting a ghost arises).

    There has been no attempt to rework SUSANNE in line with this recent change of practice. The use of ghost elements in SUSANNE is believed to conform fairly well to the EFC standard, and it would be a pity to remove explicit, correct information from the files, even if it could be reconstructed deterministically.

    6. Errors in the Source Texts

    The SUSANNE Corpus aims to reflect the incidence of errors found in real-life written English, and therefore to reproduce those errors which stem from the original texts on which the Brown Corpus was based, while correcting errors that were introduced into the Brown Corpus during the process of corpus-construction. The SUSANNE Corpus was developed from the "Bergen I" version of the Brown Corpus; whenever a form in Bergen I appeared erroneous (and is not discussed explicitly in the Brown Corpus Manual), the SUSANNE team checked whether the error reflected the original source. (In some cases we went to the original publications; in many other cases we asked W. Nelson Francis and Andrew Mackie of Brown University to consult the copies of those publications which had been used in compiling the Brown Corpus - their help is gratefully acknowledged.) When an error (or apparent error) reflects the form found in the original publication, it is preserved in the SUSANNE Corpus (flagged by an E in the status field); otherwise, the SUSANNE Corpus restores the text of the original publication and the status field ignores the error. Where errors are original, wordtagging and grammatical analysis is applied to the erroneous text as best it can be by analogy with correct forms (cf. EFC, §4.26).

    For the benefit of users of different versions of the Brown Corpus, a list is included below of all the apparent errors examined, both those which turned out to be original and those which were introduced subsequently and are corrected in SUSANNE. In one or two cases it is open to question whether some odd usage found in the original sources is in fact a misprint or merely an unusual use of words. In cases of doubt the SUSANNE policy was to assume that the original wording was intended, and not to put E in the status field, but these cases too are listed below.

    It is of course all too possible that the task of creating the SUSANNE Corpus may have introduced new errors in the word field, and that some errors in Bergen I which were not original have been allowed to stand by oversight. Users who discover text errors not logged below are encouraged to bring them to my attention for correction in subsequent releases.

    In the following list, "BgI" represents the Bergen I version of the Brown Corpus, "BCUM" represents the 1989 edition of the Brown Manual (Francis & Kučera 1989). The simple comment "X for Y" means that the form X occurs in the original text and appears to be a mistake which should read Y; the form X stands in the SUSANNE Corpus, with E in the status field. (Suggested corrections cannot be certain, and in a few cases may represent a misunderstanding of the writer's intention even assuming this was well-defined.) BCUM attempts to list cases of this sort exhaustively; where such a case is not logged there, the comment "not logged in BCUM" is added. The comment "BgI has X for Y", on the other hand, means that the erroneous form X is not original but was introduced in the process of creating the Brown Corpus; SUSANNE has restored the original form Y, and the status field contains a hyphen.

    A01:0370 BgI omits double closing inverted commas between money and full stop
    A02:0560 BgI has be to for to be
    A02:0730 county for County in Lamar county Hospital District
    A03:0840 words are missing
    A04:0150 conspicious for conspicuous
    A04:1880 double closing inverted commas omitted between problem and full stop (not logged in BCUM)
    A04:1900 ot for to
    A04:1930 ond for and
    A05:0200 full stop for comma
    A06:0250 adminstration for administration
    A06:0410 Nothing for Noting
    A06:0670 rebound appears to be malapropism for redound (not logged in BCUM)
    A06:1680 alloted for allotted
    A08:0470 statutes for statues
    A08:0770 builtin for built-in
    A08:0910 full stop where there should be comma or no punctuation mark (not logged in BCUM)
    A09:0350 BgI has full stop for comma after firms (the copy of text A09 held at Brown University is now too faded to be sure that this error is not original, but it is assumed to be an error of transcription)
    A09:0760 accomodations for accommodations
    A09:1420 severly for severely
    A10:0190 ritiuality for (?) rituality
    A10:0580 in grammatically redundant before which
    A10:0590 double closing inverted commas omitted between following and full stop (not logged in BCUM)
    A10:1330 Diety for Deity
    A10:1890 conpired for conspired
    A11:0060 draought for drought
    A11:0060 BgI has 3-to-o for 3-to-0
    A11:0090 righthandler for righthander
    A11:0230 It is not clear whether the form A's as an abbreviation for Athletics' should be regarded as an error. This extract regularly abbreviates the team name Athletics as A's, but in this line it occurs as a genitive; it is an open question whether English orthography requires a second apostrophe following the s in such a case. (If it does, this would be the sole SUSANNE line calling for two different markers in the status field.)
    A11:0820 rookie-of-the year for rookie-of-the-year
    A11:0860 6-foot 3 inch for (?) 6-foot-3-inch
    A11:1120 Dimaggio for (?) DiMaggio
    A11:1400 BgI has redundant comma following initial "B." in Norman B. Small
    A11:1760 Wellsley College for (?) Wellesley College
    A12:0210 double closing inverted commas omitted between training and full stop (not logged in BCUM)
    A12:1160 Owl's for Owls' (not logged in BCUM)
    A12:1460 out linebackers may be error for our linebackers but SUSANNE Corpus assumes out was intended
    A13:0040 noun missing after countless (not logged in BCUM)
    A13:0140 Bear's for Bears' (not logged in BCUM)
    A13:0140 gruonded for grounded
    A13:0350 are for is (not logged in BCUM)
    A13:1000 Closing double inverted commas appear to be needed between gone and full stop, but this is not an error in the original; the Brown text consists of six short extracts from separate news stories, and the extract ending on line 1000 breaks off in the middle of a quotation.
    A13:1120 closing double inverted commas omitted after leaguer
    A14:1060 redundant space inserted between diamond- and studded
    A19:1260 Dresbach's for Dresbachs' (not logged in BCUM)
    A19:1560 BgI has up to for up
    A20:0970 advance for advanced
    G01:1640 is with plural subject (not logged in BCUM)
    G01:1680 Northeners for Northerners
    G02:1370 socal for social
    G03:1110, 1630 two tokens of preprepared are assumed to be intended, not misprints
    G04:0480 grazer assumed to be error for grazier
    G05:0110 aromatick occurs as archaism
    G05:0410 BgI has ?t for It
    G06:0630 Wozzek for Wozzeck (title of an opera by Alban Berg) (not logged in BCUM)
    G06:1330 closing single inverted comma missing after Meinung
    G06:1550 BgI has double rather than single inverted comma before the angry
    G09:1170 comma missing after light
    G09:1290 kaleidescope for kaleidoscope
    G10:0380-0450 The original has a quotation immediately followed by a note in square brackets, all set blocked. The Brown Corpus substitutes inverted commas for blocking to represent quotation, by the policy discussed in EFC, §2.25. In BgI the <minbrk> for paragraph boundary is shown as following the opening inverted commas on line 0380; SUSANNE normalizes this to <minbrk> followed by <ldquo>.
    G10:0390, 0410 BgI has two instances of Negroes for original negroes
    G10:0670 reremained for remained
    G10:0840 differences for difference (not logged in BCUM)
    G11:0310 Consitutional for Constitutional
    G11:0510 discernable for discernible
    G11:0870 terrestial for terrestrial
    G11:1300 determing for determining
    G12:0300 BgI has "man;" for man's
    G13:1160 pyschiatrist for psychiatrist
    G17:0330 Original has the Hound of <bital> Heaven's <eital> pursuit; presumably all or none of [the] Hound of Heaven (title of a poem by Francis Thompson) should be in italics.
    G17:1050 full stop for comma
    G17:1760 BgI has And for and
    G18:0610 original has is after plural subject poems
    G22:0370 BgI has "man;" for man's
    J01:0190 BgI has 1o for 10
    J01:0190 BgI has 6ooº for 600º
    J02:0060 BgI omits full stop after energy
    J02:0220 BgI has co0ling for cooling
    J02:0350 up for up to before <formul> (not logged in BCUM)
    J02:0400 BgI has it for It
    J02:0860 BgI omits full stop after 30
    J02:1030 has for gas (not logged in BCUM)
    J02:1040 space omitted between to and the
    J03:0570 as for is
    J04:1370 electron for electrons
    J05:1670 a for an
    J06:1760 BgI has oserved for observed
    J08:0170 areosol for aerosol
    J08:0270 assesment for assessment
    J08:0440 meterological for meteorological
    J08:0560 full stop for comma
    J08:0860 on-sure for on-shore
    J08:1620 prowazwki for prowazekii (not logged in BCUM)
    J09:0600 used for use
    J09:0740 BgI has EEAE for DEAE
    J09:0850 BgI has the for The
    J10:1170 BgI has full stop for comma after honeybee
    J12:1830 BgI has maybe for may be
    J17:1210 parsympathetic for parasympathetic
    J21:0250 meets for meet
    J21:0880 Original lacks a comma after <formul> where context appears to call for it; BCUM logs this as a typographical error, but SUSANNE assumes that the comma was deliberately omitted to avoid confusion with the formula.
    J22:0700 of for or (not logged in BCUM)
    J23:0770 humilation for humiliation
    J23:0930-40 BgI has united states for United States
    J23:1720 BgI has it for It
    J24:1900 BgI has our for Our
    N02:0560 BgI has two for Two
    N02:1610 BgI omits he after where
    N03:0100 BgI has My for my
    N03:0550 BgI omits opening double inverted commas before Just
    N03:1450 BgI has mike for Mike
    N04:0140 BgI omits opening double inverted commas before The
    N04:0550 BgI omits full stop after other
    N06:0100 of for or
    N07:1430 The apostrophe after dollars is arguably redundant or requires a following word such as worth, but the original text is as shown; no E in status field since the grammatical position is debatable
    N09:0330 ommission for omission
    N09:1080 BgI omits opening double inverted commas before I'm
    N10:0080 coosie's for cooky's or cookie's
    N10:1280 BgI has redundant space after the h of Marshal
    N12:1220 shout for shot
    N13:1500 onct in original is assumed not to be an error but to represent a dialect pronunciation of once
    N14:1010 BgI omits opening double inverted commas before It'll
    N14:1110 original has redundant opening double inverted commas before Pat
    N14:1160 down off for off down
    N14:1600 original lacks was before pumping
    N15:0090 opportunities for opportunity

    7. Sources

    The SUSANNE Corpus is based on 64 of the 500 texts of the Brown Corpus. The 64 SUSANNE texts are the following (publication details summarized from Francis & Kučera (1989), dates shown in ISO yyyy.mm.dd order):

    A01
    The Atlanta Constitution
    1961.11.04 issue, p. 1, 2 items
    1961.08.17 issue, p. 6, 2 items
    1961.03.06 issue, p. 1, 3 items; p. 18, 1 item
    A02
    The Dallas Morning News
    1961.02.17 issue, section 1, p. 5, 5 items; p. 12, 2 items
    Chicago Daily Tribune
    1961.02.10 issue, part 1, p. 4, 1 item
    A03
    Chicago Daily Tribune
    1961.07.25 issue, p. 1, 3 items
    1961.02.10 issue, p. 1, 2 items
    A04
    The Christian Science Monitor
    1961.05.11 issue, p. 1, 3 items
    A05
    The Providence Journal
    1961.07.23 issue, p. 19, 1 item
    1961.07.16 issue, sec. 1, p. 9, 1 item
    1961.07.19 issue, p. 5, 1 item
    1961.07.20 issue, p. 5, 1 item
    1961.07.22 issue, p. 17, 1 item
    A06
    Newark Evening News
    1961.03.22 issue, p. 25, 6 items
    A07
    The New York Times
    1961.06.19 issue, p. 1, 7 items
    A08
    The Times-Picayune [New Orleans]
    1961.01.01 issue, sec. 2, p. 3, 4 items
    A09
    The Philadelphia Inquirer
    1961.05.10 issue, p. 49, 4 items
    Chicago Daily Tribune
    1961.02.10 issue, sec. F, p. 9, 1 item
    1961.10.25 issue, sec. I, p. 16, 1 item
    A10
    The Oregonian [Portland]
    1961.10.24 issue, p. 8, 5 items
    1961.11.29 issue, p. 12, 3 items
    1961.10.24 issue, p. 8, 1 item
    A11
    The Sun [Baltimore]
    1961.03.18 issue, pp. 15 and 18, 6 items
    A12
    The Dallas Morning News
    1961.10.10 issue, sec. 2
    p. 1, 1 item
    p. 2, 1 item
    p. 3, 3 items
    A13
    Rocky Mountain News [Denver, Colorado]
    1961.05.02 issue
    p. 50, 3 items
    p. 51, 2 items
    The Dallas Morning News
    1961.10.10 issue, sec. 2, p. 1, 1 item
    A14
    The New York Times
    1961.01.24 issue, p. 23, 5 items
    A19
    The Sun [Baltimore]
    1961.01.08 issue, p. 36, 8 items
    1961.12.10 issue, sec. C, p. 1, 4 items
    A20
    Chicago Daily Tribune
    1961.02.10 issue
    pp. 1 and 2, 1 item
    p. 9, 1 item
    p. 2, 2 items
    p. 9, 1 item
    G01 Edward P. Lawton, "Northern Liberals and Southern Bourbons", The Georgia Review, 15 (1961), 254-259
    G02 Arthur S. Miller, "Toward a Concept of National Responsibility", The Yale Review, LI:2 (December 1961), 186-191
    G03 Peter Wyden, "The Chances of Accidental War", The Saturday Evening Post, 1961.06.03, 18-19 and 60-61
    G04 Eugene Burdick, "The Invisible Aborigine", Harper's Magazine, 223:1336 (September 1961), 70-72
    G05 Terence O'Donnell, "Evenings at the Bridge", Horizon, III:5 (May 1961), 26-30
    G06
    The American-German Review, October-November 1961
    pp. 26-28: Ruth Berges, "William Steinberg, Pittsburgh's Dynamic Conductor"
    pp. 28-29: Henry W. Koller, "German Youth Looks at the Future"
    G07 Richard B. Morris, "Seven Who Set Our Destiny", The New York Times Magazine, 1961.02.19, 9 and 69-70
    G08 Frank Murphy, "New Southern Fiction: Urban or Agrarian?", The Carolina Quarterly, 13:2 (Spring 1961), 18-25
    G09 Selma Jeanne Cohen, "Avant-Garde Choreography", Criticism A Quarterly for Literature and the Arts, vol. III, no. 1 (Winter 1961), 24-28
    G10 Clarence Streit, "How the Civil War Kept You Sovereign" [chapter 8 of Freedom's Frontier - Atlantic Union Now], Freedom and Union, 16:2 (February 1961), 16-18
    G11 Frank Oppenheimer, "Science and Fear - A Discussion of Some Fruits of Scientific Understanding", The Centennial Review, 5:4 (Fall 1961), 404-409
    G12 Tom F. Driver, "Beckett by the Madeleine", Columbia University Forum, 4:3 (Summer 1961), 21-24
    G13 Charles Glicksberg, "Sex in Contemporary Literature", The Colorado Quarterly, 9:3 (Winter 1961), 278-82
    G17 Randall Stewart, "A Little History, a Little Honesty: A Southern Viewpoint", The Georgia Review, 15:1 (Spring 1961), 10-15
    G18 Charles Wharton Stork, "Verner von Heidenstam", American-Scandinavian Review, 49:1 (March 1961), 39-43
    G22 Kenneth Reiner, "Coping with Runaway Technology", The Ethical Outlook, XLVII:3 (May-June 1961), 91-95
    J01 Cornell H. Mayer, "Radio Emission of the Moon and Planets", in Gerard P. Kuiper & Barbara M. Middlehurst (eds.), Planets and Satellites. Vol. 3 of The Solar System. University of Chicago Press, 1961, pp. 442-446
    J02 Raymond C. Binder et al. (eds.), Proceedings of the 1961 Heat Transfer and Fluid Mechanics Institute, Stanford University Press, 1961, pp. 193-196
    J03 Harry H. Hull, "The Normal Forces and Their Thermodynamic Significance", Transactions of the Society of Rheology, V (1961), 120-125
    J04 James A. Ibers et al., "Proton magnetic resonance study of polycrystalline HCrO2", The Physical Review, 121:6 (1961.03.15), 1620-1622
    J05 Jay C. Harris & John R. Van Wazer, "Detergent building", in J.R. Van Wazer (ed.), Phosphorus and its Compounds, Interscience Publishers, Inc., 1961, pp. 1732-1737
    J06 Francis J. Johnston & John E. Willard, "The exchange reaction between Chlorine and Carbon Tetrachloride", Journal of Physical Chemistry, 65 (February 1961), 317-318
    J07 J.F. Vedder, "Micrometeorites", in Francis S. Johnson (ed.), Satellite Environment Handbook, Stanford University Press, 1961, pp. 92-97
    J08 LeRoy Fothergill, "Biological Warfare", in Peter Gray (ed.), The Encyclopedia of the Biological Sciences, Reinhold Publishing Corporation, 1961, pp. 145-149
    J09 M. Yokoyama et al., "Chemical and serological characteristics of blood group antibodies in the ABO and Rh systems", The Journal of Immunology, 87 (1961), 56-60
    J10 B.J.D. Meeuse, The Story of Pollination, The Ronald Press Company, 1961, pp. 104-108
    J12 Richard F. McLaughlin et al., "A study of the subgross pulmonary anatomy in various mammals", The American Journal of Anatomy, 108 (1961), 154-157
    J17 E. Gellhorn, "Prolegomena to a theory of the emotions", Perspectives in Biology and Medicine, 4 (1961), 426-431
    J21 C.R. Wylie, Jr., "Line involutions in S3 whose singular lines all meet in a twisted curve", Proceedings of the American Mathematical Society, 12 (1961), 335-339.
    J22 Max F. Millikan & Donald L.M. Blackmer (eds.), The Emerging Nations: Their Growth and United States Policy, Little, Brown and Company, 1961, pp. 136-142
    J23 Joyce O. Hertzler, American Social Institutions; A Sociological Analysis, Allyn and Bacon, Inc., 1961, pp. 478-482
    J24 Howard J. Parad, "Preventive casework: problems and implications", The Social Welfare Forum, 1961, Columbia University Press for the National Conference on Social Welfare, 1961, pp. 186-191
    N01 Wayne D. Overholser, The Killer Marshal, Dell Publishing Co., 1963 [copyright 1961], pp. 53-58
    N02 Clifford Irving, The Valley, McGraw-Hill Book Company, Inc., 1961, pp. 262-267
    N03 Cliff Farrell, Trail of the Tattered Star, Doubleday & Company, 1961, 168-173
    N04 James D. Horan, The Shadow Catcher, Crown Publishers, Inc., 1961, pp. 248-253
    N05 Richard Ferber, Bitter Valley, Dell Publishing Company, 1961, pp. 9-17
    N06 Thomas Anderson, Here Comes Pete Now, Random House, 1961, pp. 4-12
    N07 Todhunter Ballard, The Night Riders, Pocket Books, Inc., 1961, pp. 5-11
    N08 Mary Savage, Just For Tonight, Dodd, Mead & Company, 1961, pp. 114-120
    N09 Jim Thompson, The Transgressors, The New American Library of World Literature, Inc., 1961, pp. 9-13
    N10 Joseph Chadwick, No Land Is Free, Avon Book Division, Hearst Corporation, 1961, pp. 21-26
    N11 Gene Caesar, Rifle For Rent, Monarch Books, Inc., 1963 [copyright 1961], pp. 46-51
    N12 Edwin Booth, Outlaw Town, Ballantine Books, Inc., 1961, pp. 103-108
    N13 Martha Ferguson McKeown, Mountains Ahead, G.P. Putnam's Sons, 1961, pp. 390-395
    N14 Peter Field, Rattlesnake Ridge, Jefferson House, Inc., 1961, pp. 164-172
    N15 Donald J. Plantz, Sweeney Squadron, Dell Publishing Co., Inc., 1961, pp. 133-138
    N18 Peter Bains, "With Women...Education Pays Off", Monsieur, 4:2 (February 1961), 17 and 77-78

    Notes

    ¹The support of the Economic and Social Research Council is gratefully acknowledged. The SUSANNE Project, "Construction of an Analysed Corpus of English", was funded by ESRC award no. R00023 1142, over the period 1988 to 1992. "SUSANNE" stands for "Surface and underlying structural analyses of naturalistic English". I should like to express my warmest thanks to the team who worked on the SUSANNE Project, namely Robin Haigh, Hélène Knight, Tim Willis, and Nancy Glaister.

    ²I thank Alvar Ellegård for permission to circulate a research resource derived from the work of his group.

    ³This is not to suggest that the Pennsylvania Treebank analytic system is crude or ill-defined. Although the Pennsylvania team began by focusing on quantity of material analysed, they have since published electronically a definition of their own analytic scheme (URL 8) which is very comparable in degree of refinement to that of English for the Computer. Having been first in the field, for me and my group it would now be difficult to consider abandoning the SUSANNE scheme for a radically different one even if some alternative were to prove demonstrably superior.

    References

    A. Ellegård  (1978)  The Syntactic Structure of English Texts. Gothenburg Studies in English, 43.

    W.N. Francis & H. Kučera  (1989)  Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for use with Digital Computers (corrected and revised edition). Department of Linguistics, Brown University, Providence, Rhode Island. [In case of browsers which fail to display the appropriate Unicode subset, note that the third character of the second author's surname is c with the Czech hachek diacritic.]

    R.G. Garside, G.N. Leech, & G.R. Sampson, eds.  (1987)  The Computational Analysis of English. Longman.

    D. Gibbon, R. Moore, & R. Winski, eds.  (1997)  Handbook of Standards and Resources for Spoken Language Systems. Mouton.

    K. Hofland & S. Johansson  (1982)  Word Frequencies in British and American English. Longman.

    A. Kilgarriff & Martha Palmer, eds.  (2000)  Computers and the Humanities special issue on Senseval, vol. 34, issues 1-2.

    D.T. Langendoen  (1997)  Review of Sampson (1995). Language 73.600-3.

    M.P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz  (1993)  "Building a large annotated corpus of English: the Penn Treebank". Computational Linguistics 19.313-30.

    R. Quirk, S. Greenbaum, G. Leech, & J. Svartvik  (1985)  A Comprehensive Grammar of the English Language. Longman.

    G.R. Sampson  (1992)  "Probabilistic parsing". In J. Svartvik, ed., Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, Mouton de Gruyter.

    G.R. Sampson  (1995)  English for the Computer: The SUSANNE Corpus and Analytic Scheme. Clarendon Press (Oxford University Press).

    G.R. Sampson  (1998)  Review of S. Greenbaum, ed., Comparing English Worldwide. Natural Language Engineering 4.363-5.

    G.R. Sampson  (2001)  Empirical Linguistics. Continuum International.

    URL List

    1http://www.uic.edu/orgs/tei/
    2http://www.cs.vassar.edu/CES/
    3http://www.hd.uib.no/icame.html
    4http://www.grsampson.net/RChristine.html
    5ftp://ftp.cogs.susx.ac.uk/pub/users/geoffs/CHRISTINE1.tar.Z
    6http://www.grsampson.net/ChrisDoc.html
    7http://www.cis.upenn.edu/~treebank/home.html
    8ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/
    9http://www.ilc.pi.cnr.it/EAGLES/home.html
    10http://www.itri.brighton.ac.uk/events/senseval/