Geoffrey Sampson
School of Cognitive & Computing Sciences
University of Sussex
Falmer, Brighton BN1 9QH, England
Change of AddressIn the past, the SUSANNE Corpus and other language-engineering resources produced by my research team have been scattered at different internet locations not all under my control, and they have more than once been shifted to new addresses without notification to me. I apologize to users for the frustrations this has sometimes caused. To avoid such problems in future, I have acquired my own internet domain, which I intend to maintain indefinitely. My home page has now moved to: From now on this will always include a pointer to a list of the current locations of the SUSANNE Corpus and other downloadable research resources produced under my direction. In due course, those resources may themselves be shifted into the grsampson.net domain. |
Release 5 is the first new release of the SUSANNE Corpus for almost six years, and incorporates larger changes than did previous releases. Rather than detail these here, they are discussed in a separate section of this document, "Innovations in Release 5", §5 below.
Release 4 of 1994.11.07 corrected a handful of errors discovered in checking the proofs of English for the Computer and otherwise.
Release 3 of 1994.04.04 corrected errors which came to light during the process of finalizing the MS of the book English for the Computer. One proofreading technique applied in the creation of Release 3 was to read through the entire Corpus text printed in a format which used indentation to display the parse-field bracketing structure, in order to catch structural errors such as inappropriate placement of postmodifier constituents within parse trees. Also, this documentation file was provided with a detailed listing of misprints and similar errors in the Corpus texts, showing which of them stem from the original publications (and are therefore preserved in the SUSANNE Corpus), and which were introduced in the work of creating the Brown Corpus (and have accordingly been eliminated from SUSANNE).
Release 2, dated 1993.06.02, corrected a number of errors found in Release 1; I am grateful to all those users who helped to find them. It also contained one minor change in annotation conventions: in the parse field, from Release 2 onwards all node labels are written within square brackets (Release 1 included a redundant distinction between square brackets for ordinary nodes and angle brackets for "ghost" (or "trace") nodes, which are distinguished in several other ways). This documentation file now includes a listing of the text sources on which the Corpus is based, and incorporates some minor changes in wording.
Release 1 of the SUSANNE Corpus was completed on 1992.09.06.
The SUSANNE Corpus was created, with the sponsorship of the Economic and Social Research Council (UK), as part of the process of developing a comprehensive language-engineering-oriented taxonomy and annotation scheme for the (logical and surface) grammar of English.¹ The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis. The SUSANNE scheme may be likened to a "Linnaean taxonomy" of the grammatical domain: its aim (comparable to that of Linnaeus's eighteenth-century taxonomy for the domain of botany) is not to identify categories which are theoretically optimal or which necessarily reflect the psychological organization of speakers' linguistic competence, but simply to offer a scheme of categories and ways of applying them that make it practical for language-engineering researchers to register everything that occurs in real-life usage systematically and unambiguously, and for researchers at different sites to exchange empirical grammatical data without misunderstandings over local uses of analytic terminology. On reasons why such a scheme is needed at the present juncture in language-engineering research, see e.g. Sampson (1992, 2001 ch. 6).
Note that a sharp distinction is drawn here between the terms scheme and system. A "parsing scheme", or "analytic scheme", refers to a range of notations and guidelines for using them which prescribe to a human analyst what the appropriate grammatical annotation for a language example should be. A parsing "system" on the other hand refers to a software system which automatically produces analyses (according to some parsing scheme) of input language examples. A parsing scheme defines the target which a parsing system hits (or misses). The SUSANNE Corpus represents part of the definition of a parsing scheme. It has been produced largely manually, not as the output of an automatic parsing system.
The SUSANNE analytic scheme is defined in detail in a book by myself, English for the Computer, published under the Clarendon imprint of Oxford University Press in 1995 (Sampson 1995 - abbreviated below as EFC). The SUSANNE scheme aims to specify annotation norms for the modern English language; it does not cover other languages, although it is hoped that the general principles of the SUSANNE scheme may prove helpful in developing comparable taxonomies for these.
Although I collaborated during the later stages of developing the SUSANNE Corpus with the US/EU-sponsored Text Encoding Initiative (URL 1), the SUSANNE Corpus is not a "TEI-conformant" resource. Various aspects of the annotation scheme were decided in such a way as to facilitate a possible move to TEI conformance in later releases, but the working timetable of the Initiative meant that relevant aspects of the TEI Guidelines were not yet complete at the point when the SUSANNE Corpus was ready for release. The TEI Guidelines are in any case very general, but at the time of writing it seems possible that the Ide/Véronis "Corpus Encoding Standard" (URL 2) may become a special case of the TEI system which achieves recognition as a standard way of encoding this type of information. If that happens, it is my intention to circulate a CES-conformant version of SUSANNE alongside the current version. (However, I believe many users will always prefer to work with the Corpus in its current fixed-field format.)
The brief description of the SUSANNE Corpus which follows cannot replace the very detailed statements to be found in EFC, and any user aiming to do serious work with the Corpus or its annotation scheme would need to consult the book. Nevertheless, it may be useful to have a summary statement included with the electronic Corpus.
The present SUSANNE annotation scheme originated in work carried out by myself in collaboration with Prof. Geoffrey Leech FBA and others in the years 1983-85 to produce a database of manually analysed sentences from the LOB Corpus of written British English, as a source of statistics for probabilistic automatic-parsing techniques; this database, which has not been (and will not now be) published, is described in Garside et al. (1987: ch. 7). The annotation scheme of this "Lancaster-Leeds Treebank" represented surface grammar only, without indications of logical form. It subsequently seemed desirable to extend this scheme to include methods for representing logical grammar, and to refine both surface and logical aspects of the annotation scheme by applying it to a larger body of texts. The only way that a parsing scheme can in practice be made increasingly adequate is in the way that the English Common Law develops, by collecting and systematizing the body of precedents generated through detailed consideration of more and more individual cases that arise in real life. Accordingly, the SUSANNE Project took a subset of the Brown Corpus of written American English which had been manually analysed by Alvar Ellegård's group at Gothenburg (Ellegård 1978), and reworked the annotations in this under-used resource in order to turn them into a scheme consistent with that used in the Lancaster-Leeds Treebank but including specifications of logical as well as surface structure: several categories of information not indicated in either Lancaster-Leeds or Gothenburg schemes were also added.² (For the Brown and LOB Corpora, see e.g. Garside et al. (1987: 4-5), URL 3.)
The SUSANNE analytic scheme has thus been developed on the basis of samples of both British and American English. It was initially oriented towards written language only, and the SUSANNE Corpus contains exclusively written-language samples. However, in later work sponsored first by the Royal Signals and Radar Establishment, and more recently by the Economic and Social Research Council, my team has produced extensions to the scheme for annotating the distinctive structural phenomena of the spoken language, and has applied these to samples of recent spontaneous spoken English (the CHRISTINE Corpus, URL 4). The first stage of the CHRISTINE Corpus, comprising analyses of a demographically balanced cross-section of English spoken in all parts of the UK within the last decade, was released in August 1999 (URL 5), and is one of the first analysed speech corpora to become available anywhere in the English-speaking world (or, so far as I know, outside it). The speech-related aspects of the analytic scheme are outlined in ch. 6 of EFC, and discussed in greater detail in the CHRISTINE documentation file (which is available as a Web page at URL 6).
It should be noted that the SUSANNE analytic scheme has emerged through a process of detailed critical discussion of analytic standards by about a dozen people over almost twenty years. Apart from myself, the leading role in the early years of these discussions was taken by Geoffrey Leech, whose standing as an English grammarian needs no emphasis.
The SUSANNE Corpus itself comprises an approximately 130,000-word subset of the Brown Corpus of American English, annotated in accordance with the SUSANNE scheme. The original motives for producing this database included that of providing better statistics for probabilistic parsing; but in this respect the SUSANNE Project was overtaken after its inception by projects (notably Mitchell Marcus's Pennsylvania Treebank project, cf. Marcus et al. (1993), URL 7) which have used quasi-industrial methods to generate far larger bodies of grammatically-analysed material. However, the SUSANNE scheme may be unparallelled in the extent to which its categories have been refined and tested through detailed consideration of the almost endless small quirks of the texts to which they have been applied, and in the degree of precision to which the resulting guidelines for using the categories have been documented - thus defining analytic standards which permit annotation of future material to be extremely self-consistent. The SUSANNE scheme has been winning a measure of international recognition in this respect; for instance, a 1995 report of the European Union EAGLES language-engineering standards initiative described it as "a unique achievement", and D. Terence Langendoen, then President of the Linguistic Society of America, wrote in 1997 that its "detail ... is unrivalled" (Langendoen 1997: 600).³
Accordingly the SUSANNE Corpus is offered to the research community primarily as a demonstration of the application of the parsing scheme, evidencing the fact that the scheme has survived the test of experience rather than being a merely aprioristic system. The SUSANNE Corpus functions, as it were, like a collection of type specimens appended to a botanical taxonomy.
Although the Corpus itself was created as ancillary to the parsing scheme, it has been pleasing, during the years since its initial release, to find that it has been widely used as a research resource in its own right, often by researchers and groups very distant from the site where it was created. There are no legal restrictions on copying or using the SUSANNE Corpus, though it would not be a friendly act for an individual or agency other than the Oxford Text Archive to set up as an alternative distributor, or (without permission) to exploit the information in the Corpus in order to promote a rival annotation scheme. Any person or group publishing work based on the SUSANNE Corpus is requested to acknowledge the roles of the Economic and Social Research Council (UK) as sponsor and the University of Sussex as grantholder.
Each successive release of the SUSANNE Corpus has eliminated errors discovered in earlier releases. The number of errors found and corrected between releases has fallen very considerably as the years have gone by, but there will undoubtedly still be some left. I shall be extremely grateful if users discovering errors will log them and send me details.
Release 5 of the SUSANNE Corpus comprises 66 files:
Sixteen texts are drawn from each of the following Brown genre categories:
The Corpus thus samples each of the four broad genre groups established on the basis of word-frequency data by Hofland & Johansson (1982: 27). For publication details of the original texts, see §7, "Sources", below.
A press reportage G belles lettres, biography, memoirs J learned (mainly scientific and technical) writing N adventure and Western fiction
The text files average about 83 kilobytes in size; the entire Corpus totals about six megabytes. The names of the text files are those of the respective Brown texts, e.g. A01, N18.
The Corpus comprises data files only, structured in a way that makes the task of extracting information as straightforward as possible. We do not see it as part of our task to produce special-purpose software for data extraction. We could not do that, since we have no way of knowing what sorts of questions future researchers will want to pose to our data. (SUSANNE has been used for various kinds of research that I had no thought of when I first put it into circulation.)
This last point seems worth making, because since the first release of SUSANNE I have more than once encountered comments suggesting that, in failing to supply accompanying utility software, we left a job half done. In response, let me quote remarks I made in a recent book review (Sampson 1998: 365) about the approach which sees utility software as an essential accompaniment to corpus data:
It is hard to see this as a wise policy for allocating scarce research resources. In practice there are usually two possibilities when one wants to exploit corpus data. Often, one wants to put very obvious and simple questions to the corpus; in that case, it is usually possible to get answers via general-purpose Unix commands like grep and wc, avoiding the overhead of learning special-purpose software. Sometimes, the questions one wants to put are original and un-obvious; in those cases, the developer of a corpus utility is unlikely to have anticipated that anyone might want to ask them, so one has to write one's own program to extract the information. No doubt there are intermediate cases where a corpus utility will do the job and grep will not. I am not convinced that these cases are common enough to justify learning to use such software, let alone writing it.
The lexicon file contains an alphabetized list of all pairs of wordform and wordtag that occur at least once in the Corpus. This is an innovation in Release 5, prompted by a recommendation relating to speech corpora made by the EAGLES Spoken Language Working Group (Gibbon et al. 1997: 170, Recommendation 6). (For the EAGLES initiative, see URL 9.) Including such a list is potentially no less valuable for written-language than for speech corpora, and separate listing of grammatically-distinct uses of single wordforms is an obvious way of increasing its value.
Each line of the lexicon file contains a wordform followed by a wordtag, separated by a tab character, and terminated by a newline. Wordforms differing only in the case of one or more letters are separately listed (e.g. OSLO and Oslo have separate entries). The alphabetization uses the sequence A, a, B, b, ... (rather than A, B, ..., Z, a, b, ...).
Each file has a line (terminating in a newline character) for each "word" of the original text; but "words" for SUSANNE purposes are often smaller than words in the ordinary orthographic sense, for instance punctuation marks and the apostrophe-s suffix are treated as separate words and assigned lines of their own. (For details on the rules by which orthographic words are segmented, as well as on all other analytic matters mentioned below, see EFC.)
Each line of a SUSANNE file has six fields separated by tabs (that is, there is one tab after each of fields 1 to 5, but a newline after field 6). Each field on every line contains at least one character. A typical short sequence of lines is:
N06:0180.12 - NN1u Baldness baldness [S[Ns:s.Ns:s] N06:0180.15 - VBDZ was be [Vsu. N06:0180.18 - VVGt attacking attack .Vsu] N06:0180.21 - APPGm his his [Ns:o. N06:0180.24 - NN1c pate pate .Ns:o]S]
The six fields on each line are:
Apart from the tab and newline characters used to structure fields and records, all bytes in each of the 64 SUSANNE files are drawn from a subset of the 94 graphic character allocations of the International Reference Version ("IRV") of ISO 646:1983 Information Processing - ISO 7-bit coded character set for information interchange, from hexadecimal 21 (exclamation mark) to hex 7E (tilde). These codes are assumed for SUSANNE purposes to represent the graphic symbols assigned by the IRV system. Twelve members of the IRV character set are not used in the Corpus, namely (all codes hexadecimal):
The space character, hex 20, which is classified by ISO 646 as a control code also does not occur in the SUSANNE Corpus.
23 gate 24 generalized currency unit 27 prime 2F solidus 5C reverse solidus 5E circumflex 5F underline 60 grave 7B opening curly bracket 7C vertical bar 7D closing curly bracket 7E tilde
Where text characters cannot be adequately represented directly within the resulting 82-member character set, they are represented by entity names within angle brackets. Where possible these are drawn from Appendix D to ISO 8879:1986, Information Processing - Text & Office Systems - Standard Generalized Markup Language (SGML). For instance, <eacute> stands for é. Symbols in angle brackets are used also to represent such things as typographical shifts, which for purposes of grammatical analysis are conveniently represented as items within the word-sequence: e.g. <bital> stands for "begin italics". The complete set of such entity names used in SUSANNE Release 5 is listed in EFC, §2.32, except for <docbrk>, which is new in Release 5 and is defined in §5.4 of this document, below.
The reference field contains eleven bytes which give each line a reference number that is unique across the SUSANNE Corpus, e.g. N06:0180.15. The first three bytes (here N06) are the file name; the fourth byte is always a colon; bytes 5 to 8 (here 0180) are the number of the line in the "Bergen I" version of the Brown Corpus on which the relevant word appears (Brown line numbers normally increment in tens, with occasional odd numbers interpolated); the ninth byte is always a full stop; and bytes 10 and 11 (here 15) are a two-digit number identifying the individual SUSANNE line, i.e. the individual word or punctuation mark (word numbers normally increment in threes, again with occasional intermediate numbers). The word-numbering system is one of the respects in which Release 5 differs from previous SUSANNE releases.
The status field contains one byte. The letters A and S show that the word is an "abbreviation" or "symbol", respectively, as defined by Brown Corpus codes (Francis & Kučera 1989: 12). The letter E shows that the word is (or is part of) a misprint or solecism in the original text (for details see "Errors in the Source Texts", §6 below). On the great majority of lines, to which none of these three categories apply, the status field contains a hyphen character (this applies to each line in the short SUSANNE extract displayed in §4.1 above).
The SUSANNE wordtag set is based on the "Lancaster" tagset listed in Garside et al. (1987: Appendix B), with some additional distinctions and modifications. In line N06:0180.15 (see §4.1 above), the wordtag VBDZ applies to the word was (only). The SUSANNE tagset comprises 353 distinct wordtags, not counting tags for elements of "grammatical idioms" (see below); a few of these wordtags never occur in the SUSANNE Corpus. The wordtags are listed, and their application rigorously defined, in EFC - in the case of closed wordclasses, by enumeration of their members, and in the case of open classes by rules for choice between alternative tags. These rules refer to information in a specified published dictionary (the Oxford Advanced Learner's Dictionary of Current English, 3rd edition).
Note particularly that the tag YG appears in the wordtag field to represent a "ghost" - the logical position of a constituent which has been shifted elsewhere, or deleted, in the surface grammatical structure.
The word field contains a segment of the text, often coinciding with a word in the orthographic sense but sometimes, as noted above, including only part of an orthographic word. (In line N06:0180.15 the word field contains was.) In general the word field represents all and only those typographical distinctions in the original documents which were recorded in the Brown Corpus (Francis & Kučera 1989: 10-15), though in certain cases the SUSANNE Corpus has gone behind the Brown Corpus to reconstruct typographical details omitted from Brown.
Certain characters have special meanings in the word field, as follows:
+ (occurs only as first byte of the word field) shows that the contents of the field were not separated in the original text from the immediately-preceding text segment by whitespace (e.g. in the case of a punctuation mark, or part of a hyphenated sequence split over successive SUSANNE lines); - the line corresponds to no text material (it represents the "ghost" or "trace" for a grammatically-moved element); <...> enclose entity names for special typographical features, as listed in EFC, §2.32.
The lemma field shows the dictionary headword of which the text word is a form: the field shows base forms for words which are inflected in the text, and eliminates typographical variations (such as sentence-initial capitalization) which are not inherent to the word but relate to its use in context. (In line N06:0180.15, the lemma field contains be as the base form of was.) In the case of "words" to which the dictionary-form concept is inappropriate, e.g. numerals and punctuation marks, the lemma field contains a hyphen. Orthographic forms in the lemma field are those of a specified dictionary (the Oxford Advanced Learner's Dictionary of Current English, 3rd edition).
The SUSANNE project aimed also to indicate the senses which polysemous words bear in context, via codes relating word-tokens to numbered subsenses in a specified dictionary. The book English for the Computer provides a detailed coding scheme for representing this information. Unfortunately, this aspect of the project's output proved to contain a number of inadequacies, and the information was not included in the finished Corpus. In the years since the initial release of SUSANNE, wordsense-coding has developed norms of its own, independent of our early work on the SUSANNE project (see e.g. Kilgarriff and Palmer 2000, URL 10), so it no longer seems appropriate to revise and incorporate that material.
The contents of the sixth field represent the central raison d'être of the SUSANNE Corpus. They code the grammatical structure of texts as a sequence of labelled trees, having a leaf node for each Corpus line.
Each text is treated as a sequence of "paragraphs" separated by "headings". A "paragraph" normally coincides with an ordinary orthographic paragraph; a "heading" may consist of actual verbal material, or may be merely a typographical paragraph division, symbolized <minbrk> in the word field. (See §5.4, "Subdivision of Texts", for more detail on paragraphs and headings.) Conceptually, the internal structure of each paragraph or heading is a labelled tree with root node labelled O (Oh for a heading), and with a leaf node labelled with a wordtag for each SUSANNE word or trace, i.e. each line of the Corpus. There will commonly be many intermediate labelled nodes.
Such a tree is represented as a bracketed string in the ordinary way, with the labels of nonterminal nodes written "inside" both opening and closing brackets (that is, to the right of opening brackets and to the left of closing brackets). This bracketed string is then adapted as follows for inclusion in successive SUSANNE parse fields. Wherever an opening bracket immediate follows a closing bracket, the string is segmented, yielding one segment per leaf node; and within each such segment, the sequence opening-bracket + wordtag + closing-bracket, representing the leaf node, is replaced by full stop. Thus each parse field contains exactly one full stop, corresponding to a terminal node labelled with the contents of the wordtag field, sometimes preceded by labelled opening bracket(s) and sometimes followed by labelled closing bracket(s), corresponding to higher tagmas which begin or end with the word on the line in question. In line N06:0180.15, the parse field entry [Vsu. shows that was is the first word of a tagma was attacking, which as a whole is a Vsu, i.e. a progressive verb group beginning with the word was.
Nonterminal node labels in the SUSANNE scheme contain up to three types of information: a formtag, a functiontag, and an index, in that order. In a label containing a formtag and one or both of the other two elements, a colon separates the formtag from the other elements. A functiontag is always a single alphabetic character, and an index is a sequence of three digits; restrictions on valid combinations of elements within a node label mean that complex labels can always be unambiguously decomposed into their elements.
In total the parse-trees of Release 3 of SUSANNE comprised 267,046 nodes, of which 4383 were roots and 156,584 were leaves. In Release 5, the number of leaf nodes is only marginally different at 156,622; other parse nodes have not been re-counted, but their numbers are likely also to be very close to the figures for Release 3.
Apart from nodes immediately dominating "ghost" elements, all nodes have labels including formtags, which identify the internal properties of the word or word-sequence dominated by the node. The shape of a parse-tree is defined in terms of a hierarchy of formtag ranks:
Each grammatical clause, whether consisting of one or more words, is given a node labelled with a clausetag. Each immediate constituent of a clause, whether there are one or more such constituents and whether the constituent consists of one or more words, is given a node labelled with a phrasetag, unless the constituent belongs to a wordrank category that has no corresponding phraserank category (e.g. punctuation marks, conjunctions), or to a rootrank category (e.g. a direct quotation, formtagged Q). Thus a clause consisting of one verb will be assigned a clausetag (e.g. Tg for present- participle clause) which singularily dominates a phrasetag (e.g. Vg for "verb group beginning with present participle") which in turn singularily dominates a wordrank formtag (e.g. VVGi for "present participle of intransitive verb").
Other than by these rules, and in certain other limited circumstances specified in EFC, singulary branching does not occur. An intermediate phrase node is inserted between a higher phrase node and a sequence of words dominated by it only if two or more of those words form a coherent constituent within the higher phrase. A clause which fills a slot standardly filled by a phrase (e.g. a nominal clause as subject or object) will not have a phrase node above the clause node unless the clause proper is preceded and/or followed by modifying elements that are not part of the clause.
Detailed rules for deciding constituency in various debatable cases, for placing items such as punctuation marks within parse trees, etc. are laid down in EFC.
Functiontags, identifying roles such as surface subject, logical object, time adjunct, are assigned to all immediate constituents of clauses, except for their verb-group heads and certain other constituents for which function labelling is inappropriate.
Indices are assigned to pairs of nodes to show referential identity between items which are in certain defined grammatical relationships to one another. For instance, a phrase raised out of a lower clause to act as object in a higher clause, as in John expected Mary to admit it, will be assigned an index identical to that assigned to the ghost element which marks the logical position of the item in the lower clause. The (artificial) example quoted would be represented as:
[Nns:s John] expected [Nns:O999 Mary] [Ti:o [s999 GHOST] to admit [Ni:o it]]
- where the index 999 shows that the ghost acting as logical subject (symbolized s) of the admit clause is coreferential with Mary which acts as surface object (O) of the expected clause; the logical object (o) of the expected clause being the entire infinitival subordinate clause (Ti). In some cases, movement rules displace a constituent into a tagma within which it has no grammatical role (for instance, an adverb which is logically a clause constituent may interrupt the verb group - sequence of auxiliary verbs and main verb - of the clause): in such cases the functiontag is G ("guest"). Constituents which do not logically belong below the node which immediately dominates them in surface structure are always given G functiontags and indices linking them to their logical position. With that exception (and with one other exception not discussed here relating to co-ordination), functiontagging is used only for immediate constituents of clauses. EFC lists the categories of surface/logical-grammar discordance which are represented by the SUSANNE scheme, and the approved methods of representing them. The SUSANNE analysis is always chosen so as to be as far as possible neutral as between alternative linguistic theories.The SUSANNE formtags are as follows:
Rootrank Formtags
O paragraph Oh heading Ot title (e.g. of book) Q quotation I interpolation Iq tag question Iu technical reference Clausetags
S main clause Ss embedded quoting clause Fa adverbial clause Fn nominal clause Fr relative clause Ff fused relative Fc comparative clause Tg present participle clause Tn past participle clause Ti infinitival clause Tf for-to clause Tb bare nonfinite clause Tq infinitival relative clause W with clause A special as clause Z reduced ("whiz-deleted") relative L miscellaneous verbless clause Phrasetags
V verb group N noun phrase J adjective phrase R adverb phrase P prepositional phrase D determiner phrase M numeral phrase G genitive phrase
The various phrase categories take lower-case subcategory symbols which can be combined in any meaningful combination (e.g. the verb group must have been noticed would be formtagged Vcfp). The phrase subcategories are:
Vo operator section of verb group, when separated from remainder of V e.g. by subject-auxiliary inversion Vr remainder of V from which Vo has been separated Vm V beginning with am Va V beginning with are Vs V beginning with was Vz V beginning with other 3rd-singular verb Vw V beginning with were Vj V beginning with be Vd V beginning with past tense Vi infinitival V Vg V beginning with present participle Vn V beginning with past participle Vc V beginning with modal Vk V containing emphatic DO Ve negative V Vf perfective V Vu progressive V Vp passive V Vb V ending with BE Vx V lacking main verb Vt catenative V Nq wh- N Nv wh...ever N Ne I/me as whole or head Ny you as whole or head Ni it as whole or head Nj adjectival head Nn proper name Nu unit of measurement as head Na marked as subject No marked as nonsubject Ns marked as singular Np marked as plural Jq wh- J Jv wh...ever J Jx measured absolute J Jr measured comparative J Jh "heavy" (postmodified) J Rq wh- R Rv wh...ever R Rx measured absolute R Rr measured comparative R Rs adverb conducive to asyndeton Rw quasi-nominal adverb Po of phrase Pb by phrase Pq wh- P Pv wh...ever P Dq wh- D Dv wh...ever D Ds marked as singular Dp marked as plural Ms M headed by one Gq wh- G Gv wh...ever G
In later work (see §13.3 of the CHRISTINE/I Documentation file), two further M subcategories were introduced which had been omitted by oversight from the original SUSANNE scheme:
Formtags may also contain non-alphanumeric symbols, including:
? interrogative clause * imperative clause % subjunctive clause ! exclamatory clause or other item " vocative item
Other non-alphanumeric symbols represent co-ordination structure. Under the SUSANNE scheme, second and subsequent conjuncts in a co-ordination are analysed as subordinate to the first conjunct; thus a co-ordination of the form:
chi, psi, and omega(whatever the grammatical rank of the word-sequences chi, psi, etc.) would be assigned a structure of the form:
[chi, [psi], [and omega]]
The formtag of the entire co-ordination is determined by the properties of the first conjunct (except for singular/plural subcategories in the case of phrase categories to which these apply); the later conjuncts (which will often be transformationally reduced) have nodes of their own whose formtags mark them as "subordinate conjuncts". The following symbols relate to co-ordination (and apposition) structure:
+ subordinate conjunct introduced by conjunction - subordinate conjunct not introduced by conjunction @ appositional element & co-ordinate structure acting as first conjunct within a higher co-ordination (marked in certain cases only)
Co-ordination is recognised as occurring between words as well as between higher-rank tagmas. Therefore nonterminal nodes may have formtags consisting of wordtags followed by co-ordination symbols, thus (using WT to stand for an arbitrary wordtag):
(A wordlevel co-ordination always takes an ampersand on its formtag; phrase or clause co-ordinations do so only in very restricted circumstances.)
WT& co-ordination of words WT+ conjunct within wordlevel co-ordination that is introduced by a conjunction WT- conjunct within wordlevel co-ordination not introduced by a conjunction
Also, certain sequences of orthographic words, in certain uses, are regarded as functioning grammatically as single words ("grammatical idioms"). For instance, none the less would normally be treated as a grammatical idiom, equivalent to an adverb (for which the wordtag is RR). In such cases, the nonterminal node dominating the sequence has a formtag consisting of an equals sign suffixed to the corresponding wordtag; and the individual words composing the grammatical idiom are not wordtagged in their own right, but receive tags with numerical suffixes reflecting their membership of an idiom. (The sequence none the less would be formtagged RR=, and the words none, the, and less in this context would be wordtagged RR31 RR32 RR33.) EFC includes exhaustive listings of closed-class grammatical idioms.
Note that formtags of the forms WT& WT+ WT- WT= count as wordrank formtags for the purposes of determining tree structure as discussed above.
Functiontags divide into complement and adjunct tags: broadly, a given complement tag can occur at most once in any clause, but a clause may contain multiple adjuncts of the same type. The scheme of adjunct categories was developed from the classification of Quirk et al. (1985).
Complement Functiontags
s logical subject o logical direct object i indirect object u prepositional object e predicate complement of subject j predicate complement of object a agent of passive S surface (and not logical) subject O surface (and not logical) direct object G "guest" having no grammatical role within its tagma Adjunct Functiontags
p place q direction t time h manner or degree m modality c contingency r respect w comitative k benefactive b absolute Other Functiontags
n participle of phrasal verb x relative clause having higher clause as antecedent z complement of catenative
Detailed guidelines for the application of these functional categories are included in EFC.
As with other releases after the first, Release 5 includes a number of corrections of analytic errors in particular passages. Relative to Release 4 there are twenty or so of these. Many of this batch of errors were brought to my attention by Vladimir V. Gojol of the National Institute of Informatics, Bucharest, to whom I offer thanks.
There are also, however, a number of larger-scale differences between Release 5 and previous SUSANNE releases.
The lexicon file is new in Release 5; this has already been discussed in §3 above.
A "structural" difference between Release 5 and its predecessors relates to the format of the first field (the "reference field") in each line. The reference field gives each line of the SUSANNE Corpus (hence, each word or punctuation mark of the texts) a unique identifying code. Up to Release 4, this had a format exemplified by N06:0180e, where N06 is a Brown file name, 0180 is the number of a line in the Bergen I version of the Brown Corpus (which is a horizontal, i.e. many words per line, version), and e identifies the individual word within the Bergen I line - successive words were given successive lower-case letters. In Release 5, the last byte is replaced by a full stop followed by a pair of digits: N06:0180e becomes N06:0180.15.
The motive for this change is that, whenever the Corpus was corrected in a way that changed the division of the text into word-tokens (for instance, perhaps the analytic guidelines required a hyphenated word to be split across multiple lines), under the old system not only the lines immediately affected but many following lines had to be painstakingly edited, because the lettering was continuous. By moving to two-digit line identifiers it becomes possible to leave gaps (the numbers usually go up in threes), so corrections to the Corpus are less arduous. This is arguably a case of shutting the stable door, since I hope that far fewer corrections will be needed in future than the large number that have been made since the initial release. Nevertheless, the change seemed worth making.
The system by which texts are subdivided into paragraphs and larger units was not described adequately in previous versions of the Documentation file, and has been slightly changed for Release 5.
Text structure above the paragraph level was not central to the concerns either of the SUSANNE project, or of the Brown Corpus which provided our source material. Brown (see Francis & Kučera 1989: 10) recognized just two levels of text subdivision. For any particular 2000-word text extract, a "major" subdivision was defined as the highest-order division that happened to fall within the extract (for instance a chapter boundary, or a section boundary if the whole extract fell within one chapter); a "minor" subdivision was defined as any lower-order subdivision within that extract, down to and including paragraph boundaries. This, at least, is how the Brown Manual describes their system. Taken literally, though, it implies that paragraphs should be "major" subdivisions for extracts that contain no higher-order divisions, and every extract should include at least one major boundary. In reality, only a minority of the 64 Brown texts used in SUSANNE include anything labelled as a "major" subdivision. Paragraph boundaries are always coded as minor subdivisions even in the many extracts that contain no higher-order divisions.
When a text subdivision is introduced by a heading, Brown codes it in a way that is translated in SUSANNE by surrounding the heading with the "words" <bmajhd>...<emajhd> (for a major heading) or <bminhd>...<eminhd> (for a minor heading). The wording of the heading is shown in Brown (and therefore also in SUSANNE) in all capitals, irrespective of the original typography. When a subdivision has no heading, it is preceded by the SUSANNE "word" <majbrk> or <minbrk> - thus paragraphs are separated by <minbrk> symbols.
The text following a heading is not treated as a separate paragraph from the heading (the <minbrk> symbol never follows immediately after <emajhd> or <eminhd> - nor does <minbrk> occur immediately before <bmajhd> or <bminhd>). There is no concept of balancing a "beginning of subdivision" marker with an "end of subdivision" marker, akin to SGML <p>...</p>. The Brown Corpus was compiled long before the creation of SGML and its descendant systems, and SUSANNE simply followed the logic of the Brown annotation in this area.
In one respect, however, Release 5 has introduced a change. Many of the 2000-word Brown "texts" are made up of two or more short pieces from separate sources, grouped together to achieve a Corpus in which all "texts" are of roughly equal length for statistical convenience. (In SUSANNE this applies mainly, but not exclusively, to the newspaper texts of Category A.) A single "text" may include the whole or parts of separate items from one issue of a newspaper, items from different issues, or even items from different newspapers. (Details are given in §7, "Sources", below.) In the Brown Corpus, junctions between separate items within one "text" were given no special marker. If the following item began with a heading, it would start with (the Brown equivalent of) a <bmajhd> or <bminhd> element; but often it was marked merely as a paragraph break. In consequence, there are cases where, illogically, a subdivision internal to a single item was marked as a more important break than the junction between what were originally entirely separate documents.
SUSANNE Release 5 addresses this problem by introducing a new symbol, <docbrk>, to mark junctions within SUSANNE files between separate documents (including separate news stories from the same issue of a newspaper).
If the material following a <docbrk> happens to be a heading, it will begin with a <bmajhd> or <bminhd> symbol as normal. This is the only situation in the SUSANNE Corpus where <...brk> and <...hd> lines are adjacent.
A separate point worth mentioning about the SUSANNE heading and paragraph system is that it is arguably rather illogical for the beginning- and end-heading marks to be included as first and last daughters of Oh (heading) tagmas, and for break symbols to be given Oh nodes above them. Elsewhere in the grammatical annotation scheme, boundary markers are treated as sisters rather than daughters of the units they bound. For instance, full stop and other sentence-final punctuation marks are right-sisters, not last daughters, of S tagmas. However, it has not seemed worth revising the Corpus in this respect. The linguistically-interesting top-level tagmas are those labelled O (paragraph) rather than Oh, and they would be unaffected by such a revision.
Work on the CHRISTINE speech corpus over the years up to 1999 led to many additions and refinements to the annotation scheme of English for the Computer. The bulk of these related to specifically spoken-language phenomena; but the work also uncovered various errors and inconsistencies in the guidelines on analysis of features that occur in written language.
New decisions about how to deal with these features were specified in §15 of the CHRISTINE documentation file (URL 6), and Release 5 of SUSANNE has been brought into line with most of them. Many of the points are rather trivial and are not listed separately here. Certain points are discussed, however, because they proved problematic to implement for SUSANNE.
The CHRISTINE documentation file describes it as a mistake in EFC that this idiom is classified as CC rather than II. That was a response to instances of as well as in CHRISTINE/I where the idiom functioned unmistakably as a preposition. But the truth is that there are other cases where it is unmistakably a conjunction. As well as should be tagged as either II= or CC= according to context (and there are many contexts where it is hard to make a choice one way or the other).
The CHRISTINE documentation file, §15, re p. 194 of EFC says that the definition of Vx should explicitly exclude verb groups in which DO replaces a more specific main verb, e.g. you must do (CHRISTINE T12.03913) for you must remember. But this statement is itself insufficiently specific. In infiltrating the state as they did in the Republican administration (SUSANNE A06:1580), the word did is tagged Vdx because they did could be seen as standing for they did infiltrate; in you must do (which I believe is a specifically British rather than American usage), must do is Vc rather than Vcx because it can only be seen as standing for must infiltrate, not *must do infiltrate. Likewise in he will visit...as he never fails to do (SUSANNE G06:0980), to do is Vi rather than Vix because to do can only be taken to stand for to visit, not *to do visit.
EFC, §4.501, noted that SUSANNE had not marked one-word appositional elements within phrases with ...@ tags, and noted that this was illogical and regrettable since one-word conjuncts within phrases are regularly given ...+ or ...- nodes. In CHRISTINE (§15 of the documentation file, re §4.507 of EFC), quoted words or phrases which occur in apposition are regularly tagged Q@, and this was advocated as desirable for annotating written language also: thus the example quoted in EFC, the word mug meant the object which ..., would take the annotation [Ns:s the word [Q@ mug_NN1c]] .... Unfortunately it has not been practical to locate and change the various relevant passages in SUSANNE.
I noted in §4.10 above that SUSANNE Release 5 has not applied these new numeral-phrase subcategories.
EFC, §4.475, pointed out that the ...& notation for left-grouped coordination had been applied in SUSANNE only to coordinations where the main conjunct is a clause rather than a phrase, though in subsequent work we have applied it across the board. SUSANNE Release 5 has not corrected earlier SUSANNE versions in this respect.
Our team has also made one new decision about analytic practice which was too recent to include in the CHRISTINE documentation file. This relates to the use of the ghost and index system within postmodifying tagmas which logically are reduced relative clauses, tagged Tg, Ti, Tn, or Z. According to the EFC guidelines, these should resemble unreduced Fr clauses in containing ghosts indexed to their antecedents.
This proved to be an aspect of the scheme to which analysts often found it difficult to conform in practice; and the indexing system adds no real information in the case of Tg and Tn postmodifiers, because the understood element of the clause, if represented as a ghost, will always be functiontagged :s in a Tg and :S in a Tn. Therefore, in our current practice, ghosts are not included in postmodifying Tg or Tn tagmas. They are still included in Ti or Z tagmas, where their functiontag is not straightforwardly predictable from the surrounding structure (and, in the case of postmodifying Ti's, where there is a difference between instances which are reduced relative clauses, containing an understood element coreferential with the antecedent, and instances where no element is understood and no possibility of inserting a ghost arises).
There has been no attempt to rework SUSANNE in line with this recent change of practice. The use of ghost elements in SUSANNE is believed to conform fairly well to the EFC standard, and it would be a pity to remove explicit, correct information from the files, even if it could be reconstructed deterministically.
The SUSANNE Corpus aims to reflect the incidence of errors found in real-life written English, and therefore to reproduce those errors which stem from the original texts on which the Brown Corpus was based, while correcting errors that were introduced into the Brown Corpus during the process of corpus-construction. The SUSANNE Corpus was developed from the "Bergen I" version of the Brown Corpus; whenever a form in Bergen I appeared erroneous (and is not discussed explicitly in the Brown Corpus Manual), the SUSANNE team checked whether the error reflected the original source. (In some cases we went to the original publications; in many other cases we asked W. Nelson Francis and Andrew Mackie of Brown University to consult the copies of those publications which had been used in compiling the Brown Corpus - their help is gratefully acknowledged.) When an error (or apparent error) reflects the form found in the original publication, it is preserved in the SUSANNE Corpus (flagged by an E in the status field); otherwise, the SUSANNE Corpus restores the text of the original publication and the status field ignores the error. Where errors are original, wordtagging and grammatical analysis is applied to the erroneous text as best it can be by analogy with correct forms (cf. EFC, §4.26).
For the benefit of users of different versions of the Brown Corpus, a list is included below of all the apparent errors examined, both those which turned out to be original and those which were introduced subsequently and are corrected in SUSANNE. In one or two cases it is open to question whether some odd usage found in the original sources is in fact a misprint or merely an unusual use of words. In cases of doubt the SUSANNE policy was to assume that the original wording was intended, and not to put E in the status field, but these cases too are listed below.
It is of course all too possible that the task of creating the SUSANNE Corpus may have introduced new errors in the word field, and that some errors in Bergen I which were not original have been allowed to stand by oversight. Users who discover text errors not logged below are encouraged to bring them to my attention for correction in subsequent releases.
In the following list, "BgI" represents
the Bergen I version of the Brown
Corpus, "BCUM" represents the 1989 edition
of the Brown Manual (Francis
& Kučera 1989).
The simple comment "X for Y" means that the form
X occurs in the original text and appears to be a mistake which should
read Y; the form X stands in the SUSANNE Corpus, with E in the status
field. (Suggested corrections cannot be certain, and in a few cases may
represent a misunderstanding of the writer's intention even assuming this
was well-defined.) BCUM attempts to list cases of this sort exhaustively;
where such a case is not logged there,
the comment "not logged in BCUM"
is added. The comment "BgI has X for Y",
on the other hand, means that
the erroneous form X is not original but was introduced in the process
of creating the Brown Corpus; SUSANNE has restored the original form Y,
and the status field contains a hyphen.
A01:0370 | BgI omits double closing inverted commas between money and full stop |
A02:0560 | BgI has be to for to be |
A02:0730 | county for County in Lamar county Hospital District |
A03:0840 | words are missing |
A04:0150 | conspicious for conspicuous |
A04:1880 | double closing inverted commas omitted between problem and full stop (not logged in BCUM) |
A04:1900 | ot for to |
A04:1930 | ond for and |
A05:0200 | full stop for comma |
A06:0250 | adminstration for administration |
A06:0410 | Nothing for Noting |
A06:0670 | rebound appears to be malapropism for redound (not logged in BCUM) |
A06:1680 | alloted for allotted |
A08:0470 | statutes for statues |
A08:0770 | builtin for built-in |
A08:0910 | full stop where there should be comma or no punctuation mark (not logged in BCUM) |
A09:0350 | BgI has full stop for comma after firms (the copy of text A09 held at Brown University is now too faded to be sure that this error is not original, but it is assumed to be an error of transcription) |
A09:0760 | accomodations for accommodations |
A09:1420 | severly for severely |
A10:0190 | ritiuality for (?) rituality |
A10:0580 | in grammatically redundant before which |
A10:0590 | double closing inverted commas omitted between following and full stop (not logged in BCUM) |
A10:1330 | Diety for Deity |
A10:1890 | conpired for conspired |
A11:0060 | draought for drought |
A11:0060 | BgI has 3-to-o for 3-to-0 |
A11:0090 | righthandler for righthander |
A11:0230 | It is not clear whether the form A's as an abbreviation for Athletics' should be regarded as an error. This extract regularly abbreviates the team name Athletics as A's, but in this line it occurs as a genitive; it is an open question whether English orthography requires a second apostrophe following the s in such a case. (If it does, this would be the sole SUSANNE line calling for two different markers in the status field.) |
A11:0820 | rookie-of-the year for rookie-of-the-year |
A11:0860 | 6-foot 3 inch for (?) 6-foot-3-inch |
A11:1120 | Dimaggio for (?) DiMaggio |
A11:1400 | BgI has redundant comma following initial "B." in Norman B. Small |
A11:1760 | Wellsley College for (?) Wellesley College |
A12:0210 | double closing inverted commas omitted between training and full stop (not logged in BCUM) |
A12:1160 | Owl's for Owls' (not logged in BCUM) |
A12:1460 | out linebackers may be error for our linebackers but SUSANNE Corpus assumes out was intended |
A13:0040 | noun missing after countless (not logged in BCUM) |
A13:0140 | Bear's for Bears' (not logged in BCUM) |
A13:0140 | gruonded for grounded |
A13:0350 | are for is (not logged in BCUM) |
A13:1000 | Closing double inverted commas appear to be needed between gone and full stop, but this is not an error in the original; the Brown text consists of six short extracts from separate news stories, and the extract ending on line 1000 breaks off in the middle of a quotation. |
A13:1120 | closing double inverted commas omitted after leaguer |
A14:1060 | redundant space inserted between diamond- and studded |
A19:1260 | Dresbach's for Dresbachs' (not logged in BCUM) |
A19:1560 | BgI has up to for up |
A20:0970 | advance for advanced |
G01:1640 | is with plural subject (not logged in BCUM) |
G01:1680 | Northeners for Northerners |
G02:1370 | socal for social |
G03:1110, 1630 | two tokens of preprepared are assumed to be intended, not misprints |
G04:0480 | grazer assumed to be error for grazier |
G05:0110 | aromatick occurs as archaism |
G05:0410 | BgI has ?t for It |
G06:0630 | Wozzek for Wozzeck (title of an opera by Alban Berg) (not logged in BCUM) |
G06:1330 | closing single inverted comma missing after Meinung |
G06:1550 | BgI has double rather than single inverted comma before the angry |
G09:1170 | comma missing after light |
G09:1290 | kaleidescope for kaleidoscope |
G10:0380-0450 | The original has a quotation immediately followed by a note in square brackets, all set blocked. The Brown Corpus substitutes inverted commas for blocking to represent quotation, by the policy discussed in EFC, §2.25. In BgI the <minbrk> for paragraph boundary is shown as following the opening inverted commas on line 0380; SUSANNE normalizes this to <minbrk> followed by <ldquo>. |
G10:0390, 0410 | BgI has two instances of Negroes for original negroes |
G10:0670 | reremained for remained |
G10:0840 | differences for difference (not logged in BCUM) |
G11:0310 | Consitutional for Constitutional |
G11:0510 | discernable for discernible |
G11:0870 | terrestial for terrestrial |
G11:1300 | determing for determining |
G12:0300 | BgI has "man;" for man's |
G13:1160 | pyschiatrist for psychiatrist |
G17:0330 | Original has the Hound of <bital> Heaven's <eital> pursuit; presumably all or none of [the] Hound of Heaven (title of a poem by Francis Thompson) should be in italics. |
G17:1050 | full stop for comma |
G17:1760 | BgI has And for and |
G18:0610 | original has is after plural subject poems |
G22:0370 | BgI has "man;" for man's |
J01:0190 | BgI has 1o for 10 |
J01:0190 | BgI has 6ooº for 600º |
J02:0060 | BgI omits full stop after energy |
J02:0220 | BgI has co0ling for cooling |
J02:0350 | up for up to before <formul> (not logged in BCUM) |
J02:0400 | BgI has it for It |
J02:0860 | BgI omits full stop after 30 |
J02:1030 | has for gas (not logged in BCUM) |
J02:1040 | space omitted between to and the |
J03:0570 | as for is |
J04:1370 | electron for electrons |
J05:1670 | a for an |
J06:1760 | BgI has oserved for observed |
J08:0170 | areosol for aerosol |
J08:0270 | assesment for assessment |
J08:0440 | meterological for meteorological |
J08:0560 | full stop for comma |
J08:0860 | on-sure for on-shore |
J08:1620 | prowazwki for prowazekii (not logged in BCUM) |
J09:0600 | used for use |
J09:0740 | BgI has EEAE for DEAE |
J09:0850 | BgI has the for The |
J10:1170 | BgI has full stop for comma after honeybee |
J12:1830 | BgI has maybe for may be |
J17:1210 | parsympathetic for parasympathetic |
J21:0250 | meets for meet |
J21:0880 | Original lacks a comma after <formul> where context appears to call for it; BCUM logs this as a typographical error, but SUSANNE assumes that the comma was deliberately omitted to avoid confusion with the formula. |
J22:0700 | of for or (not logged in BCUM) |
J23:0770 | humilation for humiliation |
J23:0930-40 | BgI has united states for United States |
J23:1720 | BgI has it for It |
J24:1900 | BgI has our for Our |
N02:0560 | BgI has two for Two |
N02:1610 | BgI omits he after where |
N03:0100 | BgI has My for my |
N03:0550 | BgI omits opening double inverted commas before Just |
N03:1450 | BgI has mike for Mike |
N04:0140 | BgI omits opening double inverted commas before The |
N04:0550 | BgI omits full stop after other |
N06:0100 | of for or |
N07:1430 | The apostrophe after dollars is arguably redundant or requires a following word such as worth, but the original text is as shown; no E in status field since the grammatical position is debatable |
N09:0330 | ommission for omission |
N09:1080 | BgI omits opening double inverted commas before I'm |
N10:0080 | coosie's for cooky's or cookie's |
N10:1280 | BgI has redundant space after the h of Marshal |
N12:1220 | shout for shot |
N13:1500 | onct in original is assumed not to be an error but to represent a dialect pronunciation of once |
N14:1010 | BgI omits opening double inverted commas before It'll |
N14:1110 | original has redundant opening double inverted commas before Pat |
N14:1160 | down off for off down |
N14:1600 | original lacks was before pumping |
N15:0090 | opportunities for opportunity |
The SUSANNE Corpus is based on 64 of the 500 texts of the Brown Corpus.
The 64 SUSANNE texts are the following (publication details summarized
from Francis & Kučera (1989), dates shown in ISO yyyy.mm.dd
order):
A01 |
|
A02 |
|
A03 |
|
A04 |
|
A05 |
|
A06 |
|
A07 |
|
A08 |
|
A09 |
|
A10 |
|
A11 |
|
A12 |
|
A13 |
|
A14 |
|
A19 |
|
A20 |
|
G01 | Edward P. Lawton, "Northern Liberals and Southern Bourbons", The Georgia Review, 15 (1961), 254-259 |
G02 | Arthur S. Miller, "Toward a Concept of National Responsibility", The Yale Review, LI:2 (December 1961), 186-191 |
G03 | Peter Wyden, "The Chances of Accidental War", The Saturday Evening Post, 1961.06.03, 18-19 and 60-61 |
G04 | Eugene Burdick, "The Invisible Aborigine", Harper's Magazine, 223:1336 (September 1961), 70-72 |
G05 | Terence O'Donnell, "Evenings at the Bridge", Horizon, III:5 (May 1961), 26-30 |
G06 |
|
G07 | Richard B. Morris, "Seven Who Set Our Destiny", The New York Times Magazine, 1961.02.19, 9 and 69-70 |
G08 | Frank Murphy, "New Southern Fiction: Urban or Agrarian?", The Carolina Quarterly, 13:2 (Spring 1961), 18-25 |
G09 | Selma Jeanne Cohen, "Avant-Garde Choreography", Criticism A Quarterly for Literature and the Arts, vol. III, no. 1 (Winter 1961), 24-28 |
G10 | Clarence Streit, "How the Civil War Kept You Sovereign" [chapter 8 of Freedom's Frontier - Atlantic Union Now], Freedom and Union, 16:2 (February 1961), 16-18 |
G11 | Frank Oppenheimer, "Science and Fear - A Discussion of Some Fruits of Scientific Understanding", The Centennial Review, 5:4 (Fall 1961), 404-409 |
G12 | Tom F. Driver, "Beckett by the Madeleine", Columbia University Forum, 4:3 (Summer 1961), 21-24 |
G13 | Charles Glicksberg, "Sex in Contemporary Literature", The Colorado Quarterly, 9:3 (Winter 1961), 278-82 |
G17 | Randall Stewart, "A Little History, a Little Honesty: A Southern Viewpoint", The Georgia Review, 15:1 (Spring 1961), 10-15 |
G18 | Charles Wharton Stork, "Verner von Heidenstam", American-Scandinavian Review, 49:1 (March 1961), 39-43 |
G22 | Kenneth Reiner, "Coping with Runaway Technology", The Ethical Outlook, XLVII:3 (May-June 1961), 91-95 |
J01 | Cornell H. Mayer, "Radio Emission of the Moon and Planets", in Gerard P. Kuiper & Barbara M. Middlehurst (eds.), Planets and Satellites. Vol. 3 of The Solar System. University of Chicago Press, 1961, pp. 442-446 |
J02 | Raymond C. Binder et al. (eds.), Proceedings of the 1961 Heat Transfer and Fluid Mechanics Institute, Stanford University Press, 1961, pp. 193-196 |
J03 | Harry H. Hull, "The Normal Forces and Their Thermodynamic Significance", Transactions of the Society of Rheology, V (1961), 120-125 |
J04 | James A. Ibers et al., "Proton magnetic resonance study of polycrystalline HCrO2", The Physical Review, 121:6 (1961.03.15), 1620-1622 |
J05 | Jay C. Harris & John R. Van Wazer, "Detergent building", in J.R. Van Wazer (ed.), Phosphorus and its Compounds, Interscience Publishers, Inc., 1961, pp. 1732-1737 |
J06 | Francis J. Johnston & John E. Willard, "The exchange reaction between Chlorine and Carbon Tetrachloride", Journal of Physical Chemistry, 65 (February 1961), 317-318 |
J07 | J.F. Vedder, "Micrometeorites", in Francis S. Johnson (ed.), Satellite Environment Handbook, Stanford University Press, 1961, pp. 92-97 |
J08 | LeRoy Fothergill, "Biological Warfare", in Peter Gray (ed.), The Encyclopedia of the Biological Sciences, Reinhold Publishing Corporation, 1961, pp. 145-149 |
J09 | M. Yokoyama et al., "Chemical and serological characteristics of blood group antibodies in the ABO and Rh systems", The Journal of Immunology, 87 (1961), 56-60 |
J10 | B.J.D. Meeuse, The Story of Pollination, The Ronald Press Company, 1961, pp. 104-108 |
J12 | Richard F. McLaughlin et al., "A study of the subgross pulmonary anatomy in various mammals", The American Journal of Anatomy, 108 (1961), 154-157 |
J17 | E. Gellhorn, "Prolegomena to a theory of the emotions", Perspectives in Biology and Medicine, 4 (1961), 426-431 |
J21 | C.R. Wylie, Jr., "Line involutions in S3 whose singular lines all meet in a twisted curve", Proceedings of the American Mathematical Society, 12 (1961), 335-339. |
J22 | Max F. Millikan & Donald L.M. Blackmer (eds.), The Emerging Nations: Their Growth and United States Policy, Little, Brown and Company, 1961, pp. 136-142 |
J23 | Joyce O. Hertzler, American Social Institutions; A Sociological Analysis, Allyn and Bacon, Inc., 1961, pp. 478-482 |
J24 | Howard J. Parad, "Preventive casework: problems and implications", The Social Welfare Forum, 1961, Columbia University Press for the National Conference on Social Welfare, 1961, pp. 186-191 |
N01 | Wayne D. Overholser, The Killer Marshal, Dell Publishing Co., 1963 [copyright 1961], pp. 53-58 |
N02 | Clifford Irving, The Valley, McGraw-Hill Book Company, Inc., 1961, pp. 262-267 |
N03 | Cliff Farrell, Trail of the Tattered Star, Doubleday & Company, 1961, 168-173 |
N04 | James D. Horan, The Shadow Catcher, Crown Publishers, Inc., 1961, pp. 248-253 |
N05 | Richard Ferber, Bitter Valley, Dell Publishing Company, 1961, pp. 9-17 |
N06 | Thomas Anderson, Here Comes Pete Now, Random House, 1961, pp. 4-12 |
N07 | Todhunter Ballard, The Night Riders, Pocket Books, Inc., 1961, pp. 5-11 |
N08 | Mary Savage, Just For Tonight, Dodd, Mead & Company, 1961, pp. 114-120 |
N09 | Jim Thompson, The Transgressors, The New American Library of World Literature, Inc., 1961, pp. 9-13 |
N10 | Joseph Chadwick, No Land Is Free, Avon Book Division, Hearst Corporation, 1961, pp. 21-26 |
N11 | Gene Caesar, Rifle For Rent, Monarch Books, Inc., 1963 [copyright 1961], pp. 46-51 |
N12 | Edwin Booth, Outlaw Town, Ballantine Books, Inc., 1961, pp. 103-108 |
N13 | Martha Ferguson McKeown, Mountains Ahead, G.P. Putnam's Sons, 1961, pp. 390-395 |
N14 | Peter Field, Rattlesnake Ridge, Jefferson House, Inc., 1961, pp. 164-172 |
N15 | Donald J. Plantz, Sweeney Squadron, Dell Publishing Co., Inc., 1961, pp. 133-138 |
N18 | Peter Bains, "With Women...Education Pays Off", Monsieur, 4:2 (February 1961), 17 and 77-78 |
¹The support of the Economic and Social Research Council is gratefully acknowledged. The SUSANNE Project, "Construction of an Analysed Corpus of English", was funded by ESRC award no. R00023 1142, over the period 1988 to 1992. "SUSANNE" stands for "Surface and underlying structural analyses of naturalistic English". I should like to express my warmest thanks to the team who worked on the SUSANNE Project, namely Robin Haigh, Hélène Knight, Tim Willis, and Nancy Glaister.
²I thank Alvar Ellegård for permission to circulate a research resource derived from the work of his group.
³This is not to suggest that the Pennsylvania Treebank analytic system is crude or ill-defined. Although the Pennsylvania team began by focusing on quantity of material analysed, they have since published electronically a definition of their own analytic scheme (URL 8) which is very comparable in degree of refinement to that of English for the Computer. Having been first in the field, for me and my group it would now be difficult to consider abandoning the SUSANNE scheme for a radically different one even if some alternative were to prove demonstrably superior.
A. Ellegård (1978) The Syntactic Structure of English Texts. Gothenburg Studies in English, 43.
W.N. Francis & H. Kučera (1989) Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for use with Digital Computers (corrected and revised edition). Department of Linguistics, Brown University, Providence, Rhode Island. [In case of browsers which fail to display the appropriate Unicode subset, note that the third character of the second author's surname is c with the Czech hachek diacritic.]
R.G. Garside, G.N. Leech, & G.R. Sampson, eds. (1987) The Computational Analysis of English. Longman.
D. Gibbon, R. Moore, & R. Winski, eds. (1997) Handbook of Standards and Resources for Spoken Language Systems. Mouton.
K. Hofland & S. Johansson (1982) Word Frequencies in British and American English. Longman.
A. Kilgarriff & Martha Palmer, eds. (2000) Computers and the Humanities special issue on Senseval, vol. 34, issues 1-2.
D.T. Langendoen (1997) Review of Sampson (1995). Language 73.600-3.
M.P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz (1993) "Building a large annotated corpus of English: the Penn Treebank". Computational Linguistics 19.313-30.
R. Quirk, S. Greenbaum, G. Leech, & J. Svartvik (1985) A Comprehensive Grammar of the English Language. Longman.
G.R. Sampson (1992) "Probabilistic parsing". In J. Svartvik, ed., Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, Mouton de Gruyter.
G.R. Sampson (1995) English for the Computer: The SUSANNE Corpus and Analytic Scheme. Clarendon Press (Oxford University Press).
G.R. Sampson (1998) Review of S. Greenbaum, ed., Comparing English Worldwide. Natural Language Engineering 4.363-5.
G.R. Sampson (2001) Empirical Linguistics. Continuum International.
1 | http://www.uic.edu/orgs/tei/ |
2 | http://www.cs.vassar.edu/CES/ |
3 | http://www.hd.uib.no/icame.html |
4 | http://www.grsampson.net/RChristine.html |
5 | ftp://ftp.cogs.susx.ac.uk/pub/users/geoffs/CHRISTINE1.tar.Z |
6 | http://www.grsampson.net/ChrisDoc.html |
7 | http://www.cis.upenn.edu/~treebank/home.html |
8 | ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/ |
9 | http://www.ilc.pi.cnr.it/EAGLES/home.html |
10 | http://www.itri.brighton.ac.uk/events/senseval/ |