This Web version of Christopher Powell’s SEMiSUSANNE readme file was prepared by Geoffrey Sampson on 17 Jan 2006.
SEMiSUSANNE is a semantically tagged and structurally annotated corpus formed from the union of the SUSANNE and SemCor corpora. It consists of the 33 documents common to both corpora, retains the one-word-per-line structure of SUSANNE, but is extended to reflect the WordNet senses expressed by SemCor. This release of SEMiSUSANNE employs WordNet 1.6 senses.
SEMiSUSANNE is identical to SUSANNE in structure, but has two additional fields appended:
The sense field comprises a two part string; the first part is a single character drawn from the set {n,v,j,r} and indicates the noun, verb, adjective or adverb (resp.) part-of-speech. The second part of the string is numeric, and corresponds to the ‘hereiam’ value of a WordNet synset within the database indicated by the part-of-speech character. Thus the sense field ‘n475542’ is a synset located at byte-offset 475542 in the WordNet noun database, and encodes the sense for ‘irregularity’ (behaviour that breaches the rule or etiquette), e.g.:
A01:0030.06 - NN2 irregularities irregularity .Np:s] 0 n475542
This encoding was employed to enable rapid access to the synset (when using the WordNet API), and to eliminate the synonym matching problem encountered when comparing senses using the formal WordNet sense keys. I refer the reader to the WordNet documentation for details of synsets, sense keys, and the WordNet API.
SUSANNE encodes one word per line whereas SemCor encodes one sense per line, so some jiggery pokery is needed to align the two in the case of componds. Essentially, the sense of a compound is assigned to each of its constituent words, and is therefore repeated on each coresponding SEMiSUSANNE line. The single/compound field is given the value ‘1’ (one) for the first line of the compound, and is incremented for each subesequent line of that compound, e.g.:
A01:0010.09 - NP1s Fulton Fulton [Nns. 1 n17954 A01:0010.12 - NNL1cb County county .Nns] 2 n17954 A01:0010.15 - JJ Grand grand . 3 n17954 A01:0010.18 - NN1c Jury jury .Nns:s] 4 n17954
For single words, the single/compound field contains a ‘0’ (zero), e.g.:
A01:0010.21 - VVDv said say [Vd.Vd] 0 v682542
SEMiSUSANNE follows SUSANNE by using the hyphen character to fill the single/compound and sense fields, e.g.:
A01:0010.06 - AT The the [O[S[Nns:s. - -
For part of my Ph.D. research* I needed a corpus from which I could extract known noun-sense and verb-sense pairs in both subject–verb and verb–object relations. As such a corpus proved difficult to find, I had to build one. To avoid the labours of hand-crafting, I looked at available corpora and found that SUSANNE and SemCor had a useful overlap. 33 documents forms only a small corpus, but it became the ‘gold standard’ I needed to evaluate my word-sense disambiguation algorithms. The name SEMiSUSANNE is intended to show that the corpus is derived from the SUSANNE corpus, is SEMantically tagged, and is (just over) half the size of (i.e. semi) SUSANNE.
I hope that SEMiSUSANNE is useful to you in your research. If you wish to ask any questions about or comment on SEMiSUSANNE then feel free to contact me.
Chris Powell, Jan 2006.
e: chris.powell at ashmus.ox.ac.uk
*“From E-Language to I-Language: Elements of a Pre-Processor for the Construction-Integration Model”, Chris. Powell, Oxford Brookes University, 2005.