SEMiSUSANNE Corpus: Documentation


This Web version of Christopher Powell’s SEMiSUSANNE readme file was prepared by Geoffrey Sampson on 17 Jan 2006.



1. What is SEMiSUSANNE?

SEMiSUSANNE is a semantically tagged and structurally annotated corpus formed from the union of the SUSANNE and SemCor corpora. It consists of the 33 documents common to both corpora, retains the one-word-per-line structure of SUSANNE, but is extended to reflect the WordNet senses expressed by SemCor. This release of SEMiSUSANNE employs WordNet 1.6 senses.

2. What is the structure of SEMiSUSANNE?

SEMiSUSANNE is identical to SUSANNE in structure, but has two additional fields appended:

SEMiSUSANNE therefore consists of eight fields per word – the original six fields of SUSANNE, the single/compound status field, and the WordNet sense field.

3. How do the sense-tags work?

The sense field comprises a two part string; the first part is a single character drawn from the set {n,v,j,r} and indicates the noun, verb, adjective or adverb (resp.) part-of-speech. The second part of the string is numeric, and corresponds to the ‘hereiam’ value of a WordNet synset within the database indicated by the part-of-speech character. Thus the sense field ‘n475542’ is a synset located at byte-offset 475542 in the WordNet noun database, and encodes the sense for ‘irregularity’ (behaviour that breaches the rule or etiquette), e.g.:

A01:0030.06  -	NN2	irregularities	irregularity	.Np:s]	0	n475542

This encoding was employed to enable rapid access to the synset (when using the WordNet API), and to eliminate the synonym matching problem encountered when comparing senses using the formal WordNet sense keys. I refer the reader to the WordNet documentation for details of synsets, sense keys, and the WordNet API.

4. How are compounds encoded?

SUSANNE encodes one word per line whereas SemCor encodes one sense per line, so some jiggery pokery is needed to align the two in the case of componds. Essentially, the sense of a compound is assigned to each of its constituent words, and is therefore repeated on each coresponding SEMiSUSANNE line. The single/compound field is given the value ‘1’ (one) for the first line of the compound, and is incremented for each subesequent line of that compound, e.g.:

	
	A01:0010.09  -	NP1s	Fulton	Fulton	[Nns.	1	n17954
	A01:0010.12  -	NNL1cb	County	county	.Nns]	2	n17954
	A01:0010.15  -	JJ	Grand	grand	.	3	n17954
	A01:0010.18  -	NN1c	Jury	jury	.Nns:s]	4	n17954

5. How are single words encoded?

For single words, the single/compound field contains a ‘0’ (zero), e.g.:

A01:0010.21  -	VVDv	said	say	[Vd.Vd]	0	v682542

6. What about closed-class words?

SEMiSUSANNE follows SUSANNE by using the hyphen character to fill the single/compound and sense fields, e.g.:

A01:0010.06  -	AT	The	the	[O[S[Nns:s.	-	-

7. Why SEMiSUSANNE?

For part of my Ph.D. research* I needed a corpus from which I could extract known noun-sense and verb-sense pairs in both subject–verb and verb–object relations. As such a corpus proved difficult to find, I had to build one. To avoid the labours of hand-crafting, I looked at available corpora and found that SUSANNE and SemCor had a useful overlap. 33 documents forms only a small corpus, but it became the ‘gold standard’ I needed to evaluate my word-sense disambiguation algorithms. The name SEMiSUSANNE is intended to show that the corpus is derived from the SUSANNE corpus, is SEMantically tagged, and is (just over) half the size of (i.e. semi) SUSANNE.

Contact

I hope that SEMiSUSANNE is useful to you in your research. If you wish to ask any questions about or comment on SEMiSUSANNE then feel free to contact me.

Chris Powell, Jan 2006.

e: chris.powell at ashmus.ox.ac.uk



*“From E-Language to I-Language: Elements of a Pre-Processor for the Construction-Integration Model”, Chris. Powell, Oxford Brookes University, 2005.