The Treebank Semantics Parsed Corpus (TSPC)

The Treebank Semantics Parsed Corpus (abbreviated TSPC) is a collection of English with hand worked tree annotation for approaching half-a-million words. Highlights include:

labelled constituent structure
assignments of grammatical function
inclusion of verb codes
information to resolve anaphoric dependencies

Further results — notably, dependency graphs — derived from the analysis can be seen with the static interface.

A brief history

Construction of the TSPC has been ongoing since 2012. The resource was started as a sister corpus to the Kainoki treebank for Japanese. Since its inception, the TSPC has formed a testing ground for the Treebank Semantics method of processing constituency tree annotations to return logic-based meaning representations. Forming the base input for reaching a database of meaning representations remains a major application of the corpus, but also the corpus has grown into a resource for general use.

Search interface

The TSPC is associated with a powerful user interface that enables search using virtually any aspect of the annotation. Results of specific searches can be downloaded in the form of annotated data. The source data to which the search interface links is being updated constantly to reflect improvements in analysis.

For examples of search patterns, see: Help with finding constructions of Grammar and Beyond 3B in the TSPC.

About the annotation

There is a Parsing Guide that describes the annotation scheme in detail.

The corpus, while on a smaller scale, is comparable to the resources of the Penn Treebank (Marcus, Santorini, and Marcinkiewicz, 1993), Parsed ICE Corpora (Nelson, Wallis, and Aarts, 2002), and the Penn Historical Parsed Corpora (Santorini, 2016). The annotation is most relatable to the SUSANNE Corpus and Analytic Scheme (Sampson, 1995), and a large percentage (nearly three quarters) of the data was converted from annotation that had been following the SUSANNE scheme.

Differentiating factors of the current resource are the richness of disambiguation information contained, and the high degree of normalised structure present. Both of these properties assist the automatic creation of meaning representations.

About the data

The data of the corpus is from a wide range of sources as detailed in the following table:

ID prefix	Counts	Date	Details
a	70 files; 5,274 trees; 88,682 words	1859–2021	diverse selection of data from the MASC component of the American National Corpus (Ide et al. 2010) (6 files; 242 trees; 4,400 words), the LOB Corpus (Johansson et al. 1986) (6 files; 666 trees; 12,214 words), the CLC FCE Dataset (Yannakoudakis et al. 2011) (10 files; 362 trees; 4,573 words), the Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles (MASTAR 2011) (3 files; 177 trees; 2,749 words), The Little Prince Corpus (AMR Project 2016) (1 file; 103 trees; 1,282 words), the Groningen Meaning Bank (Bos et al. 2017) (1 file; 122 trees; 2,216 word), Project Gutenberg and other internet sources (44 files; 3,705 trees; 62,530 words)
lucy_bnc	39 files; 3,816 trees; 79,419 words	1990s	written data samples from the British National Corpus: academic prose, biographies/autobiographies, broadsheet national newspapers, commerce and finance, miscellaneous texts, non-academic prose, novels and short stories, popular magazines; almost complete overlap with the BNC component of the LUCY Corpus (Sampson 2005)
lucy_child	150 files; 1,792 trees; 25,831 words	1965	child writing from the LUCY Corpus: 9-year-old (48 files; 574 trees; 7,836 words), 10-year-old (29 files; 356 trees; 5,115 words), 11-year-old (36 files; 420 trees; 5,927 words), 12-year-old (37 files; 442 trees; 6,953 words)
christine	40 files; 20,083 trees; 80,712 words	1990s	selection of informal face-to-face conversations from the British National Corpus (BNC Consortium 2007); complete overlap with the CHRISTINE Corpus (Sampson 2000)
lucy_student	47 files; 1,301 trees; 27,325 words	1990s	student work from the LUCY Corpus: A-Level General Studies script, Access-course coursework First-year undergraduate essays
susanne	64 files; 7,057 trees; 130,615 words	1961	written data samples from the BROWN Corpus (Francis and Kucera 1979): press reportage, belles lettres/biography/essays, learned and scientific writings, adventure and western fiction; complete overlap with the dataset used for the SUSANNE Corpus (Sampson 1995); some overlap with the BROWN component of the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993) (32 files; 4,082 trees; 65,311 words).
x	22 files; 4,531 trees; 35,171 words	-	FraCaS dataset (Cooper et al. 1996) (1 file; 502 trees; 3,502 words) and miscellaneous textbook examples

Mistakes

As with any annotated text corpus, there are mistakes in the TSPC. The corpus is ongoing work and is under continuous improvement and correction. Mistakes will be corrected as they become apparent and as time allows. We will be grateful to be made aware of mistakes and will endeavour to eliminate them with the help of users (contact).

Attribution

Presentations of research results using the TSPC should include a citation taking the general form of the example below (with appropriate modifications depending on the date of access):

Butler, Alastair (2022) “The Treebank Semantics Parsed Corpus (TSPC)”, Hirosaki University. Available from: entrees.github.io (accessed 28 September 2022).

Terms of use

This work is licensed under a Creative Commons Attribution 4.0 International License.