The Treebank Semantics Parsed Corpus (TSPC)

Front Page

The Treebank Semantics Parsed Corpus (abbreviated TSPC) is a corpus of English with hand worked tree analysis for approaching half-a-million words. Highlights include:

Further results — notably, dependency graphs — derived from the analysis can be seen with the search interface.

A brief history

Construction of the TSPC has been ongoing since 2012. The corpus was started as a sister corpus to the Kainoki treebank for Japanese (then the Keyaki treebank). Since its inception, the corpus has formed a testing ground for the Treebank Semantics method of processing constituency tree annotations to return logic based meaning representations. Forming the base input for reaching a database of meaning representations remains a major application of the corpus, but also the corpus has grown into a resource for general use.

About the annotation

There is a Parsing Guide that describes the annotation scheme in detail.

The corpus, while on a smaller scale, is comparable to the resources of the Penn Treebank (Marcus, Santorini, and Marcinkiewicz, 1993), Parsed ICE Corpora (Nelson, Wallis, and Aarts, 2002), and the Penn Historical Parsed Corpora (Santorini, 2016). The annotation is most relatable to the SUSANNE Corpus and Analytic Scheme (Sampson, 1995), and a large percentage (nearly three quarters) of the data was converted from annotation that had been following the SUSANNE scheme.

Differentiating factors of the current resource are the richness of disambiguation information contained, and the high degree of normalised structure present. Both of these properties assist the automatic creation of meaning representations.

Search Interface

The TSPC is associated with a powerful user interface that enables search using virtually any aspect of the annotation. Results of specific searches can be downloaded in the form of annotated data. The source data to which the search interface links is being updated constantly to reflect improvements in analysis.

For examples of search patterns, see: Help with finding constructions of Grammar and Beyond 3B in the TSPC.

Mistakes

As with any annotated text corpus, there are mistakes in the TSPC. The corpus is ongoing work and is under continuous improvement and correction. Mistakes will be corrected as they become apparent and as time allows. We will be grateful to be made aware of mistakes and will endeavour to eliminate them with the help of users (contact).

Attribution

Presentations of research results using the TSPC should include a citation taking the general form of the example below (with appropriate modifications depending on the date of access):

Butler, Alastair (2021) “The Treebank Semantics Parsed Corpus (TSPC)” https://entrees.github.io (accessed 26 December 2021).

Terms of use

This work is licensed under a Creative Commons Attribution 4.0 International License.

Creative Commons License