Chapter 1 Grammar and Parsing

1.1 Introduction

The term GRAMMAR refers to the system (mechanism or otherwise) for building language expressions. When something is built, there is the natural impulse to want to take it apart. Taking apart language is known as PARSING. With care, we can put dismantled parts back together, and while doing so we can also add trails of instructions known as annotations. Such enhancements to language data act as handles to access the structural analysis revealed by parsing. This book is focused on introducing the annotation approach of the Treebank Semantics Parsed Corpus (TSPC). The TSPC is a corpus of English for general use with hand worked analysis for approaching half-a-million words.

1.2 A hierarchy of layers

With parsing being the activity of taking apart language expressions, the aim is to reach smaller and smaller layers of grammatical analysis. To start, let's consider layers of structure introduced in terms of following possible paths through Figure 1.1 from the highest sentence layer through to the lowest word layers, via clause and possibly phrase layers, with the arrow X ⟶ Y interpreted: ‘X immediately contains Y’. Dashed arrows indicate optionality.

Figure 1.1: Paths through layers of grammatical analysis

As an internal layer in the hierarchy, a CLAUSE is reached from the outermost sentence layer, and subsequently a PHRASE can be reached from a clause layer. In addition, clauses can occur within clauses, phrases can occur within phrases, and clauses can occur within phrases.

A SENTENCE is at the top of the hierarchy, so it is the largest unit considered. (A discourse grammar would look above the sentence layer.) A WORD is at the other end of the hierarchy. (It is possible to go inside words to consider morphology, that is, the components from which words are made, which would be lower in the hierarchy of structure.) Chapter 2 will detail the range of possible words, but for considering the hierarchy, it is helpful to start from an outlook where words are of two kinds:

+VERB words (and so immediately contained within a clause layer)
-VERB words (non-verb words)

1.2.1 Identifying sentence and word layers

Sentence and word layers are fairly easily identified by conventions of the English writing system. Sentences are delimited by an initial capital letter and a final full-stop (or question-mark or exclamation-mark). Words are created with single character strings that are delimited by a space on each side (or punctuation mark other than a hyphen or apostrophe). But these conventions are not always followed. An important exception is when a sequence of words behaves syntactically in an idiomatic way (e.g. ‘as long as’, ‘with regard to’, ‘by the way’). Such word sequences will be analysed as if single words, with each component separated by underscore characters in a single character string, as in (1.1).

(1.1): as_long_as
with_regard_to
by_the_way

To allow for integration into the parse analysis, punctuation points, quotation marks, and brackets (‘.’ ‘?’ ‘!’ ‘:’ ‘;’ ‘,’ ‘-’ ‘(’ ‘)’ etc.) are treated as words; see section 2.10 for details.

1.2.2 A simple sentence

A simple sentence consists of one clause, so Figure 1.1 paths will start: SENTENCE,CLAUSE. This initial clause layer is known as the MATRIX clause, which is symbolised most generally with IP-MAT. IP is an abbreviation for Inflectional Phrase, which is another way to say ‘clause’. A clause contains minimally a verb. Consider the simple sentence (1.2), which is comprised of a single verb contained within a single clause.

(1.2): Look!

This can be given the paths of Figure 1.2 from sentence layer to words, with the punctuation treated as a word.

Figure 1.2: Paths through ‘Look!’

The parse information of Figure 1.2 can be represented with bracketed notation as in (1.3), which follows CorpusSearch format (Randall 2009).

(1.3): ( (IP-MAT (+VERB Look) (-VERB !))
(ID 101_a_saint_exupery_1943))

With CorpusSearch format, every tree has a “wrapper”. A wrapper is a pair of unlabelled parentheses surrounding the tree content together with an ID node. Furthermore, for the TSPC, an ID node contains a character string that begins with a number corresponding to the number of the tree in its corpus file. This tree number is followed by an underscore (‘_’) and then by the name of the corpus file (minus any extension). Other character string material may follow the filename, provided there is an intervening semicolon (‘;’).

Another way to present the parse information from sentence layer to words is with (1.4).

(1.4): IP-MAT,+VERB,Look
IP-MAT,-VERB,!
ID,101_a_saint_exupery_1943

With (1.4), each word of the parse comes at the end of its own line, with each line presenting path information from the root (IP-MAT) layer through to a word layer. The last line is the ID node for the parse.

1.2.3 A complex sentence

Consider (1.5) as an example of the interplay between internal clause and phrase layers.

(1.5): He wanted to show what happened to anyone who tried to start trouble.

Analysis of (1.5) leads to the bracketed structure of (1.6).

(1.6): ( (IP-MAT
    (PHRASE (-VERB He))
    (+VERB wanted)
    (CLAUSE (-VERB to) (+VERB show)
      (CLAUSE
        (PHRASE (-VERB what))
        (+VERB happened)
        (PHRASE (-VERB to)
          (PHRASE (-VERB anyone)
            (CLAUSE
              (PHRASE (-VERB who))
              (+VERB tried)
              (CLAUSE (-VERB to) (+VERB start)
                (PHRASE (-VERB trouble))))))))
    (-VERB .))
  (ID 154_susanne_n12))

When viewed in terms of paths from sentence layer to words, the analysis seen with (1.6) becomes (1.7).

(1.7): IP-MAT,PHRASE,-VERB,He
IP-MAT,+VERB,wanted
IP-MAT,CLAUSE,-VERB,to
IP-MAT,CLAUSE,+VERB,show
IP-MAT,CLAUSE,CLAUSE,PHRASE,-VERB,what
IP-MAT,CLAUSE,CLAUSE,+VERB,happened
IP-MAT,CLAUSE,CLAUSE,PHRASE,-VERB,to
IP-MAT,CLAUSE,CLAUSE,PHRASE,PHRASE,-VERB,anyone
IP-MAT,CLAUSE,CLAUSE,PHRASE,PHRASE,CLAUSE,PHRASE,-VERB,who
IP-MAT,CLAUSE,CLAUSE,PHRASE,PHRASE,CLAUSE,+VERB,tried
IP-MAT,CLAUSE,CLAUSE,PHRASE,PHRASE,CLAUSE,CLAUSE,-VERB,to
IP-MAT,CLAUSE,CLAUSE,PHRASE,PHRASE,CLAUSE,CLAUSE,+VERB,start
IP-MAT,CLAUSE,CLAUSE,PHRASE,PHRASE,CLAUSE,CLAUSE,PHRASE,-VERB,trouble
IP-MAT,-VERB,.
ID,154_susanne_n12

Notably, the paths of (1.7) conform to Figure 1.1 above. They start with a sentence and clause layer (IP-MAT), and then can have clauses within clauses, phrases within clauses, phrases within phrases, clauses within phrases, verbs with occasional non-verb words within clauses, and non-verb words within phrases.

1.3 Outlook

Each of the units of structure that label the brackets of (1.3) and (1.6) above, and paths of (1.4) and (1.7) above, is either a clause or a phrase. It is typical to refer to such units as CONSTITUENTS.

We have so far categorised words into being either:

+VERB words (e.g. look, wanted, start)
-VERB words (e.g. he, to, town)

Because of this categorisation, we have already been able to distinguish clause constituents from phrase constituents:

an immediately contained +VERB word gives a clause layer
without an immediately contained +VERB word, a non-word unit is a phrase

We will want to further distinguish phrases into being of various kinds based on the words they contain. To this end, we can divide -VERB words into classes such as:

noun (N)
adjective (ADJ)
adverb (ADV)

With such divisions, phrases can be sub-divided into:

noun phrases (NP) that minimally contain a noun
adjective phrases (ADJP) that minimally contain an adjective
adverb phrases (ADVP) that minimally contain an adverb

Establishing the range of word classes is the aim of chapter 2. Looking at phrase classes in detail is pursued in chapter 3. A big part of chapters 2 and 3 will be to raise and answer questions of FORM, that is, how words and constituents look — their shape or appearance, and what their internal structure happens to be. But form is not the full story required for parsing: In order to gain motivation for having structure, we will often need to be equally concerned to explore the FUNCTION that words and constituents perform in the larger structures which contain them.

The dual concern for form and function in chapters 2 and 3 will be carried over to an exploration of clause constituents in chapter 4. Chapter 5 details options available for realising complex sentences with clause-within-clause subordination.

The parsing practise developed over chapters 3, 4, and 5 will involve specifying form and function for the constituents named PHRASE and CLAUSE above. IP-MAT as a constituent name may also need further specification. For example, the annotation of (1.3)/(1.4) above will be further refined so that IP-IMP replaces IP-MAT to identify sentence (1.2) as an imperative.

Chapters 6, 7, and 8 integrate verb codes that signal the clause requirements of verbs.

Chapter 9 extends coverage to constructions with displacements where full analysis can require indexing. Chapter 10 discusses exceptional scope, control and anaphora. Chapter 11 clear up loose ends.

1.3.1 Notational conventions

Table 1.1 details useful notational conventions used in this book.

Notation	Unit	Example
italics	word-form	the word look
SMALL CAPITALS	lexeme	the verb lexeme LOOK (look, looks, looked, looking)
{...}	morpheme	the suffix {s}
`TELETYPE`	annotation	the matrix clause tag label `IP-MAT`

Table 1.1: Notational conventions

1.4 About the annotation approach

This book's annotation approach builds on alternative annotation schemes for English. A central goal has been to consolidate strengths of existing methods in a single annotation scheme with a high degree of normalised structure. Ideas were taken from the following annotation schemes:

The SUSANNE Corpus and Analytic Scheme (Sampson 1995)
The ICE Parsing Scheme (Nelson, Wallis and Aarts 2002)
The Penn Treebank Scheme (Marcus, Santorini and Marcinkiewicz 1993)
The Penn Historical Parsed Corpora Scheme (Santorini 2016)

From the SUSANNE scheme, there is adoption of form and function information, such that the approach taken in this book can be linked most closely to the SUSANNE scheme. Moreover, a large percentage (nearly three quarters) of the Treebank Semantics Parsed Corpus (TSPC) exists as data that was converted from annotation that had been following the SUSANNE scheme.

Regarding form and function information, the SUSANNE scheme is closely related to the English grammars of Quirk et al. (1972, 1985).

The ICE Parsing Scheme similarly follows the Quirk et al. grammars. In addition, ICE is notable for its rich range of features, and care has been taken to ensure that the current annotation supports the ability for many of these features to be automatically derived.

The Penn Historical Corpora scheme, which itself draws on the bracketed approach of the Penn Treebank scheme, has informed the ‘look’ of the annotation. This includes:

the tag labelling: CP and IP for clause layers; ADJP, ADVP, NP and PP for phrase layers
that label extensions are added to CP, IP, ADJP, ADVP, NP and PP to indicate functions
the presentation of conjunction structure with CONJP layers

However, it should be noted that, labels aside, the tag set of the current scheme is most compatible with the SUSANNE scheme, especially with regards to function marking. Moreover, word class tags closely overlap with the Lancaster word class tagging systems, especially the UCREL CLAWS5 tag set used for the British National Corpus (BNC Consortium 2005).

The annotation also contains plenty that is innovative. For example, normalised structure is achieved with internal layers at:

clause levels (ILYR)
noun phrase levels (NLYR)
adjective phrase levels (AJLYR)
adverb phrase levels (AVLYR)

These internal layers are required when their corresponding levels of structure include CONJP layers.

Another area of innovation is the verb code integration of chapters 6, 7 and 8. The codes of chapter 6 classify catenative verbs; cf. Huddleston and Pullum (2002, p.1220). These are verbs of a verb sequence that are prior to the main verb of a clause and structure internal to the clause annotation (IP-PPL-CAT) further supports this type of verb. Additional codes of chapter 6 with very particular distributions are included from Hornby (1975). The verb codes for main verbs in chapter 7 are from the mnemonic system of the fourth edition of the Oxford Advanced Learner's Dictionary (OALD4; Cowie 1989).

The most innovative aspect of the annotation gives the TSPC its name: It can be fed to the Treebank Semantics evaluation system (Butler 2021). Treebank Semantics processes constituency tree annotations and returns logic-based meaning representations. While the annotation seldom includes indexing, results calculated with Treebank Semantics resolve both inter and intra clause dependencies, including cross sentential anaphoric dependencies. It is sometimes especially helpful to see meaning representations from Treebank Semantics as dependency graphs (e.g., from the on-line corpus interface) to make visually apparent connections that the design of the annotation captures.