Grammars and parsing

7.1. Grammars and parsing

The syntax of a language is specified by a grammar, which is a set of grammar rules of the form

Category1 --> Category2,Category3.
Category2 --> [Terminal].

Here, CategoryX denotes a syntactic category, specifying the type of a sentence part (e.g. noun, noun phrase, etc.). The first rule states that a Category2 followed by a Category3 is a Category1. For instance, the fact that a sentence may consist of a noun phrase followed by a verb phrase is expressed by the rule

sentence --> noun_phrase,verb_phrase.

A terminal is any word which occurs in the language. The second rule above assigns a syntactic category to a word. For instance:

noun --> [bicycle].

Syntactic categories are also called non-terminals.

A grammar which specifies a tiny bit of the English language is given below. As in clausal logic, grammar rules are separated by periods.

sentence             --> noun_phrase,verb_phrase.
noun_phrase          --> proper_noun.
noun_phrase          --> article,adjective,noun.
noun_phrase          --> article,noun.
verb_phrase          --> intransitive_verb.
verb_phrase          --> transitive_verb,noun_phrase.
article              --> [the].
adjective            --> [lazy].
adjective            --> [rapid].
proper_noun          --> [achilles].
noun                 --> [turtle].
intransitive_verb    --> [sleeps].
transitive_verb      --> [beats].
/** <examples> ?- phrase(sentence,[achilles,beats,the,lazy,turtle]). ?- phrase(sentence,L),reverse(L,L). */

Some sentences generated by this grammar are: ‘the lazy turtle sleeps’, ‘Achilles beats the turtle’, and ‘the rapid turtle beats Achilles’. The grammatical structure of these sentences can be described by a parse tree, which is a tree containing the words of the sentence as leaves, and the syntactic categories assigned to parts of the sentence as nodes (Figure 7.1).


Figure 7.1 Parse tree for the sentence ‘the rapid turtle beats Achilles’.

Exercise 7.1

Redraw this parse tree in the manner of an SLD proof tree, where ‘resolvents’ are partially parsed sentences such as


and ‘clauses’ are grammar rules.

Such a parse tree can be constructed by starting with the non-terminal sentence, and repeatedly replacing non-terminals by the right-hand side of an applicable rule, until the given sentence is obtained as a sequence of terminals. This method is called top-down parsing. Alternatively, we could start with the sentence and look for parts of it which occur on the right-hand side of a rule, and replace that part of the sentence with the non-terminal on the left-hand side of the rule, until we obtain the single non-terminal sentence. This procedure is called bottom-up parsing. It should be noted that both methods require search: at any stage, several rules might be applicable.

Exercise 7.2

Draw the search space generated by the above grammar for a top-down parse, if grammar rules are applied to sentences from left to right. Discuss the similarities and differences with SLD-trees.

In general, grammar rules are allowed to be recursive. For instance, a noun phrase can contain several adjectives, as described by the following rules:

noun_phrase          --> article,noun_phrase2.
noun_phrase2         --> noun.
noun_phrase2         --> adjective,noun_phrase2.
article              --> [the].
adjective            --> [lazy].
adjective            --> [rapid].
noun                 --> [turtle].
/** <examples> ?- phrase(noun_phrase,[the,lazy,rapid,turtle]). ?- phrase(noun_phrase,L). */

This set of rules allows ‘the lazy rapid turtle’ as a noun phrase. Recursion extends the descriptive power of a grammar considerably, by allowing repetitive structures.

Grammars like the ones we have seen are called context-free grammars. This name derives from the fact that only one non-terminal is allowed on the left of a grammar rule. A grammar rule which contains several non-terminals on its left-hand side is called context-sensitive: some of those non-terminals act as a context for the others, allowing the rule to be used only when that context is present. As an example, consider a grammar which would rule out sentences like ‘the turtles sleeps’, in which the ‘plurality’ (singular, plural) of noun and verb disagree. A candidate would be:

sentence                   --> noun_phrase,plurality,verb_phrase.
noun_phrase                --> article,noun.
plurality                  --> singular.
plurality                  --> plural.
verb_phrase                --> intransitive_verb.
article                    --> [the].
noun,singular              --> [turtle],singular.
noun,plural                --> [turtles],plural.
singular,intransitive_verb --> [sleeps].
plural,intransitive_verb   --> [sleep].

In this grammar, the non-terminal plurality creates a context for the applicability of the rewrite rules for noun and intransitive verb. Procedural programming languages like Pascal are also, to some extent, context-sensitive: statements like X:=10 can only be parsed in the context created by the declaration of the variable X (if it is declared to be a Boolean, the statement is illegal). Apart from this, such programming languages are context-free: each statement can be parsed without referring to its context.

Context-sensitive grammars greatly increase the complexity of the parsing task; moreover, the grammatical structure of sentences cannot be simply described by a parse tree. In this chapter, we will restrict attention to context-free grammars, extended with some Prolog-specific features. The resulting grammars are called Definite Clause Grammars, and will be introduced in the next section.