.. -*- mode: rst -*-
.. include:: ../definitions.txt

=======================
7. Grammars and Parsing
=======================

------------
Introduction
------------

Early experiences with the kind of grammar taught in school are often
confusing.  Written work may be graded by a teacher who red-lines all
the grammar errors they won't put up with.  Like the dangling
preposition in the last sentence, or sentences like this one which
lack a main verb.  Learning English as a second language, it may be
difficult to discover which of these errors need to be fixed
(or *needs* to be fixed?).
Correct punctuation is something of an obsession for many writers and
editors (as our own students have observed).  Of course, it is all in
the name of effective communication.  In the following example, the
interpretation of a relative clause as restrictive or non-restrictive
depends on the presence of commas alone:

.. ex::
  .. _rest:
  .. ex:: The presidential candidate, who was extremely popular,
          smiled broadly.
  .. _nonrest:
  .. ex:: The presidential candidate who was extremely popular smiled broadly.

In rest_, we assume there is just one presidential candidate, and say
two things about her: that she was popular and that she smiled. In
nonrest_, on the other hand, we use the description `who was extremely
popular`:lx: as a means of identifying for the hearer which of several
candidates we are referring to. 

It is clear that some of these rules are important.  Others seem to be
vestiges of antiquated style that preoccupies only the most crusty
curmudgeons.  As an example, consider the injunction that
:lx:`however` |mdash| when used to mean *nevertheless* |mdash| must not appear
at the start of a sentence.  Pullum argues that Strunk and White were
merely insisting that English usage should conform to "an utterly
unimportant minor statistical detail of style concerning adverb
placement in the literature they knew" [languagelog.org].

When reading, we sometimes find that we have to stop and re-read a
sentence in order to resolve an ambiguity.  Curiously, it seems possible
to combine *unambiguous* words to create ambiguous sentences:

.. ex::
  .. ex:: Fighting animals could be dangerous.
  .. ex:: Visiting relatives can be tiresome.

Perhaps another kind of syntactic variation, word order, is easier to
understand.  We know that the two sentences `Kim likes Sandy`:lx: and
`Sandy likes Kim`:lx: have different meanings, and that `likes Sandy
Kim`:lx: is simply ungrammatical.  Similarly, we know that the
following two sentences are equivalent:

.. ex::
  .. ex:: The farmer *loaded* the cart with sand
  .. ex:: The farmer *loaded* sand into the cart

However, consider the semantically similar verbs `filled`:lx: and `dumped`:lx:.
Now the word order cannot be altered (ungrammatical sentences are
prefixed with an asterisk.)

.. ex::
  .. ex:: The farmer *filled* the cart with sand
  .. ex:: \*The farmer *filled* sand into the cart
  .. ex:: \*The farmer *dumped* the cart with sand
  .. ex:: The farmer *dumped* sand into the cart

A further curious fact is that we are able to access the meaning of
sentences we have not encountered.  It is not difficult to concoct an
entirely novel sentence, one that has probably never been used before
in the history of the language, and yet all speakers of the language
will agree about its meaning.  In fact, the set of possible sentences is
infinite, given that there is no upper bound on length.
Consider the following passage from a children's story, containing a
rather impressive sentence:

   You can imagine Piglet's joy when at last the ship came in sight of
   him. In after-years he liked to think that he had been in Very
   Great Danger during the Terrible Flood, but the only danger he had
   really been in was the last half-hour of his imprisonment, when
   Owl, who had just flown up, sat on a branch of his tree to comfort
   him, and told him a very long story about an aunt who had once laid
   a seagull's egg by mistake, and the story went on and on, rather
   like this sentence, until Piglet who was listening out of his
   window without much hope, went to sleep quietly and naturally,
   slipping slowly out of the window towards the water until he was
   only hanging on by his toes, at which moment, luckily, a sudden
   loud squawk from Owl, which was really part of the story, being
   what his aunt said, woke the Piglet up and just gave him time to
   jerk himself back into safety and say, "How interesting, and did
   she?"  when -- well, you can imagine his joy when at last he saw
   the good ship, Brain of Pooh (Captain, C. Robin; 1st Mate, P. Bear)
   coming over the sea to rescue him...  (from A.A. Milne *In which
   Piglet is Entirely Surrounded by Water*)

Our ability to produce and understand entirely new sentences, of
arbitrary length, demonstrates that the set of well-formed sentences in
English is infinite.  The same case can be made for any human language.

This chapter presents grammars and parsing, as the formal and
computational methods for investigating and modelling the linguistic
phenomena we have been touching on (or tripping over).
As we shall see, patterns of well-formedness and ill-formedness in a
sequence of words can be understood with respect to the underlying
:dt:`phrase structure` of the sentences.  We can develop formal
models of these structures using grammars and parsers.
As before, the motivation is natural language *understanding*.  How
much more of the meaning of a text can we access when we can reliably
recognize the linguistic structures it contains?  Having read in a
text, can a program 'understand' it enough to be able to answer simple
questions about "what happened?" or "who did what to whom?"  Also as
before, we will develop simple programs to process annotated corpora
and perform useful tasks.

-------------------------
What's the Use of Syntax?
-------------------------

Earlier chapters focussed on words: how to identify them, how to
analyse their morphology, and how to assign them to classes via
part-of-speech tags. We have also seen how to identify recurring
sequences of words (i.e. n-grams). Nevertheless, there seem to be
linguistic regularities which cannot be described simply in terms of
n-grams. In this section we will see why it is useful to have some
kind of syntactic representation of sentences.  In particular, we will
see that there are systematic aspects of meaning which are much easier
to capture once we have established a level of syntactic structure.


Syntactic Ambiguity
-------------------

We have seen that sentences can be ambiguous.  If we overheard someone
say :lx:`I went to the bank`, we wouldn't know whether it was
a river bank or a financial institution.  This ambiguity concerns
the meaning of the word :lx:`bank`, and is a kind of :dt:`lexical
ambiguity`.

However, other kinds of ambiguity cannot be explained in terms of
ambiguity of specific words. We can construct simple examples of
syntactic ambiguity involving coordinating conjunctions like `and`:lx:
and `or`:lx:. Consider the following sentence:

.. ex:: Kim left or Dana arrived and everyone cheered.

It should be obvious to you that there are two distinct
interpretations of this sentence. How should we account for the
difference? If you are familiar with propositional logic, you will not
be surprised at the idea of using brackets to represent semantic
structure:

.. ex::
  .. ex::  Kim arrived or (Dana left and everyone cheered)
  .. ex::  (Kim arrived or Dana left) and everyone cheered

We can describe this ambiguity in terms of the semantic `scope`:dt: of
`or`:lx: and `and`:lx:\ : in the
reading represented by (6a), the operator `or`:lx:  takes
the conjoined sentence `Dana arrived and everyone cheered`:lx: as one
of its arguments, and therefore is said to have wider scope than
`and`:lx:. Conversely, in (6b), the operator `and`:lx: has wider scope than
`or`:lx:.
One convenient way of representing this scope difference at a structural
level is by means of a `tree diagram`:dt:. 

.. ex::
  .. ex::
    .. tree:: (S <S Kim arrived>
                 (conj or)
                 (S <S Dana left> (conj and) <S everyone cheered>))
  .. ex::
    .. tree:: (S (S <S Kim arrived> (conj or) <S Dana left>)
                 (conj and)
                 <S everyone cheered>)

Note that linguistic trees grow upside down: the node labeled `S`:gc:
is the `root`:dt: of the tree, while the `leaves`:dt: of the tree are
labeled with the words.

In NLTK-Lite, you can easily produce trees like this yourself with the
following commands:
 
.. doctest-ignore::
    >>> from nltk_lite.parse import bracket_parse
    >>> sent = '(S (S Kim arrived) (conj or) (S Dana left))'
    >>> tree = bracket_parse(sent)
    >>> tree.draw()

Conveniently, the resulting tree object supports Python's standard
array operations for accessing its children:

    >>> tree[0]
    (S: 'Kim' 'arrived')

A second example of scope ambiguity involves adjectives:
`old men and women`:lx:.
Does `old`:lx: have wider scope than `and`:lx:, or is it the other way
round? In fact, both interpretations are possible.

For our third illustration of ambiguity, we look at
prepositional phrases.
Consider a sentence like: :lx:`I saw the man with a telescope`.  Who
has the telescope?  To clarify what is going on here, consider the
following pair of sentences:

.. ex::
  .. ex:: The policeman saw a burglar *with a gun*.
         (not some other burglar)
  .. ex:: The policeman saw a burglar *with a telescope*.
         (not with his naked eye)

In both cases, there is a prepositional phrase introduced by
:lx:`with`.  In the first case this phrase modifies the noun
:lx:`burglar`, and in the second case it modifies the verb :lx:`saw`.
We could again think of this in terms of scope: does the prepositional
phrase (`PP`:gc:) just have scope over the `NP`:gc:
`a burglar`:lx:, or does it have scope over
the whole verb phrase? Again, we can represent the difference in terms
of tree structure:

.. ex::
  .. ex::
    .. tree:: (S <NP the policeman>
                 (VP (V saw)
                     (NP <NP the burglar>
                         <PP with a gun>)))
  .. ex::
    .. tree:: (S <NP the policeman>
                 (VP (V saw)
                     <NP the burglar>
                     <PP with a telescope>))

We can generate these trees in Python as follows:

    >>> s1 = '(S (NP the policeman) (VP (V saw) (NP (NP the burglar) (PP with a gun))))'
    >>> s2 = '(S (NP the policeman) (VP (V saw) (NP the burglar) (PP with a telescope)))'
    >>> tree1 = bracket_parse(s1)
    >>> tree2 = bracket_parse(s2)
   
We can see that they are trees over the same sequence of words (that
is, the two trees have the same `leaves`:dt:):

    >>> tree1.leaves() == tree2.leaves
    True

On the other hand, they have different `heights`:dt: (given by the
number of nodes in the longest branch of the tree, starting at `S`:gc:
and descending to the words):

    >>> tree1.height() == tree2.height()
    False

In general, how can we determine whether a prepositional phrase
modifies the preceding noun or verb? This problem is often described
with the label `PP attachment`:dt:.  The :dt:`Prepositional Phrase
Attachment Corpus`, included with NLTK-Lite as `ppattach`, makes it
possible for us to study this question systematically.  The corpus is
derived from the IBM-Lancaster Treebank of Computer Manuals and from
the Penn Treebank, and distills out only the essential information
about `PP`:gc: attachment. Consider the following sentence from the
WSJ:

.. ex:: Four of the five surviving workers have asbestos-related diseases,
      including three with recently diagnosed cancer.

The corresponding line in the  `ppattach` corpus is this:

.. ex::
   ::

      16 including three with cancer N

That is, it includes an identifier for the original sentence, the
head of the relevant verb phrase (i.e., `including`:lx:), the head of
the verb's `NP`:gc: object (`three`:lx:), the preposition
(`with`:lx:), and the head noun within the prepositional phrase
(`cancer`:lx:). Finally, it contains an 'attachment' feature (``N`` or
``V``) to indicate whether the prepositional phrase attaches to
(modifies) the noun phrase or the verb phrase. 
Here are some further examples:

.. ex::
   :: 

     47830 allow visits between families N
     47830 allow visits on peninsula V
     42457 acquired interest in firm N
     42457 acquired interest in 1986 V

The attachments in the above examples can also be made explicit by
using phrase groupings as follows:

.. ex::
   :: 

     allow (NP visits (PP between families))
     allow (NP visits) (PP on peninsula)
     acquired (NP interest (PP in firm))
     acquired (NP interest) (PP in 1986)

Observe in each case that the argument of the verb is either a single
complex expression ``(visits (between families))`` or a pair of
simpler expressions ``(visits) (on peninsula)``.
We can access this corpus from NLTK-Lite as follows:

    >>> from nltk_lite.corpora import ppattach, extract
    >>> from pprint import pprint
    >>> item = extract(16, ppattach.dictionary('devset'))
    >>> pprint(item)
    {'attachment': 'N',
     'noun1': 'three',
     'noun2': 'cancer',
     'prep': 'with',
     'sent': '16',
     'verb': 'including'}

If we go back to our first examples of `PP`:gc: attachment ambiguity,
it appears as though it is the `PP`:gc: itself (e.g., `with a gun`:lx:
versus `with a telescope`:lx:) that determines the attachment. However,
we can use this corpus to find examples where other factors come in to play.
The following program uses ``MinimalSet`` to find pairs of entries in the
corpus which have different attachments based on the *verb* only.

    >>> from nltk_lite.utilities import MinimalSet
    >>> ms = MinimalSet()
    >>> for entry in ppattach.dictionary('training'):
    ...     target  = entry['attachment']
    ...     context = (entry['noun1'], entry['prep'], entry['noun2'])
    ...     display = (target, entry['verb'])
    ...     ms.add(context, target, display)
    >>> for context in ms.contexts():
    ...     print context, ms.display_all(context)

Here is one of the pairs found by the program.

.. ex::
   :: 

      received (NP offer) (PP from group)
      rejected (NP offer (PP from group))

This finding gives us clues to a structural difference: the verb
:lx:`receive` usually comes with two following arguments; we receive
something *from* someone.  In contrast, the verb :lx:`reject` only
needs a single following argument; we can reject something without
needing to say where it originated from. We expect that if you look at
the data, you will be able to come up with further ideas about the
factors that influence `PP`:gc: attachment.

Constituency
------------

We claimed earlier that one of the motivations for building syntactic
structure was to help make explicit how a sentence says "who did what
to whom". Let's just focus for a while on the "who" part of this
story: in other words, how can syntax tell us what the subject of a
sentence is? At first, you might think this task is rather simple
|mdash| so simple indeed that we don't need to bother with syntax. In
a sentence such as

.. ex:: The fierce dog bit the man.

we know that it is the dog that is doing the biting. So we could
say that the noun phrase immediately preceding the verb is the
subject of the sentence. And we might try to make this more explicit
in terms of sequences part-of-speech tags.  Let's try to come up with a simple
definition of `noun phrase`:gc:; we might start off with something
like this:

.. ex::
   ::

      DT JJ* NN

We're using regular expression notation here in the form of
`JJ*`:gc: to indicate a sequence of zero or more `JJ`:gc:\s. So this
is intended to say that a noun phrase can consist of a
determiner, possibly followed by some adjectives, followed by a
noun. Then we can go on to say that if we can find a sequence of
tagged words like this that precedes a word tagged as a verb, then
we've identified the subject. But now think about this sentence:

.. ex:: The child with a fierce dog bit the man.

This time, it's the child that is doing the biting. But the tag
sequence preceding the verb is:

.. ex::
   ::

      DT NN IN DT JJ NN

Our previous attempt at identifying the subject would have
incorrectly come up
with `the fierce dog`:lx: as the subject.

So our next hypothesis would have to be a bit more complex. For
example, we might say that the subject can be identified as any string
matching the following pattern before the verb:

.. ex::
   ::
 
      DT JJ* NN (IN DT JJ* NN)*

In other words, we need to find a noun phrase followed by zero or more
sequences consisting of a preposition followed by a noun phrase. Now
there are two unpleasant aspects to this proposed solution. The first
is aesthetic: we are forced into repeating the sequence of tags (`DT
JJ* NN`:gc:) that constituted our initial notion of noun phrase, and
our initial notion was in any case a drastic simplification. More
worrying, this approach still doesn't work! For consider the following
example:

.. _seagull:
.. ex:: The seagull that attacked the child with the fierce dog bit the man.

This time the seagull is the culprit, but it won't be detected as subject by our
attempt to match sequences of tags. So it seems that we need a
richer account of how words are *grouped* together into patterns, and
a way of referring to these groupings at different points in the
sentence structure. This idea of grouping is often called
syntactic `constituency`:dt:.

As we have just seen, a well-formed sentence of a language is more
than an arbitrary sequence of words from the language.  Certain kinds
of words usually go together.  For instance, determiners like `the`:lx:
are typically followed by adjectives or nouns, but not by verbs.
Groups of words form intermediate structures called phrases or
:dt:`constituents`.  These constituents can be identified using
standard syntactic tests, such as substitution, movement and
coordination.  For example, if a sequence of words can be replaced
with a pronoun, then that sequence is likely to be a constituent.
According to this test, we can infer that the italicised string in the
following example is a constituent, since it can be replaced by
`they`:lx:\:

.. ex::
  .. ex:: *Ordinary daily multivitamin and mineral supplements* could 
         help adults with diabetes fight off some minor infections.
  .. ex:: *They* could help adults with diabetes fight off some minor
         infections.


In order to identify whether a phrase is the subject of a sentence, we
can use the construction called `Subject-Auxiliary Inversion`:dt: in
English. This construction allows us to form so-called Yes-No
Questions. That is, corresponding to the statement in have1_, we have
the question in have2_:

.. ex::
  .. _have1:
  .. ex:: All the cakes have been eaten.
  .. _have2:
  .. ex:: Have *all the cakes* been eaten?

Roughly speaking, if a sentence already contains an auxiliary verb,
such as `has`:lx: in have1_, then we can turn it into a Yes-No
Question by moving the auxiliary verb 'over' the subject noun phrase
to the front of the sentence. If there is no auxiliary in the
statement, then we insert the appropriate form of `do`:lx: as the
fronted auxiliary and replace the tensed main verb by its base form:

.. ex::
  .. ex:: The fierce dog bit the man.
  .. ex:: Did *the fierce dog* bite the man?

As we would hope, this test also confirms our earlier claim about the
subject constituent of seagull_:

.. ex:: Did *the seagull that attacked the child with the fierce dog* bite
       the man?

To sum up then, we have seen that the notion of constituent brings a
number of benefits. By having a constituent labeled `noun phrase`:gc:,
we can provide a unified statement of the classes of word that
constitute that phrase, and reuse this statement in describing noun
phrases wherever they occur in the sentence. Second, we can use the
notion of a noun phrase in defining the subject of sentence, which in
turn is a crucial ingredient in determining the "who does what to
whom" aspect of meaning.


More on NLTK's Trees
--------------------

We have been discussing structural differences between sentences, and
we have been probing these structures by substituting words and
phrases.  We have informally shown how sentence structures can be
represented using syntactic trees, and we will now look at these
structures in a bit more detail.

Consider the following example:

.. ex::
  .. tree:: (S (NP Lee) (VP (V saw) (NP the dog)))

**Terminology:**
A tree is a set of connected nodes, each of which is labeled with a
category.  It common to use a 'family' metaphor to talk about the
relationships of nodes in a tree: for example, `S`:gc: is the
`parent`:dt: of `VP`:gc:; conversely `VP`:gc: is a `daughter`:dt: (or
`child`:dt:) of `S`:gc:.  Also, since `NP`:gc: and `VP`:gc: are both
daughters of `S`:gc:, they are also `sisters`:dt:. 

Each production in a CFG corresponds to a tree of depth one; we call
these `local`:dt:` trees.


Although it is helpful to represent trees in a graphical format, for
computational purposes we usually need a more text-oriented
representation. One standard method is to use a combination of bracket
and labels to indicate the structure, as shown here:

.. doctest-ignore::
      (S 
         (NP  'Lee')
         (VP 
            (V 'saw')
            (NP 
               (Det 'the')
               (N  'dog'))))

The conventions for displaying trees in NLTK are similar:

.. doctest-ignore::
      (S: (NP: 'Lee') (VP: (V: 'saw') (NP: 'the' 'dog')))

In such trees, the node value is a string containing the tree's
constituent type (e.g., `NP`:gc: or `VP`:gc:), while the children encode
the hierarchical contents of the tree [#]_.

.. [#] Although the ``Tree`` class is usually used for encoding
   syntax trees, it can be used to encode *any* homogeneous hierarchical
   structure that spans a text (such as morphological structure or
   discourse structure).  In the general case, leaves and node values do
   not have to be strings.

Trees are created with the ``Tree`` constructor, which takes a
node value and a list of zero or more children.  Here's an example of
a simple NLTK-Lite tree with a single child node, where the latter is
a token:

    >>> from nltk_lite.parse.tree import Tree
    >>> tree1 = Tree('NP', ['John'])
    >>> tree1
    (NP: 'John')

Here is an example with two children:

    >>> tree2 = Tree('NP', ['the', 'man'])
    >>> tree2
    (NP: 'the' 'man')

Finally, here is a more complex example, where one of the
children is itself a tree:

    >>> tree3 = Tree('VP', ['saw', tree2])
    >>> tree3
    (VP: 'saw' (NP: 'the' 'man'))

A tree's root node value is accessed with the ``node`` property, and
its leaves are accessed with the ``leaves()`` method:

    >>> tree3.node
    'VP'
    >>> tree3.leaves()
    ['saw', 'the', 'man']

One common way of defining the subject of a sentence `S`:gc: in
English is as *the noun phrase that is the daughter of* `S`:gc: *and
the sister of* `VP`:gc:. Although we cannot access subjects directly,
in practice we can get something similar by using `tree
positions`:dt:. Consider ``tree4`` defined as follows:

    >>> tree4 = Tree('S', [tree1, tree3])
    >>> tree4
    (S: (NP: 'John') (VP: 'saw' (NP: 'the' 'man')))

Now we can just use indexing to access the subtrees of this tree:

    >>> tree4[0]
    (NP: 'John')
    >>> tree4[1]
    (VP: 'saw' (NP: 'the' 'man'))

Since the value of ``tree4[1]`` is itself a tree, we can index into that as well:

    >>> tree4[1][0]
    'saw'
    >>> tree4[1][1]
    (NP: 'the' 'man')

The printed representation for complex trees can be difficult to read.
In these cases, the ``draw`` method can be very useful. 

.. doctest-ignore::
    >>> tree3.draw()

This method opens a new window, containing a graphical representation
of the tree:

.. image:: ../images/parse_draw.png
   :scale: 70

The tree display window allows you to zoom in and out;
to collapse and expand subtrees; and to print the graphical
representation to a postscript file (for inclusion in a document).

To compare multiple trees in a single window, we can use the
``draw_trees()`` method:

.. doctest-ignore::
    >>> from nltk_lite.draw.tree import draw_trees
    >>> draw_trees(tree1, tree2, tree3)

The ``Tree`` class implements a number of other useful methods.  See
the ``Tree`` reference documentation for more information about these
methods.

The ``nltk_lite.corpora`` module defines the ``treebank`` corpus,
which contains a collection of hand-annotated parse trees for English
text, derived from the Penn Treebank.

    >>> from nltk_lite.corpora import treebank, extract
    >>> print extract(0, treebank.parsed())
    (S:
      (NP-SBJ:
        (NP: (NNP: 'Pierre') (NNP: 'Vinken'))
        (,: ',')
        (ADJP: (NP: (CD: '61') (NNS: 'years')) (JJ: 'old'))
        (,: ','))
      (VP:
        (MD: 'will')
        (VP:
          (VB: 'join')
          (NP: (DT: 'the') (NN: 'board'))
          (PP-CLR:
            (IN: 'as')
            (NP: (DT: 'a') (JJ: 'nonexecutive') (NN: 'director')))
          (NP-TMP: (NNP: 'Nov.') (CD: '29'))))
      (.: '.'))


Exercises
---------

#. a) Write code to produce two trees, one for each reading of the phrase
      `old men and women`:lx:

   #) Encode any of the trees presented in this chapter as a labeled
      bracketing and use the ``nltk_lite.parse`` module's
      ``bracket_parse()`` method to check that it is well-formed. Now
      use the ``draw()`` to display the tree.

   #) As in (a) above, draw a tree for `The woman saw a man last Thursday`:lx:.

#. Using tree positions, list the subjects of the first 100
   sentences in the Penn treebank; to make the results easier to view,
   limit the extracted subjects to subtrees whose height is 2.

--------------------
Context Free Grammar
--------------------

As we have seen, languages are infinite |mdash| there is no principled
upper-bound on the length of a sentence.  Nevertheless, we would like
to write programs that can process well-formed sentences.  It turns
out that we can characterize what we mean by well-formedness using a
grammar.  The way that finite grammars are able to describe an
infinite set uses `recursion`:dt:.  (We already came across this idea
when we looked at regular expressions: the finite expression ``a+`` is
able to describe the infinite set ``{a, aa, aaa, aaaa, ...}``).  Apart
from their compactness, grammars usually capture important structural
and distributional properties of the language, and can be used to map
between sequences of words and abstract representations of meaning.
Even if we were to impose an upper bound on sentence length to ensure
the language was finite, we would probably still want to come up with
a compact representation in the form of a grammar.

A `grammar`:dt: is a formal system which specifies which sequences of
words are well-formed in the language, and which provides one or more
phrase structures for well-formed sequences.  We will be looking at
:dt:`context-free grammar` (CFG), which is a collection of
`productions`:dt: of the form `S`:gc: |rarr| `NP VP`:gc:.  This says
that a constituent `S`:gc: can consist of sub-constituents `NP`:gc:
and `VP`:gc:. Similarly, the production `VB`:gc: |rarr| ``'help'``
means that the constituent `VB`:gc: can consist of the string
`help`:lx:.  For a phrase structure tree to be well-formed relative to
a grammar, each non-terminal node and its children must correspond to
a production in the grammar.


A Simple Grammar
----------------

Let's start off by looking at a simple context-free grammar:

.. _G1:
.. ex::

 | S |rarr| NP VP
 | NP |rarr| Det N PP
 | NP |rarr| Det N
 | VP |rarr| V NP PP
 | VP |rarr| V NP
 | VP |rarr| V
 | PP |rarr| P NP
 |
 | Det |rarr| 'the'
 | Det |rarr| 'a'
 | N |rarr| 'man' | 'park' | 'dog' | 'telescope'
 | V |rarr| 'saw' | 'walked'
 | P |rarr| 'in' | 'with'

This grammar contains productions involving various syntactic categories,
as laid out in the following table:

.. table:: Syntactic Categories

  ======    ====================    =====================
  Symbol    Meaning                 Example
  ======    ====================    =====================
  S         sentence                *the man walked*
  NP        noun phrase             *a dog*
  VP        verb phrase             *saw a park*
  PP        prepositional phrase    *with a telescope*
  ..        ..                      ..
  Det       determiner              ..
  N         noun                    ..
  V         verb                    ..
  P         preposition             ..
  ======    ====================    =====================

**Terminology:**
The grammar consists of productions, where each production involves a
single `non-terminal`:dt: (e.g. `S`:gc:, `NP`:gc:), an arrow, and one
or more non-terminals and `terminals`:dt: (e.g. `walked`:lx:).
The productions are often divided into two main groups.
The `grammatical productions`:dt: are those without a terminal on
the right-hand side.  The `lexical productions`:dt: are those having
a terminal on the right-hand side.
A special case of non-terminals are the `pre-terminals`:dt:, which
appear on the left-hand side of lexical productions.

In order to get started with developing simple grammars of your own, you
will probably find it convenient to play with the recursive descent
parser demo, which is invoked as follows:

.. doctest-ignore::
  >>> from nltk_lite.draw import rdparser
  >>> rdparser.demo()

The demo opens a window which displays a list of grammar rules in the
lefthand pane and the current parse diagram in the central pane:

.. image:: ../images/parse_rdparsewindow.png
   :scale: 70

The demo comes with the grammar in `G1`_ already loaded. We will
discuss the parsing algorithm in greater detail below, but for the
time being you can get a rough idea of how it works by using the
*autostep* button.

If we parse the string `The dog saw a man in the park` using
the grammar in `G1`_, we end up with two trees:

.. ex::
  .. ex::
    .. tree:: (S (NP (Det the) (N dog))
                 (VP (V saw)
                     (NP (NP (Det a) (N man)))
                     (PP (P in) (NP (Det the) (N park)))))
  .. ex::
    .. tree:: (S (NP (Det the) (N dog))
                 (VP (V saw)
                     (NP (Det a)
                         (N man)
                         (PP (P in) (NP (Det the) (N park))))))

Since our grammar assigns two distinct structures, the sentence is
said to be :dt:`structurally ambiguous`.  The ambiguity in question is called
a :dt:`PP attachment ambiguity`, as we saw earlier in this chapter.
As you may recall, it is an ambiguity about attachment since the
`PP`:gc: `in the park`:lx: needs to be attached to one of two places
in the tree: either as a daughter of `VP`:gc: or else as a daughter of
`NP`:gc:.

As we noted earlier, there is also a difference in interpretation:
where the `PP`:gc: is attached to `VP`:gc:, the intended interpretation is
that the event of seeing took place in the park, while if the `PP`:gc:
is attached to `NP`:gc:, being in the park is a property of the `NP`:gc:
referent; that is, the man was in the park, but the agent of the
seeing |mdash| the dog |mdash| might have been somewhere else (e.g.,
sitting on the balcony of an apartment overlooking the park).  As we
will see, dealing with ambiguity is a key challenge in parsing.

Exercises
---------

#. In the recursive descent parser demo, experiment with changing the
   sentence to be parsed by selecting *Edit Text* in the *Edit* menu.

#. Can the grammar in `G1`_ be used to describe sentences which are
   more than 20 words in length?

#. You can modify the grammar in the recursive descent parser demo
   by selecting *Edit Grammar*  in the *Edit* menu. Change
   the first expansion rule, namely ``NP -> Det N PP``, to ``NP -> NP
   PP``. Using the *Step* button, try to build a parse tree. What happens?



Recursion
---------

Observe that sentences can be nested within sentences, with no limit
to the depth:

.. ex::
  .. ex:: Jodie won the 100m freestyle
  .. ex:: 'The Age' reported that Jodie won the 100m freestyle
  .. ex:: Sandy said 'The Age' reported that Jodie won the 100m freestyle
  .. ex::  I think Sandy said 'The Age' reported that Jodie won the 100m freestyle

This nesting is explained in terms of :dt:`recursion`. A grammar is
said to be :dt:`recursive` if a category occurring on the lefthand
side of a production (such as `S`:gc: in this case) also appears on
the righthand side of a production. If this dual occurrence takes
place in *one and the same production*, then we have :dt:`direct
recursion`; otherwise we have :dt:`indirect recursion`. There is no
recursion in `G1`_. However, grammar `G2`_ below illustrates both kinds of
recursive rule:

.. _G2:
.. ex::

 | S  |rarr| NP VP
 | NP |rarr| Det Nom
 | NP |rarr| Det Nom PP
 | NP |rarr| PropN
 | Nom |rarr| Adj Nom
 | Nom |rarr| N
 | VP |rarr| V NP PP
 | VP |rarr| V NP
 | VP |rarr| V S
 | VP |rarr| V
 | PP |rarr| P NP
 |
 | PropN |rarr| 'John' | 'Mary' 
 | Det |rarr| 'the'
 | Det |rarr| 'a'
 | N |rarr| 'man' | 'woman' | 'park' | 'dog' | 'lead' | 'telescope' | 'butterfly'
 | Adj  |rarr| 'fierce' | 'black' |  'big' | 'European'
 | V |rarr| 'saw' | 'chased' | 'barked'  | 'disappeared' | 'said' | 'reported' 
 | P |rarr| 'in' | 'with' 

Notice that the production `Nom`:gc: |rarr| `Adj Nom`:gc: (where
`Nom`:gc: is the category of nominals) involves direct
recursion on the category `Nom`:gc:, whereas indirect recursion on `S`:gc:
arises from the combination of two productions, namely `S`:gc: |rarr|
`NP VP`:gc: and `VP`:gc: |rarr| `V S`:gc:.  

To illustrate recursion in this grammar, we show first of all a tree
involving nested nominal phrases:

.. ex::
  .. tree::
   (S (NP (Det a) (Nom (Adj fierce)(Nom (Adj black) (N dog))))
       (VP (V chased)
          (NP (Det the) (Nom (Adj big)(Nom (Adj European) (N butterfly))))))

Next, observe how we can embed one `S`:gc:  constituent into another:

.. ex::
  .. tree::
    (S (NP (Det the) (N man))
       (VP (V said)
           (S (NP (Det the) (N woman))
              (VP (V reported)
                  (S (NP (Det the) (N dog))
                           (VP (V barked)))))))
   

If you did the exercises for the last section, you will have noticed
that the recursive descent parser fails to deal properly with the
following rule:

.. ex::  
  .. parsed-literal:: 

   NP |rarr| NP PP

From a linguistic point of view, this rule is perfectly respectable,
and will allow us to derive trees like this:

.. ex::
  .. tree::
    (S (NP 
           (NP 
               (NP (Det the) (N man))
               (PP (P with) (NP  (Det a) (N dog))))
            (PP (P in  (NP  (Det the) (N park)))))
         (VP (V disappeared)))

More schematically, the trees will be of the following shape:

.. _leftrec:
.. ex::
  .. tree::
    (NP (NP (NP (NP Det N) PP) PP) PP)

leftrec_ is an example of a `left recursive`:dt: structure. Such
structures seem to occur rather frequently in analyses of English, and
the failure of recursive descent parsers to deal adequately with left
recursion means that we will need to find alternative approaches, as
we will discuss later in this chapter.

Heads, Complements and Modifiers
--------------------------------

Let us take a closer look at verbs.
The grammar `G2`_ correctly generates examples like subcat1_,
corresponding to the four productions with `VP`:gc: on the lefthand side:

.. _subcat1:
.. ex::
   .. ex:: The woman gave the telescope to the dog.
   .. ex:: The woman saw a man.
   .. ex:: A man said that the woman disappeared.
   .. ex:: The dog barked.

That is, `gave`:lx: can occur with a following `NP`:gc: and `PP`:gc:; 
`saw`:lx: can occur with a following `NP`:gc:; 
`said`:lx: can occur with a following `S`:gc:; 
and `barked`:lx: can occur with no following phrase.
In these cases, `NP`:gc:, `PP`:gc: and `S`:gc: are called :dt:`complements`
of the respective verbs, and the verbs themselves are called
:dt:`heads` of the verb phrase.

However, there are fairly strong constraints on what verbs can occur
with what complements. Thus, we would like our grammars to mark the
following examples as ungrammatical [#]_:

.. _subcat2:
.. ex:: 
   .. ex:: \*The woman disappeared the telescope to the dog.
   .. ex:: \*The dog barked a man.
   .. ex:: \*A man gave that the woman disappeared.
   .. ex:: \*A man said.

.. [#] It should be borne in mind that it is possible to create
   examples which involve 'non-standard' but interpretable
   combinations of verbs and complements. Thus, we can, at a stretch,
   interpret `the man disappeared the dog`:lx: as meaning that the man
   made the dog disappear. We will ignore such examples here.

How can we ensure that our grammar correctly excludes the
ungrammatical examples in subcat2_?  We need some way of constraining
grammar productions which expand `VP`:gc: so that verbs *only* cooccur
with their correct complements. We do this by dividing the class of
verbs into `subcategories`:dt:, each of which is associated with a
different set of complements. For example, `transitive verbs`:dt: such
as `saw`:lx:, `kissed`:lx: and `hit`:lx: require a following `NP`:gc:
object complement. Borrowing from the terminology of chemistry, we
sometimes refer to the `valency`:dt: of a verb, that is, its capacity
to combine with a sequence of arguments and thereby compose a verb
phrase.

Let's introduce a new category label for such verbs, namely
`TV`:gc: (for Transitive Verb), and use it in the following productions:

.. ex::
   .. parsed-literal::

     VP |rarr| TV NP
     TV |rarr| 'saw' | 'kissed' | 'hit'

Now `*the dog barked the man`:lx: is excluded since we haven't listed
`barked`:lx: as a `V_tr`:gc:, but `the woman saw a man`:lx: is still allowed.
The following table provides more examples of labels for verb subcategories.

======    ====================    ========================
Verb Subcategories
----------------------------------------------------------
Symbol    Meaning                 Example
======    ====================    ========================
IV        intransitive verb       *barked*
TV        transitive verb         *saw a man*
DatV      dative verb             *gave a dog to a man*
SV        sentential verb         *said that a dog barked*
======    ====================    ========================

The revised grammar for `VP`:gc: will now look like this:

.. _subcat3:
.. ex::
   .. parsed-literal:: 

      VP |rarr| DatV NP PP
      VP |rarr| TV NP
      VP |rarr| SV S
      VP |rarr| IV 

      DatV |rarr| 'gave' | 'donated' | 'presented'
      TV |rarr| 'saw' | 'kissed' | 'hit' | 'sang'
      SV |rarr| 'said' | 'knew' | 'alleged'
      IV |rarr| 'barked' | 'disappeared' | 'elapsed' | 'sang'

Notice that according to subcat3_, a given lexical item can belong to more
than one subcategory. For example, `sang`:lx: can occur both with and
without a following `NP`:gc: complement.




Lexical Acquisition
-------------------

We have seen increasingly detailed grammars, e.g., identifying
different kinds of verbs.  How are we to acquire this information in a
scalable way?  One method is to return to the chunking methods.  For
example, we saw in the `Chunking <chunk.html>`__ chapter that it is
possible to collapse chunks down to the chunk label, thus:

.. ex::
   ::

     gave NP
     gave up NP in NP
     gave NP up
     gave NP NP
     gave NP to NP

We can use this as raw material to guide us as we manually construct
more grammar productions.

Review of CFGs
--------------

We have seen that a CFG contains terminal and nonterminal symbols, and
rules which dictate how constituents are expanded into other
constituents and words. In this section, we provide some formal definitions.

A CFG is a 4-tuple |langle|\ `N`:math:, |Sigma|, `P`:math:, `S`:math:\
|rangle|\ , where:

*  |Sigma| is a set of *terminal* symbols (e.g., lexical items);

*  `N`:math: is a set of *non-terminal* symbols (the category labels); 

*  `P`:math: is a set of *productions* of the form `A`:math: |rarr| |alpha|, where

   * `A`:math: is a non-terminal, and 
   
   * |alpha| is a string of symbols from (`N`:math: |cup| |Sigma|)*
     (i.e., strings of either terminals or non-terminals);

*  `S`:math: is the *start symbol*.

A `derivation`:dt: of a string from a non-terminal `A`:math:
in grammar `G`:math: is the result of successively applying
productions from  `G`:math: to `A`:math:.  For example, deriv1_ is a
derivation of `the dog with a telescope`:lx: for the grammar in G1_.

.. _deriv1:
.. ex::
  ::

     NP
     Det N PP
     the N PP
     the dog PP
     the dog P NP
     the dog with NP
     the dog with Det N
     the dog with a N
     the dog with a telescope

Although we have chosen here to expand the leftmost non-terminal
symbol at each stage, this is not obligatory; productions can be
applied in any order. Thus,  derivation deriv1_
could equally have started off in the following manner:

.. _deriv2:
.. ex::
  ::

     NP
     Det N PP
     Det N P NP
     Det N with NP
     ...

We can also write derivation deriv1_ as:

.. _deriv3:
.. ex::
   `NP`:gc: |DoubleRightArrow| `Det N PP`:gc:
   |DoubleRightArrow| `the`:lx: `N PP`:gc:
   |DoubleRightArrow| `the dog`:lx: `PP`:gc:
   |DoubleRightArrow| `the dog`:lx: `P NP`:gc:
   |DoubleRightArrow| `the dog with`:lx: `NP`:gc:
   |DoubleRightArrow| `the dog with a`:lx: `N`:gc:
   |DoubleRightArrow| `the dog with a telescope`:lx:

where |DoubleRightArrow| means "derives in one step". 
We use |DoubleRightArrow|\ * to mean "derives in zero or more steps":

* |alpha| |DoubleRightArrow|\ * |alpha| for any string |alpha|, and

* if |alpha| |DoubleRightArrow|\ * |beta| and |beta|
  |DoubleRightArrow| |gamma|, then |alpha| |DoubleRightArrow|\ *
  |gamma|.

We write `A`:math: |DoubleRightArrow|\ * |alpha| to indicate that
|alpha| can be derived from  `A`:math:.


Context Free Grammars in NLTK-Lite
----------------------------------

Context free grammars are encoded by the ``cfg.Grammar`` class.  The easiest
way to construct a grammar object is from the standard string representation
of grammars:

    >>> productions = '''
    ... S -> NP VP
    ... VP -> V NP | V NP PP
    ... V -> "saw" | "ate"
    ... NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
    ... Det -> "a" | "an" | "the" | "my"
    ... N -> "dog" | "cat" | "cookie"
    ... PP -> P NP
    ... P -> "on" | "by" | "with"
    ... '''

Now we can convert this string into a grammar object:

    >>> from nltk_lite import parse
    >>> grammar = parse.cfg.parse_grammar(productions)
    >>> grammar
    <Grammar with 21 productions>

Next, we can build a parser for this grammar:

    >>> from nltk_lite import parse
    >>> rd_parser = parse.RecursiveDescent(grammar)

Finally, we can use the parser to parse a sentence:

    >>> from nltk_lite import tokenize
    >>> sent = list(tokenize.whitespace("Mary saw Bob"))
    >>> for p in rd_parser.get_parse_list(sent):
    ...      print p
    (S: (NP: 'Mary') (VP: (V: 'saw') (NP: 'Bob')))
  

Exercises
---------

1. Extend the grammar in `G2`_ with productions which expand prepositions as
   intransitive, transitive and requiring a `PP`:gc:
   complement. Based on these productions, use the method of the
   preceding exercise to draw
   a tree for the sentence `Lee ran away home`:lx:\.

2. Pick some common verbs.

  a) Write a program to find those verbs in the PP Attachment Corpus
     included with NLTK-Lite.  Find any cases where the same verb
     exhibits two different attachments, but where the first noun,
     or second noun, or preposition, stay unchanged (as we saw in
     the PP Attachment Corpus example data above).

  b) Devise CFG grammar productions to cover some of these cases.

3. **Lexical Acquisition:**
   Identify some English verbs that are near-synonyms, such as the
   :lx:`dumped/filled/loaded` example from earlier in this chapter.
   Use the chunking method to study the complementation patterns of
   these verbs.  Create a grammar to cover these cases.  Can the verbs
   be freely substituted for each other, or are their constraints?
   Discuss your findings.


-------
Parsing
-------

A :dt:`parser` is a computational system which processes input
sentences according to the productions of a grammar, and builds one or
more constituent structures which conform to the grammar. While a
grammar is a declarative specification of well-formedness, a parser is
a procedural interpretation of the grammar.  We can think of the
parser as searching through the space of possible trees licensed by a
grammar, to find one that has the required sentence along its fringe.
Following on from our description of context free grammars, we will
now describe some simple parsers that work with them.

Parsing is important in linguistics and natural language processing
for a variety of reasons.  A parser permits a grammar to be evaluated
against a potentially large collection of test sentences, helping the
linguist to identify shortcomings in their analysis.  A parser can
also be used as a model of psycholinguistic processing, with the goal
of explaining the processing difficulties that humans have with
certain syntactic constructions (e.g., the so-called 'garden path'
sentences).  There are many NL applications which involve parsing at
some point; for example, we would expect the natural language
questions submitted to a question-answering system to undergo parsing
as an initial step.

The Parser Interface 
--------------------

The ``parse`` module defines the ``ParseI`` interface, which in turn defines
the two methods which all parsers should support:


1. The ``parse`` method returns the single best parse for a given
   text.  The text is represented as a list of word tokens.  If no
   parses are found for the given text, then ``parse`` returns
   ``None``.

#. The ``get_parse_list`` method returns a list of the parses for the
   given text.

For example, here is what the recursive descent parser generates for a
simple sentence and grammar:

    >>> from nltk_lite import tokenize, parse
    >>> sent = list(tokenize.whitespace('I saw a man in the park'))
    >>> rd_parser = parse.RecursiveDescent(grammar)
  
    >>> for p in rd_parser.get_parse_list(sent):
    ...     print p
    (S:
      (NP: 'I')
      (VP:
        (V: 'saw')
        (NP:
          (Det: 'a')
          (N: 'man')
          (PP: (P: 'in') (NP: (Det: 'the') (N: 'park'))))))
    (S:
      (NP: 'I')
      (VP:
        (V: 'saw')
        (NP: (Det: 'a') (N: 'man'))
        (PP: (P: 'in') (NP: (Det: 'the') (N: 'park')))))


Recursive Descent Parsing 
-------------------------

The simplest kind of parser interprets the grammar as a specification
of how to break a high-level goal into several lower-level subgoals.
The top-level goal is to find an `S`:gc:.  The `S`:gc: |rarr| `NP VP`:gc:
production permits the parser to replace this goal with two subgoals:
find an `NP`:gc:, then find a `VP`:gc:.  Each of these subgoals can be
replaced in turn by sub-sub-goals, using productions that have `NP`:gc:
and `VP`:gc: on their left-hand side.  Eventually, this expansion
process leads to subgoals such as: find the word `telescope`:lx:.  Such
subgoals can be directly compared against the input string, and
succeed if the next word is matched.  If there is no match the parser
must back up and try a different alternative.

The recursive descent parser builds a parse tree during the above
process.  With the initial goal (find an `S`), the `S` root node
is created.  As the above process recursively expands its goals using
the productions of the grammar, the parse tree is extended downwards
(hence the name *recursive descent*).  We can see this in action using
the parser demonstration ``nltk_lite.draw.rdparser``.  To run this
demonstration, use the following commands:

.. doctest-ignore::
  >>> from nltk_lite.draw import rdparser
  >>> rdparser.demo()

Six stages of the execution of this parser are shown below:

+----------------------------------------------------------------------------------+
|                        Six Stages of a Recursive Descent Parser                  |
+===========================+==========================+===========================+
| a. Initial stage          | b. After two productions | c. After matching "the"   |
|                           |                          |                           |
| |rdparser1|               | |rdparser2|              | |rdparser3|               |
+---------------------------+--------------------------+---------------------------+
| d. Failing to match "man" | e. Completed parse       | f. Backtracking           |
|                           |                          |                           |
| |rdparser4|               | |rdparser5|              | |rdparser6|               |
+---------------------------+--------------------------+---------------------------+

.. |rdparser1| image:: ../images/rdparser1.png
.. |rdparser2| image:: ../images/rdparser2.png
.. |rdparser3| image:: ../images/rdparser3.png
.. |rdparser4| image:: ../images/rdparser4.png
.. |rdparser5| image:: ../images/rdparser5.png
.. |rdparser6| image:: ../images/rdparser6.png

During this process, the parser sometimes must choose between several
possible productions.  For example, in going from step c to step d, it
tries to find productions with `N`:gc: on the left-hand side.  The
first of these is `N`:gc: |rarr| `man`:lx:.  When this does not work
it backtracks, and tries other `N`:gc: productions in order, under it
gets to `N`:gc: |rarr| `dog`:lx:, which matches the next word in the
input sentence.  Much later, as shown in step e, it finds a complete
parse.  This is a tree which covers the entire sentence, without any
dangling edges.  Once a parse has been found, we can get the parser to
look for additional parses.  Again it will backtrack and explore other
choices of production in case any of them result in a parse.

The Recursive Descent Parser in NLTK
------------------------------------

The ``nltk_lite.parse`` module defines ``RecursiveDescent``, a
simple recursive implementation of a top-down parser.
Recursive descent parsers are created from ``Grammars`` by the
``RecursiveDescent`` constructor.

    >>> from nltk_lite import parse
    >>> rd_parser = parse.RecursiveDescent(grammar)
    >>> sent = list(tokenize.whitespace('I saw a man'))
    >>> rd_parser.get_parse_list(sent)
    [(S: (NP: 'I') (VP: (V: 'saw') (NP: (Det: 'a') (N: 'man'))))]

The constructor takes an optional parameter ``trace``.  If ``trace``
is greater than zero, then the parser will describe the steps that it
takes as it parses a text.

Problems with Recursive Descent Parsing
---------------------------------------

Recursive descent parsing has three key shortcomings.  First,
left-recursive productions like `NP`:gc: |rarr| `NP PP`:gc: send it
into an infinite loop.  Second, the parser wastes a lot of time
considering words and structures that do not correspond to the input
sentence.  Third, the backtracking process may discard parsed
constituents that will need to be rebuilt again later.  For example,
backtracking over `VP`:gc: |rarr| `V NP`:gc: will discard the subtree
created for the `NP`:gc:.  If the parser then proceeds with `VP`:gc:
|rarr| `V NP PP`:gc:, then the `NP`:gc: subtree must be created all
over again.

Recursive descent parsing is a kind of `top-down parsing`:dt:.
Top-down parsers use a grammar to *predict* what the input will be,
before inspecting the input!  However, since the input is available to
the parser all along, it would be more sensible to consider the input
sentence from the very beginning.  This approach is called
`bottom-up parsing`:dt:, and we will see an example in the next section.

Shift-Reduce Parsing 
--------------------

The simplest kind of bottom-up parsing is known as `shift-reduce
parsing`:dt:.  In common with all bottom-up parsers, a shift-reduce
parser tries to find sequences of words and phrases that correspond
to the *right-hand* side of a grammar production, and replace them
with the left-hand side, until the whole sentence is reduced to
an `S`:gc:.

The shift-reduce parser repeatedly pushes the next input word onto a
stack; this is the `shift`:dt: operation.  If the top *n* items on the
stack match the *n* items on the right-hand side of some production,
then they are all popped off the stack, and the item on the left-hand
side of the production is pushed on the stack.  This replacement of
the top *n* items with a single item is the `reduce`:dt: operation.
(This reduce operation may only be applied to the top of the stack;
reducing items lower in the stack must be done before later items are
pushed onto the stack.)  The parser finishes when all the input is
consumed and there is only one item remaining on the stack, a parse
tree with an `S`:gc: node as its root.

The shift-reduce parser builds a parse tree during the above process.
If the top of stack holds the word `dog`:lx:, and if the grammar has a
production `N`:gc: |rarr| `dog`:lx:, then the reduce operation causes the word
to be replaced with the parse tree for this production.  For
convenience we will represent this tree as ``N(dog)``.  At a later
stage, if the top of the stack holds two items ``Det(the) N(dog)`` and
if the grammar has a production `NP`:gc: |rarr| `Det N`:gc: then the reduce
operation causes these two items to be replaced with ``NP(Det(the),
N(dog))``.  This process continues until a parse tree for the entire
sentence has been constructed.  We can see this in action using the
parser demonstration ``nltk_lite.draw.srparser``.  To run this
demonstration, use the following commands:

.. doctest-ignore::
    >>> from nltk_lite.draw import srparser
    >>> srparser.demo()

Six stages of the execution of this parser are shown below:

+----------------------------------------------------+
| Six Stages of a Shift-Reduce Parser                |
+====================================================+
| .. image:: ../images/srparser1.png                 |
|                                                    |
| a. Initial State                                   |
+----------------------------------------------------+
| .. image:: ../images/srparser2.png                 |
|                                                    |
| b. After one shift                                 |
+----------------------------------------------------+
| .. image:: ../images/srparser3.png                 |
|                                                    |
| c. After shift reduce shift                        |
+----------------------------------------------------+
| .. image:: ../images/srparser4.png                 |
|                                                    |
| d. After recognizing the second NP                 |
+----------------------------------------------------+
| .. image:: ../images/srparser5.png                 |
|                                                    |
| e. Complex NP                                      |
+----------------------------------------------------+
| .. image:: ../images/srparser6.png                 |
|                                                    |
| f. Final Step                                      |
+----------------------------------------------------+

The Shift Reduce Parser in NLTK
-------------------------------

The ``nltk_lite.parse`` module defines ``ShiftReduce``, a simple
implementation of a shift-reduce parser.  This parser does not
implement any backtracking, so it is not guaranteed to find a parse
for a text, even if one exists.  Furthermore, it will only find at
most one parse, even if more parses exist.

Shift reduce parsers are created from ``Grammars`` by the
``ShiftReduceParse`` constructor.  The constructor takes an optional
parameter ``trace``.  As with the recursive descent parser, this value
specifies how verbosely the parser should describe the steps that it
takes as it parses a text: 

    >>> sr_parse = parse.ShiftReduce(grammar, trace=1)

The following example shows the trace output generated by
``sr_parser`` on a simple sentence: 

    >>> sent = list(tokenize.whitespace('I saw a man'))
    >>> sr_parse.parse(sent)
    Parsing 'I saw a man'
        [ * I saw a man]
      S [ 'I' * saw a man]
      R [ <NP> * saw a man]
      S [ <NP> 'saw' * a man]
      R [ <NP> <V> * a man]
      S [ <NP> <V> 'a' * man]
      R [ <NP> <V> <Det> * man]
      S [ <NP> <V> <Det> 'man' * ]
      R [ <NP> <V> <Det> <N> * ]
      R [ <NP> <V> <NP> * ]
      R [ <NP> <VP> * ]
      R [ <S> * ]
    (S: (NP: 'I') (VP: (V: 'saw') (NP: (Det: 'a') (N: 'man'))))

NLTK also defines a graphical demonstration tool for the
shift reduce parser:

.. doctest-ignore::
    >>> from nltk.draw.srparser import demo
    >>> demo()

Problems with Shift Reduce Parser
---------------------------------

A shift-reduce parser may fail to parse the sentence, even though the
sentence is well-formed according to the grammar.  In such cases,
there are no remaining input words to shift, and there is no way to
reduce the remaining items on the stack, as exemplified in the left
example below.  The parser entered this blind alley at an earlier
stage shown in the middle example below, when it reduced instead of
shifted.  This situation is called a `shift-reduce conflict`:dt:.  At
another possible stage of processing shown in the right example below,
the parser must choose between two possible reductions, both matching
the top items on the stack: `V`:gc: |rarr| `V NP PP`:gc: or `NP`:gc: |rarr|
`NP PP`:gc:.  This situation is called a `reduce-reduce conflict`:dt:.

+----------------------------------------------------+
| Conflict in Shift-Reduce Parsing                   |
+====================================================+
| .. image:: ../images/srparser7.png                 |
+----------------------------------------------------+
| .. image:: ../images/srparser8.png                 |
+----------------------------------------------------+
| .. image:: ../images/srparser9.png                 |
+----------------------------------------------------+

.. To do: diagram showing search tree with success and failure.

Shift-reduce parsers may implement policies for resolving such
conflicts.  For example, they may address shift-reduce conflicts by
shifting only when no reductions are possible, and they may address
reduce-reduce conflicts by favouring the reduction operation that removes
the most items from the stack.  No such policies are failsafe however.

The advantages of shift-reduce parsers over recursive descent parsers
is that they only build structure that corresponds to the words in the
input.  Furthermore, they only build each sub-structure once,
e.g. ``NP(Det(the), N(man))`` is only built and pushed onto the stack
a single time, regardless of whether it will later be used by the `V`:gc:
|rarr| `V NP PP`:gc: reduction or the `NP`:gc: |rarr| `NP PP`:gc: reduction.

The Left-Corner Parser
----------------------

One of the problems with the recursive descent parser is that it can
get into an infinite loop.  This is because it applies the grammar
productions blindly, without considering the actual input sentence.
A left-corner parser is a hybrid between the bottom-up and top-down
approaches we have seen.

Grammar G2_ allows us to produce the following parse of `John saw
Mary`:lx:\ :

.. _jmtree:
.. ex::
  .. tree::
   (S (NP John) 
      (VP (V saw)
         (NP Mary)))

Recall that the grammar in G2_ has the following rules for expanding `NP`:gc:\ :

.. ex::
   .. _r1:
   .. ex:: `NP`:gc: |rarr| `Det Nom`:gc:
   .. _r2:
   .. ex:: `NP`:gc: |rarr| `Det Nom PP`:gc:
   .. _r3:
   .. ex:: `NP`:gc: |rarr| `PropN`:gc: 

Suppose we ask you to first look at tree jmtree_, and then decide
which of the `NP`:gc: rules you'd want a recursive descent parser to
apply first |mdash| obviously, r3_ is the right choice! How do you
know that it would be pointless to apply r1_ or r2_ instead? Because
neither of these rules will derive a string whose first word is
`John`:lx:.  That is, we can easily tell that in a successful
parse of `John saw Mary`:lx:, the parser has to expand `NP`:gc: in
such a way that `NP`:gc: derives the string `John`:lx: |alpha|. More
generally, we say that a category `B`:math: is a `left-corner`:dt: of
a tree rooted in `A`:math: if  `A`:math: |DoubleRightArrow|\ *
`B`:math: |alpha|.

.. ex::
  .. tree:: <A B a>

A `left-corner parser`:dt: is a top-down parser with bottom-up filtering.
Unlike an ordinary recursive descent parser, it does not get trapped
in left recursive productions. 

.. Suppose `X`:gc: is a non-terminal that
   expands to some sequence `Y`:subscript:`1` |cdots| `Y`:subscript:`n`
   of terminals and non-terminals.  We call `Y`:subscript:`1` the
   `left-corner`:dt: of `X`.

Before starting its work, a left-corner parser preprocesses the
context-free grammar to build a table where each row contains two
cells, the first holding a non-terminal, and the second holding the
collection of possible left corners of that non-terminal. Table lc_
illustrates this for the grammar from G2_.

.. _lc:
.. table:: Left-Corners in G2_
 
 ========  ============
 Category  Left-Corners
 ========  ============
 S         NP
 NP        Det, PropN
 VP        V
 PP        P
 ========  ============

Each time a production is considered by the parser, it checks that the
next input word is compatible with at least one of the pre-terminal
categories in the left-corner table.

Exercises
---------

#. **Left-corner parser:** Develop a left-corner parser (inheriting from
   ``ParseI``), based on the recursive descent parser.

#. Compare the performance of the top-down, bottom-up, and left-corner
   parsers using the same grammar and three grammatical test
   sentences. Use ``time.time()`` to log the amount of time each
   parser takes on the same sentence.  Write a function which runs all
   three parsers on all three sentences, and prints a 3-by-3 grid of
   times, as well as row and column totals. Discuss your findings.

#. Extend NLTK's shift-reduce parser to incorporate backtracking, so
   that it is guaranteed to find all parses that exist (i.e. it is `complete`:dt:).

----------
Conclusion
----------

We began this chapter talking about confusing encounters with grammar
at school.  We just wrote what we wanted to say, and our work was
handed back with red marks showing all our grammar mistakes.  If this
kind of 'grammar' seems like secret knowledge, the linguistic approach
we have taken in this chapter is quite the opposite: grammatical
structures are made explicit as we build trees on top of sentences.
We can write down the grammar productions, and parsers can build the
trees automatically.  This thoroughly objective approach
is widely referred to as `generative grammar`:dt:.

Note that we have only considered 'toy grammars,' small grammars that
illustrate the key aspects of parsing.  But there is an obvious
question as to whether the general approach can be scaled up to cover
large corpora of natural languages. How hard would it be to construct
such a set of productions by hand? In general, the answer is: *very
hard*. Even if we allow ourselves to use various formal devices that
give much more succinct representations of grammar productions (some
of which will be discussed in the next chapter), it is still extremely
difficult to keep control of the complex interactions between the many
productions required to cover the major constructions of a
language. In other words, it is hard to modularize grammars so that
one portion can be developed independently of the other parts. This in
turn means that it is difficult to distribute the task of grammar
writing across a team of linguists. Another difficulty is that as the
grammar expands to cover a wider and wider range of constructions,
there is a corresponding increase in the number of analyses which are
admitted for any one sentence. In other words, ambiguity increases
with coverage.

Despite these problems, there are a number of large collaborative
projects which have achieved interesting and impressive results in
developing rule-based grammars for several languages. Examples are the
Lexical Functional Grammar (LFG) Pargram project
(http://www2.parc.com/istl/groups/nltt/pargram/), the Head-Driven
Phrase Structure Grammar (HPSG) LinGO Matrix framework
(http://www.delph-in.net/matrix/), and the Lexicalized Tree Adjoining
Grammar XTAG Project (http://www.cis.upenn.edu/~xtag/).

---------------
Further Reading
---------------

McCawley (1998) *The Syntactic Phenomena of English*.
Chicago University Press.

Rodney D. Huddleston, Geoffrey K. Pullum (2002).
*The Cambridge Grammar of the English Language*.
Cambridge University Press.

.. include:: footer.txt






  
