Breadth-first and Depth-first search
Given a tree, we can visit its nodes in a depth-first
manner (working down a path )
or in a breadth-first manner
(working level by level)
In the parsing process we build a parse tree. If more
than one rule can be applied, the process itself can
be represented as a tree, each node corresponding to a
currently built parse tree.
Hence, depth-first or breadth-first search refer to the way
we accomplish the parsing process.
Complexity of parsing
The top-down and bottom-up parsing methods described above fall into the category of
exhaustive search algorithms. At each derivation step we apply a grammar
rule either increasing the length of the current string from 1 symbol
(the starting symbol) to the target string (only terminal symbols)
or decreasing the length from all terminal symbols to the starting symbol.
Thus the number of the derivation steps is bound by the length of the parsed string.
At each step we may have a choice among several rules that expand the same non-terminal symbol
(in top-down parsing), or that have the same right-hand side (in bottom-up parsing).
Therefore the complexity of the exhaustive parsing is exponential
in the length of the parsed string and hence extremely inefficient.
There are two widely used algorithms for parsing, both based on dynamic programming,
that reduce the complexity to O(n3), where n is the length of the parsed string:
Earley’s algorithm, published in 1970,
and Cocke-Younger-Kasami (CYK) algorithm, developed 1965 – 1967.
Below you can see the description of Earley’s algorithm.
Earley's algorithm (1970)
The input string is represented in the following way:
a1 a2 a3 ... an
0 1 2 3 ... n-1 n
The meaning is: a1 starts from position 0 and ends in position 1.
The string a1a2 starts in position 0
and ends in position 2, etc.
"Dotted" rules:
-
A → .B (i,i)
We expect that a substring starting at position i is of type B
- A → B.C (i,j) for a
substring x, if x = uv,
B =>* u, and the substring u starts
at i and ends at j
This means that we have recognized part of the string
(u form
i to j)
as corresponding to B,
and we have to explore the rest
of the string (v) to see if it corresponds to C.
- A → w. (i,j)
This means that the string w starting at i and
ending at j is recognized as A
A table E(N+1, N+1) is built, N = the length of the
input string.
The algorithm has two parts:
Part 1: Initialization:
For top-down parsing, in E(0,0) we store all rules S → .
w
Part 2:
The following steps are repeated until no dotted rules
are added into the table:
- Prediction:
If A → w1. B w2
is in E(i, j), we add all rules B → . v to
E(j, j)
We have recognized w1 from i to j and we expect
a substring starting at j to be of type B
- Moving the dot (reduction):
If A → w1. is in E(i, j)
and if B → w1. A w2
is in some E(k, i),
we add B → w1 A . w2
to E(k, j).
We have recognized a substring w1 from i
to j to be of type A.
We record this fact in all rules
that expect A from position i by moving the dot
after A and expanding the interval to j.
- Completion:
If A → . xj is in E(j-1, j-1) ,
we add A → xj. in E(j-1,j).
The meaning is: if A is a pre-terminal symbol and the current
terminal symbol is xj,
we consider
xj to
be recognized.
The algorithm works in cycles: prediction,
reduction, completion.
Since in a given cell in the table we may have more
than one dotted rule, we can explore
the possible solutions
either in a depth-first manner or in a
breadth-first manner,
by storing the dotted rules to be processed
in a stack (depth-first) or in a queue (breadth-first).
The algorithm as described above works in a top-down manner.
It can be modified to work in a bottom-up manner by changing the initialization step.
Example