CmSc 365 Theory of Computation


Context-Free Grammars - Parsing

Learning goals

Exam-like questions and problems


Parsing (syntactic analysis) – determining whether a string is in the language defined by the grammar,
and if the answer is yes - deriving the structure of the string, i.e. building the parse tree.

  1. Top down parsing

    • Start with the starting symbol S
    • Given current string x A y, and a rule A → w, rewrite the string as x w y

      Here A is a non-terminal symbol, x, y, and w may be terminal and/or non-terminal symbols,
      may also be empty.

    If we can derive the input string from S, then the input string belongs to the language.

    Building the parse tree:

    Initially we build a parse tree of one node - S.
    Then, each time a rule A → A1 A2 … An is used, we create nodes A1, A2, .. An
    and link them as children to the node A.

    Here A is a non-terminal symbol, A1, A2,…An may me terminal and /or non-terminal symbols.

  2. Bottom up parsing

    • Start with the input string
    • If the current string is x w y and there is a rule A → w, rewrite the string as x A y

    If we can derive S from the input string, then the input string belongs to the language.

    Building the parse tree:

    For an input string of length N initially we create N nodes, corresponding to the terminal symbols.
    Then, each time a rule A → A1 A2 … An is used, we create a node A and link it
    as a parent node to the nodes A1, A2, ….An.

  3. Non-detereminism

  4. This is the case when at a given step more than one rule can be applied
    Non-detereminism is handled by exploring all possibilities

  5. Breadth-first and Depth-first search

  6. Given a tree, we can visit its nodes in a depth-first manner (working down a path )
    or in a breadth-first manner (working level by level)

    In the parsing process we build a parse tree. If more than one rule can be applied, the process itself can be represented as a tree, each node corresponding to a currently built parse tree.

    Hence, depth-first or breadth-first search refer to the way we accomplish the parsing process.

  7. Complexity of parsing

  8. The top-down and bottom-up parsing methods described above fall into the category of exhaustive search algorithms. At each derivation step we apply a grammar rule either increasing the length of the current string from 1 symbol (the starting symbol) to the target string (only terminal symbols) or decreasing the length from all terminal symbols to the starting symbol. Thus the number of the derivation steps is bound by the length of the parsed string.

    At each step we may have a choice among several rules that expand the same non-terminal symbol (in top-down parsing), or that have the same right-hand side (in bottom-up parsing).

    Therefore the complexity of the exhaustive parsing is exponential in the length of the parsed string and hence extremely inefficient.

    There are two widely used algorithms for parsing, both based on dynamic programming, that reduce the complexity to O(n3), where n is the length of the parsed string: Earley’s algorithm, published in 1970, and Cocke-Younger-Kasami (CYK) algorithm, developed 1965 – 1967.

    Below you can see the description of Earley’s algorithm.

  9. Earley's algorithm (1970)

  10. The input string is represented in the following way:

    
    	   a1   a2   a3    ...      an
    	0     1    2    3  ...  n-1    n
    

    The meaning is: a1 starts from position 0 and ends in position 1. The string a1a2 starts in position 0 and ends in position 2, etc.

    "Dotted" rules:

    • A → .B (i,i)
      We expect that a substring starting at position i is of type B
    • A → B.C (i,j) for a substring x, if x = uv, B =>* u, and the substring u starts at i and ends at j

      This means that we have recognized part of the string (u form i to j) as corresponding to B,
      and we have to explore the rest of the string (v) to see if it corresponds to C.

    • A → w. (i,j)
      This means that the string w starting at i and ending at j is recognized as A

    A table E(N+1, N+1) is built, N = the length of the input string.

    The algorithm has two parts:

    Part 1: Initialization:

    For top-down parsing, in E(0,0) we store all rules S → . w

    Part 2:

    The following steps are repeated until no dotted rules are added into the table:

    1. Prediction:
    2. If Aw1. B w2 is in E(i, j), we add all rules B . v to E(j, j)

      We have recognized w1 from i to j and we expect a substring starting at j to be of type B

    3. Moving the dot (reduction):
    4. If Aw1. is in E(i, j) and if Bw1. A w2 is in some E(k, i),
      we add Bw1 A . w2 to E(k, j).

      We have recognized a substring w1 from i to j to be of type A.
      We record this fact in all rules that expect A from position i by moving the dot
      after A and expanding the interval to j.

    5. Completion:
    6. If A. xj is in E(j-1, j-1) , we add Axj. in E(j-1,j).

      The meaning is: if A is a pre-terminal symbol and the current terminal symbol is xj,
      we consider xj to be recognized.

    The algorithm works in cycles: prediction, reduction, completion.

    Since in a given cell in the table we may have more than one dotted rule, we can explore
    the possible solutions either in a depth-first manner or in a breadth-first manner,
    by storing the dotted rules to be processed in a stack (depth-first) or in a queue (breadth-first).

    The algorithm as described above works in a top-down manner.
    It can be modified to work in a bottom-up manner by changing the initialization step.

    Example


Learning goals

  1. Be able to explain top-down and bottom-up parsing and how the parsing tree is built in each case
  2. Know the basic idea of Earley's algorithm


Exam-like questions and problems

  1. Given a CFG and a string, give a top-down derivation and build the parse tree in a top-down manner.
  2. Given a CFG and a string, give the derivation from the string to the starting symbol of the grammar
    and build the parse tree in a bottom-up manner.

Back to Contents page

Created by Lydia Sinapova