CS374: Programming Language Principles - Parsers and Interpreters

Activity Goals

The goals of this activity are:
  1. To remove left recursion and to left factor a grammar for LL(1) parsing, where possible
  2. To describe the components of an LL(k) and an LR(k) parser
  3. To implement a simple parser given a grammar, from code or using tools such as lex and yacc
  4. To generate a parse table for a given grammar

The Activity

Directions

Consider the activity models and answer the questions provided. First reflect on these questions on your own briefly, before discussing and comparing your thoughts with your group. Appoint one member of your group to discuss your findings with the class, and the rest of the group should help that member prepare their response. Answer each question individually from the activity on the Class Activity Questions discussion board. After class, think about the questions in the reflective prompt and respond to those individually in your notebook. Report out on areas of disagreement or items for which you and your group identified alternative approaches. Write down and report out questions you encountered along the way for group discussion.

Model 1: An LL(1) Parser Implemented with Recursive Descent

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
// LL(1) means Left-To-Right scanning, Left-To-Right Parsing, and one character lookahead for parsing
/*
  expr: term addsub term
      | term addsub term addsub expr
      | term
  term: factor muldiv factor
      | factor muldiv factor muldiv term
      | factor
  factor: num
        | "(" expr ")"
  addsub: "+" | "-"
  muldiv: "*" | "/"
  num: [0-9]+
 */
 
#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>
#include <string.h>
 
int expr(void);
char token;
 
// remove the next character from the stream
char pop() {
    char ch = getchar();
    printf("Popped %c\n", ch);
    return ch;
}
 
// check the next character, but don't remove it from the stream
char peek() {
    char ch = pop();
    ungetc(ch, stdin);
    printf("Pushed %c\n", ch);
    return ch;
}
 
int readint() {
    int numchars = 0;
    char* val = (char*) malloc(1); // for the trailing \0
     
    token = peek();
    while(isdigit(token)) {
        pop();
         
        numchars = numchars + 1;
        val = (char*) realloc(val, numchars + 1); // for the trailing \0
         
        val[numchars-1] = token;
         
        token = peek();
    }
     
    val[numchars] = '\0'; // null terminate
     
    int result = atoi(val);
     
    printf("Read int from string %s (length %d): %d\n", val, strlen(val), result);
     
    free(val);
     
    return result;
}
 
int factor(void) {
    printf("In factor\n");
     
    int value;
    token = peek();
 
    if (token == '(') { // resolve parens as a whole new expression, resolving to a value, regardless of where they occur in the input (since they are highest precedence)
        pop(); // (
        value = expr();
        token = pop(); // )
 
        if(token != ')') {
            printf("Error parsing factor, imbalanced parenthesis\n");
            exit(-1);
        }
    } else if (isdigit(token)) {
        value = readint();
    } else {
        printf("Error parsing factor\n");
    }
     
    printf("Factor value: %d\n", value);
 
    return value;
}
 
int term(void) {
    printf("In term\n");
     
    // get the mandatory first factor
    int value = factor();
    token = peek();
 
    // now handle the optional second factor
    if(token == '*') {
        pop(); // *
        value *= factor();      
    } else if(token == '/') {
        pop(); // /
        value /= factor();
    } // these are optional in the grammar, so no error if this is not found
 
    token = peek();
     
    // now handle the optional third term
    // since we're multiplying now, we only want to consider other multiplications and divisions, so that we don't accidentally add or subtract out of order.  the term grammar rule only has * and / in it, so that's safe.  We've built order of operations into our grammar!  It seems backwards, since we're worried about only multiplying and dividing in groups, but the expr's are "at the top" and will parse the + and - expressions once they resolve at the bottom.  That means these bottom "terms" resolve in their entirety first to a number that we *then* add together or subtract.
    // because there is no left recursion, we always know what to do next
    // but there is a common left prefix that we can try to simplify
    if(token == '*') {
        pop(); // *
        value *= term();      
    } else if(token == '/') {
        pop(); // /
        value /= term();
    } // these are optional in the grammar, so no error if this is not found   
 
    printf("Term value: %d\n", value);
     
    return value;
}
 
int expr() {
    printf("In expr\n");
     
    // get the mandatory first term
    int value = term();
    token = peek();
 
    // now handle the optional second term
    if(token == '+') {
        pop(); // +
        value += term();       
    } else if(token == '-') {
        pop(); // -
        value -= term();
    } // these are optional in the grammar, so no error if this is not found
 
    token = peek();
     
    // now handle the optional third expr
    // expr's have + and - in them, so it's safe to keep doing those since we're only adding and subtracting here (no risk of doing a multiplication)
    if(token == '+') {
        pop(); // +
        value += expr();       
    } else if(token == '-') {
        pop(); // -
        value -= expr();
    } // these are optional in the grammar, so no error if this is not found
     
    printf("Expression value: %d\n", value);
     
    return value;
}
 
int main(void) {
    token = peek();
    int result = expr();
    printf("Result: %d\n", result);
 
    return 0;
}

Questions

  1. Try this example with some sample calculations.
  2. An LL(1) parser requires a grammar with no left recursion, and with unique nonterminals starting each production. Are these requirements met? If not, on what inputs will this parser fail? How can we fix it?

Model 2: An Alternative Grammar

1
2
3
4
5
6
7
8
9
10
11
/*
 expr: term expr2
 expr2: addsub expr | null
 term: factor term2
 term2: muldiv term | null
 factor: num
       | "(" expr ")"
 addsub: "+" | "-"
 muldiv: "*" | "/"
 num: [0-9]+
*/

Questions

  1. What is different about this grammar from the original one?

Model 3: Parsing with yacc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
%{
 
#include <stdio.h>
#include <stdlib.h>
 
extern int yylex();
extern int yyparse();
extern FILE* yyin;
 
void yyerror(const char* s);
%}
 
%union {
    int ival;
}
 
%token<ival> T_INT
%token T_PLUS T_MINUS T_MULTIPLY T_DIVIDE T_LEFT T_RIGHT T_NEWLINE
 
%type<ival> expr term factor
%start calc
 
%%
 
calc:                               { printf("EOF\n"); exit(0); }
    | expr T_NEWLINE                { printf("%d\n", $1); exit(0); } // no return value needed, no type
 
expr: term T_PLUS expr              { $$ = $1 + $3; }
    | term T_MINUS expr             { $$ = $1 - $3; }
    | term                          { $$ = $1; }
     
term: factor T_MULTIPLY term        { $$ = $1 * $3; }
    | factor T_DIVIDE term          { $$ = $1 / $3; }
    | factor                        { $$ = $1; }
     
factor: T_INT                       { $$ = $1; }            // just resolve the value!
      | T_LEFT expr T_RIGHT         { $$ = $2; }            // expr will resolve to a value!
       
%%
 
int main() {
    yyin = stdin;
     
    do {
        yyparse();
    } while(!feof(yyin));
}
 
void yyerror(const char* s) {
    fprintf(stderr, "Parse error: %s\n", s);
    exit(1);
}
 
// the token types are as defined in the %union above

Questions

  1. How would you revise this to support the corrected LL(1) grammar?

Model 4: Generating an LR(0) Parse Table

Begin with a grammar definition:
1
2
3
4
/*
   S -> Xyx
   X -> xX | y
*/

Augment with a clean starting state:
1
2
3
4
5
/*
   S' -> S
   S -> Xyx
   X -> xX | y
*/

State 0 is the S' -> .S initial state plus the closure, . indicates the S hasn't been read yet. Closure includes whatever can be produced by the production (plus any productions that can be executed from those - as indicated by a dot to the left of the nonterminal).
1
2
3
4
5
6
/*
   State 0:
       S' -> .S
       S -> .Xyx
       X -> .xX | .y
*/

Now, transition from State 0 on each of the possible productions in the closure:
  • State 0 has 4 outputs upon reading S, X, x, and y. State 1: State 0 -> State 1 on reading S
    Move the dot to the right of the production that results from reading this transition, and perform the closure on productions that follow the dot (here, there are none). This state contains S' -> S.
  • State 2: State 0 -> State 2 on reading X
    S -> X.yx
  • State 3: State 0 -> State 3 on reading x
    X -> x.X; X -> .xX | .y
  • State 4: State 0 -> State 4 on reading y
    X -> y.
  • States 2 and 3 have additional outputs, so we proceed until the dot is all the way to the right. State 5: State 2 -> State 5 on reading y
    S -> Xy.x
  • State 6: State 3 -> State 6 on reading X
    X -> xX.
  • State 3 -> State 3 on reading x. Notice the productions in the closure are the same as in state 3.
    X -> x.X; X -> .xX | .y
  • Similarly, State 3 -> State 4 on reading y
  • Now, state 5 can continue - State 7: State 5 -> State 7 on reading x
    S -> Xyx.

Now the parse table: one state row per state above (0 through 7) ACTION is the set of terminals plus $ meaning the end of the string, and GOTO is the set of nonterminals besides the augmented first state S' -> S Fill in what state you go to if you read that terminal or nonterminal from that state. For example, state 0 goes to 1 on S, and it goes to 2 on X, per the rules at the top. Go to state 3 on x, and state 4 on y, and we call these shifts because the dot is still to the left of the end of the production, indicating a shift, as opposed to a reduce after the dot reaches the end of the production. The augmented state (state 1) accepts on end of string. The remaining states (4, 6, 7) have dots on the right. These are our reduce states. States 4 and 6 produce X, while state 7 produces S. State 4 is X -> y., State 6 is X -> xX., and State 7 is S -> Xyx.. We reduce these state rows to their corresponding production. X was initiated by state 2, due to its X. in the production, so we reduce on all inputs to r2 in state 4 and state 6. State 1 contains S. via S' -> S., so this is an r1 from state 7.
STATE ACTION x y $ GOTO S X
0 s3 s4 1 2
1 acc
2 s5
3 s3 s4 6
4 r2 r2 r2
5 s7
6 r2 r2 r2
7 r1 r1 r1

This can be programmed using an if statement (for example: if state == 0 and input == x, then state = 3, etc.). On reductions, you've read a whole statement. By shifting the individual tokens, you also know what inputs were provided to those productions. This is what tools like yacc do.

Questions

  1. Draw the state machine (the states and transitions) implemented by this LR(0) parse table.
  2. Augment the grammar and generate an LR(0) parse table for the grammar S -> XX; X -> xXy; X -> x; X -> y

Model 5: LR(0) Parse Tables

Wikipedia
Curr Lookahead LHS Goto
State Current Rules int id * + eof Sums Products Value
0 Goal → Sums eof 8 9 1 4 7
1 Goal → Sums eof
Sums → Sums + Products
2 done
2 Sums → Sums + Products 8 9 3 7
3 Sums → Sums + Products
Products → Products * Value
5 r1
r1
4 Sums → Products
Products → Products * Value
5 r2
r2
5 Products → Products * Value 8 9 6
6 Products → Products * Value r3 r3 r3
7 Products → Value r4 r4 r4
8 Value → int r5 r5 r5
9 Value → id r6 r6 r6

Questions

  1. Click on the image to see the steps involved with parsing A*2 + 1.
  2. Describe these steps in terms of the LR(0) parse table.
  3. What does the L, R, and 0 refer to in LR(0)?

Submission

Submit your answers to the questions using the Class Activity Questions discussion board. You may also respond to questions or comments made by others, or ask follow-up questions there. Answer any reflective prompt questions in the Reflective Journal section of your OneNote Classroom personal section.