CS374: Programming Language Principles - Parsers and Interpreters

Activity Goals

The goals of this activity are:

To explain the limitations of an LL(1) parser
To remove left recursion and to left factor a grammar for LL(1) parsing, where possible

The Activity

Directions

Consider the activity models and answer the questions provided. First reflect on these questions on your own briefly, before discussing and comparing your thoughts with your group. Appoint one member of your group to discuss your findings with the class, and the rest of the group should help that member prepare their response. Answer each question individually from the activity, and compare with your group to prepare for our whole-class discussion. After class, think about the questions in the reflective prompt and respond to those individually in your notebook. Report out on areas of disagreement or items for which you and your group identified alternative approaches. Write down and report out questions you encountered along the way for group discussion.

Model 1: An LL(1) Parser Implemented with Recursive Descent

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

// LL(1) means Left-To-Right scanning, Left-To-Right Parsing, and one character lookahead for parsing
// https://gist.github.com/mlabbe/81d667bac36aa60787fee60e3647a0a8
// https://github.com/meyerd/flex-bison-example
/*
  expr: term addsub term 
      | term addsub term addsub expr
      | term
  term: factor muldiv factor
      | factor muldiv factor muldiv term
      | factor
  factor: num
        | "(" expr ")"
  addsub: "+" | "-"
  muldiv: "*" | "/"
  num: [0-9]+
 */
 
#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>
#include <string.h>
 
int expr(void);
char token;
 
// remove the next character from the stream
char pop() {
    char ch = getchar();
    printf("Popped %c\n", ch);
    return ch;
}
 
// check the next character, but don't remove it from the stream
char peek() {
    char ch = pop();
    ungetc(ch, stdin);
    printf("Pushed %c\n", ch);
    return ch;
}
 
int readint() {
    int numchars = 0;
    char* val = (char*) malloc(1); // for the trailing \0
     
    token = peek();
    while(isdigit(token)) {
        pop(); 
         
        numchars = numchars + 1;
        val = (char*) realloc(val, numchars + 1); // for the trailing \0
         
        val[numchars-1] = token;
         
        token = peek();
    }
     
    val[numchars] = '\0'; // null terminate
     
    int result = atoi(val);
     
    printf("Read int from string %s (length %d): %d\n", val, strlen(val), result);
     
    free(val);
     
    return result;
}
 
int factor(void) {
    printf("In factor\n");
     
    int value;
    token = peek();
 
    if (token == '(') { // resolve parens as a whole new expression, resolving to a value, regardless of where they occur in the input (since they are highest precedence)
        pop(); // (
        value = expr();
        token = pop(); // )
 
        if(token != ')') {
            printf("Error parsing factor, imbalanced parenthesis\n");
            exit(-1);
        }
    } else if (isdigit(token)) {
        value = readint();
    } else {
        printf("Error parsing factor\n");
    }
     
    printf("Factor value: %d\n", value);
 
    return value;
}
 
int term(void) {
    printf("In term\n");
     
    // get the mandatory first factor
    int value = factor();
    token = peek();
 
    // now handle the optional second factor
    if(token == '*') {
        pop(); // *
        value *= factor();       
    } else if(token == '/') {
        pop(); // /
        value /= factor();
    } // these are optional in the grammar, so no error if this is not found
 
    token = peek();
     
    // now handle the optional third term
    // since we're multiplying now, we only want to consider other multiplications and divisions, so that we don't accidentally add or subtract out of order.  the term grammar rule only has * and / in it, so that's safe.  We've built order of operations into our grammar!  It seems backwards, since we're worried about only multiplying and dividing in groups, but the expr's are "at the top" and will parse the + and - expressions once they resolve at the bottom.  That means these bottom "terms" resolve in their entirety first to a number that we *then* add together or subtract.
    // because there is no left recursion, we always know what to do next
    // but there is a common left prefix that we can try to simplify
    if(token == '*') {
        pop(); // *
        value *= term();       
    } else if(token == '/') {
        pop(); // /
        value /= term();
    } // these are optional in the grammar, so no error if this is not found    
 
    printf("Term value: %d\n", value);
     
    return value;
}
 
int expr() {
    printf("In expr\n");
     
    // get the mandatory first term
    int value = term();
    token = peek();
 
    // now handle the optional second term
    if(token == '+') {
        pop(); // +
        value += term();        
    } else if(token == '-') {
        pop(); // -
        value -= term();
    } // these are optional in the grammar, so no error if this is not found
 
    token = peek();
     
    // now handle the optional third expr
    // expr's have + and - in them, so it's safe to keep doing those since we're only adding and subtracting here (no risk of doing a multiplication)
    if(token == '+') {
        pop(); // +
        value += expr();        
    } else if(token == '-') {
        pop(); // -
        value -= expr();
    } // these are optional in the grammar, so no error if this is not found
     
    printf("Expression value: %d\n", value);
     
    return value;
}
 
int main(void) {
    token = peek();
    int result = expr();
    printf("Result: %d\n", result);
 
    return 0;
}

Questions

Try this example with some sample calculations.
An LL(1) parser requires a grammar with no left recursion, and with unique nonterminals starting each production. Are these requirements met? If not, on what inputs will this parser fail? How can we fix it?
Typically, a scanner functions pops each token, stores it in a variable, and returns the type of constant read. Modify this program to include a function addsub that pops a token and returns a constant representing the type of token found (either a '+' or a '-' character).
Move the scanning functions to a separate .c file with a corresponding header, and include that header from the scanner code module and main parser code module. Don't forget to use an include guard to ensure that the file is only included once during compilation.

Submission

I encourage you to submit your answers to the questions (and ask your own questions!) using the Class Activity Questions discussion board. You may also respond to questions or comments made by others, or ask follow-up questions there. Answer any reflective prompt questions in the Reflective Journal section of your OneNote Classroom personal section. You can find the link to the class notebook on the syllabus.