Writing my own interpreter for Lox, Part 2 - Parsing Expressions
Previously I have written a scanner for the Lox language. I will continue working through the Crafting Interpreters book, specifically chapters 5 and 6 on Representing Code and Parsing Expressions.
In this post, I work through how code can be represented in a richer format than as a list of Token
s. Specifically, we look at Abstract Syntax Trees (AST) representation and how we can translate a series of Tokens in Lox to ASTs. We will only focus on expressions in this chapter but from what I gather it isn't very different for statements and other constructs.
Representing Code
When building the scanner, we relied on an implicit grammar that dictates how individual characters in source code are grouped into tokens. For Lox, this lexical grammar can be described using a regular language. To define richer representations, like complex nested expressions, we need something more powerful (than regular grammars). Using such representations we can define clearly what strings are allowed and what aren't and more importantly we can generate parse trees as a representation of the source code.
Context-Free Grammars (CGF)
I'm already somewhat familiar with CFGs so I will only highlight what I think are the essential aspects of it:
- They are simple enough to have efficient parsing algorithms i.e., finding whether and how a string is derivable in a CFG.
- They are complex enough to represent the syntax (grammatical structure) of most programming languages. For example, check here's the CFG representation of the C language.
I think this table really helps:
Terminology | Lexical Grammar | Syntactic Grammar |
---|---|---|
Alphabet | Characters in source code | Lexemes / Tokens |
Strings | Lexemes / Tokens | Expression / Statement / Code Block |
Formal Grammar | Regular (for Lox) | Context Free (for Lox) |
Stage of the compiler | Scanner | Parser |
Representation | List | Tree (syntax tree / parse tree) |
Grammar for Expressions in Lox
The syntactic grammar is very complex compared to the lexical grammar, so for now, we just deal with certain expressions: Literals, Unary and Binary Expressions, and Parentheses.
The grammar for these expressions is easy to state and understand:
expresion → literal | unary | binary | grouping
literal → NUMBER | STRING | "true" | "false" | "nil"
grouping → "(" expression ")"
unary → ("-" | "!" ) expression
binary → expression operator expression
operator → "==" | "!=" | "<" | ">" | "<=" | ">=" | "+" | "-" | "*" | "/"
Abstract Syntax Trees
Trees as a representation makes sense given our grammar is recursive. We define Expr
abstract base class and derive other classes from it: Unary
, Binary
etc.
class Expr(ABC):
pass
class Binary(Expr):
def __init__(self, left: Expr, operator: Token, right: Expr):
self.left = left
self.operator = operator
self.right = right
class Grouping(Expr):
def __init__(self, expression: Expr):
self.expression = expression
class Literal(Expr):
def __init__(self, value):
self.value = value
class Unary(Expr):
def __init__(self, operator: Token, right: Expr):
self.operator = operator
self.right = right
Now, just like how we had to handle different types of Tokens, we must now think of handling different kind of Expressions. At runtime, they will be evaluated differently based on the type of expression they are.
- Long chain of if else type testing is slow because expression types which come later in the chain would take longer to execute.
- Oh My OOPness! Why don't we define a method like
evaluate
and have each expression class implement it? Because the expression/tree classes span a few stages of the compiler, and if we add methods to the class relevant for every stage we are mixing concerns.
The book explains this "expression problem" and how the Visitor pattern, which approximates functional-style in an OOP language, is an appropriate solution.
Visitor Pattern and the AST Printer
Here's the code for the abstract visitor class for Expr
and a corresponding accept method for each class the extends Expr
:
class ExprVisitor(ABC):
def visit_binary(expr):
raise NotImplementedError()
def visit_grouping(expr):
raise NotImplementedError()
def visit_literal(expr):
raise NotImplementedError()
def visit_unary(expr):
raise NotImplementedError()
class Expr(ABC):
def accept(self, visitor: ExprVisitor):
raise NotImplementedError()
class Binary(Expr):
def __init__(self, left: Expr, operator: Token, right: Expr):
self.left = left
self.operator = operator
self.right = right
def accept(self, visitor: ExprVisitor):
return visitor.visit_binary(self)
class Grouping(Expr):
def __init__(self, expression: Expr):
self.expression = expression
def accept(self, visitor: ExprVisitor):
return visitor.visit_grouping(self)
class Literal(Expr):
def __init__(self, value):
self.value = value
def accept(self, visitor: ExprVisitor):
return visitor.visit_literal(self)
class Unary(Expr):
def __init__(self, operator: Token, right: Expr):
self.operator = operator
self.right = right
def accept(self, visitor: ExprVisitor):
return visitor.visit_unary(self)
Now, to implement a printer for the AST, we write a ASTPrinter
class that extends this ExprVisitor
abstract class. There's nothing interesting here, except that I am trying to print friendly each expression type using the _parenthesize
function.
class AstPrinter(ExprVisitor):
def print(self, expr: Expr):
return expr.accept(self)
def visit_binary(self, expr: Binary):
return self._parenthesize(expr.operator.lexeme, [expr.left, expr.right])
def visit_grouping(self, expr: Grouping):
return self._parenthesize("group", [expr.expression])
def visit_literal(self, expr: Literal):
if isinstance(expr.value, bool):
return 'true' if expr.value else 'false'
if isinstance(expr.value, int) or isinstance(expr.value, float):
return str(expr.value)
return str(expr.value) if expr.value else 'nil'
def visit_unary(self, expr: Unary):
return self._parenthesize(expr.operator.lexeme, [expr.right])
def _parenthesize(self, name: str, exprs: list[Expr]) -> str:
builder = ["(", name]
for expr in exprs:
builder.append(" ")
builder.append(expr.accept(self))
builder.append(")")
return "".join(builder)
We can test it by printing an AST:
expression = Binary(
Unary(
Token(TokenType.MINUS, 1, "-", None),
Literal(123)
),
Token(TokenType.STAR, 1, "*", None),
Grouping(
Literal(45.67)
)
)
print(AstPrinter().print(expression))
The main takeaway is here how we are using the visitor pattern as an indirection to add functional style to OOP constructs.
Parsing Expressions
The expression grammar mentioned earlier is easy to understand and define. However, it's ambiguous: there exists a string with more than one parse tree. If we were just checking whether the series of tokens satisfy the syntax, then ambiguity wouldn't matter. But, we are trying to convert the sequence of tokens into a richer representation i.e., a parse tree. If more than one parse tree is possible for a given program then that can affect the meaning of the program.
In efforts to make it unambiguous, we define a separate rule for each precedence level and obtain the following unambiguous grammar. Showing each operator has well defined precedence and consistent associativity in the grammar is an effective way to argue an expression grammar is unambiguous.
expression → equality
equality → comparison ( ( "!=" | "==" ) comparison )*
comparison → term ( ( ">" | ">=" | "<" | "<=" ) term )*
term → factor ( ( "-" | "+" ) factor )*
factor → unary ( ( "/" | "*" ) unary )*
unary → ( "!" | "-" ) unary | primary
primary → NUMBER | STRING | "true" | "false" | "nil" | "(" expression ")"
Here's the Parser
class we will implement in the rest of the post. We initialise it with a list of tokens obtained from scanning phase, and uses some helper functions similar to the Scanner
class to perform parsing. The entry point would be the expression
function to produce a AST for a single expression.
class Parser:
def __init__(self, tokens: list[Token]):
self.tokens = tokens
self.current = 0
def is_at_end(self):
return self.peek().token_type == TokenType.EOF
def check(self, token_type: TokenType) -> bool:
if self.is_at_end():
return False
return self.peek().token_type == token_type
def advance(self):
if not self.is_at_end():
self.current += 1
def previous(self):
return self.tokens[self.current - 1]
def peek(self):
return self.tokens[self.current]
def match(self, *token_types) -> bool:
for token_type in token_types:
if self.check(token_type):
self.advance()
return True
return False
def consume(self, token_type, error_msg):
if self.check(token_type):
return self.advance()
# error handling tbd
def expression(self):
raise NotImplementedError()
Booleans, Nil, Number and String Literals
These literals are the simplest expressions to parse since they're just single tokens that map directly to literal values. For boolean literals, we need to handle the true
and false
keywords. When the parser encounters these tokens, it creates a Literal
expression node with the corresponding boolean value. So, in Parser
class, we add:
def primary(self):
if self.match(TokenType.FALSE):
return Literal(False)
if self.match(TokenType.TRUE):
return Literal(True)
if self.match(TokenType.NIL):
return Literal(None)
if self.match(TokenType.NUMBER, TokenType.STRING):
return Literal(self.previous().literal)
The nil
literal in Lox represents the absence of a value, similar to null
in other languages. In our Python implementation, we represent this as None
.
When these tokens are encountered during parsing, we simply consume the token and create a Literal
AST node with the appropriate value.
Parentheses
Parentheses are used for grouping expressions overriding the natural operator precedence. For example, in (1 + 2) * 3
, the parentheses force the addition to be evaluated before the multiplication.
From our grammar, parenthesized expressions are part of the primary
rule:
primary → NUMBER | STRING | "true" | "false" | "nil" | "(" expression ")"
Parsing parentheses involves three steps:
- Consume the opening
(
- Parse the expression inside the parentheses
- Consume the closing
)
We add this to our primary()
method:
def primary(self):
if self.match(TokenType.FALSE):
return Literal(False)
if self.match(TokenType.TRUE):
return Literal(True)
if self.match(TokenType.NIL):
return Literal(None)
if self.match(TokenType.NUMBER, TokenType.STRING):
return Literal(self.previous().literal)
if self.match(TokenType.LEFT_PAREN):
expr = self.expression()
self.consume(TokenType.RIGHT_PAREN, "Expect ')' after expression.")
return Grouping(expr)
# Handle error case - unexpected token
The key insight here is that parsing the inner expression is recursive: we call self.expression()
which will parse according to the full expression grammar. This means parentheses can contain any valid expression, including other parenthesized expressions.
The Grouping
AST node we return contains the inner expression. During evaluation, grouping simply evaluates its inner expression, they only affect the parsing structure and thus the order of operations.
Unary operators
From our grammar, unary expressions are defined as:
unary → ( "!" | "-" ) unary | primary
This grammar rule is interesting because it's recursive and right-associative. This means unary operators can be chained together like !!true
or --5
, and they associate from right to left. For example, --5
is parsed as -(-(5))
.
The implementation follows the grammar directly:
def unary(self):
if self.match(TokenType.BANG, TokenType.MINUS):
operator = self.previous()
right = self.unary()
return Unary(operator, right)
return self.primary()
Unary operators have higher precedence than binary operators, which is why unary
appears in the grammar hierarchy before the binary operator rules. This ensures that -a + b
is parsed as ((-a) + b)
rather than (-(a + b))
.
Arithmetic operators
Arithmetic operators are binary operators that perform mathematical operations. Lox supports four arithmetic operators: +
, -
, *
, and /
. These operators have different precedence levels: *
and /
have higher precedence than +
and -
.
From our grammar, arithmetic operations are split across two rules to handle precedence:
term → factor ( ( "-" | "+" ) factor )*
factor → unary ( ( "/" | "*" ) unary )*
Unlike unary operators which are right-associative, binary operators are left-associative. This means 1 - 2 - 3
is parsed as ((1 - 2) - 3)
rather than (1 - (2 - 3))
.
The implementation uses a common pattern for parsing left-associative binary operators:
def term(self):
expr = self.factor()
while self.match(TokenType.MINUS, TokenType.PLUS):
operator = self.previous()
right = self.factor()
expr = Binary(expr, operator, right)
return expr
def factor(self):
expr = self.unary()
while self.match(TokenType.SLASH, TokenType.STAR):
operator = self.previous()
right = self.unary()
expr = Binary(expr, operator, right)
return expr
This approach handles left-associativity because each new operator becomes the root of a new Binary node, with the previous expression as its left child. The precedence is handled by the fact that term()
calls factor()
, ensuring multiplication and division are parsed before addition and subtraction.
Comparison operators
Lox supports four comparison operators: >
, >=
, <
, and <=
. These operators have lower precedence than arithmetic operators, meaning 1 + 2 > 3
is parsed as ((1 + 2) > 3)
.
From our grammar:
comparison → term ( ( ">" | ">=" | "<" | "<=" ) term )*
Like arithmetic operators, comparison operators are left-associative. While expressions like 1 < 2 < 3
are syntactically valid and parse as ((1 < 2) < 3)
, they don't behave as mathematical notation since the first comparison returns a boolean.
The implementation follows the same pattern as arithmetic operators:
def comparison(self):
expr = self.term()
while self.match(TokenType.GREATER, TokenType.GREATER_EQUAL,
TokenType.LESS, TokenType.LESS_EQUAL):
operator = self.previous()
right = self.term()
expr = Binary(expr, operator, right)
return expr
Since comparison()
calls term()
for its operands, arithmetic expressions are parsed with higher precedence. This ensures that complex expressions like a + b > c * d
are correctly parsed as ((a + b) > (c * d))
.
Equality operators
Lox supports two equality operators: ==
(equal) and !=
(not equal). These operators have the lowest precedence among all binary operators, meaning they are evaluated last in an expression.
From our grammar:
equality → comparison ( ( "!=" | "==" ) comparison )*
This low precedence means that 1 + 2 == 3 * 1
is parsed as ((1 + 2) == (3 * 1))
, allowing complex arithmetic and comparison expressions on both sides of the equality check.
Like other binary operators, equality operators are left-associative. Expressions like a == b != c
parse as ((a == b) != c)
, though such chaining is rarely meaningful since the first equality returns a boolean.
The implementation is straightforward:
def expression(self):
return self.equality()
def equality(self):
expr = self.comparison()
while self.match(TokenType.BANG_EQUAL, TokenType.EQUAL_EQUAL):
operator = self.previous()
right = self.comparison()
expr = Binary(expr, operator, right)
return expr
The expression()
method serves as the entry point for parsing expressions. This completes our precedence hierarchy: expression
→ equality
→ comparison
→ term
→ factor
→ unary
→ primary
.
Syntactic errors
First, I will have a parser_error
helper function that takes a token that we couldn't parse and a message and reports it to stderr
.
def parser_error(token: Token, msg):
if token.token_type == TokenType.EOF:
report(token.line, " at end", msg)
else:
report(token.line, f" at '{token.lexeme}'", msg)
Then, I define a ParseError
exception type with a static method that helps create an Exception
.
class ParseError(Exception):
def __init__(self, msg):
super().__init__(msg)
@staticmethod
def error(token: Token, msg: str):
parser_error(token, msg)
return ParseError(msg)
Finally, in the main script, I invoke parsing as follows:
try:
expr = parser.expression()
if expr:
print(AstPrinter().print(expr))
except Exception:
exit(65)
So far, we have defined the expression grammar for Lox, implemented a recursive descent parser (recursively parses starting from the top rule in the grammar), and used Visitor design pattern to implement AstPrinter
. Next time, we will actually evaluate the expressions we are able to parse this way.