Writing my own interpreter for Lox, Part 2 - Parsing Expressions

13 Jul, 2025

Previously I have written a scanner for the Lox language. I will continue working through the Crafting Interpreters book, specifically chapters 5 and 6 on Representing Code and Parsing Expressions.

In this post, I work through how code can be represented in a richer format than as a list of Tokens. Specifically, we look at Abstract Syntax Trees (AST) representation and how we can translate a series of Tokens in Lox to ASTs. We will only focus on expressions in this chapter but from what I gather it isn't very different for statements and other constructs.

Representing Code

When building the scanner, we relied on an implicit grammar that dictates how individual characters in source code are grouped into tokens. For Lox, this lexical grammar can be described using a regular language. To define richer representations, like complex nested expressions, we need something more powerful (than regular grammars). Using such representations we can define clearly what strings are allowed and what aren't and more importantly we can generate parse trees as a representation of the source code.

Context-Free Grammars (CGF)

I'm already somewhat familiar with CFGs so I will only highlight what I think are the essential aspects of it:

They are simple enough to have efficient parsing algorithms i.e., finding whether and how a string is derivable in a CFG.
They are complex enough to represent the syntax (grammatical structure) of most programming languages. For example, check here's the CFG representation of the C language.

I think this table really helps:

Terminology	Lexical Grammar	Syntactic Grammar
Alphabet	Characters in source code	Lexemes / Tokens
Strings	Lexemes / Tokens	Expression / Statement / Code Block
Formal Grammar	Regular (for Lox)	Context Free (for Lox)
Stage of the compiler	Scanner	Parser
Representation	List	Tree (syntax tree / parse tree)

Grammar for Expressions in Lox

The syntactic grammar is very complex compared to the lexical grammar, so for now, we just deal with certain expressions: Literals, Unary and Binary Expressions, and Parentheses.

The grammar for these expressions is easy to state and understand:

expresion       → literal | unary | binary | grouping
literal         → NUMBER | STRING | "true" | "false" | "nil"
grouping        → "(" expression ")"
unary           → ("-" | "!" ) expression
binary          → expression operator expression
operator        → "==" | "!=" | "<" | ">" | "<=" | ">=" | "+" | "-" | "*" | "/"

Abstract Syntax Trees

Trees as a representation makes sense given our grammar is recursive. We define Expr abstract base class and derive other classes from it: Unary, Binary etc.

class Expr(ABC):
    pass

class Binary(Expr):
    def __init__(self, left: Expr, operator: Token, right: Expr):
        self.left = left
        self.operator = operator
        self.right = right
        
class Grouping(Expr):
    def __init__(self, expression: Expr):
        self.expression = expression
    
class Literal(Expr):
    def __init__(self, value):
        self.value = value
        
class Unary(Expr):
    def __init__(self, operator: Token, right: Expr):
        self.operator = operator
        self.right = right

Now, just like how we had to handle different types of Tokens, we must now think of handling different kind of Expressions. At runtime, they will be evaluated differently based on the type of expression they are.

Long chain of if else type testing is slow because expression types which come later in the chain would take longer to execute.
Oh My OOPness! Why don't we define a method like evaluate and have each expression class implement it? Because the expression/tree classes span a few stages of the compiler, and if we add methods to the class relevant for every stage we are mixing concerns.

The book explains this "expression problem" and how the Visitor pattern, which approximates functional-style in an OOP language, is an appropriate solution.

Visitor Pattern and the AST Printer

Here's the code for the abstract visitor class for Expr and a corresponding accept method for each class the extends Expr:

class ExprVisitor(ABC):    
    def visit_binary(expr):
        raise NotImplementedError()
    
    def visit_grouping(expr):
        raise NotImplementedError()
    
    def visit_literal(expr):
        raise NotImplementedError()
    
    def visit_unary(expr):
        raise NotImplementedError()

class Expr(ABC):
    def accept(self, visitor: ExprVisitor):
        raise NotImplementedError()

class Binary(Expr):
    def __init__(self, left: Expr, operator: Token, right: Expr):
        self.left = left
        self.operator = operator
        self.right = right
    
    def accept(self, visitor: ExprVisitor):
        return visitor.visit_binary(self)
        
class Grouping(Expr):
    def __init__(self, expression: Expr):
        self.expression = expression
    
    def accept(self, visitor: ExprVisitor):
        return visitor.visit_grouping(self)
    
class Literal(Expr):
    def __init__(self, value):
        self.value = value
        
    def accept(self, visitor: ExprVisitor):
        return visitor.visit_literal(self)
        
class Unary(Expr):
    def __init__(self, operator: Token, right: Expr):
        self.operator = operator
        self.right = right
    
    def accept(self, visitor: ExprVisitor):
        return visitor.visit_unary(self)

Now, to implement a printer for the AST, we write a ASTPrinter class that extends this ExprVisitor abstract class. There's nothing interesting here, except that I am trying to print friendly each expression type using the _parenthesize function.

class AstPrinter(ExprVisitor):
    def print(self, expr: Expr):
        return expr.accept(self)
    
    def visit_binary(self, expr: Binary):
        return self._parenthesize(expr.operator.lexeme, [expr.left, expr.right])
    
    def visit_grouping(self, expr: Grouping):
        return self._parenthesize("group", [expr.expression])
    
    def visit_literal(self, expr: Literal):
        if isinstance(expr.value, bool):
            return 'true' if expr.value else 'false'
        if isinstance(expr.value, int) or isinstance(expr.value, float):
            return str(expr.value)
        return str(expr.value) if expr.value else 'nil'
    
    def visit_unary(self, expr: Unary):
        return self._parenthesize(expr.operator.lexeme, [expr.right])
    
    def _parenthesize(self, name: str, exprs: list[Expr]) -> str:
        builder = ["(", name]
        for expr in exprs:
            builder.append(" ")
            builder.append(expr.accept(self))
        builder.append(")")
        return "".join(builder)

We can test it by printing an AST:

expression = Binary(
    Unary(
        Token(TokenType.MINUS, 1, "-", None),
        Literal(123)
    ),
    Token(TokenType.STAR, 1, "*", None),
    Grouping(
        Literal(45.67)
    )
)
print(AstPrinter().print(expression))

The main takeaway is here how we are using the visitor pattern as an indirection to add functional style to OOP constructs.

Parsing Expressions

The expression grammar mentioned earlier is easy to understand and define. However, it's ambiguous: there exists a string with more than one parse tree. If we were just checking whether the series of tokens satisfy the syntax, then ambiguity wouldn't matter. But, we are trying to convert the sequence of tokens into a richer representation i.e., a parse tree. If more than one parse tree is possible for a given program then that can affect the meaning of the program.

In efforts to make it unambiguous, we define a separate rule for each precedence level and obtain the following unambiguous grammar. Showing each operator has well defined precedence and consistent associativity in the grammar is an effective way to argue an expression grammar is unambiguous.

expression     → equality
equality       → comparison ( ( "!=" | "==" ) comparison )*
comparison     → term ( ( ">" | ">=" | "<" | "<=" ) term )*
term           → factor ( ( "-" | "+" ) factor )*
factor         → unary ( ( "/" | "*" ) unary )*
unary          → ( "!" | "-" ) unary | primary
primary        → NUMBER | STRING | "true" | "false" | "nil" | "(" expression ")"

Here's the Parser class we will implement in the rest of the post. We initialise it with a list of tokens obtained from scanning phase, and uses some helper functions similar to the Scanner class to perform parsing. The entry point would be the expression function to produce a AST for a single expression.

class Parser:
    def __init__(self, tokens: list[Token]):
        self.tokens = tokens
        self.current = 0
        
    def is_at_end(self):
        return self.peek().token_type == TokenType.EOF
        
    def check(self, token_type: TokenType) -> bool:
        if self.is_at_end():
            return False
        return self.peek().token_type == token_type

    def advance(self):
        if not self.is_at_end():
            self.current += 1
    
    def previous(self):
        return self.tokens[self.current - 1]
    
    def peek(self):
        return self.tokens[self.current]
    
    def match(self, *token_types) -> bool:
        for token_type in token_types:
            if self.check(token_type):
                self.advance()
                return True
        return False        
        
    def consume(self, token_type, error_msg):
        if self.check(token_type):
            return self.advance()

        # error handling tbd
    
    def expression(self):
        raise NotImplementedError()

Booleans, Nil, Number and String Literals

These literals are the simplest expressions to parse since they're just single tokens that map directly to literal values. For boolean literals, we need to handle the true and false keywords. When the parser encounters these tokens, it creates a Literal expression node with the corresponding boolean value. So, in Parser class, we add:

def primary(self):
    if self.match(TokenType.FALSE):
        return Literal(False)
    
    if self.match(TokenType.TRUE):
        return Literal(True)
    
    if self.match(TokenType.NIL):
        return Literal(None)
    
    if self.match(TokenType.NUMBER, TokenType.STRING):
        return Literal(self.previous().literal)

The nil literal in Lox represents the absence of a value, similar to null in other languages. In our Python implementation, we represent this as None.

When these tokens are encountered during parsing, we simply consume the token and create a Literal AST node with the appropriate value.

Parentheses

Parentheses are used for grouping expressions overriding the natural operator precedence. For example, in (1 + 2) * 3, the parentheses force the addition to be evaluated before the multiplication.

From our grammar, parenthesized expressions are part of the primary rule:

primary → NUMBER | STRING | "true" | "false" | "nil" | "(" expression ")"

Parsing parentheses involves three steps:

Consume the opening (
Parse the expression inside the parentheses
Consume the closing )

We add this to our primary() method:

def primary(self):
    if self.match(TokenType.FALSE):
        return Literal(False)
    
    if self.match(TokenType.TRUE):
        return Literal(True)
    
    if self.match(TokenType.NIL):
        return Literal(None)
    
    if self.match(TokenType.NUMBER, TokenType.STRING):
        return Literal(self.previous().literal)
    
    if self.match(TokenType.LEFT_PAREN):
        expr = self.expression()
        self.consume(TokenType.RIGHT_PAREN, "Expect ')' after expression.")
        return Grouping(expr)
    
    # Handle error case - unexpected token

The key insight here is that parsing the inner expression is recursive: we call self.expression() which will parse according to the full expression grammar. This means parentheses can contain any valid expression, including other parenthesized expressions.

The Grouping AST node we return contains the inner expression. During evaluation, grouping simply evaluates its inner expression, they only affect the parsing structure and thus the order of operations.

Unary operators

From our grammar, unary expressions are defined as:

unary → ( "!" | "-" ) unary | primary

This grammar rule is interesting because it's recursive and right-associative. This means unary operators can be chained together like !!true or --5, and they associate from right to left. For example, --5 is parsed as -(-(5)).

The implementation follows the grammar directly:

def unary(self):
    if self.match(TokenType.BANG, TokenType.MINUS):
        operator = self.previous()
        right = self.unary()
        return Unary(operator, right)
    
    return self.primary()

Unary operators have higher precedence than binary operators, which is why unary appears in the grammar hierarchy before the binary operator rules. This ensures that -a + b is parsed as ((-a) + b) rather than (-(a + b)).

Arithmetic operators

Arithmetic operators are binary operators that perform mathematical operations. Lox supports four arithmetic operators: +, -, *, and /. These operators have different precedence levels: * and / have higher precedence than + and -.

From our grammar, arithmetic operations are split across two rules to handle precedence:

term   → factor ( ( "-" | "+" ) factor )*
factor → unary ( ( "/" | "*" ) unary )*

Unlike unary operators which are right-associative, binary operators are left-associative. This means 1 - 2 - 3 is parsed as ((1 - 2) - 3) rather than (1 - (2 - 3)).

The implementation uses a common pattern for parsing left-associative binary operators:

def term(self):
    expr = self.factor()
    
    while self.match(TokenType.MINUS, TokenType.PLUS):
        operator = self.previous()
        right = self.factor()
        expr = Binary(expr, operator, right)
    
    return expr

def factor(self):
    expr = self.unary()
    
    while self.match(TokenType.SLASH, TokenType.STAR):
        operator = self.previous()
        right = self.unary()
        expr = Binary(expr, operator, right)
    
    return expr

This approach handles left-associativity because each new operator becomes the root of a new Binary node, with the previous expression as its left child. The precedence is handled by the fact that term() calls factor(), ensuring multiplication and division are parsed before addition and subtraction.

Comparison operators

Lox supports four comparison operators: >, >=, <, and <=. These operators have lower precedence than arithmetic operators, meaning 1 + 2 > 3 is parsed as ((1 + 2) > 3).

From our grammar:

comparison → term ( ( ">" | ">=" | "<" | "<=" ) term )*

Like arithmetic operators, comparison operators are left-associative. While expressions like 1 < 2 < 3 are syntactically valid and parse as ((1 < 2) < 3), they don't behave as mathematical notation since the first comparison returns a boolean.

The implementation follows the same pattern as arithmetic operators:

def comparison(self):
    expr = self.term()
    
    while self.match(TokenType.GREATER, TokenType.GREATER_EQUAL, 
                     TokenType.LESS, TokenType.LESS_EQUAL):
        operator = self.previous()
        right = self.term()
        expr = Binary(expr, operator, right)
    
    return expr

Since comparison() calls term() for its operands, arithmetic expressions are parsed with higher precedence. This ensures that complex expressions like a + b > c * d are correctly parsed as ((a + b) > (c * d)).

Equality operators

Lox supports two equality operators: == (equal) and != (not equal). These operators have the lowest precedence among all binary operators, meaning they are evaluated last in an expression.

From our grammar:

equality → comparison ( ( "!=" | "==" ) comparison )*

This low precedence means that 1 + 2 == 3 * 1 is parsed as ((1 + 2) == (3 * 1)), allowing complex arithmetic and comparison expressions on both sides of the equality check.

Like other binary operators, equality operators are left-associative. Expressions like a == b != c parse as ((a == b) != c), though such chaining is rarely meaningful since the first equality returns a boolean.

The implementation is straightforward:

def expression(self):
    return self.equality()

def equality(self):
    expr = self.comparison()
    
    while self.match(TokenType.BANG_EQUAL, TokenType.EQUAL_EQUAL):
        operator = self.previous()
        right = self.comparison()
        expr = Binary(expr, operator, right)
    
    return expr

The expression() method serves as the entry point for parsing expressions. This completes our precedence hierarchy: expression → equality → comparison → term → factor → unary → primary.

Syntactic errors

First, I will have a parser_error helper function that takes a token that we couldn't parse and a message and reports it to stderr.

def parser_error(token: Token, msg):
    if token.token_type == TokenType.EOF:
        report(token.line, " at end", msg)
    else:
        report(token.line, f" at '{token.lexeme}'", msg)

Then, I define a ParseError exception type with a static method that helps create an Exception.

class ParseError(Exception):
    def __init__(self, msg):
        super().__init__(msg)

    @staticmethod
    def error(token: Token, msg: str):
        parser_error(token, msg)
        return ParseError(msg)

Finally, in the main script, I invoke parsing as follows:

try:
    expr = parser.expression()
    if expr:
        print(AstPrinter().print(expr))
except Exception:
    exit(65)

So far, we have defined the expression grammar for Lox, implemented a recursive descent parser (recursively parses starting from the top rule in the grammar), and used Visitor design pattern to implement AstPrinter. Next time, we will actually evaluate the expressions we are able to parse this way.

#compilers #lox-interpreter #programming-languages #software-engineering