Sai Sasank Y

Writing my own interpreter for Lox, Part 1 - Scanner

Crafting Interpreters seems to be a high-quality hands-on resource to learn about programming languages and their implementation. In this post, I will be implementing a scanner for the Lox programming language in Python and write my worklogs. The complete source code for this project is here.

Scanning - Overview

The first stage of a compiling or interpreting a program would be to perform scanning (also known as lexing or tokenizing). This involves reading the source file and emitting tokens (meaningful groups of characters). Considering the example from the book, given the following source code:

var x = (min + max) / 2;

The tokenizer would output a list of tokens that looks something like:

[var, x, =, (, min, +, max, ), /, 2, ;]

Let's write main.py and run.sh that executes main.py on a Lox source file.

import sys

def main():
    if len(sys.argv) < 3:
        print("Usage: ./run.sh tokenize <filename>", file=sys.stderr)
        exit(1)

    command = sys.argv[1]
    filename = sys.argv[2]

    if command != "tokenize":
        print(f"Unknown command: {command}", file=sys.stderr)
        exit(1)

    with open(filename) as file:
        file_contents = file.read()

    if file_contents:
        raise NotImplementedError("Scanner not implemented")
    else:
        print("EOF  null")

if __name__ == "__main__":
    main()

run.sh is fairly minimal.

#!/bin/sh
set -e # Exit early if any commands fail
exec pipenv run python3 -m app.main "$@"

Scanning parentheses and braces

In this section, the goal is to be able to scan source code consisting of the (,),{,} parentheses and braces only. For example consider the following Lox source code:

()}{

It consists of the following four tokens, (, ), }, {, EOF. EOF is a special token that's always at the end. It's convenient to use this to indicate the end of the source file. Let's write scanner.py that can scan a source file like this. It will have three things: 1) TokenType enum 2) Token class and 3) Scanner class

from enum import Enum

TokenType = Enum(
    'TokenType',
    [
        'LEFT_PAREN',
        'RIGHT_PAREN',
        'LEFT_BRACE',
        'RIGHT_BRACE',
        'EOF',
    ]
)
    
class Token:
    def __init__(
        self,
        token_type: TokenType,
        line: int,
        lexeme: str = '',
        literal: str | float = 'null'
    ):
        self.token_type = token_type
        self.line = line
        self.lexeme = lexeme
        self.literal = literal
        
    def __repr__(self):
        return f"{self.token_type.name} {self.lexeme} {self.literal}"
    

class Scanner:
    def __init__(self, src: str) -> None:
        self.src = src
        self.start = 0
        self.current = 0
        self.line = 1
        self.tokens = []
        
    def is_at_end(self) -> bool:
        return self.current >= len(self.src)
        
    def scan_token(self):
        ch = self.advance()
        if ch == '(':
            lexeme = self.src[self.start: self.current]
            token = Token(TokenType.LEFT_PAREN, self.line, lexeme)
            self.tokens.append(token)
        elif ch == ')':
            lexeme = self.src[self.start: self.current]
            token = Token(TokenType.RIGHT_PAREN, self.line, lexeme)
            self.tokens.append(token)
        elif ch == '{':
            lexeme = self.src[self.start: self.current]
            token = Token(TokenType.LEFT_BRACE, self.line, lexeme)
            self.tokens.append(token)
        elif ch == '}':
            lexeme = self.src[self.start: self.current]
            token = Token(TokenType.RIGHT_BRACE, self.line, lexeme)
            self.tokens.append(token)
        else:
            raise ValueError(f"Unknown character at line {self.line}: {ch}")
        
    def advance(self):
        ch = self.src[self.current]
        self.current += 1
        return ch
    
    def scan(self):
        while not self.is_at_end():
            self.start = self.current
            self.scan_token()
        self.tokens.append(Token(TokenType.EOF, self.line))

The code is pretty self-explanatory at this point.

  1. TokenType enumerates the different types of tokens we will have. Right now, it's just the three we need: LEFT_PAREN, RIGHT_PAREN and EOF.

  2. Token class is basically a dataclass that stores the type of the token, the line at which it's present, the corresponding lexeme and the identifier. Lexeme corresponds to the raw substring of the source program that corresponds to the token. For a LEFT_PAREN token type the corresponding lexme would be (. Not all tokens would correspond to a lexeme, consider EOF for example, in which case we use the empty string as the lexeme. Finally, some tokens might have literal value associated with them, like the number 42 when you have a line var x = 42;.

  3. Scanner class has one main method called scan. This class is initialised with the source code and keeps track of the line we're scanning (currently we are not updating this), the start of the next token position and the current position of the next token we are considering (this is required because we have to deal with multi-character lexemes / tokens). Each token is appended to the tokens attribute.

It may seem over-engineered right now but with increasing variety of tokens (multi-character tokens, identifiers etc.) these little abstractions will be useful.

Scanning other single-character tokens

Now, we'll add support for the following single-character tokens: ,, ., -, +, ; and *. We will add the following strings to our TokenType enum:

        'PLUS',
        'MINUS',
        'STAR',
        
        'SEMICOLON',
        'COLON',
        'COMMA',
        'DOT',

We will create a helper method called add_token to avoid code-duplication:

def add_token(self, token_type: TokenType):
    lexeme = self.src[self.start : self.current]
    token = Token(token_type, self.line, lexeme)
    self.tokens.append(token)

Finally, we handle the new tokens inside scan_token as usual:

        elif ch == '+':
            self.add_token(TokenType.PLUS)
        elif ch == '-':
            self.add_token(TokenType.MINUS)
        elif ch == '*':
            self.add_token(TokenType.STAR)
        elif ch == ';':
            self.add_token(TokenType.SEMICOLON)
        elif ch == ':':
            self.add_token(TokenType.COLON)
        elif ch == ',':
            self.add_token(TokenType.COMMA)
        elif ch == '.':
            self.add_token(TokenType.DOT)

Lexical Errors

If the source code contains characters Lox doesn't support, what should we do? Currently, we are throwing a value error reporting the line and the character. But we could do better. We could print the error to the standard error stream and return with 65 exit code to represent malformed input data (which is not a standard strictly-speaking). We import sys to use sys.stderr and define a new member had_error inside the Scanner class initialised to False. We make the following change to print the error:

elif ch == '.':
    self.add_token(TokenType.DOT)
else:
    print(f"[line {self.line}] Error: Unexpected character: {ch}", file=sys.stderr)
    self.had_error = True

In main.py we check on had_error and exit with code 65 if it's true.

if scanner.had_error:
    exit(65)

Assignment and equality operators

We will handle the following two operators: =, ==. This case is a bit interesting. When we encounter = in the source file, we need to look ahead a character and decide if we have == or something else like =a. We need functionality to look ahead and conditioned on what we see we may increment self.current or not. We define match method:

def match(self, expected):
    if (self.is_at_end()):
        return False
    if self.src[self.current] == expected:
        self.current += 1
        return True
    return False

This will be useful to see if the next character after = is another =. Only if it is we increment self.current and return True. So in scan_token method, we have:

elif ch == '=':
    if self.match('='):
        self.add_token(TokenType.EQUAL_EQUAL)
    else:
        self.add_token(TokenType.EQUAL)

Negation and inequality operators

We create new token enum values and call them BANG and BANG_EQUAL to represent ! and != respectively. The logic to process them is similar.

elif ch == "!":
    if self.match("="):
        self.add_token(TokenType.BANG_EQUAL)
    else:
        self.add_token(TokenType.BANG)

Relational operators

Similarly, we can handle the relational operators:

elif ch == "<":
    if self.match("="):
        self.add_token(TokenType.LESS_EQUAL)
    else:
        self.add_token(TokenType.LESS)
elif ch == ">":
    if self.match("="):
        self.add_token(TokenType.GREATER_EQUAL)
    else:
        self.add_token(TokenType.GREATER)

Division and Comments

Dealing with division is a bit tricky, since comments also start with a \. Here's how we do it: if there's a // we read (and discard characters) until the end of the line (or end of file if it's the last line). If it's just \ we consider it as TokenType.SLASH.

elif ch == "/":
    if self.match("/"):
         # read the comment until a new line or EOF
         while self.peek() != "\n" and not self.is_at_end():
              self.advance()
    else:
         self.add_token(TokenType.SLASH)

Whitespace

We can pretty much ignore whitespace.

elif ch == ' ' or ch == '\t' or ch == '\r':
     return

Lexical Errors

If we see some character that we don't expect, we simply log the error to standard error and update a had_error flag so we don't proceed to the later stages of the compilation.

else:
     print(f"[line {self.line}] Error: Unexpected character: {ch}", file=sys.stderr)
     self.had_error = True

String literals

If we encounter a double quote, we start looking for a string literal. We define a string function to help us read the string (or an error, if we don't see what we expect to see).

elif ch == '"':
     self.string()

Helper function:

def string(self):
    while self.peek() != '"' and not self.is_at_end():
        if self.peek() == "\n":
            self.line += 1
        self.advance()

    if self.is_at_end():
        print(f"[line {self.line}] Error: Unterminated string.", file=sys.stderr)
        self.had_error = True
    else:
        self.advance()  # consume closing double-quote
        value_str = self.src[self.start + 1 : self.current - 1]
        self.add_token(TokenType.STRING, value_str)

Number literals

We define a few helper functions to identify and read tokens that are numbers.

def is_digit(self, ch):
    try:
        d = int(ch)
        return 0 <= d <= 9
    except:
        return False

def peek_next(self):
    if self.current + 1 >= len(self.src):
        return "\0"
    return self.src[self.current + 1]

def number(self):
    while self.is_digit(self.peek()):
        self.advance()

    if self.peek() == "." and self.is_digit(self.peek_next()):
        self.advance()
        while self.is_digit(self.peek()):
            self.advance()
    value_double = float(self.src[self.start : self.current])
    self.add_token(TokenType.NUMBER, value_double)

To recognize numbers we simply do:

elif self.is_digit(ch):
    self.number()

Identifiers and Keywords

To scan identifiers, we see if the current character is from the alphabet using is_alpha and then read characters while we continue to see alphanumeric characters.

def is_alpha(self, ch):
    return (
        (ord(ch) >= ord("a") and ord(ch) <= ord("z"))
        or (ord(ch) >= ord("A") and ord(ch) <= ord("Z"))
        or (ord(ch) == ord("_"))
    )

def identifier(self):
    while not self.is_at_end() and (
        self.is_alpha(self.src[self.current])
        or self.is_digit(self.src[self.current])
    ):
        self.advance()
    text = self.src[self.start : self.current]
    if text in self.keywords:
        self.add_token(self.keywords[text])
    else:
        self.add_token(TokenType.IDENTIFIER)

We also store the keywords of our language in the scanner class so that if the current identifier is actually a keyword then we store it accordingly.

keywords = {
    "and": TokenType.AND,
    "class": TokenType.CLASS,
    "else": TokenType.ELSE,
    "false": TokenType.FALSE,
    "for": TokenType.FOR,
    "fun": TokenType.FUN,
    "if": TokenType.IF,
    "nil": TokenType.NIL,
    "or": TokenType.OR,
    "print": TokenType.PRINT,
    "return": TokenType.RETURN,
    "super": TokenType.SUPER,
    "this": TokenType.THIS,
    "true": TokenType.TRUE,
    "var": TokenType.VAR,
    "while": TokenType.WHILE,
}

We now have a working scanner.

Here's a complete code for scanner.py
from enum import Enum
import sys

TokenType = Enum(
    "TokenType",
    [
        "LEFT_PAREN",
        "RIGHT_PAREN",
        "LEFT_BRACE",
        "RIGHT_BRACE",
        "PLUS",
        "MINUS",
        "STAR",
        "SEMICOLON",
        "COLON",
        "COMMA",
        "DOT",
        "SLASH",
        "EQUAL",
        "EQUAL_EQUAL",
        "BANG",
        "BANG_EQUAL",
        "GREATER",
        "LESS",
        "GREATER_EQUAL",
        "LESS_EQUAL",
        "STRING",
        "NUMBER",
        "IDENTIFIER",
        "AND",
        "CLASS",
        "ELSE",
        "FALSE",
        "FOR",
        "FUN",
        "IF",
        "NIL",
        "OR",
        "PRINT",
        "RETURN",
        "SUPER",
        "THIS",
        "TRUE",
        "VAR",
        "WHILE",
        "EOF",
    ],
)


class Token:
    def __init__(
        self,
        token_type: TokenType,
        line: int,
        lexeme: str = "",
        literal: str | float = "null",
    ):
        self.token_type = token_type
        self.line = line
        self.lexeme = lexeme
        self.literal = literal

    def __repr__(self):
        return f"{self.token_type.name} {self.lexeme} {self.literal}"


class Scanner:
    keywords = {
        "and": TokenType.AND,
        "class": TokenType.CLASS,
        "else": TokenType.ELSE,
        "false": TokenType.FALSE,
        "for": TokenType.FOR,
        "fun": TokenType.FUN,
        "if": TokenType.IF,
        "nil": TokenType.NIL,
        "or": TokenType.OR,
        "print": TokenType.PRINT,
        "return": TokenType.RETURN,
        "super": TokenType.SUPER,
        "this": TokenType.THIS,
        "true": TokenType.TRUE,
        "var": TokenType.VAR,
        "while": TokenType.WHILE,
    }

    def __init__(self, src: str) -> None:
        self.src = src
        self.start = 0
        self.current = 0
        self.line = 1
        self.tokens = []
        self.had_error = False

    def is_at_end(self) -> bool:
        return self.current >= len(self.src)

    def add_token(self, token_type: TokenType, literal="null"):
        lexeme = self.src[self.start : self.current]
        token = Token(token_type, self.line, lexeme, literal)
        self.tokens.append(token)

    def scan_token(self):
        ch = self.advance()
        if ch == "(":
            self.add_token(TokenType.LEFT_PAREN)
        elif ch == ")":
            self.add_token(TokenType.RIGHT_PAREN)
        elif ch == "{":
            self.add_token(TokenType.LEFT_BRACE)
        elif ch == "}":
            self.add_token(TokenType.RIGHT_BRACE)
        elif ch == "+":
            self.add_token(TokenType.PLUS)
        elif ch == "-":
            self.add_token(TokenType.MINUS)
        elif ch == "*":
            self.add_token(TokenType.STAR)
        elif ch == ";":
            self.add_token(TokenType.SEMICOLON)
        elif ch == ":":
            self.add_token(TokenType.COLON)
        elif ch == ",":
            self.add_token(TokenType.COMMA)
        elif ch == ".":
            self.add_token(TokenType.DOT)
        elif ch == "=":
            if self.match("="):
                self.add_token(TokenType.EQUAL_EQUAL)
            else:
                self.add_token(TokenType.EQUAL)
        elif ch == "!":
            if self.match("="):
                self.add_token(TokenType.BANG_EQUAL)
            else:
                self.add_token(TokenType.BANG)
        elif ch == "<":
            if self.match("="):
                self.add_token(TokenType.LESS_EQUAL)
            else:
                self.add_token(TokenType.LESS)
        elif ch == ">":
            if self.match("="):
                self.add_token(TokenType.GREATER_EQUAL)
            else:
                self.add_token(TokenType.GREATER)
        elif ch == "/":
            if self.match("/"):
                # read the comment until a new line or EOF
                while self.peek() != "\n" and not self.is_at_end():
                    self.advance()
            else:
                self.add_token(TokenType.SLASH)
        elif ch == '"':
            self.string()
        elif self.is_digit(ch):
            self.number()
        elif self.is_alpha(ch):
            self.identifier()
        elif ch == " " or ch == "\t" or ch == "\r":
            return
        elif ch == "\n":
            self.line += 1
        else:
            print(
                f"[line {self.line}] Error: Unexpected character: {ch}", file=sys.stderr
            )
            self.had_error = True

    def match(self, expected):
        if self.is_at_end():
            return False
        if self.src[self.current] == expected:
            self.current += 1
            return True
        return False

    def peek(self):
        if self.is_at_end():
            return "\0"
        return self.src[self.current]

    def peek_next(self):
        if self.current + 1 >= len(self.src):
            return "\0"
        return self.src[self.current + 1]

    def advance(self):
        ch = self.src[self.current]
        self.current += 1
        return ch

    def string(self):
        while self.peek() != '"' and not self.is_at_end():
            if self.peek() == "\n":
                self.line += 1
            self.advance()

        if self.is_at_end():
            print(f"[line {self.line}] Error: Unterminated string.", file=sys.stderr)
            self.had_error = True
        else:
            self.advance()  # consume closing double-quote
            value_str = self.src[self.start + 1 : self.current - 1]
            self.add_token(TokenType.STRING, value_str)

    def is_digit(self, ch):
        try:
            d = int(ch)
            return 0 <= d <= 9
        except:
            return False

    def number(self):
        while self.is_digit(self.peek()):
            self.advance()

        if self.peek() == "." and self.is_digit(self.peek_next()):
            self.advance()
            while self.is_digit(self.peek()):
                self.advance()
        value_double = float(self.src[self.start : self.current])
        self.add_token(TokenType.NUMBER, value_double)

    def is_alpha(self, ch):
        return (
            (ord(ch) >= ord("a") and ord(ch) <= ord("z"))
            or (ord(ch) >= ord("A") and ord(ch) <= ord("Z"))
            or (ord(ch) == ord("_"))
        )

    def identifier(self):
        while not self.is_at_end() and (
            self.is_alpha(self.src[self.current])
            or self.is_digit(self.src[self.current])
        ):
            self.advance()
        text = self.src[self.start : self.current]
        if text in self.keywords:
            self.add_token(self.keywords[text])
        else:
            self.add_token(TokenType.IDENTIFIER)

    def scan(self):
        while not self.is_at_end():

Testing the Scanner

Now that we have a complete scanner implementation, we can test it with various examples using the run.sh script. Here's an example test (ideally, we will write unit tests):

Create a test file test.lox:

class Circle {
    fun init(radius) {
        this.radius = radius;
    }
    
    fun area() {
        // Calculates area
        return 3.14159 * this.radius * this.radius;
    }
}

var circle = Circle();
circle.init(5.0);
print circle.area();

Run the scanner:

./run.sh tokenize test.lox
This produces a comprehensive token stream showing classes, methods, object instantiation, and method calls.
CLASS class null
IDENTIFIER Circle null
LEFT_BRACE { null
FUN fun null
IDENTIFIER init null
LEFT_PAREN ( null
IDENTIFIER radius null
RIGHT_PAREN ) null
LEFT_BRACE { null
THIS this null
DOT . null
IDENTIFIER radius null
EQUAL = null
IDENTIFIER radius null
SEMICOLON ; null
RIGHT_BRACE } null
FUN fun null
IDENTIFIER area null
LEFT_PAREN ( null
RIGHT_PAREN ) null
LEFT_BRACE { null
RETURN return null
NUMBER 3.14159 3.14159
STAR * null
THIS this null
DOT . null
IDENTIFIER radius null
STAR * null
THIS this null
DOT . null
IDENTIFIER radius null
SEMICOLON ; null
RIGHT_BRACE } null
RIGHT_BRACE } null
VAR var null
IDENTIFIER circle null
EQUAL = null
IDENTIFIER Circle null
LEFT_PAREN ( null
RIGHT_PAREN ) null
SEMICOLON ; null
IDENTIFIER circle null
DOT . null
IDENTIFIER init null
LEFT_PAREN ( null
NUMBER 5.0 5.0
RIGHT_PAREN ) null
SEMICOLON ; null
PRINT print null
IDENTIFIER circle null
DOT . null
IDENTIFIER area null
LEFT_PAREN ( null
RIGHT_PAREN ) null
SEMICOLON ; null
EOF  null

We can now demonstrate (and test for) the scanner's ability to handle:

#compilers #lox-interpreter #programming-languages #software-engineering