Writing my own interpreter for Lox, Part 1 - Scanner
Crafting Interpreters seems to be a high-quality hands-on resource to learn about programming languages and their implementation. In this post, I will be implementing a scanner for the Lox programming language in Python and write my worklogs. The complete source code for this project is here.
Scanning - Overview
The first stage of a compiling or interpreting a program would be to perform scanning (also known as lexing or tokenizing). This involves reading the source file and emitting tokens (meaningful groups of characters). Considering the example from the book, given the following source code:
var x = (min + max) / 2;
The tokenizer would output a list of tokens that looks something like:
[var, x, =, (, min, +, max, ), /, 2, ;]
Let's write main.py and run.sh that executes main.py on a Lox source file.
import sys
def main():
if len(sys.argv) < 3:
print("Usage: ./run.sh tokenize <filename>", file=sys.stderr)
exit(1)
command = sys.argv[1]
filename = sys.argv[2]
if command != "tokenize":
print(f"Unknown command: {command}", file=sys.stderr)
exit(1)
with open(filename) as file:
file_contents = file.read()
if file_contents:
raise NotImplementedError("Scanner not implemented")
else:
print("EOF null")
if __name__ == "__main__":
main()
run.sh is fairly minimal.
#!/bin/sh
set -e # Exit early if any commands fail
exec pipenv run python3 -m app.main "$@"
Scanning parentheses and braces
In this section, the goal is to be able to scan source code consisting of the (,),{,} parentheses and braces only. For example consider the following Lox source code:
()}{
It consists of the following four tokens, (, ), }, {, EOF. EOF is a special token that's always at the end. It's convenient to use this to indicate the end of the source file. Let's write scanner.py that can scan a source file like this. It will have three things: 1) TokenType enum 2) Token class and 3) Scanner class
from enum import Enum
TokenType = Enum(
'TokenType',
[
'LEFT_PAREN',
'RIGHT_PAREN',
'LEFT_BRACE',
'RIGHT_BRACE',
'EOF',
]
)
class Token:
def __init__(
self,
token_type: TokenType,
line: int,
lexeme: str = '',
literal: str | float = 'null'
):
self.token_type = token_type
self.line = line
self.lexeme = lexeme
self.literal = literal
def __repr__(self):
return f"{self.token_type.name} {self.lexeme} {self.literal}"
class Scanner:
def __init__(self, src: str) -> None:
self.src = src
self.start = 0
self.current = 0
self.line = 1
self.tokens = []
def is_at_end(self) -> bool:
return self.current >= len(self.src)
def scan_token(self):
ch = self.advance()
if ch == '(':
lexeme = self.src[self.start: self.current]
token = Token(TokenType.LEFT_PAREN, self.line, lexeme)
self.tokens.append(token)
elif ch == ')':
lexeme = self.src[self.start: self.current]
token = Token(TokenType.RIGHT_PAREN, self.line, lexeme)
self.tokens.append(token)
elif ch == '{':
lexeme = self.src[self.start: self.current]
token = Token(TokenType.LEFT_BRACE, self.line, lexeme)
self.tokens.append(token)
elif ch == '}':
lexeme = self.src[self.start: self.current]
token = Token(TokenType.RIGHT_BRACE, self.line, lexeme)
self.tokens.append(token)
else:
raise ValueError(f"Unknown character at line {self.line}: {ch}")
def advance(self):
ch = self.src[self.current]
self.current += 1
return ch
def scan(self):
while not self.is_at_end():
self.start = self.current
self.scan_token()
self.tokens.append(Token(TokenType.EOF, self.line))
The code is pretty self-explanatory at this point.
TokenTypeenumerates the different types of tokens we will have. Right now, it's just the three we need:LEFT_PAREN,RIGHT_PARENandEOF.Tokenclass is basically a dataclass that stores the type of the token, the line at which it's present, the corresponding lexeme and the identifier. Lexeme corresponds to the raw substring of the source program that corresponds to the token. For aLEFT_PARENtoken type the corresponding lexme would be(. Not all tokens would correspond to a lexeme, considerEOFfor example, in which case we use the empty string as the lexeme. Finally, some tokens might have literal value associated with them, like the number42when you have a linevar x = 42;.Scannerclass has one main method calledscan. This class is initialised with the source code and keeps track of the line we're scanning (currently we are not updating this), the start of the next token position and the current position of the next token we are considering (this is required because we have to deal with multi-character lexemes / tokens). Each token is appended to thetokensattribute.
It may seem over-engineered right now but with increasing variety of tokens (multi-character tokens, identifiers etc.) these little abstractions will be useful.
Scanning other single-character tokens
Now, we'll add support for the following single-character tokens: ,, ., -, +, ; and *. We will add the following strings to our TokenType enum:
'PLUS',
'MINUS',
'STAR',
'SEMICOLON',
'COLON',
'COMMA',
'DOT',
We will create a helper method called add_token to avoid code-duplication:
def add_token(self, token_type: TokenType):
lexeme = self.src[self.start : self.current]
token = Token(token_type, self.line, lexeme)
self.tokens.append(token)
Finally, we handle the new tokens inside scan_token as usual:
elif ch == '+':
self.add_token(TokenType.PLUS)
elif ch == '-':
self.add_token(TokenType.MINUS)
elif ch == '*':
self.add_token(TokenType.STAR)
elif ch == ';':
self.add_token(TokenType.SEMICOLON)
elif ch == ':':
self.add_token(TokenType.COLON)
elif ch == ',':
self.add_token(TokenType.COMMA)
elif ch == '.':
self.add_token(TokenType.DOT)
Lexical Errors
If the source code contains characters Lox doesn't support, what should we do? Currently, we are throwing a value error reporting the line and the character. But we could do better. We could print the error to the standard error stream and return with 65 exit code to represent malformed input data (which is not a standard strictly-speaking). We import sys to use sys.stderr and define a new member had_error inside the Scanner class initialised to False. We make the following change to print the error:
elif ch == '.':
self.add_token(TokenType.DOT)
else:
print(f"[line {self.line}] Error: Unexpected character: {ch}", file=sys.stderr)
self.had_error = True
In main.py we check on had_error and exit with code 65 if it's true.
if scanner.had_error:
exit(65)
Assignment and equality operators
We will handle the following two operators: =, ==. This case is a bit interesting. When we encounter = in the source file, we need to look ahead a character and decide if we have == or something else like =a. We need functionality to look ahead and conditioned on what we see we may increment self.current or not. We define match method:
def match(self, expected):
if (self.is_at_end()):
return False
if self.src[self.current] == expected:
self.current += 1
return True
return False
This will be useful to see if the next character after = is another =. Only if it is we increment self.current and return True. So in scan_token method, we have:
elif ch == '=':
if self.match('='):
self.add_token(TokenType.EQUAL_EQUAL)
else:
self.add_token(TokenType.EQUAL)
Negation and inequality operators
We create new token enum values and call them BANG and BANG_EQUAL to represent ! and != respectively. The logic to process them is similar.
elif ch == "!":
if self.match("="):
self.add_token(TokenType.BANG_EQUAL)
else:
self.add_token(TokenType.BANG)
Relational operators
Similarly, we can handle the relational operators:
elif ch == "<":
if self.match("="):
self.add_token(TokenType.LESS_EQUAL)
else:
self.add_token(TokenType.LESS)
elif ch == ">":
if self.match("="):
self.add_token(TokenType.GREATER_EQUAL)
else:
self.add_token(TokenType.GREATER)
Division and Comments
Dealing with division is a bit tricky, since comments also start with a \. Here's how we do it: if there's a // we read (and discard characters) until the end of the line (or end of file if it's the last line). If it's just \ we consider it as TokenType.SLASH.
elif ch == "/":
if self.match("/"):
# read the comment until a new line or EOF
while self.peek() != "\n" and not self.is_at_end():
self.advance()
else:
self.add_token(TokenType.SLASH)
Whitespace
We can pretty much ignore whitespace.
elif ch == ' ' or ch == '\t' or ch == '\r':
return
Lexical Errors
If we see some character that we don't expect, we simply log the error to standard error and update a had_error flag so we don't proceed to the later stages of the compilation.
else:
print(f"[line {self.line}] Error: Unexpected character: {ch}", file=sys.stderr)
self.had_error = True
String literals
If we encounter a double quote, we start looking for a string literal. We define a string function to help us read the string (or an error, if we don't see what we expect to see).
elif ch == '"':
self.string()
Helper function:
def string(self):
while self.peek() != '"' and not self.is_at_end():
if self.peek() == "\n":
self.line += 1
self.advance()
if self.is_at_end():
print(f"[line {self.line}] Error: Unterminated string.", file=sys.stderr)
self.had_error = True
else:
self.advance() # consume closing double-quote
value_str = self.src[self.start + 1 : self.current - 1]
self.add_token(TokenType.STRING, value_str)
Number literals
We define a few helper functions to identify and read tokens that are numbers.
def is_digit(self, ch):
try:
d = int(ch)
return 0 <= d <= 9
except:
return False
def peek_next(self):
if self.current + 1 >= len(self.src):
return "\0"
return self.src[self.current + 1]
def number(self):
while self.is_digit(self.peek()):
self.advance()
if self.peek() == "." and self.is_digit(self.peek_next()):
self.advance()
while self.is_digit(self.peek()):
self.advance()
value_double = float(self.src[self.start : self.current])
self.add_token(TokenType.NUMBER, value_double)
To recognize numbers we simply do:
elif self.is_digit(ch):
self.number()
Identifiers and Keywords
To scan identifiers, we see if the current character is from the alphabet using is_alpha and then read characters while we continue to see alphanumeric characters.
def is_alpha(self, ch):
return (
(ord(ch) >= ord("a") and ord(ch) <= ord("z"))
or (ord(ch) >= ord("A") and ord(ch) <= ord("Z"))
or (ord(ch) == ord("_"))
)
def identifier(self):
while not self.is_at_end() and (
self.is_alpha(self.src[self.current])
or self.is_digit(self.src[self.current])
):
self.advance()
text = self.src[self.start : self.current]
if text in self.keywords:
self.add_token(self.keywords[text])
else:
self.add_token(TokenType.IDENTIFIER)
We also store the keywords of our language in the scanner class so that if the current identifier is actually a keyword then we store it accordingly.
keywords = {
"and": TokenType.AND,
"class": TokenType.CLASS,
"else": TokenType.ELSE,
"false": TokenType.FALSE,
"for": TokenType.FOR,
"fun": TokenType.FUN,
"if": TokenType.IF,
"nil": TokenType.NIL,
"or": TokenType.OR,
"print": TokenType.PRINT,
"return": TokenType.RETURN,
"super": TokenType.SUPER,
"this": TokenType.THIS,
"true": TokenType.TRUE,
"var": TokenType.VAR,
"while": TokenType.WHILE,
}
We now have a working scanner.
Here's a complete code for scanner.py
from enum import Enum
import sys
TokenType = Enum(
"TokenType",
[
"LEFT_PAREN",
"RIGHT_PAREN",
"LEFT_BRACE",
"RIGHT_BRACE",
"PLUS",
"MINUS",
"STAR",
"SEMICOLON",
"COLON",
"COMMA",
"DOT",
"SLASH",
"EQUAL",
"EQUAL_EQUAL",
"BANG",
"BANG_EQUAL",
"GREATER",
"LESS",
"GREATER_EQUAL",
"LESS_EQUAL",
"STRING",
"NUMBER",
"IDENTIFIER",
"AND",
"CLASS",
"ELSE",
"FALSE",
"FOR",
"FUN",
"IF",
"NIL",
"OR",
"PRINT",
"RETURN",
"SUPER",
"THIS",
"TRUE",
"VAR",
"WHILE",
"EOF",
],
)
class Token:
def __init__(
self,
token_type: TokenType,
line: int,
lexeme: str = "",
literal: str | float = "null",
):
self.token_type = token_type
self.line = line
self.lexeme = lexeme
self.literal = literal
def __repr__(self):
return f"{self.token_type.name} {self.lexeme} {self.literal}"
class Scanner:
keywords = {
"and": TokenType.AND,
"class": TokenType.CLASS,
"else": TokenType.ELSE,
"false": TokenType.FALSE,
"for": TokenType.FOR,
"fun": TokenType.FUN,
"if": TokenType.IF,
"nil": TokenType.NIL,
"or": TokenType.OR,
"print": TokenType.PRINT,
"return": TokenType.RETURN,
"super": TokenType.SUPER,
"this": TokenType.THIS,
"true": TokenType.TRUE,
"var": TokenType.VAR,
"while": TokenType.WHILE,
}
def __init__(self, src: str) -> None:
self.src = src
self.start = 0
self.current = 0
self.line = 1
self.tokens = []
self.had_error = False
def is_at_end(self) -> bool:
return self.current >= len(self.src)
def add_token(self, token_type: TokenType, literal="null"):
lexeme = self.src[self.start : self.current]
token = Token(token_type, self.line, lexeme, literal)
self.tokens.append(token)
def scan_token(self):
ch = self.advance()
if ch == "(":
self.add_token(TokenType.LEFT_PAREN)
elif ch == ")":
self.add_token(TokenType.RIGHT_PAREN)
elif ch == "{":
self.add_token(TokenType.LEFT_BRACE)
elif ch == "}":
self.add_token(TokenType.RIGHT_BRACE)
elif ch == "+":
self.add_token(TokenType.PLUS)
elif ch == "-":
self.add_token(TokenType.MINUS)
elif ch == "*":
self.add_token(TokenType.STAR)
elif ch == ";":
self.add_token(TokenType.SEMICOLON)
elif ch == ":":
self.add_token(TokenType.COLON)
elif ch == ",":
self.add_token(TokenType.COMMA)
elif ch == ".":
self.add_token(TokenType.DOT)
elif ch == "=":
if self.match("="):
self.add_token(TokenType.EQUAL_EQUAL)
else:
self.add_token(TokenType.EQUAL)
elif ch == "!":
if self.match("="):
self.add_token(TokenType.BANG_EQUAL)
else:
self.add_token(TokenType.BANG)
elif ch == "<":
if self.match("="):
self.add_token(TokenType.LESS_EQUAL)
else:
self.add_token(TokenType.LESS)
elif ch == ">":
if self.match("="):
self.add_token(TokenType.GREATER_EQUAL)
else:
self.add_token(TokenType.GREATER)
elif ch == "/":
if self.match("/"):
# read the comment until a new line or EOF
while self.peek() != "\n" and not self.is_at_end():
self.advance()
else:
self.add_token(TokenType.SLASH)
elif ch == '"':
self.string()
elif self.is_digit(ch):
self.number()
elif self.is_alpha(ch):
self.identifier()
elif ch == " " or ch == "\t" or ch == "\r":
return
elif ch == "\n":
self.line += 1
else:
print(
f"[line {self.line}] Error: Unexpected character: {ch}", file=sys.stderr
)
self.had_error = True
def match(self, expected):
if self.is_at_end():
return False
if self.src[self.current] == expected:
self.current += 1
return True
return False
def peek(self):
if self.is_at_end():
return "\0"
return self.src[self.current]
def peek_next(self):
if self.current + 1 >= len(self.src):
return "\0"
return self.src[self.current + 1]
def advance(self):
ch = self.src[self.current]
self.current += 1
return ch
def string(self):
while self.peek() != '"' and not self.is_at_end():
if self.peek() == "\n":
self.line += 1
self.advance()
if self.is_at_end():
print(f"[line {self.line}] Error: Unterminated string.", file=sys.stderr)
self.had_error = True
else:
self.advance() # consume closing double-quote
value_str = self.src[self.start + 1 : self.current - 1]
self.add_token(TokenType.STRING, value_str)
def is_digit(self, ch):
try:
d = int(ch)
return 0 <= d <= 9
except:
return False
def number(self):
while self.is_digit(self.peek()):
self.advance()
if self.peek() == "." and self.is_digit(self.peek_next()):
self.advance()
while self.is_digit(self.peek()):
self.advance()
value_double = float(self.src[self.start : self.current])
self.add_token(TokenType.NUMBER, value_double)
def is_alpha(self, ch):
return (
(ord(ch) >= ord("a") and ord(ch) <= ord("z"))
or (ord(ch) >= ord("A") and ord(ch) <= ord("Z"))
or (ord(ch) == ord("_"))
)
def identifier(self):
while not self.is_at_end() and (
self.is_alpha(self.src[self.current])
or self.is_digit(self.src[self.current])
):
self.advance()
text = self.src[self.start : self.current]
if text in self.keywords:
self.add_token(self.keywords[text])
else:
self.add_token(TokenType.IDENTIFIER)
def scan(self):
while not self.is_at_end():
Testing the Scanner
Now that we have a complete scanner implementation, we can test it with various examples using the run.sh script. Here's an example test (ideally, we will write unit tests):
Create a test file test.lox:
class Circle {
fun init(radius) {
this.radius = radius;
}
fun area() {
// Calculates area
return 3.14159 * this.radius * this.radius;
}
}
var circle = Circle();
circle.init(5.0);
print circle.area();
Run the scanner:
./run.sh tokenize test.lox
This produces a comprehensive token stream showing classes, methods, object instantiation, and method calls.
CLASS class null
IDENTIFIER Circle null
LEFT_BRACE { null
FUN fun null
IDENTIFIER init null
LEFT_PAREN ( null
IDENTIFIER radius null
RIGHT_PAREN ) null
LEFT_BRACE { null
THIS this null
DOT . null
IDENTIFIER radius null
EQUAL = null
IDENTIFIER radius null
SEMICOLON ; null
RIGHT_BRACE } null
FUN fun null
IDENTIFIER area null
LEFT_PAREN ( null
RIGHT_PAREN ) null
LEFT_BRACE { null
RETURN return null
NUMBER 3.14159 3.14159
STAR * null
THIS this null
DOT . null
IDENTIFIER radius null
STAR * null
THIS this null
DOT . null
IDENTIFIER radius null
SEMICOLON ; null
RIGHT_BRACE } null
RIGHT_BRACE } null
VAR var null
IDENTIFIER circle null
EQUAL = null
IDENTIFIER Circle null
LEFT_PAREN ( null
RIGHT_PAREN ) null
SEMICOLON ; null
IDENTIFIER circle null
DOT . null
IDENTIFIER init null
LEFT_PAREN ( null
NUMBER 5.0 5.0
RIGHT_PAREN ) null
SEMICOLON ; null
PRINT print null
IDENTIFIER circle null
DOT . null
IDENTIFIER area null
LEFT_PAREN ( null
RIGHT_PAREN ) null
SEMICOLON ; null
EOF null
We can now demonstrate (and test for) the scanner's ability to handle:
- All single and multi-character tokens
- String and number literals
- Keywords vs identifiers
- Comments (which are ignored)
- Error cases with appropriate error messages
- Complex Lox programs