
Lexing and Parsing: Building a Java(ish) Interpreter in Haskell
Table of contents
Timeline: | April 2017 | |
Languages Used: | Haskell | |
School: | Colorado College | |
Course: | CP341: Programming Languages |
My Last Stand (with a Functional Language)
It was April of my senior year of college. The finish line was in sight. Just one more final project stood between me and graduation. The course? Programming Languages. The challenge? Build a fully-functional interpreter for a subset of Java—lovingly nicknamed “Javaish”—using a language I had only just been introduced to: Haskell.
💡 Functional ≠ Logic Programming
While Haskell shares some declarative properties with Prolog, a logic programming language I used to build a sudoku solver one year before this, in functional programming the focus is more on computing and evaluating mathematical functions, as opposed to expressing relationships and facts within a logical system.
At Colorado College, we had the Block Plan, which meant this entire course, from researching a comprehensive history of programming languages, to learning Haskell’s syntax and submitting the final project, was compressed into three and a half weeks. It was a crash course in imperative vs functional programming, compilers, and sleep deprivation. And honestly? I loved every minute of it.
Building an interpreter from scratch felt like peering under the hood of programming itself. It’s one thing to write code; it’s another to write code that understands and runs other code.
From Text to Tree: Tokenizing and Parsing
The first step in any interpreter is to make sense of the source code, which is really just a long string of text. This process has two parts:
- Lexing/Tokenizing: Chopping the string into meaningful chunks, or “tokens.” For example,
class
,{
,myVar
,123
. - Parsing: Arranging those tokens into a structured tree (an Abstract Syntax Tree, or AST) that represents the program’s logic.
For this, I used the classic Haskell duo: Alex for lexing and Happy for parsing. They’re the Haskell world’s answer to the venerable lex
and yacc
.
The tokenizer definition in javaish_tok.hs
maps raw strings to my custom Token
types.
1-- From javaish_tok.hs
2alex_action_1 = addPosn TClass
3alex_action_2 = addPosn TNew
4alex_action_3 = addPosn TString
5-- ...and so on for every keyword and symbol...
6alex_action_19 = addPosnS( TIntLiteral . read )
7alex_action_32 = addPosnS( TIdent )
Once Alex gave me a stream of tokens, Happy’s job was to assemble them into our AST, which I defined in javaish_parse.hs
. The ExpOnly
data type is a great example of how I represented different kinds of expressions.
1-- From javaish_parse.hs
2data ExpOnly
3 = ExpOp Exp Char Exp
4 | ExpComOp Exp Char Exp
5 | ExpArray Exp Exp
6 | ExpFCall Exp Ident [ Exp ]
7 | ExpInt Int
8 | ExpNewInt Exp
9 | ExpBool Bool
10 | ExpIdent Ident
11 | ExpNewIdent Ident
12 | ExpThis
13 | ExpNot Exp
14 | ExpLength Exp
15 | ExpError
16 deriving (Show, Eq)
It’s just so clean! Using Haskell’s algebraic data types to model the language’s structure felt incredibly powerful.
Visualizing the AST for Javaish
This Abstract Syntax Tree structure supports a Java-like object-oriented language with classes, methods, arrays, and typical programming constructs.
graph src
1graph TD
2 Program["Program<br/>MainClass + [ClassDecl]"]
3
4 MainClass["MainClass<br/>MClass name args statement"]
5 ClassDecl["ClassDecl<br/>name extends fields methods"]
6
7 MethodDecl["MethodDecl<br/>type name params locals stmts returnExp"]
8
9 Statement["Statement<br/>SList | SIfElse | SWhile<br/>SPrint | SAssign | SArrayAssign"]
10
11 Expression["Expression (Exp)"]
12
13 %% Core Expression Types (Middle Level)
14 ExpOp["ExpOp<br/>Binary Operations<br/>(+, -, *, &&, <)"]
15 ExpComOp["ExpComOp<br/>Comparison Operations"]
16 ExpArray["ExpArray<br/>Array Access<br/>arr[index]"]
17 ExpFCall["ExpFCall<br/>Method Call<br/>obj.method(args)"]
18 ExpNot["ExpNot<br/>Logical Negation<br/>!expr"]
19 ExpLength["ExpLength<br/>Array Length<br/>arr.length"]
20
21 %% Leaf Expressions - Level 1 (Literals)
22 ExpInt["ExpInt<br/>Integer Literal<br/>42"]
23 ExpBool["ExpBool<br/>Boolean Literal<br/>true/false"]
24 ExpIdent["ExpIdent<br/>Variable Reference<br/>varName"]
25
26 %% Leaf Expressions - Level 2 (Object/Creation)
27 ExpThis["ExpThis<br/>this"]
28 ExpNewInt["ExpNewInt<br/>New Array<br/>new int[size]"]
29 ExpNewIdent["ExpNewIdent<br/>New Object<br/>new ClassName()"]
30
31 %% Main Structure Relationships
32 Program --> MainClass
33 Program --> ClassDecl
34 ClassDecl --> MethodDecl
35 MethodDecl --> Statement
36 MethodDecl --> Expression
37 Statement --> Expression
38
39 %% Expression to Operation Relationships
40 Expression --> ExpOp
41 Expression --> ExpComOp
42 Expression --> ExpArray
43 Expression --> ExpFCall
44 Expression --> ExpNot
45 Expression --> ExpLength
46
47 %% Operations to Leaf Expressions
48 Expression --> ExpInt
49 Expression --> ExpBool
50 Expression --> ExpIdent
51 Expression --> ExpThis
52 Expression --> ExpNewInt
53 Expression --> ExpNewIdent
54
55 %% Expression recursion - showing how expressions can contain other expressions
56 ExpOp --> |left operand| Expression
57 ExpOp --> |right operand| Expression
58 ExpComOp --> |left operand| Expression
59 ExpComOp --> |right operand| Expression
60 ExpArray --> |array expr| Expression
61 ExpArray --> |index expr| Expression
62 ExpFCall --> |object expr| Expression
63 ExpFCall --> |arguments| Expression
64 ExpNot --> |operand| Expression
65 ExpLength --> |array expr| Expression
66 ExpNewInt --> |size expr| Expression
67
68 %% Invisible relationships to organize leaf nodes in levels
69 %% Level 2
70 ExpOp ~~~ ExpNot
71 ExpComOp ~~~ ExpLength
72 ExpFCall ~~~ ExpArray
73 %% Level 3
74 ExpNot ~~~ ExpInt
75 ExpNot ~~~ ExpBool
76 ExpLength ~~~ ExpIdent
77 ExpLength ~~~ ExpThis
78 ExpArray ~~~ ExpNewInt
79 ExpArray ~~~ ExpNewIdent
80
81 %% Styling
82 classDef expressionNode fill:#e1f5fe,stroke:#0277bd,color:#333
83 classDef literalNode fill:#f3e5f5,stroke:#7b1fa2,color:#333
84 classDef operatorNode fill:#fff3e0,stroke:#ef6c00,color:#333
85 classDef structureNode fill:#e8f5e8,stroke:#2e7d32,color:#333
86
87 class Expression,ExpOp,ExpComOp,ExpArray,ExpFCall,ExpNot,ExpLength expressionNode
88 class ExpInt,ExpBool,ExpIdent,ExpThis literalNode
89 class ExpNewInt,ExpNewIdent operatorNode
90 class Program,MainClass,ClassDecl,MethodDecl,Statement structureNode
91
92 %% Link styling for visibility on both light and dark backgrounds
93 %% Regular connections - visible with good contrast
94 linkStyle 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28 stroke:#666666,stroke-width:2px
95
96 %% Invisible connections - transparent
97 linkStyle 29,30,31,32,33,34,35,36,37 stroke:transparent,stroke-width:0px
Key Expression Categories
1. Binary Operations (ExpOp
, ExpComOp
)
- Arithmetic:
+
,-
,*
- Logical:
&&
- Comparison:
<
- Both operands are expressions (recursive structure)
2. Array Operations (ExpArray
, ExpLength
, ExpNewInt
)
- Array Access:
arr[index]
- accesses array element - Array Length:
arr.length
- gets array size - Array Creation:
new int[size]
- creates new array
3. Method Calls (ExpFCall
)
- Object-oriented method invocation:
obj.method(arg1, arg2, ...)
- Receiver expression + method name + argument list
4. Object Operations (ExpNewIdent
, ExpThis
)
- Object Creation:
new ClassName()
- instantiate new object - This Reference:
this
- reference to current object
5. Primitive Values (ExpInt
, ExpBool
, ExpIdent
)
- Literals: Integer and boolean constants
- Variables: Identifier references
6. Unary Operations (ExpNot
)
- Logical Negation:
!expr
- boolean negation
The Heart of the Interpreter: Making it Run
With a beautiful AST in hand, the real magic could begin: evaluation. The core of the interpreter is a set of recursive functions that walk the tree and “do” what the code says.
The state of our “Javaish” program—variables, class definitions, objects—was managed in a few key data structures. The Value
type shows how we could represent all the possible data types within our interpreted language.
1-- From interpreter.hs
2data Value
3 = VInt Int
4 | VFloat Float
5 | VString String
6 | VObj ObjInfo
7 | VBool Bool
8 | VArray Int [ Value ]
9 deriving( Eq, Show )
The main workhorse was the interp_stmt
function. It’s essentially a giant case
statement that uses pattern matching on the AST. This is where the elegance of Haskell really shines. You can see how cleanly it handles an if/else
statement:
1-- From interpreter.hs
2interp_stmt ( Stmt stmt p ) =
3 case stmt of
4 -- { <stmts> }
5 SList stmts -> forM_ stmts interp_stmt
6
7 -- if <test> then <thenStmt> else <elseStmt>
8 SIfElse test thenStmt elseStmt -> do
9 VBool testTrue <- interp_exp test
10 if testTrue then
11 interp_stmt thenStmt
12 else
13 interp_stmt elseStmt
14
15 -- ... and cases for all other statement types
Using Monads for State Management
Some say only the weak-minded require monads to manage state in their Haskell code. But good luck avoiding them entirely!
💡 The Impurity of State
Haskell is a “pure” functional language, wherein a function’s outputs depend solely on its inputs, with no side effects allowed. But interpreters need to manage state (like changing the value of a variable). To accomplish that, I had to venture into the “impure” world using
IORef
, which is basically a mutable reference.
Call me “impure” all you want, but the IO
monad was an essential part of the interpreter.hs
file to manage the mutable call stack and object field tables. Those are necessary for simulating the state changes inherent in an imperative language like Javaish within Haskell’s pure functional paradigm.
I imported the Data.IORef
module, which provides the tools for mutable references in Haskell’s IO
monad.
ObjInfo
Data Type (indirect reference):1data ObjInfo = OI( Maybe ( Ident, ObjInfo, IORef SymbolTable ) ) 2 deriving( Eq )
The
ObjInfo
data type, which defines the structure of an object, includes anIORef SymbolTable
. This means that the fields of an object are stored in a mutableSymbolTable
reference.interpret
Function (initialization of call stack):1interpret classes whole_program = do 2 -- each frame of the call stack includes: 3 -- - "this"'s class 4 -- - "this"'s field table 5 -- - local variable table for the current call 6 -- callStack :: IORef [ ( Ident, IORef SymbolTable, SymbolTable ) ] -- Original commented out type 7 callStack <- newIORef []
Here,
callStack
is initialized as anIORef
holding a list of tuples representing call frames.newIORef []
creates an empty mutable reference.interp_prog
Function (initialization ofthisFields
):1interp_prog ( Program ( MClass name args body _ ) _ _ ) = do 2 thisFields <- newIORef Map.empty 3 -- TODO: add real args to local state 4 writeIORef callStack [ ( VObj (OI (Just (name, OI Nothing, thisFields) ) ), Map.singleton args ( VString "" ) ) ] 5 interp_stmt body
thisFields
is created as anIORef
to hold the fields of thethis
object for the main class. Its value is then immediately written into thecallStack
.SAssign
Statement (writing to locals or fields):1SAssign lhs rhs -> do 2 v <- interp_exp rhs 3 ( ( VObj (OI (Just ( c, parent, fieldsRef ) ) ), locals ) : s ) <- readIORef callStack 4 case Map.lookup lhs locals of 5 Just _ -> 6 let locals' = Map.insert lhs v locals in 7 writeIORef callStack ( ( VObj (OI (Just ( c, parent, fieldsRef ) ) ), locals' ) : s ) 8 Nothing -> do 9 fields <- readIORef fieldsRef 10 let fields' = Map.insert lhs v fields 11 writeIORef fieldsRef fields'
This is a crucial part. When an assignment happens:
readIORef callStack
retrieves the current call stack.fieldsRef
is extracted from the currentthis
object (which itself is anIORef SymbolTable
).- If
lhs
is a local variable,writeIORef callStack
updates the stack with the new local variable map. - If
lhs
is a field,readIORef fieldsRef
gets the current fields,Map.insert
updates them, andwriteIORef fieldsRef
commits the change back to the mutable field table.
SArrayAssign
Statement (writing to array elements):1SArrayAssign lhs idx rhs -> do 2 VInt i <- interp_exp idx 3 v <- interp_exp rhs 4 ( ( VObj (OI (Just ( c, parent, fieldsRef ) ) ), locals ) : s ) <- readIORef callStack 5 case Map.lookup lhs locals of 6 Just ( VArray length items ) -> 7 -- ... (updates local array) ... 8 writeIORef callStack ( ( VObj (OI (Just ( c, parent, fieldsRef ) ) ), locals' ) : s ) 9 Nothing -> do 10 fields <- readIORef fieldsRef 11 let Just ( VArray length items ) = Map.lookup lhs fields 12 -- ... (updates field array) ... 13 writeIORef fieldsRef fields'
Similar to
SAssign
, this block usesreadIORef
andwriteIORef
to update either local array variables (by updating the entire local map on the stack) or object fields (by updating thefieldsRef
directly).ExpIdent
Expression (reading variable values):1ExpIdent var -> do 2 ( ( VObj (OI (Just ( _, _, thisFields ) ) ), locals ) : _ ) <- readIORef callStack 3 case Map.lookup var locals of 4 Just val -> return val 5 Nothing -> do 6 fieldTable <- readIORef thisFields 7 let Just val = Map.lookup var fieldTable 8 return val
When an identifier is evaluated, it first checks local variables. If not found, it uses
readIORef thisFields
to look up the variable in the object’s mutable field table.create_obj
Function (creating new object instances):1create_obj cname = do 2 let Just( parent_name, fields, _ ) = Map.lookup cname classes 3 fieldTable <- newIORef fields 4 parent <- make_parent_from_name parent_name 5 return $ VObj $ OI $ Just( cname, parent, fieldTable )
When a new object is created, its
fieldTable
is initialized as a newIORef
holding the default field values for that class.
In essence, IORef
is used throughout interpreter.hs
to manage the dynamic, mutable state (call stack and object fields) that is fundamental to interpreting an imperative language.
All Together Now: Running “Javaish” Code
After weeks of work, it was time for the moment of truth. Our professor provided a test file, test.javaish
, that used classes, inheritance, and method calls to see if everything worked.
1// From test.javaish
2class parent {
3 int g;
4
5 public int f1(int a) {
6 { System.out.println( 2 ); }
7 return a;
8 }
9}
10
11class kiddo extends parent {
12
13 public int f2(int b) {
14 { System.out.println( 3 ); }
15 return this.f1(b);
16 }
17}
18
19class T01 {
20 public static void main( String [] args )
21 {
22 System.out.println( (new kiddo()).f2(1) );
23 }
24}
This program creates a kiddo
object, calls its f2
method (which prints 3
), which in turn calls the parent
’s f1
method (which prints 2
), and finally prints the return value (1
).
Seeing the correct output appear on my screen after piping this file into my compiled Haskell program was a satisfying moment.
Final Thoughts
This project was more than just a final assignment; it was bringing concepts into practice I had learned in CP404 Theory of Computation. It forced me to grapple with complex ideas like parsing, evaluation, state management, and functional programming paradigms (including mutability with monads!).
It taught me that the languages we use fundamentally shape how we think about problems. And sometimes, the best way to understand a language is to try and build one yourself.
More Resources
Source Code
The complete source code, including the tokenizer, parser, and interpreter, is available on GitHub. You can see the full progression from Day 1 through the end of the course. Feel free to browse, but please don’t judge my 2017-era Haskell too harshly!