Thumbnail image

Lexing and Parsing: Building a Java(ish) Interpreter in Haskell

Table of contents

Timeline:April 2017
Languages Used:Haskell
School:Colorado College
Course:CP341: Programming Languages

My Last Stand (with a Functional Language)

It was April of my senior year of college. The finish line was in sight. Just one more final project stood between me and graduation. The course? Programming Languages. The challenge? Build a fully-functional interpreter for a subset of Java—lovingly nicknamed “Javaish”—using a language I had only just been introduced to: Haskell.

💡 Functional ≠ Logic Programming

While Haskell shares some declarative properties with Prolog, a logic programming language I used to build a sudoku solver one year before this, in functional programming the focus is more on computing and evaluating mathematical functions, as opposed to expressing relationships and facts within a logical system.

At Colorado College, we had the Block Plan, which meant this entire course, from researching a comprehensive history of programming languages, to learning Haskell’s syntax and submitting the final project, was compressed into three and a half weeks. It was a crash course in imperative vs functional programming, compilers, and sleep deprivation. And honestly? I loved every minute of it.

Building an interpreter from scratch felt like peering under the hood of programming itself. It’s one thing to write code; it’s another to write code that understands and runs other code.

From Text to Tree: Tokenizing and Parsing

The first step in any interpreter is to make sense of the source code, which is really just a long string of text. This process has two parts:

  1. Lexing/Tokenizing: Chopping the string into meaningful chunks, or “tokens.” For example, class, {, myVar, 123.
  2. Parsing: Arranging those tokens into a structured tree (an Abstract Syntax Tree, or AST) that represents the program’s logic.

For this, I used the classic Haskell duo: Alex for lexing and Happy for parsing. They’re the Haskell world’s answer to the venerable lex and yacc.

The tokenizer definition in javaish_tok.hs maps raw strings to my custom Token types.

1-- From javaish_tok.hs
2alex_action_1 =  addPosn TClass
3alex_action_2 =  addPosn TNew
4alex_action_3 =  addPosn TString
5-- ...and so on for every keyword and symbol...
6alex_action_19 =  addPosnS( TIntLiteral . read )
7alex_action_32 =  addPosnS( TIdent )

Once Alex gave me a stream of tokens, Happy’s job was to assemble them into our AST, which I defined in javaish_parse.hs. The ExpOnly data type is a great example of how I represented different kinds of expressions.

 1-- From javaish_parse.hs
 2data ExpOnly
 3    = ExpOp Exp Char Exp
 4    | ExpComOp Exp Char Exp
 5    | ExpArray Exp Exp
 6    | ExpFCall Exp Ident [ Exp ]
 7    | ExpInt Int
 8    | ExpNewInt Exp
 9    | ExpBool Bool
10    | ExpIdent Ident
11    | ExpNewIdent Ident
12    | ExpThis
13    | ExpNot Exp
14    | ExpLength Exp
15    | ExpError
16    deriving (Show, Eq)

It’s just so clean! Using Haskell’s algebraic data types to model the language’s structure felt incredibly powerful.

Visualizing the AST for Javaish

This Abstract Syntax Tree structure supports a Java-like object-oriented language with classes, methods, arrays, and typical programming constructs.

Mermaid diagram: Abstract Syntax Tree for Javaish

graph src
 1graph TD
 2    Program["Program<br/>MainClass + [ClassDecl]"]
 3
 4    MainClass["MainClass<br/>MClass name args statement"]
 5    ClassDecl["ClassDecl<br/>name extends fields methods"]
 6
 7    MethodDecl["MethodDecl<br/>type name params locals stmts returnExp"]
 8
 9    Statement["Statement<br/>SList | SIfElse | SWhile<br/>SPrint | SAssign | SArrayAssign"]
10
11    Expression["Expression (Exp)"]
12
13    %% Core Expression Types (Middle Level)
14    ExpOp["ExpOp<br/>Binary Operations<br/>(+, -, *, &&, <)"]
15    ExpComOp["ExpComOp<br/>Comparison Operations"]
16    ExpArray["ExpArray<br/>Array Access<br/>arr[index]"]
17    ExpFCall["ExpFCall<br/>Method Call<br/>obj.method(args)"]
18    ExpNot["ExpNot<br/>Logical Negation<br/>!expr"]
19    ExpLength["ExpLength<br/>Array Length<br/>arr.length"]
20
21    %% Leaf Expressions - Level 1 (Literals)
22    ExpInt["ExpInt<br/>Integer Literal<br/>42"]
23    ExpBool["ExpBool<br/>Boolean Literal<br/>true/false"]
24    ExpIdent["ExpIdent<br/>Variable Reference<br/>varName"]
25
26    %% Leaf Expressions - Level 2 (Object/Creation)
27    ExpThis["ExpThis<br/>this"]
28    ExpNewInt["ExpNewInt<br/>New Array<br/>new int[size]"]
29    ExpNewIdent["ExpNewIdent<br/>New Object<br/>new ClassName()"]
30
31    %% Main Structure Relationships
32    Program --> MainClass
33    Program --> ClassDecl
34    ClassDecl --> MethodDecl
35    MethodDecl --> Statement
36    MethodDecl --> Expression
37    Statement --> Expression
38
39    %% Expression to Operation Relationships
40    Expression --> ExpOp
41    Expression --> ExpComOp
42    Expression --> ExpArray
43    Expression --> ExpFCall
44    Expression --> ExpNot
45    Expression --> ExpLength
46
47    %% Operations to Leaf Expressions
48    Expression --> ExpInt
49    Expression --> ExpBool
50    Expression --> ExpIdent
51    Expression --> ExpThis
52    Expression --> ExpNewInt
53    Expression --> ExpNewIdent
54
55    %% Expression recursion - showing how expressions can contain other expressions
56    ExpOp --> |left operand| Expression
57    ExpOp --> |right operand| Expression
58    ExpComOp --> |left operand| Expression
59    ExpComOp --> |right operand| Expression
60    ExpArray --> |array expr| Expression
61    ExpArray --> |index expr| Expression
62    ExpFCall --> |object expr| Expression
63    ExpFCall --> |arguments| Expression
64    ExpNot --> |operand| Expression
65    ExpLength --> |array expr| Expression
66    ExpNewInt --> |size expr| Expression
67
68    %% Invisible relationships to organize leaf nodes in levels
69    %% Level 2
70    ExpOp ~~~ ExpNot
71    ExpComOp ~~~ ExpLength
72    ExpFCall ~~~ ExpArray
73    %% Level 3
74    ExpNot ~~~ ExpInt
75    ExpNot ~~~ ExpBool
76    ExpLength ~~~ ExpIdent
77    ExpLength ~~~ ExpThis
78    ExpArray ~~~ ExpNewInt
79    ExpArray ~~~ ExpNewIdent
80
81    %% Styling
82    classDef expressionNode fill:#e1f5fe,stroke:#0277bd,color:#333
83    classDef literalNode fill:#f3e5f5,stroke:#7b1fa2,color:#333
84    classDef operatorNode fill:#fff3e0,stroke:#ef6c00,color:#333
85    classDef structureNode fill:#e8f5e8,stroke:#2e7d32,color:#333
86
87    class Expression,ExpOp,ExpComOp,ExpArray,ExpFCall,ExpNot,ExpLength expressionNode
88    class ExpInt,ExpBool,ExpIdent,ExpThis literalNode
89    class ExpNewInt,ExpNewIdent operatorNode
90    class Program,MainClass,ClassDecl,MethodDecl,Statement structureNode
91
92    %% Link styling for visibility on both light and dark backgrounds
93    %% Regular connections - visible with good contrast
94    linkStyle 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28 stroke:#666666,stroke-width:2px
95
96    %% Invisible connections - transparent
97    linkStyle 29,30,31,32,33,34,35,36,37 stroke:transparent,stroke-width:0px

Key Expression Categories

1. Binary Operations (ExpOp, ExpComOp)

  • Arithmetic: +, -, *
  • Logical: &&
  • Comparison: <
  • Both operands are expressions (recursive structure)

2. Array Operations (ExpArray, ExpLength, ExpNewInt)

  • Array Access: arr[index] - accesses array element
  • Array Length: arr.length - gets array size
  • Array Creation: new int[size] - creates new array

3. Method Calls (ExpFCall)

  • Object-oriented method invocation: obj.method(arg1, arg2, ...)
  • Receiver expression + method name + argument list

4. Object Operations (ExpNewIdent, ExpThis)

  • Object Creation: new ClassName() - instantiate new object
  • This Reference: this - reference to current object

5. Primitive Values (ExpInt, ExpBool, ExpIdent)

  • Literals: Integer and boolean constants
  • Variables: Identifier references

6. Unary Operations (ExpNot)

  • Logical Negation: !expr - boolean negation

The Heart of the Interpreter: Making it Run

With a beautiful AST in hand, the real magic could begin: evaluation. The core of the interpreter is a set of recursive functions that walk the tree and “do” what the code says.

The state of our “Javaish” program—variables, class definitions, objects—was managed in a few key data structures. The Value type shows how we could represent all the possible data types within our interpreted language.

1-- From interpreter.hs
2data Value
3  = VInt    Int
4  | VFloat  Float
5  | VString String
6  | VObj    ObjInfo
7  | VBool   Bool
8  | VArray  Int [ Value ]
9  deriving( Eq, Show )

The main workhorse was the interp_stmt function. It’s essentially a giant case statement that uses pattern matching on the AST. This is where the elegance of Haskell really shines. You can see how cleanly it handles an if/else statement:

 1-- From interpreter.hs
 2interp_stmt ( Stmt stmt p ) =
 3  case stmt of
 4    -- { <stmts> }
 5    SList stmts -> forM_ stmts interp_stmt
 6
 7    -- if <test> then <thenStmt> else <elseStmt>
 8    SIfElse test thenStmt elseStmt -> do
 9        VBool testTrue <- interp_exp test
10        if testTrue then
11          interp_stmt thenStmt
12        else
13          interp_stmt elseStmt
14
15    -- ... and cases for all other statement types

Using Monads for State Management

Some say only the weak-minded require monads to manage state in their Haskell code. But good luck avoiding them entirely!

💡 The Impurity of State

Haskell is a “pure” functional language, wherein a function’s outputs depend solely on its inputs, with no side effects allowed. But interpreters need to manage state (like changing the value of a variable). To accomplish that, I had to venture into the “impure” world using IORef, which is basically a mutable reference.

Call me “impure” all you want, but the IO monad was an essential part of the interpreter.hs file to manage the mutable call stack and object field tables. Those are necessary for simulating the state changes inherent in an imperative language like Javaish within Haskell’s pure functional paradigm.

I imported the Data.IORef module, which provides the tools for mutable references in Haskell’s IO monad.

  1. ObjInfo Data Type (indirect reference):

    1data ObjInfo = OI( Maybe ( Ident, ObjInfo, IORef SymbolTable ) )
    2  deriving( Eq )
    

    The ObjInfo data type, which defines the structure of an object, includes an IORef SymbolTable. This means that the fields of an object are stored in a mutable SymbolTable reference.

  2. interpret Function (initialization of call stack):

    1interpret classes whole_program = do
    2  -- each frame of the call stack includes:
    3  --  - "this"'s class
    4  --  - "this"'s field table
    5  --  - local variable table for the current call
    6  -- callStack :: IORef [ ( Ident, IORef SymbolTable, SymbolTable ) ] -- Original commented out type
    7  callStack <- newIORef []
    

    Here, callStack is initialized as an IORef holding a list of tuples representing call frames. newIORef [] creates an empty mutable reference.

  3. interp_prog Function (initialization of thisFields):

    1interp_prog ( Program ( MClass name args body _ ) _ _ ) = do
    2    thisFields <- newIORef Map.empty
    3    -- TODO: add real args to local state
    4    writeIORef callStack [ ( VObj (OI (Just (name, OI Nothing, thisFields) ) ), Map.singleton args ( VString "" ) ) ]
    5    interp_stmt body
    

    thisFields is created as an IORef to hold the fields of the this object for the main class. Its value is then immediately written into the callStack.

  4. SAssign Statement (writing to locals or fields):

     1SAssign lhs rhs -> do
     2    v <- interp_exp rhs
     3    ( ( VObj (OI (Just ( c, parent, fieldsRef ) ) ), locals ) : s ) <- readIORef callStack
     4    case Map.lookup lhs locals of
     5        Just _ ->
     6            let locals' = Map.insert lhs v locals in
     7            writeIORef callStack ( ( VObj (OI (Just ( c, parent, fieldsRef ) ) ), locals' ) : s )
     8        Nothing -> do
     9            fields <- readIORef fieldsRef
    10            let fields' = Map.insert lhs v fields
    11            writeIORef fieldsRef fields'
    

    This is a crucial part. When an assignment happens:

    • readIORef callStack retrieves the current call stack.
    • fieldsRef is extracted from the current this object (which itself is an IORef SymbolTable).
    • If lhs is a local variable, writeIORef callStack updates the stack with the new local variable map.
    • If lhs is a field, readIORef fieldsRef gets the current fields, Map.insert updates them, and writeIORef fieldsRef commits the change back to the mutable field table.
  5. SArrayAssign Statement (writing to array elements):

     1SArrayAssign lhs idx rhs -> do
     2    VInt i <- interp_exp idx
     3    v <- interp_exp rhs
     4    ( ( VObj (OI (Just ( c, parent, fieldsRef ) ) ), locals ) : s ) <- readIORef callStack
     5    case Map.lookup lhs locals of
     6        Just ( VArray length items ) ->
     7            -- ... (updates local array) ...
     8            writeIORef callStack ( ( VObj (OI (Just ( c, parent, fieldsRef ) ) ), locals' ) : s )
     9        Nothing -> do
    10            fields <- readIORef fieldsRef
    11            let Just ( VArray length items ) = Map.lookup lhs fields
    12            -- ... (updates field array) ...
    13            writeIORef fieldsRef fields'
    

    Similar to SAssign, this block uses readIORef and writeIORef to update either local array variables (by updating the entire local map on the stack) or object fields (by updating the fieldsRef directly).

  6. ExpIdent Expression (reading variable values):

    1ExpIdent var -> do
    2    ( ( VObj (OI (Just ( _, _, thisFields ) ) ), locals ) : _ ) <- readIORef callStack
    3    case Map.lookup var locals of
    4        Just val -> return val
    5        Nothing -> do
    6            fieldTable <- readIORef thisFields
    7            let Just val = Map.lookup var fieldTable
    8            return val
    

    When an identifier is evaluated, it first checks local variables. If not found, it uses readIORef thisFields to look up the variable in the object’s mutable field table.

  7. create_obj Function (creating new object instances):

    1create_obj cname = do
    2  let Just( parent_name, fields, _ ) = Map.lookup cname classes
    3  fieldTable <- newIORef fields
    4  parent <- make_parent_from_name parent_name
    5  return $ VObj $ OI $ Just( cname, parent, fieldTable )
    

    When a new object is created, its fieldTable is initialized as a new IORef holding the default field values for that class.

In essence, IORef is used throughout interpreter.hs to manage the dynamic, mutable state (call stack and object fields) that is fundamental to interpreting an imperative language.

All Together Now: Running “Javaish” Code

After weeks of work, it was time for the moment of truth. Our professor provided a test file, test.javaish, that used classes, inheritance, and method calls to see if everything worked.

 1// From test.javaish
 2class parent {
 3    int g;
 4
 5    public int f1(int a) {
 6        { System.out.println( 2 ); }
 7        return a;
 8    }
 9}
10
11class kiddo extends parent {
12
13    public int f2(int b) {
14        { System.out.println( 3 ); }
15        return this.f1(b);
16    }
17}
18
19class T01 {
20    public static void main( String [] args )
21    {
22        System.out.println( (new kiddo()).f2(1) );
23    }
24}

This program creates a kiddo object, calls its f2 method (which prints 3), which in turn calls the parent’s f1 method (which prints 2), and finally prints the return value (1).

Seeing the correct output appear on my screen after piping this file into my compiled Haskell program was a satisfying moment.

Final Thoughts

This project was more than just a final assignment; it was bringing concepts into practice I had learned in CP404 Theory of Computation. It forced me to grapple with complex ideas like parsing, evaluation, state management, and functional programming paradigms (including mutability with monads!).

It taught me that the languages we use fundamentally shape how we think about problems. And sometimes, the best way to understand a language is to try and build one yourself.

More Resources

Source Code

The complete source code, including the tokenizer, parser, and interpreter, is available on GitHub. You can see the full progression from Day 1 through the end of the course. Feel free to browse, but please don’t judge my 2017-era Haskell too harshly!

View Source Code

Related Posts

Comments