alJabr: An example interpreter project using Maven and ANTLR

Introduction
Maven Quickstart
- Maven phases and build lifecycles
- POM Changes
Boolean expressions
ANTLR quickstart
Specifying our boolean language
Integrating with Maven
Representing expressions
Translating between ANTLR and our internal representation
Testing the parsing machinery
Creating the REPL
Implementing our boolean reducer
Modify the REPL to use the new reducer
Future work and extensions

In this post, we build an interpreter over boolean expressions using ANTLR and Maven. The code for the project is hosted on GitHub, released under the GPL version 3.0 license. While the project contains a manifest.scm file to create a development environment via Guix, this post assumes you know how to install various necessary software dependencies and tools for your choice operating system.

Introduction

Often in software projects, a common problem involves parsing and processing different input files or formats to accomplish different tasks. Many software developers, for a myriad of reasons implement custom, hand crafted parsers for these problems. In this post, we demonstrate a different approach for creating parsing machinery for a small invented language over boolean expressions. Specifically, we'll use "ANother Tool for Language Recognition" (ANTLR) to generate all the parsing machinery for us. We add a few additional pieces, specifically a visitor, to translate the models from ANTLR into our own representation. Finally, we implement a processor that reduces our boolean expressions down to a single boolean value. We close with some interesting extensions or exercises to extend the interpreter into a, perhaps, more useful tool.

Maven Quickstart

We begin with a new Java project. Maven is an established and venerable tool for working with Java or JVM hosted projects. We opt for choosing a tool like Maven because it captures the steps for building our project declaratively within its "Project Object Model" or POM, and we can use that anywhere we have an appropriate version of Java and Maven. For example, the final source code uses Maven within GitHub Actions, a continuous integration environment hosted and available on GitHub, to build and test the project.

Maven can generate the initial project structure using the following command:

mvn archetype:generate \
    -DgroupId=dev.fmsea.aljabr \
    -DartifactId=alJabr \
    -DarchetypeArtifactId=maven-archetype-quickstart \
    -DarchetypeVersion=1.5 \
    -DinteractiveMode=false

This generates the following directory structure:

aljabr
|-- pom.xml
`-- src
    |-- main
    |   `-- java
    |       `-- dev
    |           `-- fmsea
    |               `-- aljabr
    |                   `-- App.java
    `-- test
        `-- java
            `-- dev
                `-- fmsea
                    `-- aljabr
                        `-- AppTest.java

The main subtree contains our project code, and the test subtree contains our, unimaginatively named, test code.

As a result of our newly generated project, we should be able to compile and test this code using the following Maven command:

mvn test

Maven phases and build lifecycles

Unlike tools like Ant or Make, Maven comes with a predefined set of phases or "targets" to use the former parlance. These phases help construct a project from dependency resolution, classpath modification, and the necessary flags to compile, test, package, and install or deploy the project. Here is a short list of standard phases:

validate
compile
test
package
verify
install
deploy

See the documentation to learn more about Maven Lifecycles.

POM Changes

While the above command does indeed generate a lot of the necessary pom.xml file, there are a few changes that may be necessary to make the project work. Specifically, the archetype generates a POM with build -> pluginManagement. This may be a misunderstanding on my part, but in my experience, I need to lift the plugins under pluginManagement to have everything work as desired. Afterwards, I remove the now empty pluginManagement portion of the file.

Boolean expressions

The language we use for this demonstration is simply the boolean expressions we are used to within most programming languages. We have logical or, disjunctions; logical and, conjunctions, and logical negation. The truth tables for each are given below.

Table 1: Truth table for Negation
A	\(\lnot A\)
T	F
F	T

Table 2: Truth table for Disjunction
A	B	\(A \lor B\)
T	T	T
T	F	T
F	T	T
F	F	F

Table 3: Truth table for Conjunction
A	B	\(A \land B\)
T	T	T
T	F	F
F	T	F
F	F	F

While it may seem pedantic to list these out, we want to be sure of their values as we implement our "interperter" over these values. Moreover, depending on how far we take these ideas, we may want to demonstrate and add additional operators, e.g., inference, distribution, equality tests between expressions, etc. This is all great, but we need to define our actual language.

ANTLR quickstart

As mentioned in the introduction, ANTLR is a parser generator which we can use to build the machinery for parsing our boolean expression language and creating the initial steps of our interpreter. ANTLR uses a BNF-like syntax to specify our input language, which ANTLR consumes to generate the parser and its representation of our language. The following image is an example from the ANTLR website:

"Example from the ANTLR website"

A full introduction to BNF and language grammars is out of scope for this post.

Specifying our boolean language

Before we can construct even a handcrafted parser, we need to know what our language looks like. Since our initial language goals are pretty straightforward, let's simply define a few typical expressions our language should contain:

At the bare minimum, we need a way to express literal boolean values, true, and false. For literals, we could simply use 0 and 1, or f and t, or an uppercase alternative. However, to both demonstrate the capabilities of ANTLR and since we may want to extend our language to support variable binding, for this language we opt for the unconventional choice of #f and #t to represent false and true, respectively.

Next, we need to decide on our representation of logical negation. For this, we shall use a simple not which accepts a single operand.

Similarly, we need to decide on our representation of connectives. While it would be fun to use the literal UTF-8 characters mathematical symbols for conjunctions and disjunctions, e.g., ∧ and ∨, these characters are a little painful to type. Therefore, we choose a more friendly syntax: and and or. Both of these operators are binary connectives which only take two operands each.

Finally, we need to sort out the "tightness" or the precedence of the operators. While I am certainly inclined to just make it a LISP and not worry about it, we shall not do that in this case.

Typically, we have the following for the preorder of operator precedence: \(\text{not} \sqsupset \text{and} \sqsupset \text{or}\). That is, our negation operator has the highest precedence, conjunction medium precedence, and disjunctions have the lowest. We shall continue with this typical ordering.

As an exercise, modify the grammar to flip the ordering of precedence.

Therefore, the grammar for our language looks something like the following:

bterm
: '#f'
| '#t'
bexpr
: 'not' bexpr
| bexpr 'and' bexpr
| bexpr 'or' bexpr
| bterm

We can translate this into ANTLR's syntax:

grammar BoolExpr;

WS
    : [ \t\n]+ -> skip
    ;

NotOp : 'not' ;

AndOp : 'and' ;

OrOp : 'or' ;

BoolTerm
    : '#f'
    | '#t'
    ;

bexpr
    : NotOp bexpr # Negation
    | bexpr AndOp bexpr # AndExpr
    | bexpr OrOp bexpr # OrExpr
    | BoolTerm # BoolTerm
    ;

expr : bexpr* EOF ;

The first line provides a name for the root of the language expressions. The next rule provides a "skip" rule for different whitespace characters that do not contribute to our language. The next three rules describe the different operators. The next rule describes our literal boolean terms. The bexpr rule provides the production rules for each of the types of expressions our interpreter parses. Moreover, the "# Negation" bits at the end "name" the production rule; we use these later when creating our representation of the language. Finally, the expr rule provides a "root" rule we use to ensure ANTLR handles the end of input gracefully. Notice, we are currently only interested in the bexpr productions.

Add the contents of the grammar to the following file within the root of the project: src/main/antlr4/dev/fmsea/aljabr/BoolExpr.g4.

Integrating with Maven

We need to add a few dependencies to our project so that we can compile against the generated ANTLR code:

<dependency>
  <groupId>org.antlr</groupId>
  <artifactId>antlr4</artifactId>
  <version>4.13.2</version>
</dependency>
<dependency>
  <groupId>org.antlr</groupId>
  <artifactId>antlr4-runtime</artifactId>
  <version>4.13.2</version>
</dependency>

To wire in the process of generating Java code from the ANTLR grammar and add the generated code to our compilation targets, we need to add some plugins to the pom.xml file of our project. Specifically, we need the following plugin in the build -> plugins portion of the pom.xml:

<plugin>
  <groupId>org.codehaus.mojo</groupId>
  <artifactId>build-helper-maven-plugin</artifactId>
  <version>1.7</version>
  <executions>
    <execution>
      <id>add-source</id>
      <phase>generate-sources</phase>
      <goals>
        <goal>add-source</goal>
      </goals>
      <configuration>
        <sources>
          <source>src/main/java</source>
        </sources>
      </configuration>
    </execution>
  </executions>
</plugin>

This plugin adds generated sources into our compilation target tree. Next, we need the ANTLR plugin for generating the parsing machinery. Add the following plugin to the same section of the pom.xml:

<plugin>
  <groupId>org.antlr</groupId>
  <artifactId>antlr4-maven-plugin</artifactId>
  <version>4.13.2</version>
  <configuration>
    <visitor>true</visitor>
    <listener>true</listener>
  </configuration>
  <executions>
    <execution>
      <goals>
        <goal>antlr4</goal>
      </goals>
    </execution>
  </executions>
</plugin>

With these changes, we should be able to test our compilation process.

mvn clean compile

Notice, that in the compilation process, Maven comments about compiling seven source files, while our project only actually has one. If this is not the case, stop here and make sure the process is working. The additional six source files should be found in the following folder after compilation: target/generated-sources/antlr4/dev/fmsea/aljabr/.

Representing expressions

While we could directly use ANTLR's representation for our objects, it is often better to create our own internal representation.

We start with the abstract base class that is essentially the representation of the final rule which represents the "root" of all expressions.

package dev.fmsea.aljabr;

public abstract class BoolExpr {
    public interface Visitor<R> {
    }

    public abstract <R> R accept(Visitor<R> visitor);
}

The first thing we add is an empty Visitor interface which is used for processing expressions; we shall add to it later.

Let's create an initial visitor which translates expressions back into their string representation. Under a new package called visitors, create the following class:

package dev.fmsea.aljabr.visitors;

import dev.fmsea.aljabr.BoolExpr;

public class StringReprVisitor implements BoolExpr.Visitor<String> {
}

Now, we can add an override method for toString in our BoolExpr:

@Override
public String toString() {
    return this.accept(new StringReprVisitor());
}

Now, we need to create representations for each of our different major expression types, Negation, AndExpr, OrExpr, and our literal values.

package dev.fmsea.aljabr;

public class NegationExpr extends BoolExpr {
    public final BoolExpr inner;

    public NegationExpr(BoolExpr inner) {
        this.inner = inner;
    }

    public <R> R accept(BoolExpr.Visitor<R> visitor) {
        return visitor.visitNegation(this);
    }
}

package dev.fmsea.aljabr;

public class AndExpr extends BoolExpr {
    public final BoolExpr left;
    public final BoolExpr right;

    public AndExpr(BoolExpr left, BoolExpr right) {
        this.left = left;
        this.right = right;
    }

    public <R> R accept(BoolExpr.Visitor<R> visitor) {
        return visitor.visitAnd(this);
    }
}

package dev.fmsea.aljabr;

public class OrExpr extends BoolExpr {
    public final BoolExpr left;
    public final BoolExpr right;

    public OrExpr(BoolExpr left, BoolExpr right) {
        this.left = left;
        this.right = right;
    }

    public <R> R accept(BoolExpr.Visitor<R> visitor) {
        return visitor.visitOr(this);
    }
}

package dev.fmsea.aljabr;

public abstract class BoolTerm extends BoolExpr {
    public final boolean value;

    public BoolTerm(boolean value) {
        this.value = value;
    }
}

package dev.fmsea.aljabr;

public class FalseValue extends BoolTerm {

    public FalseValue() {
        super(false);
    }

    public <R> R accept(BoolExpr.Visitor<R> visitor) {
        return visitor.visitFalse(this);
    }
}

package dev.fmsea.aljabr;

public class TrueValue extends BoolTerm {

    public TrueValue() {
        super(true);
    }

    public <R> R accept(BoolExpr.Visitor<R> visitor) {
        return visitor.visitTrue(this);
    }
}

Now that we have some elements of our tree, we must update the visitor interface and our string representation visitor to ensure everything is implemented. Add the following signatures to the visitor interface of the BoolExpr class.

public interface Visitor<R> {
    R visitTrue(TrueValue trueValue);
    R visitFalse(FalseValue falseValue);
    R visitNegation(NegationExpr negation);
    R visitAnd(AndExpr and);
    R visitOr(OrExpr or);
}

Now, modify the StringReprVisitor to implement each of these methods.

package dev.fmsea.aljabr.visitors;

import dev.fmsea.aljabr.AndExpr;
import dev.fmsea.aljabr.BoolExpr;
import dev.fmsea.aljabr.FalseValue;
import dev.fmsea.aljabr.NegationExpr;
import dev.fmsea.aljabr.OrExpr;
import dev.fmsea.aljabr.TrueValue;

public class StringReprVisitor implements BoolExpr.Visitor<String> {

    public String visitTrue(TrueValue trueValue) {
        return "#t";
    }

    public String visitFalse(FalseValue falseValue) {
        return "#f";
    }

    public String visitNegation(NegationExpr negation) {
        return String.format("not %s", negation.inner.accept(this));
    }

    public String visitAnd(AndExpr and) {
        return String.format("%s and %s",
            and.left.accept(this),
            and.right.accept(this));
    }

    public String visitOr(OrExpr or) {
        return String.format("%s or %s",
            or.left.accept(this),
            or.right.accept(this));
    }
}

Finally, personally, I like to add some convenience methods to the BoolExpr class to make constructing new expressions easier. Add the following to the BoolExpr class:

public static TrueValue newTrue() {
    return new TrueValue();
                                  }

public static FalseValue newFalse() {
    return new FalseValue();
}

public static NegationExpr newNegation(BoolExpr inner) {
    return new NegationExpr(inner);
}

public static AndExpr newAnd(BoolExpr left, BoolExpr right) {
    return new AndExpr(left, right);
}

public static OrExpr newOr(BoolExpr left, BoolExpr right) {
    return new OrExpr(left, right);
}

Translating between ANTLR and our internal representation

Now that we have our own internal representation for our boolean expressions, we need to implement a translator that rewrites terms in the ANTLR representation into our internal representation. Fortunately, this is quite straightforward using the generated visitors from ANTLR.

Let's create a new class called BoolExprInstantiator; it extends the generated base visitor, and it returns BoolExpr's. The names of methods available within the ctx parameter are derived from the grammar and the type of the token. If in doubt, refer to the inner classes of the generated parser object.

package dev.fmsea.aljabr;

public class BoolExprInstantiator extends BoolExprBaseVisitor<BoolExpr> {

    @Override
    public BoolExpr visitBoolTerm(BoolExprParser.BoolTermContext ctx) {
        String term = ctx.getText();
        if (term.equals("#f")) {
            return BoolExpr.newFalse();
        } else if (term.equals("#t")) {
            return BoolExpr.newTrue();
        } else {
            throw new IllegalArgumentException("Invalid value for boolean terms: " + term);
        }
    }

    @Override
    public BoolExpr visitNegation(BoolExprParser.NegationContext ctx) {
        BoolExpr inner = visit(ctx.bexpr());
        return BoolExpr.newNegation(inner);
    }

    @Override
    public BoolExpr visitAndExpr(BoolExprParser.AndExprContext ctx) {
        BoolExpr left = visit(ctx.bexpr(0));
        BoolExpr right = visit(ctx.bexpr(1));
        return BoolExpr.newAnd(left, right);
    }

    @Override
    public BoolExpr visitOrExpr(BoolExprParser.OrExprContext ctx) {
        BoolExpr left = visit(ctx.bexpr(0));
        BoolExpr right = visit(ctx.bexpr(1));
        return BoolExpr.newOr(left, right);
    }
}

Finally, we need to layer up the machinery for translating some "stringy" input through the layers of ANTLR, and into our representation of the expression language.

package dev.fmsea.aljabr;

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;

public class BoolExprReader {
    public static BoolExpr parse(String expression) {
        CharStream input = CharStreams.fromString(expression);
        BoolExprLexer lexer = new BoolExprLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        BoolExprParser parser = new BoolExprParser(tokens);
        ParseTree tree = parser.bexpr();
        BoolExprInstantiator instantiator = new BoolExprInstantiator();
        return instantiator.visit(tree);
    }
}

This code is a rather straightforward but not particularly robust. The code steps through the different layers of ANTLR to consume a string: breaks the string down into characters, runs the characters through a lexer, and then the tokens through a parser, ultimately transforming the expression into a parse tree. Once we have the parse tree, we run it through our instantiator to translate it into our representation.

This code assumes all expressions parse without exception; likely a bold assumption given this is an interpreter accepting user input. Furthermore, this code assumes all expressions are rooted in the bexpr production rules of the grammar. If any extensions are added and a new "root" is established, this code necessarily must change.

Testing the parsing machinery

Now that we have the reader and instantiator classes, we should be able to put together some tests to verify that our expression language can indeed parse.

We shall add some parameterized tests that implicitly use the StringReprVisitor to check that the parsed string is equal to the input expression. Create the following file in the test directory, specifically, src/test/java/dev/fmsea/aljabr/BoolExprReaderTest.java.

package dev.fmsea.aljabr;

import static org.junit.jupiter.api.Assertions.assertEquals;

import java.util.stream.Stream;

import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;

public class BoolExprReaderTest {

    @ParameterizedTest
    @MethodSource("boolexprs")
    void testParse(String expression) {
        assertEquals(expression, BoolExprReader.parse(expression).toString());
    }

    private static Stream<? extends Arguments> boolexprs() {
        return Stream.of(
            "#f",
            "#t",
            "not #f",
            "not #t",
            "#t or #f",
            "#f or #t",
            "not #f and #t",
            "#t or not #f",
            "not #t or not #t"
        ).map(s -> Arguments.of(s));
    }
}

Assuming everything is correct to this point, everything should work with 10 passing tests (the 10th is the generated test from generating the project with Maven).

Creating the REPL

For the penultimate step in our interpreter, we need to create the Read-Eval-Print-Loop or REPL. Afterwards, we modify the generated App class to run the new REPL object.

First, create the new REPL class.

package dev.fmsea.aljabr;

import java.util.Scanner;

public class REPL implements Runnable {

    public void run() {
        try (Scanner scan = new Scanner(System.in)) {
            while (true) {
                System.out.print("> ");
                String expression = scan.nextLine();
                BoolExpr expr = BoolExprReader.parse(expression);
                System.out.println(expr.toString());
            }
        }
    }
}

We shall return to this code after we implement our boolean expression reducer.

Finally, modify the App class to spin up this class.

package dev.fmsea.aljabr;

public class App {
    public static void main(String[] args) {
        REPL repl = new REPL();

        repl.run();
    }
}

You can run the interpreter at this stage. However, it only echos back the input expressions. Unless the expression fails to parse, then it explodes.

Implementing our boolean reducer

For the final part of our interpreter, we need to develop a process for reducing the given expressions down to a single truth value. Following a similar model to our StringReprVisitor, we implement this as another visitor. Therefore, you may wish to start by copying the StringReprVisitor and implement "reductions" for each of the different kinds of expressions.

package dev.fmsea.aljabr.visitors;

import dev.fmsea.aljabr.AndExpr;
import dev.fmsea.aljabr.BoolExpr;
import dev.fmsea.aljabr.BoolTerm;
import dev.fmsea.aljabr.FalseValue;
import dev.fmsea.aljabr.NegationExpr;
import dev.fmsea.aljabr.OrExpr;
import dev.fmsea.aljabr.TrueValue;

public class BoolReducer implements BoolExpr.Visitor<BoolTerm> {

    public BoolTerm visitTrue(TrueValue trueValue) {
        return trueValue;
    }

    public BoolTerm visitFalse(FalseValue falseValue) {
        return falseValue;
    }

    public BoolTerm visitNegation(NegationExpr negation) {
        BoolTerm inner = negation.inner.accept(this);
        return inner.value ? BoolExpr.newFalse() : BoolExpr.newTrue();
    }

    public BoolTerm visitAnd(AndExpr and) {
        BoolTerm left = and.left.accept(this);
        BoolTerm right = and.right.accept(this);
        if (left.value && right.value) {
            return BoolExpr.newTrue();
        } else {
            return BoolExpr.newFalse();
        }
    }

    public BoolTerm visitOr(OrExpr or) {
        BoolTerm left = or.left.accept(this);
        BoolTerm right = or.right.accept(this);
        if (left.value || right.value) {
            return BoolExpr.newTrue();
        } else {
            return BoolExpr.newFalse();
        }
    }
}

While we could push some of the operations back into our model, this representation is fine for now. Next, we should add some tests to ensure that the reducer works as expected and has the appropriate operator precedence.

To make things a little easier, add the following method to the BoolExpr abstract class:

public BoolExpr reduce() {
    return this.accept(new BoolReducer());
}

Next, create the following test class src/test/java/dev/fmsea/aljabr/visitors/BoolReductionrTest.java:

package dev.fmsea.aljabr.visitors;

import static org.junit.jupiter.api.Assertions.assertEquals;

import java.util.stream.Stream;

import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;

import dev.fmsea.aljabr.BoolExprReader;

public class BoolReductionTest {

    @ParameterizedTest
    @MethodSource("boolexprs")
    void testParse(String expected, String expression) {
        assertEquals(expected,
            BoolExprReader.parse(expression).reduce().toString());
    }

    private static Stream<? extends Arguments> boolexprs() {
        return Stream.of(
            Arguments.of("#f", "#f"),
            Arguments.of("#t", "#t"),
            Arguments.of("#t", "not #f"),
            Arguments.of("#f", "not #t"),
            Arguments.of("#t", "#t or #f"),
            Arguments.of("#t", "#f or #t"),
            Arguments.of("#t", "not #f and #t"),
            Arguments.of("#t", "#t or not #f"),
            Arguments.of("#f", "not #t or not #t"),
            Arguments.of("#t", "#t or not #f and #t or #f")
        );
    }
}

Modify the REPL to use the new reducer

As a final step to put the interpreter together, modify the REPL to use the new reduce() method.

BoolExpr expr = BoolExprReader.parse(expression).reduce();

Now, the interpreter reduces the input expressions down to a single boolean value.

Future work and extensions

While this interpreter does not represent a complete or even an ideal implementation of an interpreter, I hope it serves as a reasonable example of the overall approach. Specifically, it demonstrates how to use Maven and ANTLR together. Moreover, it demonstrates the Visitor Pattern and how we can use this pattern to easily rewrite expressons from one representation to another, and modify expressions within a representation. Certainly, it shows, with a small amount of code, we can have a fairly extensible and workable solution for interpreting over a small bespoke language.

There exist plenty of opportunities to improve this interpreter. Clearly, this post does not focus on error handling or interpreter ergonomics. Moreover, this interpreter does not implement short-circuit evaluation over connectives. It fully evaluates both sides then decides the final value. However, certain expressions like #t or #f and #f and #t do not need to be fully evaluated to determine their truth value. As a simple extension, try modifying or extending the reducer to include short-circuit evaluation. Moreover, another extension can be the addition of variable bindings of variable expressions. Introducing function evaluation and inference rules extends this idea further. With each of these additions, a nice final exercise would be to implement the concept reflexivity checker between two expressions. For example, having this, we could show the equivalence between two expressions such as P ⊃ Q and !P ∨ Q.