A Parser Combinator for Java 8

Parsing source code, configuration files and document formats can be cumbersome in Java. You'll either have to reach for a code-generating tool such as ANTLR or write a bunch of ad-hoc code that shows little resemblance to the grammar you're trying to implement.

Parser Combinators to the Rescue

A parser combinator is simply a library containing a few primitives to parse things like strings and numbers, and a small set of functions that combine smaller parsers into more complex parsers.

For example, if you have two parsers a and b, then you can combine them using the following methods:

  • a.then(b) - Parse an a followed by a b.
  • a.or(b) - Parse an a if possible, otherwise parse a b.
  • a.zeroOrMore(b) - Parse zero or more of a, separated by b.

There are two primitive parsers: string(","), which parses the exact string ",", and regex("[0-9]+"), which in this case parses one or more digits. We can parse a commma-separated list of integers by combining these two parses wtih zeroOrMore:

regex("[0-9]+").zeroOrMore(string(","))

This parser accepts syntax like the following:

2,3,5,7,11,13,17

It'd be natural for our parser to output a list of integers. The regex("[0-9]+") parser results in a java.util.regex.MatchResult, which can be used to extract the capturing groups from the regular expression. In this case, we just need group(0), eg. everything that the regular expression matched. Like the new java.util.stream.Stream, the Parser supports the map(...) method, which we can use to convert the result of the parse into an integer, using Integer.parseInt:

Parser<Integer> integerP = regex("[0-9]+").
    map(m -> Integer.parseInt(m.group(0)));

The Parser<Integer> type means "a successful parse produces an integer". I'm using the P suffix for parsers here, to make it clear which variables refer to parsers. We can then parse the list of integers, and get a List<Integer> back:

Parser<List<Integer>> listP = integerP.zeroOrMore(string(","));

In many languages, lists are enclosed in brackets []. To require their presence without making them part of the output, we can use skip:

skip(string("[")).then(listP).skip(string("]"))

Now we can parse a list of integers surrounded by brackets:

[2,3,5,7,11,13,17]

And the result is still a List<Integer>.

Combining Alternatives

JavaScript supports heterogeneous lists, ie. where the elements don't have the same type.

[1,2,3,"foo",4,5,"bar"]

To parse this, we'll first need to be able to parse strings:

Parser<String> stringP = 
    regex("\"([^\"\\\\]*|\\\\[\"\\\\trnbf\\/]|\\\\u[0-9a-f]{4})*\"").
    map(m -> m.group(1));

Ugh. Regex. But now we can combine this with .or() to parse the elements:

Parser<Object> elementP = stringP.or(integerP);

The only common supertype of Integer and String is Object. To avoid throwing out type safety, we'll have to introduce a syntax tree later on, and return that instead of Object. But for now, let's just complete our parser for heterogenous lists:

Parser<List<Object>> listP =
    skip(string("[")).
    then(elementP.zeroOrMore(string(","))).
    skip(string("]"));

Invoking the parser on a string is straightforward:

String input = "[1,2,3,\"foo\",4,5,\"bar\"]";
List<Object> result = listP.parse(input);

If the parser doesn't recognize the input, parse() will throw a Parser.Failure exception. Alternatively, we can use tryParse(), which returns java.util.Optional.empty() on failure.

Parsing JSON

To get a feel of what it's like to use the parse generator, we will write a parser for JSON. First we'll look at the complete code for the parser, and then I'll explain it. To store the parsed JSON, I'm going to use the Json syntax tree.

static Parser<?> token(String keyword) {
    return regex(Pattern.quote(keyword) + "[\\s\\r\\n]*");
}

static Parser<Json> valueP =
    token("null").<Json>map(t -> new JsonNull()).
    or(token("true").map(t -> new JsonBoolean(true))).
    or(token("false").map(t -> new JsonBoolean(false))).
    or(() -> JsonParser.stringP.map(JsonString::new)).
    or(() -> JsonParser.numberP.map(JsonNumber::new)).
    or(() -> JsonParser.objectP).
    or(() -> JsonParser.arrayP);

static Parser<String> stringP =
    regex("\"([^\"\\\\]*|\\\\[\"\\\\trnbf\\/]|\\\\u[0-9a-f]{4})*\"[\\s\\r\\n]*").
    map(m -> StringEscapeUtils.unescapeEcmaScript(m.group(1)));

static Parser<Double> numberP =
    regex("(-?(0|[1-9][0-9]*)([.][0-9]+)?([eE][+-]?[0-9]*)?)[\\s\\r\\n]*").
    map(m -> Double.parseDouble(m.group(1)));

static Parser<Pair<String, Json>> fieldP =
    stringP.skip(token(":")).then(valueP);

static Parser<Json> objectP =
    skip(token("{")).then(fieldP.zeroOrMore(token(","))).skip(token("}")).
    map(JsonObject::new);

static Parser<Json> arrayP =
    skip(token("[")).then(valueP.zeroOrMore(token(","))).skip(token("]")).
    map(JsonArray::new);

public static Parser<Json> jsonP =
    skip(regex("[\\s\\r\\n]*")).
    then(valueP).
    skip(regex("$"));

The token function simply parses a string and then skips whitespace. Since we don't care about the parsed value, we'll use a wildcard type Parser<?>.

The valueP parser lists all the different kinds of values that JSON supports. First it will try to parse null, and if that fails, it'll try to parse true and so on. To help the compiler, we'll need to be explicit about the <Json> type for the first case of the or chain, or it would expect the rest of the cases to be JsonNull parsers instead of Json parsers.

It refers to some parses that are defined below valueP, and it order to make such forward references work, we delay the evaluation by wrapping them in a no-argument function () -> .... We also need to explicitly mention the JsonParser class for forward references, or the compiler will stop us.

The numberP and stringP parsers are simply regular expressions. The final conversion for the stringP case, namely unescapeEcmaScript, comes from the Apache commons-lang library.

The arrayP is similar to what we've already seen. objectP is more of the same, but requires zero or more fields, each consisting of a string, a colon and a value, which is what fieldP parses.

Finally, the jsonP parser skips initial whitespace, parses a json value using valueP and finally requires that we're at the end of the file with regex("$"), to ensure that we parsed the entire string.

Thus, with this screenful of code, you've got a JSON parser:

String input = "{ \"name\": \"anna\", \"age\": 21, \"interests\": [\"diving\", \"programming\"] }";
Json output = JsonParser.jsonP.parse(input);

Caveats

This is a proof of concept. While it works, it's severely lacking in error reporting. Fixing this is primarily a matter of limiting backtracking (like Parsec does with its try primitive). Adding this will not require modifying the JSON parser we wrote above, since JSON doesn't need backtracking at all.

The full project is available on GitHub: https://github.com/Ahnfelt/parsercombinator

Comments? Catch me on Twitter! I'm @Continuational.