A Parser Combinator for Java 8
Parsing source code, configuration files and document formats can be cumbersome in Java. You'll either have to reach for a code-generating tool such as ANTLR or write a bunch of ad-hoc code that shows little resemblance to the grammar you're trying to implement.
Parser Combinators to the Rescue
A parser combinator is simply a library containing a few primitives to parse things like strings and numbers, and a small set of functions that combine smaller parsers into more complex parsers.
For example, if you have two parsers a
and b
, then you can combine them using the following methods:
a.then(b)
- Parse ana
followed by ab
.a.or(b)
- Parse ana
if possible, otherwise parse ab
.a.zeroOrMore(b)
- Parse zero or more ofa
, separated byb
.
There are two primitive parsers: string(",")
, which parses the exact string ","
, and regex("[0-9]+")
, which in this case parses one or more digits. We can parse a commma-separated list of integers by combining these two parses wtih zeroOrMore
:
regex("[0-9]+").zeroOrMore(string(","))
This parser accepts syntax like the following:
2,3,5,7,11,13,17
It'd be natural for our parser to output a list of integers. The regex("[0-9]+")
parser results in a java.util.regex.MatchResult
, which can be used to extract the capturing groups from the regular expression. In this case, we just need group(0)
, eg. everything that the regular expression matched. Like the new java.util.stream.Stream
, the Parser
supports the map(...)
method, which we can use to convert the result of the parse into an integer, using Integer.parseInt
:
Parser<Integer> integerP = regex("[0-9]+").
map(m -> Integer.parseInt(m.group(0)));
The Parser<Integer>
type means "a successful parse produces an integer". I'm using the P
suffix for parsers here, to make it clear which variables refer to parsers. We can then parse the list of integers, and get a List<Integer>
back:
Parser<List<Integer>> listP = integerP.zeroOrMore(string(","));
In many languages, lists are enclosed in brackets []
. To require their presence without making them part of the output, we can use skip
:
skip(string("[")).then(listP).skip(string("]"))
Now we can parse a list of integers surrounded by brackets:
[2,3,5,7,11,13,17]
And the result is still a List<Integer>
.
Combining Alternatives
JavaScript supports heterogeneous lists, ie. where the elements don't have the same type.
[1,2,3,"foo",4,5,"bar"]
To parse this, we'll first need to be able to parse strings:
Parser<String> stringP =
regex("\"([^\"\\\\]*|\\\\[\"\\\\trnbf\\/]|\\\\u[0-9a-f]{4})*\"").
map(m -> m.group(1));
Ugh. Regex. But now we can combine this with .or()
to parse the elements:
Parser<Object> elementP = stringP.or(integerP);
The only common supertype of Integer
and String
is Object
. To avoid throwing out type safety, we'll have to introduce a syntax tree later on, and return that instead of Object
. But for now, let's just complete our parser for heterogenous lists:
Parser<List<Object>> listP =
skip(string("[")).
then(elementP.zeroOrMore(string(","))).
skip(string("]"));
Invoking the parser on a string is straightforward:
String input = "[1,2,3,\"foo\",4,5,\"bar\"]";
List<Object> result = listP.parse(input);
If the parser doesn't recognize the input, parse()
will throw a Parser.Failure
exception. Alternatively, we can use tryParse()
, which returns java.util.Optional.empty()
on failure.
Parsing JSON
To get a feel of what it's like to use the parse generator, we will write a parser for JSON. First we'll look at the complete code for the parser, and then I'll explain it. To store the parsed JSON, I'm going to use the Json syntax tree.
static Parser<?> token(String keyword) {
return regex(Pattern.quote(keyword) + "[\\s\\r\\n]*");
}
static Parser<Json> valueP =
token("null").<Json>map(t -> new JsonNull()).
or(token("true").map(t -> new JsonBoolean(true))).
or(token("false").map(t -> new JsonBoolean(false))).
or(() -> JsonParser.stringP.map(JsonString::new)).
or(() -> JsonParser.numberP.map(JsonNumber::new)).
or(() -> JsonParser.objectP).
or(() -> JsonParser.arrayP);
static Parser<String> stringP =
regex("\"([^\"\\\\]*|\\\\[\"\\\\trnbf\\/]|\\\\u[0-9a-f]{4})*\"[\\s\\r\\n]*").
map(m -> StringEscapeUtils.unescapeEcmaScript(m.group(1)));
static Parser<Double> numberP =
regex("(-?(0|[1-9][0-9]*)([.][0-9]+)?([eE][+-]?[0-9]*)?)[\\s\\r\\n]*").
map(m -> Double.parseDouble(m.group(1)));
static Parser<Pair<String, Json>> fieldP =
stringP.skip(token(":")).then(valueP);
static Parser<Json> objectP =
skip(token("{")).then(fieldP.zeroOrMore(token(","))).skip(token("}")).
map(JsonObject::new);
static Parser<Json> arrayP =
skip(token("[")).then(valueP.zeroOrMore(token(","))).skip(token("]")).
map(JsonArray::new);
public static Parser<Json> jsonP =
skip(regex("[\\s\\r\\n]*")).
then(valueP).
skip(regex("$"));
The token
function simply parses a string and then skips whitespace. Since we don't care about the parsed value, we'll use a wildcard type Parser<?>
.
The valueP
parser lists all the different kinds of values that JSON supports. First it will try to parse null
, and if that fails, it'll try to parse true
and so on. To help the compiler, we'll need to be explicit about the <Json>
type for the first case of the or
chain, or it would expect the rest of the cases to be JsonNull
parsers instead of Json
parsers.
It refers to some parses that are defined below valueP
, and it order to make such forward references work, we delay the evaluation by wrapping them in a no-argument function () -> ...
. We also need to explicitly mention the JsonParser
class for forward references, or the compiler will stop us.
The numberP
and stringP
parsers are simply regular expressions. The final conversion for the stringP
case, namely unescapeEcmaScript
, comes from the Apache commons-lang library.
The arrayP
is similar to what we've already seen. objectP
is more of the same, but requires zero or more fields, each consisting of a string, a colon and a value, which is what fieldP
parses.
Finally, the jsonP
parser skips initial whitespace, parses a json value using valueP
and finally requires that we're at the end of the file with regex("$")
, to ensure that we parsed the entire string.
Thus, with this screenful of code, you've got a JSON parser:
String input = "{ \"name\": \"anna\", \"age\": 21, \"interests\": [\"diving\", \"programming\"] }";
Json output = JsonParser.jsonP.parse(input);
Caveats
This is a proof of concept. While it works, it's severely lacking in error reporting. Fixing this is primarily a matter of limiting backtracking (like Parsec does with its try
primitive). Adding this will not require modifying the JSON parser we wrote above, since JSON doesn't need backtracking at all.
The full project is available on GitHub: https://github.com/Ahnfelt/parsercombinator
Comments? Catch me on Twitter! I'm @Continuational
.