Introduction

While trying to build a scheme interpreter in rust, I had a hard time getting nom to do what I want and understanding the difference between all the macros it provides.

Don’t get me wrong, it’s an amazing framework, but most of the example parsers are hard to understand, embedded in lots of unrelated code or outdated (e.g. using the chain! macro that has been replaced by do_parse!).

This is my attempt to create some kind of tutorial that starts with the simplest building blocks, explores subtle differences between macros and some mistakes to avoid, and hopefully results in working parser for an R5RS Scheme.

Setup

  1. Make sure you know how to set up and run rust projects crate, at the time of writing this, the newest version is 3.2.0
  2. Take a look at chapter 7 of R5RS to see what the target grammar looks like.
  3. Create a new cargo (binary) project and install the following dependencies
[dependencies]
nom = "3.2.0"
rustyline = "1.0.0"

nom is a parser combinator framework, rustyline an implementation of readline that we’re going to use to create a simple REPL for experimenting with parsers.

// nom defines a ton of macros, make them available here
#[macro_use]
extern crate nom;
extern crate rustyline;

use rustyline::error::ReadlineError;
use rustyline::Editor;

fn main() {
    let mut rl = Editor::<()>::new();
    if let Err(_) = rl.load_history("history.txt") {
        println!("No previous history.");
    }
    loop {
        let readline = rl.readline(">> ");
        match readline {
            Ok(line) => {
                rl.add_history_entry(&line);
                println!("Line: {}", line);
            },
            Err(ReadlineError::Interrupted) => {
                println!("CTRL-C");
                break
            },
            Err(ReadlineError::Eof) => {
                println!("CTRL-D");
                break
            },
            Err(err) => {
                println!("Error: {:?}", err);
                break
            }
        }
    }
    rl.save_history("history.txt").unwrap();
}

The above code is a copy of the example on the rustyline github page.

cargo run now yields a nice REPL (“RPL” would be more accurate).

No previous history.
>> foo
Line: foo
>> bar
Line: bar
>> baz
Line: baz
>>
CTRL-C

Keywords

A good place to start is the section on syntactic and expression keywords. We are going to use three nom macros to parse these, named!, tag! and alt!.

Let’s see what the nom docs have to say about them:

named!: Makes a function from a parser combination

tag!, tag_s!: recognizes a specific suite of characters or bytes. tag!("hello") matches “hello”

alt!: try a list of parsers and return the result of the first successful one

The heading describes alt! as a “Choice Combinator” which sounds pretty smart and could be useful for name-dropping ;)

For now, don’t think too much about what the input and output types of these functions are, we’ll get to it later.

named!(
  syntactic_keyword,
  tag!("else")
);

fn parse(line: &str) {
    let res = syntactic_keyword(line.as_bytes());
    println!("Parsed {:#?}", res);
}

fn main() {
    // ...
        match readline {
            Ok(line) => {
                rl.add_history_entry(&line);
                parse(&line);
            },
    // ...

First, we create a new parser syntactic_keyword that parses only the string (tag!) “else”, then a helper function that takes a line, feeds it to the parser and prints out the result.

.as_bytes() is necessary because nom works on slices of bytes (u8) and &str is a slice of chars (to support unicode).

In case you didn’t know, {:#?} is a formatting literal that can be used to pretty-print the debug version of some variable.

cargo run 1

>> else
Parsed Done(
    [],
    [ 101, 108, 115, 101 ]
)
>> foo
Parsed Error(Tag)
>> elsefoo
Parsed Done(
    [ 102, 111, 111 ],
    [ 101, 108, 115, 101 ]
)
>> els
Parsed Incomplete(Size(4))
>>

What appears to have happened?

All our parsers return a IResult, which, according to the docs, can be one of the following types:

Done(I, O) indicates a correct parsing, the first field containing the rest of the unparsed data, the second field contains the parsed data

Error(Err<E>) contains a Err, an enum that can indicate an error code, a position in the input, and a pointer to another error, making a list of errors in the parsing tree

Incomplete(Needed) Incomplete contains a Needed, an enum than can represent a known quantity of input data, or unknown

This explains all the kinds of output we are seeing in the REPL.

[ 101, 108, 115, 101 ] is just a list of the bytes in "else", 101 is ASCII for ‘e’, etc.

  • “else” can be parsed fully, the I of Done is empty.
  • “foo” can not be parsed and returns an Error
  • “elsefoo” can be parsed up to “foo”, the result is a Done with I = “foo” (as bytes) and O = “else”
  • “els” could be parsed, but there is something missing (a fourth byte)

Other Return Types

Let’s add the remaining keywords:

named!(
    syntactic_keyword,
    alt!(
        expression_keyword | tag!("else") | tag!("=>") |
        tag!("define") | tag!("unquote") | tag!("unquote-splicing")
    )
);

named!(
    expression_keyword,
    alt!(
        tag!("quote") | tag!("lambda") | tag!("if") |
        tag!("set!") | tag!("begin") | tag!("cond") |
        tag!("and") | tag!("or") | tag!("case") |
        tag!("letrec") | tag!("let*") | tag!("let") |
        tag!("do") | tag!("delay") | tag!("quasiquote")
    )
);

This gives us a first glimpse at the power of parser combinators, because alt! can combine any kind of parser (as long as the result types are the same) we can create a second parser expression_keyword and combine it with the tag!s without any problems.

Getting slices of bytes as a result is not that useful (in this case), but luckily there is an easy solution.

#[derive(Debug)]
enum SyntacticKeyword {
    Else, Arrow, Define, Unquote, UnquoteSplicing,
    Expression(ExpressionKeyword)
}

#[derive(Debug)]
enum ExpressionKeyword {
    Quote, Lambda, If, Set, Begin, Cond, And, Or,
    Case, Let, LetStar, LetRec, Do, Delay, Quasiquote
}

Because we are using {:#?} to output the results, both enums need to implement the Debug trait. #[derive(Debug)] tells rust to do this for us.

The last step is to change the result type from &[u8] (a slice of bytes) to SyntacticKeyword or ExpressionKeyword using map! and closures |arg1, arg2, ...| body.

named!(
    syntactic_keyword<SyntacticKeyword>,
    alt!(
        map!(expression_keyword,      |e| SyntacticKeyword::Expression(e)) |
        map!(tag!("else"),            |_| SyntacticKeyword::Else) |
        map!(tag!("=>"),              |_| SyntacticKeyword::Arrow) |
        map!(tag!("define"),          |_| SyntacticKeyword::Define) |
        map!(tag!("unquote"),         |_| SyntacticKeyword::Unquote) |
        map!(tag!("unquote-splicing"),|_| SyntacticKeyword::UnquoteSplicing)
    )
);

named!(
    expression_keyword<ExpressionKeyword>,
    alt!(
        map!(tag!("quote"),     |_| ExpressionKeyword::Quote) |
        map!(tag!("lambda"),    |_| ExpressionKeyword::Lambda) |
        // ...
        map!(tag!("quasiquote"),|_| ExpressionKeyword::Quasiquote)
    )
);

The description of map! is accurate but doesn’t help that much…

map!: maps a function on the result of a parser

… but its type signature tells the whole story:

map!(I -> IResult<I,O>, O -> P) => I -> IResult<I, P>

Thanks to tag! we already have parsers with the type signature &[u8] -> IResult<&[u8], &[u8]> and want syntactic_keyword to be a parser with the type signature &[u8] -> IResult<&[u8], SyntacticKeyword>
(“try to parse a slice of bytes and ideally return a SyntacticKeyword”).

This is exactly what map! does, the only missing piece is the second argument, a function O -> P, in our case &[u8] -> SyntacticKeyword2.

In most cases we don’t need the value for O because the result doesn’t depend on it, so we can “throw it away” with the _ placeholder.

The one special case is

map!(expression_keyword, |e| SyntacticKeyword::Expression(e))

expression_keyword where we need to wrap the resulting ExpressionKeyword in a SyntacticKeyword::Expression(...).

Because combining map! and alt! is so common, there is a special syntax for alt! with a builtin map!:

named!(
    syntactic_keyword<SyntacticKeyword>,
    alt_complete!(
        expression_keyword       => { |e| SyntacticKeyword::Expression(e) } |
        tag!("else")             => { |_| SyntacticKeyword::Else } |
        tag!("=>")               => { |_| SyntacticKeyword::Arrow } |
        tag!("define")           => { |_| SyntacticKeyword::Define } |
        tag!("unquote-splicing") => { |_| SyntacticKeyword::UnquoteSplicing } |
        tag!("unquote")          => { |_| SyntacticKeyword::Unquote }
    )
);

Problem 1: Early Returns

Playing around with the REPL quickly leads to some unexpected results:

>> else
Parsed Done([], Else)
>> =>
Parsed Done([], Arrow)
>> let
Parsed Done(
    [],
    Expression(Let)
)
>> letrec
Parsed Done(
    [114, 101, 99],
    Expression(Let)
)
>>

“letrec” parses to Expression(Let) with [114, 101, 99] (“rec”) as remaining input?

Remember the documentation for alt!:

[…] return the result of the first successful one […]

Done with some remaining input still counts as “successful” so alt! doesn’t even try out the alternative for “letrec”. There is a similar problem for “let” / “let*” and “unquote” / “unquote-splicing”.

An easy fix is to change the order inside alt! so that the longest versions come first.

named!(
    syntactic_keyword<SyntacticKeyword>,
    alt!(
        // ...
        tag!("unquote-splicing") => { |_| SyntacticKeyword::UnquoteSplicing) } |
        tag!("unquote")          => { |_| SyntacticKeyword::Unquote) }
    )
);

named!(
    expression_keyword<ExpressionKeyword>,
    alt!(
        // ...
        tag!("letrec") => { |_| ExpressionKeyword::LetRec) } |
        tag!("let*")   => { |_| ExpressionKeyword::LetStar) } |
        tag!("let")    => { |_| ExpressionKeyword::Let) } |
        // ...
    )
)

Problem 2: Incomplete

>> letrec
Parsed Done(
    [],
    Expression(LetRec)
)
>> let*
Parsed Done(
    [],
    Expression(LetStar)
)
>> let
Parsed Incomplete(Size(6))
>>

We successfully fixed “let*” and “letrec” but now “let” won’t work because the “letrec” branch sees it, starts to parse it, notices it is incomplete and alt! happily returns that as a result.

Again there is an easy solution:

alt_complete!: is equivalent to the alt! combinator, except that it will not return Incomplete when one of the constituting parsers returns Incomplete. Instead, it will try the next alternative in the chain.

The same problem is hidden in syntactic_keyword, too, so we need to change both alt! to alt_complete.

named!(
    syntactic_keyword<SyntacticKeyword>,
    alt_complete!(
        // ...
    )
);

named!(
    expression_keyword<ExpressionKeyword>,
    alt_complete!(
        // ...
    )
)
>> let
Parsed Done(
    [],
    Expression(Let)
)
>> letrec
Parsed Done(
    [],
    Expression(LetRec)
)
>> let*
Parsed Done(
    [],
    Expression(LetStar)
)
>> le
Parsed Error(Alt)
>>
CTRL-C

Conclusion

Our parser is now able to parse all the scheme keywords (syntactic or expression) which is a good start.

There is one small problem left, it might be hard to spot because it only affects the way parsing errors are reported. After switching to alt_complete!, the result no longer contains the information if the parser failed because the input was incomplete or if there simply was no matching parser which might be useful for reporting parser errors later on.

I think there are some ways to refactor the grammar to circumvent this, but that has to wait until the next post…

Full source code: l3kn/r5rs-parser.

  1. I reformatted the output to keep it short 

  2. Or &[u8] -> Expressionkeyword for expression_keyword