Parsing Scheme with nom

Keywords: lisp nom rust scheme
Created: 2017-09-05 Tue 10:55:00

Update, 2018-12-11

With version 4 of nom a lot of things changed so the code snippets won't work anymore and some descriptions might be outdated.

Introduction

While trying to build a scheme interpreter in rust, I had a hard time getting nom to do what I want and understanding the difference between all the macros it provides.

Don't get me wrong, it's an amazing framework, but most of the example parsers are hard to understand, embedded in lots of unrelated code or outdated (e.g. using the chain! macro that has been replaced by do_parse! ).

This is my attempt to create some kind of tutorial that starts with the simplest building blocks, explores subtle differences between macros and some mistakes to avoid, and hopefully results in working parser for an R5RS Scheme.

Setup

  1. Make sure you know how to set up and run rust projects crate, at the time of writing this, the newest version is 3.2.0
  2. Take a look at chapter 7 of R5RS to see what the target grammar looks like.
  3. Create a new cargo (binary) project and install the following dependencies
[dependencies]
nom = "3.2.0"
rustyline = "1.0.0"

nom is a parser combinator framework, rustyline an implementation of readline that we're going to use to create a simple REPL for experimenting with parsers.

// nom defines a ton of macros, make them available here
#[macro_use]
extern crate nom;
extern crate rustyline;

use rustyline::error::ReadlineError;
use rustyline::Editor;

fn main() {
    let mut rl = Editor::<()>::new();
    if let Err(_) = rl.load_history("history.txt") {
        println!("No previous history.");
    }
    loop {
        let readline = rl.readline(">> ");
        match readline {
            Ok(line) => {
                rl.add_history_entry(&line);
                println!("Line: {}", line);
            },
            Err(ReadlineError::Interrupted) => {
                println!("CTRL-C");
                break
            },
            Err(ReadlineError::Eof) => {
                println!("CTRL-D");
                break
            },
            Err(err) => {
                println!("Error: {:?}", err);
                break
            }
        }
    }
    rl.save_history("history.txt").unwrap();
}

The above code is a copy of the example on the rustyline github page .

cargo run now yields a nice REPL ("RPL" would be more accurate).

No previous history.
>> foo
Line: foo
>> bar
Line: bar
>> baz
Line: baz
>>
CTRL-C

Keywords

A good place to start is the section on syntactic and expression keywords. We are going to use three nom macros to parse these, named! , tag! and alt! .

Let's see what the nom docs have to say about them:

named! : Makes a function from a parser combination

tag! , tag_s! : recognizes a specific suite of characters or bytes. tag!("hello") matches "hello"

alt! : try a list of parsers and return the result of the first successful one

The heading describes alt! as a "Choice Combinator" which sounds pretty smart and could be useful for name-dropping ;)

For now, don't think too much about what the input and output types of these functions are, we'll get to it later.

named!(
  syntactic_keyword,
  tag!("else")
);

fn parse(line: &str) {
    let res = syntactic_keyword(line.as_bytes());
    println!("Parsed {:#?}", res);
}

fn main() {
    // ...
        match readline {
            Ok(line) => {
                rl.add_history_entry(&line);
                parse(&line);
            },
    // ...

First, we create a new parser syntactic_keyword that parses only the string ( tag! ) "else", then a helper function that takes a line, feeds it to the parser and prints out the result.

.as_bytes() is necessary because nom works on slices of bytes ( u8 ) and &str is a slice of chars (to support unicode).

In case you didn't know, {:#?} is a formatting literal that can be used to pretty-print the debug version of some variable.

cargo run 1

>> else
Parsed Done(
    [],
    [ 101, 108, 115, 101 ]
)
>> foo
Parsed Error(Tag)
>> elsefoo
Parsed Done(
    [ 102, 111, 111 ],
    [ 101, 108, 115, 101 ]
)
>> els
Parsed Incomplete(Size(4))
>>

What appears to have happened?

All our parsers return a IResult , which, according to the docs, can be one of the following types:

Done(I, O) indicates a correct parsing, the first field containing the rest of the unparsed data, the second field contains the parsed data

Error(Err<E>) contains a Err , an enum that can indicate an error code, a position in the input, and a pointer to another error, making a list of errors in the parsing tree

Incomplete(Needed) Incomplete contains a Needed , an enum than can represent a known quantity of input data, or unknown

This explains all the kinds of output we are seeing in the REPL.

[ 101, 108, 115, 101 ] is just a list of the bytes in "else" , 101 is ASCII for 'e', etc.

  • "else" can be parsed fully, the I of Done is empty.
  • "foo" can not be parsed and returns an Error
  • "elsefoo" can be parsed up to "foo", the result is a Done with I = "foo" (as bytes) and O = "else"
  • "els" could be parsed, but there is something missing (a fourth byte)

Other Return Types

Let's add the remaining keywords:

named!(
    syntactic_keyword,
    alt!(
        expression_keyword | tag!("else") | tag!("=>") |
        tag!("define") | tag!("unquote") | tag!("unquote-splicing")
    )
);

named!(
    expression_keyword,
    alt!(
        tag!("quote") | tag!("lambda") | tag!("if") |
        tag!("set!") | tag!("begin") | tag!("cond") |
        tag!("and") | tag!("or") | tag!("case") |
        tag!("letrec") | tag!("let*") | tag!("let") |
        tag!("do") | tag!("delay") | tag!("quasiquote")
    )
);

This gives us a first glimpse at the power of parser combinators, because alt! can combine any kind of parser (as long as the result types are the same) we can create a second parser expression_keyword and combine it with the =tag!=s without any problems.

Getting slices of bytes as a result is not that useful (in this case), but luckily there is an easy solution.

#[derive(Debug)]
enum SyntacticKeyword {
    Else, Arrow, Define, Unquote, UnquoteSplicing,
    Expression(ExpressionKeyword)
}

#[derive(Debug)]
enum ExpressionKeyword {
    Quote, Lambda, If, Set, Begin, Cond, And, Or,
    Case, Let, LetStar, LetRec, Do, Delay, Quasiquote
}

Because we are using {:#?} to output the results, both enums need to implement the Debug trait. #[derive(Debug)] tells rust to do this for us.

The last step is to change the result type from &[u8] (a slice of bytes) to SyntacticKeyword or ExpressionKeyword using map! and closures |arg1, arg2, ...| body .

named!(
    syntactic_keyword<SyntacticKeyword>,
    alt!(
        map!(expression_keyword,      |e| SyntacticKeyword::Expression(e)) |
        map!(tag!("else"),            |_| SyntacticKeyword::Else) |
        map!(tag!("=>"),              |_| SyntacticKeyword::Arrow) |
        map!(tag!("define"),          |_| SyntacticKeyword::Define) |
        map!(tag!("unquote"),         |_| SyntacticKeyword::Unquote) |
        map!(tag!("unquote-splicing"),|_| SyntacticKeyword::UnquoteSplicing)
    )
);

named!(
    expression_keyword<ExpressionKeyword>,
    alt!(
        map!(tag!("quote"),     |_| ExpressionKeyword::Quote) |
        map!(tag!("lambda"),    |_| ExpressionKeyword::Lambda) |
        // ...
        map!(tag!("quasiquote"),|_| ExpressionKeyword::Quasiquote)
    )
);

The description of map! is accurate but doesn't help that much…

map! : maps a function on the result of a parser

… but its type signature tells the whole story:

map!(I -> IResult<I,O>, O -> P) => I -> IResult<I, P>

Thanks to tag! we already have parsers with the type signature &[u8] -> IResult<&[u8], &[u8]> and want syntactic_keyword to be a parser with the type signature &[u8] -> IResult<&[u8], SyntacticKeyword>
("try to parse a slice of bytes and ideally return a SyntacticKeyword ").

This is exactly what map! does, the only missing piece is the second argument, a function O -> P , in our case &[u8] -> SyntacticKeyword 2 .

In most cases we don't need the value for O because the result doesn't depend on it, so we can "throw it away" with the _ placeholder.

The one special case is

map!(expression_keyword, |e| SyntacticKeyword::Expression(e))

expression_keyword where we need to wrap the resulting ExpressionKeyword in a SyntacticKeyword::Expression(...) .

Because combining map! and alt! is so common, there is a special syntax for alt! with a builtin map! :

named!(
    syntactic_keyword<SyntacticKeyword>,
    alt_complete!(
        expression_keyword       => { |e| SyntacticKeyword::Expression(e) } |
        tag!("else")             => { |_| SyntacticKeyword::Else } |
        tag!("=>")               => { |_| SyntacticKeyword::Arrow } |
        tag!("define")           => { |_| SyntacticKeyword::Define } |
        tag!("unquote-splicing") => { |_| SyntacticKeyword::UnquoteSplicing } |
        tag!("unquote")          => { |_| SyntacticKeyword::Unquote }
    )
);

Problem 1: Early Returns

Playing around with the REPL quickly leads to some unexpected results:

>> else
Parsed Done([], Else)
>> =>
Parsed Done([], Arrow)
>> let
Parsed Done(
    [],
    Expression(Let)
)
>> letrec
Parsed Done(
    [114, 101, 99],
    Expression(Let)
)
>>

"letrec" parses to Expression(Let) with [114, 101, 99] ("rec") as remaining input?

Remember the documentation for alt! :

[…] return the result of the first successful one […]

Done with some remaining input still counts as "successful" so alt! doesn't even try out the alternative for "letrec". There is a similar problem for "let" / "let*" and "unquote" / "unquote-splicing".

An easy fix is to change the order inside alt! so that the longest versions come first.

named!(
    syntactic_keyword<SyntacticKeyword>,
    alt!(
        // ...
        tag!("unquote-splicing") => { |_| SyntacticKeyword::UnquoteSplicing) } |
        tag!("unquote")          => { |_| SyntacticKeyword::Unquote) }
    )
);

named!(
    expression_keyword<ExpressionKeyword>,
    alt!(
        // ...
        tag!("letrec") => { |_| ExpressionKeyword::LetRec) } |
        tag!("let*")   => { |_| ExpressionKeyword::LetStar) } |
        tag!("let")    => { |_| ExpressionKeyword::Let) } |
        // ...
    )
)

Problem 2: Incomplete

>> letrec
Parsed Done(
    [],
    Expression(LetRec)
)
>> let*
Parsed Done(
    [],
    Expression(LetStar)
)
>> let
Parsed Incomplete(Size(6))
>>

We successfully fixed "let*" and "letrec" but now "let" won't work because the "letrec" branch sees it, starts to parse it, notices it is incomplete and alt! happily returns that as a result.

Again there is an easy solution:

alt_complete! : is equivalent to the alt! combinator, except that it will not return Incomplete when one of the constituting parsers returns Incomplete . Instead, it will try the next alternative in the chain.

The same problem is hidden in syntactic_keyword , too, so we need to change both alt! to alt_complete .

named!(
    syntactic_keyword<SyntacticKeyword>,
    alt_complete!(
        // ...
    )
);

named!(
    expression_keyword<ExpressionKeyword>,
    alt_complete!(
        // ...
    )
)
>> let
Parsed Done(
    [],
    Expression(Let)
)
>> letrec
Parsed Done(
    [],
    Expression(LetRec)
)
>> let*
Parsed Done(
    [],
    Expression(LetStar)
)
>> le
Parsed Error(Alt)
>>
CTRL-C

Conclusion

Our parser is now able to parse all the scheme keywords (syntactic or expression) which is a good start.

There is one small problem left, it might be hard to spot because it only affects the way parsing errors are reported. After switching to alt_complete! , the result no longer contains the information if the parser failed because the input was incomplete or if there simply was no matching parser which might be useful for reporting parser errors later on.

Integers

Mapping over Results

Nom works with slices of bytes ( &[u8] ) so we need some way to convert these to strings and then parse them into integers.

Rust already provides a method for the first part: std::str::from_utf8 .

It's type signature looks like this:

&[u8] -> Result<&str, Utf8Error>

We need is convert &[u8] -> &str , what is up with that Result<> thingy around it?

The problem is that there are some byte sequences that are not valid as UTF-8 sequences.

If our string is just made up of bytes from 0 to 127 (ASCII), everything works fine.

fn main() {
    let input = [49, 50, 51]; // ASCII for "123"
    println!("{:?}", std::str::from_utf8(&input));
}
// => Ok("123")

255 is a valid byte value but must not appear in a sequence.

fn main() {
    let input = [49, 50, 51, 255];
    println!("{:?}", std::str::from_utf8(&input));
}
// => Err(Utf8Error { valid_up_to: 3, error_len: Some(1) })

There are a ton of other cases where parsing bytes to a string could go wrong, but the one above has to do for now…

Now that we know why there is a Result<> around the stuff we want, how do we use from_utf8 with nom? map! from Part 1 won't work and using .unwrap() or .expect() would be very inelegant.

The solution is surprisingly simple, nom already includes a variation of map! that works with functions that return Result=s and =nom::digit , a parser that recognizes one or more of the characters '0'…'9'.

named!(
    integer<&str>,
    map_res!(nom::digit, std::str::from_utf8)
);

Parsing Integers

Of course &str is not what we really want, we still need to parse it to one of the integer types, for now just i64 .

One way to do this is to use str.parse::<i64>() 3 which returns a Result , too, so we need to use map_res! again.

named!(
    integer<i64>,
    map_res!(
      map_res!(nom::digit, std::str::from_utf8),
      |s: &str| s.parse::<i64>()
    )
);

Rust seems to have a hard time figuring out the type of s inside the closure (for good reasons, I am sure), so we need set it to &str by hand.

To try out our new parser, just change the parse() function from Part 1 to use it instead of syntactic_keyword .

fn parse(line: &str) {
    // let res = syntactic_keyword(line.as_bytes());
    let res = integer(line.as_bytes());
    println!("Parsed {:#?}", res);
}

Valid values for i64 range from parsing_scheme_with_nom_f730d14a937ef2a1271dab5645ef93501f59b2d7.svg to parsing_scheme_with_nom_5923132f0eac4b6901294c8291792a547ea68046.svg 4 , so an easy way to see how map_res! handles errors would be to use parsing_scheme_with_nom_7be7b6dd708ec794a575e68ddc6b57e3fd5bdcad.svg or higher as input.

>> 1
Parsed Done([], 1)
>> 2
Parsed Done([], 2)
>> 3
Parsed Done([], 3)
>> 0004
Parsed Done([], 4)
>> 9223372036854775808
Parsed Error(MapRes)https://www.youtube.com/watch?v=je8UCmQ45h4

Funfact , this piece of code panics for the same reason:

fn main() {
    println!("{:?}",  std::i64::MIN);
    println!("{:?}", -std::i64::MIN);
}
// -9223372036854775808
// thread 'main' panicked at 'attempt to negate with overflow', ...

Processing Signs

As a last step, we'll add support for signed integers (like -42 ). To do that, we need some way to express "An optional '-' followed by one or more digits" as a parser.

opt!(parser) makes parser optional, so opt!(tag("-")) gives us the first part and we already know that nom::digit matches one or more digits, the only thing missing is some way to chain them together.

do_parse!(opt!(tag"-") >> digit >> ()) creates a parser that matches the desired pattern and returns () (no result).

The last piece of the puzzle is recognize!(parser) which returns the input if its child parser was successful.

Putting all of them together, get:

// Top of the file
use nom::{digit};

// ...

named!(
    integer<i64>,
    map_res!(
        map_res!(
            recognize!(
                do_parse!(
                    opt!(tag!("-")) >>
                    digit >>
                    ()
                )
            ),
            std::str::from_utf8
        ),
        |s: &str| s.parse::<i64>()
    )
);

nom has a problem recognizing module paths inside macros , so nom::digit won't work inside the do_parse! .

Most of the macros take parsers as inputs and return parsers, so we can make our parser less messy by creating a special integer_literal parser.

named!(
    integer_literal,
    recognize!(
        do_parse!(
            opt!(tag!("-")) >>
            digit >>
            ()
        )
    )
);

named!(
    integer<i64>,
    map_res!(
        map_res!(
            integer_literal,
            std::str::from_utf8
        ),
        |s: &str| s.parse::<i64>()
    )
);
>> -123
Parsed Done([], -123)
>> -0
Parsed Done([], 0)
>> 0
Parsed Done([], 0)
>> 123
Parsed Done([], 123)
>>

This has to do for now, in the next part I'll try to handle binary, octal and hex numbers.

Spec

In addition to decimal integers like those we handled in Part 2 , R5RS includes literals for binary, octal and hexadecimal numbers.

  • #b11 (binary)
  • #o17 (octal)
  • #d19 (decimal)
  • #xaf (hexadecimal)

They are made up of a radix specifier ( #b , #o , #d , #x , none), a sign ( + , - , none) and a non-empty sequence of digits with given radix.

If there is no radix specifier , the default radix is 10 and if there is no sign, the integer is positive (obviously).

Digit Sequences

In the last part, we used nom::digit to match sequences of decimal digits.

There are two other variants of this, nom::oct_digit and nom::hex_digit .

Sadly there is no bin_digit so we need to write it ourselves.

Looking through the list of nom macros, one might assume something like many1!(one_of!("01")) would be a good to do so, but many1!(...) returns a list of results instead of just a matching sequence of bytes.

take_while! sounds more like what we want and has a variant that only matches sequences that are non-empty:

[…] returns the longest list of bytes for which the function is true. […]

The signature of take_while! looks like this:

take_while!(T -> bool) => &[T] -> IResult<&[T], &[T]>

We are working with byte slices, so T is u8 , so what we need is a function that takes a u8 byte and returns true iff it is a binary digit.

fn is_bin_digit(char: u8) -> bool {
  // Just '0' would be a char,
  // putting b in front marks it as a byte
  char == b'0' || char == b'1'
}

Now we can build our own bin_digit parser:

named!(bin_digit, take_while1!(is_bin_digit));

More Signs

In addition to - , + can be used as a sign, too, so we need a way to handle this.

To keep the integers parsers as dry as possible, we'll extract this into its own parser:

named!(sign, recognize!(opt!(one_of!("+-"))));

The only new thing here is one_of!(str) . According to the docs , it …

… matches one of the provided characters. one_of!("abc") could recognize 'a', 'b', or 'c'.

Just using opt!(one_of!("+-")) would lead to problems once we use it inside of do_parse!(sign >> digit >> ()) , because it's return type ( Option<...> ) is different, so we have to wrap recognize! around it to get a sequence of bytes instead.

Parsing Numbers with Radix

Next we need some way to parse these digit sequences. str::parse::<i64>() won't do this time, because there is no way to tell it which radix (2, 8, 10 or 16) to use.

Instead, we can use i64::from_str_radix(src: &str, radix: u32) which returns a Result , too, so we can just swap the two functions inside the map_res! from Part 2 .

Doing this for all new variants (and for decimal integers, to keep things consistent) we can build new parsers integer_literal2 , integer_literal8 , …, that match sequences signed binary, octal, decimal and hexadecimal numbers.

// Top of the file
use nom::{digit, oct_digit, hex_digit};

// ...

named!(
    integer_literal2,
    recognize!(do_parse!(sign >> bin_digit >> ()))
);

named!(
    integer_literal8,
    recognize!(do_parse!(sign >> oct_digit >> ()))
);

named!(
    integer_literal10,
    recognize!(do_parse!(sign >> digit >> ()))
);

named!(
    integer_literal16,
    recognize!(do_parse!(sign >> hex_digit >> ()))
);

And based on that, some new parsers that return =i64=s…

named!(
    integer2<i64>,
    map_res!(
        map_res!(integer_literal2, std::str::from_utf8),
        |s| i64::from_str_radix(s, 2)
    )
);

named!(
    integer8<i64>,
    map_res!(
        map_res!(integer_literal8, std::str::from_utf8),
        |s| i64::from_str_radix(s, 8)
    )
);

named!(
    integer10<i64>,
    map_res!(
        map_res!(integer_literal10, std::str::from_utf8),
        |s| i64::from_str_radix(s, 10)
    )
);

named!(
    integer16<i64>,
    map_res!(
        map_res!(integer_literal16, std::str::from_utf8),
        |s| i64::from_str_radix(s, 16)
    )
);

Finally, we need to combine all the parsers above into one parser that can handle all kinds of integers, choosing one of the subparsers depending on the numbers radix specifier .

nom provides an elegant way to do this, preceded! takes two parsers, tries to apply the first one and then returns the result of second one.

Remember that #d is optional, so we have to use opt! there.

named!(
    integer<i64>,
    alt!(
        preceded!(tag!("#b"), integer2) |
        preceded!(tag!("#o"), integer8) |
        preceded!(opt!(tag!("#d")), integer10) |
        preceded!(tag!("#x"), integer16)
    )
);

Now fire up the REPL to check if everything works as expected:

>> 123
Parsed Done([], 123)
>> +123
Parsed Done([], 123)
>> #x+FF
Parsed Done([], 255)
>> #x+Ff
Parsed Done([], 255)
>> #b101010
Parsed Done([], 42)
>> #oFF
Parsed Error(Alt)

There is a lot of code duplication going on above but I don't want to get into macros just now, so let's call it a day.

Booleans

In R5RS there are two boolean literals, #t for true and #f for false .

With our newly aquired nom skills , this should be easy:

// We can't name this parser bool because that is a registered keyword in rust
named!(
    boolean<bool>,
    alt!(
        tag!("#t") => { |_| true } |
        tag!("#f") => { |_| false }
    )
);

Chars

Chars are not that complex either, there are just three cases to handle:

  • #\space as alias for ' '
  • #\newline as alias for '\n'
  • #\<any char>

All of these cases begin with #\ , so preceded!(tag!("#\\"), ...) would be a good start.

// Top of the file,
// all the digits are needed for the number parsers from earlier parts
use nom::{digit, oct_digit, hex_digit, anychar};

// ...

named!(
    character<char>,
    preceded!(
        tag!("#\\"),
        alt_complete!(
            tag!("space") => { |_| ' ') } |
            tag!("newline" => { |_| '\n' } |
            anychar
        )
    )
);

For the first two cases, we just match on the names with tag! and use the => syntax to return the right char . The third case is surprisingly easy as well because nom::anychar does exactly what we want: to match any character and return a char .

Again we need to use alt_complete! instead of alt! and put anychar at the end of the chain, otherwise #\space would get parsed as #\s or #\s as an Incomplete #\space .

Combining Types

At the end we want to have a parser that can handle all kinds of scheme values and returns some wrapper type.

For now, there are just four cases to handle:

  1. Keywords, from Part 1
  2. Numbers, from Part 2 & 3
  3. Booleans
  4. Characters
#[derive(Debug)]
enum Token {
    Keyword(SyntacticKeyword),
    Number(i64),
    Boolean(bool),
    Character(char),
}

In addition to that new wrapper type, we need a new parser that combines the parsers for all types and wraps the results in the corresponding Token type.

named!(
    token<Token>,
    alt!(
        syntactic_keyword => { |kw| Token::Keyword(kw) } |
        integer => { |i| Token::Number(i) } |
        boolean => { |b| Token::Boolean(b) } |
        character => { |c| Token::Character(c) }
    )
);
fn parse(line: &str) {
    let res = token(line.as_bytes());
    println!("Parsed {:#?}", res);
}

Testing

An assertion might look like this:

assert_eq!(boolean("#t".as_bytes()), nom::IResult::Done(&b""[..], true));
assert_eq!(boolean("#f".as_bytes()), nom::IResult::Done(&b""[..], false));

There is a lot of boilerplate code because the input has to be a &[u8] , not &str and we expect our input to be parsed fully, so the first part of Done is an empty &[u8] (which we get by &b""[..] ).

A nice fix is to write a macro that takes the parts we care about (parser, input string, output value) and fills in the rest:

macro_rules! assert_parsed_fully {
    ($parser:expr, $input:expr, $result:expr) => {
        assert_eq!(
            $parser($input.as_bytes()),
            nom::IResult::Done(&b""[..], $result)
        );
    }
}

Now we can write tests in a much cleaner way:

#[test] // This marks functions as unit tests, they can be run with `cargo test`
fn test_bool() {
    assert_parsed_fully!(boolean, "#t", true);
    assert_parsed_fully!(boolean, "#f", false);
}

#[test]
fn test_character() {
    assert_parsed_fully!(character, "#\\space", ' ');
    assert_parsed_fully!(character, "#\\newline", '\n');
    assert_parsed_fully!(character, "#\\ ", ' ');
    assert_parsed_fully!(character, "#\\X", 'X');
}

#[test]
fn test_integer() {
    assert_parsed_fully!(integer, "1", 1);
    assert_parsed_fully!(integer, "#d+1", 1);
    assert_parsed_fully!(integer, "-1", -1);
    assert_parsed_fully!(integer, "#b010101", 21);
    assert_parsed_fully!(integer, "#o77", 63);
    assert_parsed_fully!(integer, "#xFF", 255);
    assert_parsed_fully!(integer, "#x-ff", -255);
}

In order to use assert_eq! on Token=s, we need to define a way to test if two of them are equal, formalized in the =PartialEq trait.

We won't use Eq here, because in the future there might be some tokens (e.g. =NaN=) where the equivalence relation is not reflexive ( v ! v= for some token v ).

Just like the Display trait, we can make rust derive PartialEq automatically by adding it in the #[derive(...)] above Token , SyntacticKeyword and ExpressionKeyword .

#[derive(Debug, PartialEq)]
enum Token {
  // ..
}

Now assert_parsed_fully! works for tokens, too.

#[test]
fn test_token() {
    assert_parsed_fully!(token, "1", Token::Number(1));
    assert_parsed_fully!(token, "else", Token::Keyword(SyntacticKeyword::Else));
    assert_parsed_fully!(token, "lambda", Token::Keyword(
        SyntacticKeyword::Expression(ExpressionKeyword::Lambda))
    );
    assert_parsed_fully!(token, "#\\space", Token::Character(' '));
    // ...
}

I'll leave coming up with more test cases as an exercise for the reader. If you find a case that does not work as expected, feel free to open up an issue .

Strings

The R5RS spec for strings is pretty simple, but in addition to that, support for \n , \r and \t would be nice.

<string> → " <string element>* "
<string element> → <any character other than " or \ > | \" | \\

nom seems to have two options to handle escaped strings:

Let's use the later one, because the example code already does 80% of what we want.

fn to_s(i:Vec<u8>) -> String {
  String::from_utf8_lossy(&i).into_owned()
}

named!(
    string_content<String>,
    map!(
        escaped_transform!(
            take_until_either!("\"\\"),
            '\\',
            alt!(
                tag!("\\") => { |_| &b"\\"[..] } |
                tag!("\"") => { |_| &b"\""[..] } |
                tag!("n") => { |_| &b"\n"[..] } |
                tag!("r") => { |_| &b"\r"[..] } |
                tag!("t") => { |_| &b"\t"[..] }
            )
        ),
        to_s
    )
);

The only changes are to use take_until_either!("\"\\") to matche any characters until either a \ or a " appears instead of alpha and add support for \r and \t .

Based on this parser for stuff inside the " , next we need a way to make sure there are "=s around our strings. =delimited! is similar to the earlier preceded! and does just that, it takes three parsers

  • opening delimiter
  • body
  • closing delimiter

and returns only the result for the body.

named!(string<String>,
    delimited!(tag!("\""), string_content, tag!("\""))
);

Now we only need to add a string type to the Token enum, the string parser to the token parser and everything should work fine.

#[derive(Debug, PartialEq)]
enum Token {
    Keyword(SyntacticKeyword),
    Number(i64),
    Boolean(bool),
    Character(char),
    String(String),
}

named!(
    token<Token>,
    alt!(
        syntactic_keyword => { |kw| Token::Keyword(kw) } |
        integer           => { |i| Token::Number(i) } |
        boolean           => { |b| Token::Boolean(b) } |
        character         => { |c| Token::Character(c) } |
        string            => { |s| Token::String(s) }
    )
);
>> "seems to work"
Parsed Done([], String("seems to work"))
>> "test123 \n\t\r\"\"\\"
Parsed Done([], String("test123 \n\t\r\"\"\\"))
>>

Identifiers

We're making good progress, booleans, numbers (in a simplified form), characters and strings already work, the only missing part is a parser for identifiers.

You might ask "But what about the keyword parser from Part 1 "?

Turns out, we don't even need it for the token parser, but it will come in handy once we start to parse expressions.

<token> →
  <identifier> |
  <boolean> |
  <number> |
  <character> |
  <string> |
  ( | ) | #( | ’ | ` | , | ,@ | .

Peculiar Identifiers

Let's start with something simple, "peculiar identifiers":

named!(peculiar_identifier, alt!(tag!("+") | tag!("-") | tag!("...")));

named!(
    identifier<String>,
    map!(
        peculiar_identifier,
        |s| String::from_utf8_lossy(s).into_owned()
    )
);

A combination of alt! and tag! matches each of the peculiar identifiers and we can use the same method as in the to_s function from earlier to convert &[u8] to String .

Next we need add a Identifier type to the Token enum and parser. Note that I removed the Keyword type, too.

#[derive(Debug, PartialEq)]
enum Token {
    Number(i64),
    Boolean(bool),
    Character(char),
    String(String),
    Identifier(String),
}

named!(
    token<Token>,
    alt_complete!(
        integer           => { |i| Token::Number(i) } |
        boolean           => { |b| Token::Boolean(b) } |
        character         => { |c| Token::Character(c) } |
        string            => { |s| Token::String(s) } |
        identifier        => { |s| Token::Identifier(s) }
    )
);

Again it's important to use alt_complete! instead of alt! to avoid conflicts between the number +1 and the identifier + .

"Common" Identifiers

First we need some helper classes to match the different groups of characters. We can't use nom::digit or nom::alpha here because they match multiple characters while we only want to match a single one.

named!(letter<char>, one_of!("abcdefghijklmnopqrstuvwxyz"));
named!(single_digit<char>, one_of!("0123456789"));
named!(special_initial<char>, one_of!("!$%&*/:<=>?^_~"));
named!(special_subsequent<char>, one_of!("+-.@"));

I'm sure there is a more elegant way to do this but one_of! with a string of all characters is good enough for now. The result of these parsers is char , not &[u8] so we need to explicitely annotate their type.

Like in Part 3 we can use a combination of recognize! and do_parse! to match "common" identifiers:

named!(
    common_identifier,
    recognize!(
        do_parse!(initial >> many0!(subsequent) >> ())
    )
);

Finally change identifier to support both types:

named!(
    identifier<String>,
    map!(
        alt!(peculiar_identifier | common_identifier),
        |s| String::from_utf8_lossy(s).into_owned()
    )
);

And add the remaining tokens:

#[derive(Debug, PartialEq)]
enum Token {
    Number(i64),
    Boolean(bool),
    Character(char),
    String(String),
    Identifier(String),
    LBracket, RBracket, HashBracket,
    Quote, Quasiquote,
    Unquote, UnquoteSplicing,
    Dot
}

named!(
    token<Token>,
    alt_complete!(
        integer           => { |i| Token::Number(i) } |
        boolean           => { |b| Token::Boolean(b) } |
        character         => { |c| Token::Character(c) } |
        string            => { |s| Token::String(s) } |
        identifier        => { |s| Token::Identifier(s) } |
        tag!("(")         => { |_| Token::LBracket } |
        tag!(")")         => { |_| Token::RBracket } |
        tag!("#(")        => { |_| Token::HashBracket } |
        tag!("'")         => { |_| Token::Quote } |
        tag!("`")         => { |_| Token::Quasiquote } |
        tag!(",@")        => { |_| Token::UnquoteSplicing } |
        tag!(",")         => { |_| Token::Unquote } |
        tag!(".")         => { |_| Token::Dot }
    )
);

A quick test shows that everything works as expected and there don't seem to be any strange conflicts between identifiers and numbers:

>> test
Parsed Done([], Identifier("test"))
>> +1
Parsed Done([], Number(1))
>> +
Parsed Done([], Identifier("+"))
>> ...
Parsed Done([], Identifier("..."))
>> $foo123
Parsed Done([], Identifier("$foo123"))
>> .
Parsed Done([], Dot)
>> (
Parsed Done([], LBracket)
>> #(
Parsed Done([], HashBracket)
>>

This was easier than I expected, but I'm sure things will get more exiting once we start parsing expressions.

Full source code: l3kn/r5rs-parser .

Footnotes:

1

I reformatted the output to keep it short

2

Or &[u8] -> Expressionkeyword for expression_keyword

3

::<i64> is an alternative way of defining which type we want to parse to, usually this is already set by the type of the variable in a assignment, e.g. =let res: i64 = str.parse()=

4

If you are wondering why the range is assymetric, take a look at how two's complement is defined.


Last export: 2020-05-29 Fri 00:09
This document is also available as plain text and pdf.

If you have an idea how this page could be improved or a comment send me a mail.