We’re making good progress, booleans, numbers (in a simplified form), characters and strings already work, the only missing part is a parser for identifiers.

You might ask “But what about the keyword parser from Part 1”?

Turns out, we don’t even need it for the token parser, but it will come in handy once we start to parse expressions.

<token> →
  <identifier> |
  <boolean> |
  <number> |
  <character> |
  <string> |
  ( | ) | #( | ’ | ` | , | ,@ | .

Peculiar Identifiers

Let’s start with something simple, “peculiar identifiers”:

named!(peculiar_identifier, alt!(tag!("+") | tag!("-") | tag!("...")));

named!(
    identifier<String>,
    map!(
        peculiar_identifier,
        |s| String::from_utf8_lossy(s).into_owned()
    )
);

A combination of alt! and tag! matches each of the peculiar identifiers and we can use the same method as in the to_s function from earlier to convert &[u8] to String.

Next we need add a Identifier type to the Token enum and parser. Note that I removed the Keyword type, too.

#[derive(Debug, PartialEq)]
enum Token {
    Number(i64),
    Boolean(bool),
    Character(char),
    String(String),
    Identifier(String),
}

named!(
    token<Token>,
    alt_complete!(
        integer           => { |i| Token::Number(i) } |
        boolean           => { |b| Token::Boolean(b) } |
        character         => { |c| Token::Character(c) } |
        string            => { |s| Token::String(s) } |
        identifier        => { |s| Token::Identifier(s) }
    )
);

Again it’s important to use alt_complete! instead of alt! to avoid conflicts between the number +1 and the identifier +.

“Common” Identifiers

First we need some helper classes to match the different groups of characters. We can’t use nom::digit or nom::alpha here because they match multiple characters while we only want to match a single one.

named!(letter<char>, one_of!("abcdefghijklmnopqrstuvwxyz"));
named!(single_digit<char>, one_of!("0123456789"));
named!(special_initial<char>, one_of!("!$%&*/:<=>?^_~"));
named!(special_subsequent<char>, one_of!("+-.@"));

I’m sure there is a more elegant way to do this but one_of! with a string of all characters is good enough for now. The result of these parsers is char, not &[u8] so we need to explicitely annotate their type.

Like in Part 3 we can use a combination of recognize! and do_parse! to match “common” identifiers:

named!(
    common_identifier,
    recognize!(
        do_parse!(initial >> many0!(subsequent) >> ())
    )
);

Finally change identifier to support both types:

named!(
    identifier<String>,
    map!(
        alt!(peculiar_identifier | common_identifier),
        |s| String::from_utf8_lossy(s).into_owned()
    )
);

And add the remaining tokens:

#[derive(Debug, PartialEq)]
enum Token {
    Number(i64),
    Boolean(bool),
    Character(char),
    String(String),
    Identifier(String),
    LBracket, RBracket, HashBracket,
    Quote, Quasiquote,
    Unquote, UnquoteSplicing,
    Dot
}

named!(
    token<Token>,
    alt_complete!(
        integer           => { |i| Token::Number(i) } |
        boolean           => { |b| Token::Boolean(b) } |
        character         => { |c| Token::Character(c) } |
        string            => { |s| Token::String(s) } |
        identifier        => { |s| Token::Identifier(s) } |
        tag!("(")         => { |_| Token::LBracket } |
        tag!(")")         => { |_| Token::RBracket } |
        tag!("#(")        => { |_| Token::HashBracket } |
        tag!("'")         => { |_| Token::Quote } |
        tag!("`")         => { |_| Token::Quasiquote } |
        tag!(",@")        => { |_| Token::UnquoteSplicing } |
        tag!(",")         => { |_| Token::Unquote } |
        tag!(".")         => { |_| Token::Dot }
    )
);

A quick test shows that everything works as expected and there don’t seem to be any strange conflicts between identifiers and numbers:

>> test
Parsed Done([], Identifier("test"))
>> +1
Parsed Done([], Number(1))
>> +
Parsed Done([], Identifier("+"))
>> ...
Parsed Done([], Identifier("..."))
>> $foo123
Parsed Done([], Identifier("$foo123"))
>> .
Parsed Done([], Dot)
>> (
Parsed Done([], LBracket)
>> #(
Parsed Done([], HashBracket)
>>

This was easier than I expected, but I’m sure things will get more exiting once we start parsing expressions.

Full source code: l3kn/r5rs-parser.