70 %

Advent of Code 2020 in Rust day 2: regex vs parser combinators with nom

tldr;

Parsing the input with nom is faster than using the regex crate.

Setup

The input for today looks something like this:

1-3 a: abcde
1-3 b: cdefg
2-9 c: ccccccccc

I used the same scaffolding for each part and set up a process_password_line function to process each line.

rust

pub fn process_part1(input: &str) -> usize {
    input
        .lines()
        .filter_map(|line| match process_password_line(line) {
            Ok(true) => Some(true),
            Ok(false) => None,
            Err(_) => None,
        })
        .count()
}

process_password_line

My first attempt was using regex to parse each line. This resulted in some unwieldy code that had a lot of unwraps and additional parsing. The error handling would've been annoying to "get right" so as a result I mostly skipped it. There are also some issues with differing types. You can see that character is actually a str which means I have to use contains instead of equality. Overall kind of a messy way to go about solving the problem, but it definitely works.

regex.rs

rust

fn process_password_line(line: &str) -> Result<bool, std::io::Error> {
    lazy_static! {
        static ref RE: Regex = Regex::new(r"([\d]+)-([\d]+) ([a-z]): ([a-z]+)").unwrap();
    }
    let caps = RE.captures(line).unwrap();
    let lower_bound = caps.get(1).unwrap().as_str().parse::<usize>().unwrap();
    let upper_bound = caps.get(2).unwrap().as_str().parse::<usize>().unwrap();
    let character = caps.get(3).unwrap().as_str();
    let password = caps.get(4).unwrap().as_str();
    let char_count = password.chars().filter(|c| character.contains(*c)).count();

    if char_count < lower_bound || char_count > upper_bound {
        Ok(false)
    } else {
        Ok(true)
    }
}

Given how messy this was, I wanted to try my hand at using a parser written with nom, a parser combinator library. This felt to me more "fit for purpose" than the regex approach.

nom is a parser combinator library with a focus on safe parsing, streaming patterns, and as much as possible zero copy.

This line from the docs explains the motivation for using nom. My favorite answer to day 1 was a victim of too many allocations, so I wanted to see what nom could do compared to regex.

nom.rs

rust

fn process_password_line(line: &str) -> Result<bool, std::io::Error> {
    let result = password_line(line).map(|(_, pass)| {
        let char_count = pass
            .password
            .chars()
            .filter(|c| pass.character == *c)
            .count();
        char_count >= pass.lower.into() && char_count <= pass.upper.into()
    });
    match result {
        Ok(b) => Ok(b),
        _ => Err(Error::new(ErrorKind::Other, "failed to parse")),
    }
}

Given the parser, the actual logic for dealing with each line becomes significantly simpler. My only hangup was on converting nom's errors into a more "standard" error type.

The parser itself is fairly readable, even if you don't know what a parser combinator is. The important parts are that we're shadowing input each time, which returns the result minus the parsed piece for the next parser.

nom-parser.rs

rust

use nom::{bytes::complete::tag, character::complete::*, IResult};

struct PasswordLine<'a> {
    lower: u8,
    upper: u8,
    character: char,
    password: &'a str,
}

fn password_line(input: &str) -> IResult<&str, PasswordLine> {
    let (input, lower) = digit1(input)?;
    let (input, _) = tag("-")(input)?;
    let (input, upper) = digit1(input)?;
    let (input, _) = tag(" ")(input)?;
    let (input, parsed_character) = anychar(input)?;
    let (input, _) = tag(": ")(input)?;
    let (input, password) = alpha1(input)?;

    Ok((
        input,
        PasswordLine {
            lower: lower.parse::<u8>().unwrap(),
            upper: upper.parse::<u8>().unwrap(),
            character: parsed_character,
            password,
        },
    ))
}

Performance

The regex approach benchmarked at about 1ms while the parser approach benchmarked at 145 nanoseconds. I attribute this to the zero-copy approach nom uses and the complexity of regex engines, but I did not check that assumption.

| example | lower bound | best guess | upper bound | | ------- | ----------- | ---------- | ----------- | | nom | 143.80 us | 144.83 us | 145.99 us | | regex | 984.72 us | 994.95 us | 1007.40 us |