70 %
Chris Biscardi

Parsing mdx atx headings in Rust with nom part 1

Nom is a parser combinator library in Rust. We can use this to write a Rust implementation of MDX, starting with headings.

Our goal is to parse the following mdx file (which in this case has no differences from a markdown file).

md
# boop

In our main.rs we'll use a couple of nom functions.

rust
use nom::{
character::*, sequence::terminated, Err::Error,
IResult, *,
};

and a custom error which acts pretty much like the original.

rust
use crate::mdx_error::MDXError;

At the top of the file we'll define our data structure. This is what we're going to parse the MDX into. In this case it's an ATXHeading struct (the name of one type of heading in commonmark). In this case we're using a reference to a [u8] with a lifetime annotation, but that's not super important. We could have also used str, etc.

rust
#[derive(Debug, PartialEq, Eq)]
pub struct ATXHeading<'a> {
pub level: usize,
pub value: &'a [u8],
}

We'll start with a couple of parsers for hashes and spaces. Nom uses macros quite heavily although in 5.0 you can also write parsers with functions as we'll see in a moment. The named! macro uses the identifier in the first argument (hashtags or spaces) and builds the macros in the second argument into that identifier, so we can use hashtags or spaces as parsers later.

rust
named!(hashtags, is_a!("#"));
named!(spaces, take_while!(is_space));

Then we write a few function-based parsers that operate on strings and return IResults. IResult is a super important type to get to know because it's used everywhere and specifying the types for it is super important. While the current return for these parsers is an IResult<&str, &str> with two type arguments (the input and return types), later we'll see that we can also use three to determine the error value in addition.

rust
pub fn end_of_line(input: &str) -> IResult<&str, &str> {
if input.is_empty() {
Ok((input, input))
} else {
nom::character::complete::line_ending(input)
}
}
pub fn rest_of_line(input: &str) -> IResult<&str, &str> {
terminated(
nom::character::complete::alphanumeric0,
end_of_line,
)(input)
}

The meat of our setup is atx_heading which uses the parsers we defined earlier to parse values out and return a tuple of the leftover input and the atx struct or an error. We use .map_err to convert the return types into our custom error type so that we can return our own custom error if the hash length for the heading is greater than 6, which means it should be a paragraph. Our heading parser doesn't care about paragraphs, it only cares that it has to fail and the paragraph parser will occur somewhere else in our program.

rust
pub fn atx_heading(
input: &[u8],
) -> IResult<&[u8], ATXHeading, MDXError<&[u8]>> {
// TODO: up to 3 spaces can occur here
let (input, hashes) =
hashtags(input).map_err(Err::convert)?;
if hashes.len() > 6 {
return Err(Error(MDXError::TooManyHashes));
}
// TODO: empty headings are a thing, so any parsing below this is optional
let (input, _) = spaces(input).map_err(Err::convert)?;
// TODO: any whitespace on the end would get trimmed out
let (input, val) =
rest_of_line(std::str::from_utf8(input).unwrap())
.map_err(Err::convert)?;
Ok((
input.as_bytes(),
ATXHeading {
level: hashes.len(),
value: val.as_bytes(),
},
))
}

Finally, here's a test that asserts that we can parse an mdx string into the ATXHeading AST.

rust
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn parse_atx_heading() {
assert_eq!(
atx_heading(b"# boop"),
Ok((
"".as_bytes(),
ATXHeading {
level: 1,
value: b"boop"
}
))
);
}
}

Note that this is not a fully spec compliant parser (we noted TODOs in the program comments) but it will work for specifically written headings. Can you flesh this out to parse the rest of the ATX Heading in the spec? This is part of my work on the MDX Rust implementation so by the time you read this there may be a more sophisticated parser for headings waiting for you there.