pantheon: Parsing command line arguments

28 September 2024

1 - The Goal

After managing free memory I wanted to implement another feature, but it launched me into a very long yak shave. The next few (I can think of at least three!) posts will take us on bit of a side journey before we can come back to the kernel (hades).

This first post is on writing a library to parse command line arguments (I know, this seems to be very removed from kernel development).

It's named sheshat, after the ancient Egyptian goddess of writing (among others).

The Goal

Before writing any code let's talk a bit about what we want. This project is going to be a standalone crate to parse command line-style arguments. Because we may not have access to the standard library in all contexts (maybe we want to use our library in the kernel?), we are going to write a no_std library.

Here are the different kind of arguments we want to parse:

We also want to respect common patterns, like -- marking the end of options & - being a valid value. This crate won't handle whitespace splitting, the main input will be an array of &str (or to be more user-friendly of T: AsRef<str>).

The main interface we are going to expose is a derive macro, given a struct Args:

struct Args<'a> {
    switch: bool,
    other_switch: bool,
    x_option: bool,
    y_option: bool,
    long: &'a str,
    first_positional: u64,
    remaining_positional: Vec<u64>,
}

We are going to be able to parse something like --switch -xy 1 2 3 --long=str into:

Args {
    switch: true,
    other_switch: false,
    x_option: true,
    y_option: true,
    long: "str",
    first_positional: 1,
    remaining_positional: vec![2, 3],
}

In order to choose how fields map to arguments we will need to introduce some attributes, but we will talk more about that when we describe the derive macro itself.

Note: In the example we introduced a borrowed string, the goal is to be able to borrow from the arguments.

Lexing command line arguments

"Lexing" arguments seem a bit redundant, as we have already established that the library won't perform whitespace splitting, which is a big part of parsing arguments. Well this is a task best left to the program "launcher" whatever it is (a shell for example), but we still have some complexity to wrangle.

The interface for this step has been very inspired by clap_lex, it helped me a lot to understand the edge cases!

We will introduce two main components:

/// A wrapper around an array of arguments
struct Arguments<'a, T: AsRef<str>>(&'a [T]);
/// The current index in the array of arguments
struct ArgCursor(usize);

impl Arguments<'a, T> {
    fn peek_arg(&self, cursor: &ArgCursor) -> Option<ParsedArgument<'a>>;
    fn advance(&self, cursor: &mut ArgCursor);
    /// Equivalent to calling peek_arg & advance
    fn next_arg(&self, cursor: &mut ArgCursor) -> Option<ParsedArgument<'a>>;
}

Those methods are mainly an abstracted way to iterate over an array, the interesting methods are on the ParsedArgument.

This struct mainly exposes two kind of methods:

We need to be careful when categorizing arguments, as we don't want --foo to match a short argument, nor do we want -- to match a long argument.

The reason for introducing a ShortArgument structure is that short arguments are inherently ambiguous. -abc could be parsed as either -a -b -c or as the options -a taking the value bc (or even -a and -b taking the value c). The ShortArgument is an iterator over the chars, with a method to return the remaining string if it should be a value. This means that the argument parsing layer is able to correctly drive the parsing with the information it has on which letter takes an argument.

Parsing command line arguments provided a description

This layer is going to take two inputs:

It will then output an iterator of parsed arguments (being either a flag or an option)

The interface is pretty similar to getopt_long, we will introduce a struct Argument that contains the short letter, the long name & whether this argument takes a value. It will also contain a name (of an arbitrary type), used by the caller to identify which arguments have been provided.

#[non_exhaustive]
#[derive(Debug)]
pub struct Argument<'a, N> {
    pub name: N,
    pub short: Option<char>,
    pub long: Option<&'a str>,
    pub takes_value: bool,
}

The output will be a ParsedArgument:

#[derive(Debug, PartialEq, Eq)]
pub enum ParsedArgument<'a, N> {
    Positional(&'a str),
    Flag(N),
    Option(N, &'a str),
}

We will introduce a (different) struct Arguments that will store the state of our parsing. We need mainly three pieces of information:

We can then implement all the logic in the Iterator::next method! The Iterator::Item will be a Result<ParsedArgument<_>, Error>. Error can either be unknown arguments, or mismatches between arguments & their values.

The next method is pretty long, but I'll summarize what needs to be done:

If we are in neither of those cases we are parsing an argument. This can either be a short, or a long argument, the handling is slightly different but in essence we have to do the following:

This makes the bulk of the library, and it has tests covering 100% of it! (according to cargo-llvm-cov). The interface of this layer is really verbose to consume, as we need to write a getopt-like loop to parse the arguments. We can do much better with a derive macro!

Derive Macro

Writing a derive macro won't be very simple, because of our rule of using no external dependencies we won't have access to syn or quote. Both those crates are pretty complicated, so we won't be redeveloping a clone from scratch, we are going to write some ad-hoc code that will not be as robust.

Introduction to procedural macros

Derive macros are procedural macros. They are implemented as functions that are dynamically loaded into the compiler, taking "rust" code as an input & outputting rust code.

This is done through the TokenStream type, which is mostly an iterator over TokenTree items. TokenTree are really low level, the rust compiler has done very little processing on them, they are mostly the output of lexing the input string. TokenTrees are an enumeration with the following variants:

As you can see we don't have much to go by, we will need to match the rust syntax ourselves, this is why syn is so useful.

A derive macro takes a TokenStream representing the item it was applied on, and outputs a TokenStream that is appended after the item. To define a proc-macro we need to create a separate crate:

[package]
name = "sheshat-derive"
version = "0.1.0"
edition = "2021"

[lib]
proc-macro = true

[dependencies]

We can then register our macro:

#[proc_macro_derive(Sheshat, attributes(sheshat))]
pub fn sheshat(input: TokenStream) -> TokenStream {
    todo!()
}

This will define a derive macro named Sheshat, which accepts attributes of the form #[sheshat(...)].

Parsing a struct

In order to simplify as much as possible our implementation we are going to only support structs with fields (i.e. struct Foo { ... }). Whenever we encounter an invalid situation we are going to simply panic!. This generates terrible error messages, in order to generate good messages we would need to generate a TokenStream which emits compile_error! with the correct span (which is implemented by syn). Doing that is pretty involved, and my implementation is already 750 lines of code, so I will need to cope with the error messages.

To guide our parsing we can look at the rust reference. We can see that we first need to (optionally) parse a repetition of OuterAttribute. Those are items of the form:

#[derive(Sheshat)]
#[sheshat(something)]
struct Args {}

We are going to support only one outer attribute: #[sheshat(borrow(<lifetime>))], where <lifetime> will be a lifetime that is allowed to borrow the input.

Warning: When parsing attributes you will see all the attributes on the item. For example if the struct derives serde's attributes you could encounter those. This is going to be the case in all of the following section, you need to take care to ignore attributes that are not related to your macro.

After handling the required attributes we need to scan (meaning read the tokens but not doing anything with them) the possible visibility and the struct keyword.

The next token should be the name of the struct, which we must store as we are going to need output code of the form impl <trait> for <the derived struct> { }.

We know encounter our first tricky parsing: we may need to parse a group of generic parameters. You may have realized while reading the introduction that a TokenTree::Group is never delimited by <..>. Indeed, < and > are TokenTree::Punct. This means that we must read from a < until the corresponding >, as we could have an arbitrary stack of nested <, >. This can be done easily by maintaining a depth counter, and exiting the parsing when we encounter > & the depth was 1.

We can now parse the fields of the struct. This starts pretty much like the struct, with attributes & visibility, followed by an identifier for the name. We have to parse the type of the field, and doing this cleanly is pretty involved too, as types can't be pretty complicated. Fortunately we don't care about the internals of the type, we only need to interact with the full type of the field. This means that we can read until we find either a , or the end of the struct. Well, we still need to maintain a depth counter, due to possible generic parameters that could introduce commas that don't mean the end of the field's type.

For more information on exactly how this is implemented please refer to the full implementation.

Implementing the trait

We know have the following information:

We are going to map fields that are neither short nor long to positional arguments. Before doing anything else we need to define the trait we are going to implement:

#[derive(Debug)]
pub enum Error<'a, E, N> {
    Parsing(E),
    /// An error in the argument parsing
    InvalidArgument(args::Error<'a, N>),
    TooManyPositional,
    MissingPositional(&'static str),

    MissingArgument(N),
}

pub trait Sheshat<'a>: Sized {
    type ParseErr;
    type Name;

    fn parse_arguments<T: AsRef<str>>(args: &'a [T]) -> Result<Self, Error<'a, Self::ParseErr, Self::Name>>;
}

We can easily generate a <name>Names enum for mapping each short/long field to a variant.

To parse the arguments we are going to need to:

The main issue for all those operations is that we want a different behavior entirely depending on the type:

Well, we could use traits for this, they are made to abstract over types right? Unfortunately the ... hide something really important. We want to support all types that implement FromStr to be parsed. This means that we just asked something currently impossible from the compiler: the ability to use a trait for an operation, except for a few specific types. The name for this is specialization, and it's a pretty complicated feature that is not ready at all.

So is our derive macro impossible then? No! In macros we can use a kind of specialization: autoref specialization.

Autoref specialization

Autoref specialization is a pretty neat trick that allows macro to perform specialization. I'm not going to explain how this works, please look at the linked article for that, but I am going to describe how to obtain all the different behaviours we want.

First we are going to introduce a struct To:

pub struct To<T>(pub PhantomData<T>);

This struct is going to be the value on which we perform the specialization by calling a method.

We are then going to introduce a number of markers that represent the different behaviours we want:

/// For parsing `bool` values
pub struct FlagArg;
/// For parsing `&'a str` values
pub struct IdOptArg<'a>(PhantomData<&'a ()>);
/// For parsing `impl FromStr` values
pub struct ParseOptArg<T>(PhantomData<T>);
/// For parsing `Option<&'a str>` values
pub struct OptionalIdOptArg<'a>(PhantomData<&'a ()>);
/// For parsing `Option<impl FromStr> values
pub struct OptionalParseOptArg<T>(PhantomData<T>);
/// For parsing `impl Extend<&'a str>` values
pub struct SequenceIdArg<'a, S>(PhantomData<(&'a (), S)>);
/// For parsing `impl Extend<impl FromStr>` values
pub struct SequenceParseArg<S, T>(PhantomData<(S, T)>);

We are then going to define an _arg method on To that returns the appropriate marke for the type! Because we need to forward the generic type/lifetime to the marker we need include a generic argument to _arg which is the type of the field we extracted in the derive macro.

This is the essence of autoref specialization, we will have a different _arg method for each situation, and the compiler will choose the first one that matches.

pub trait ViaFlagArg {
    fn _arg<T>(&self) -> FlagArg {
        FlagArg
    }
}
impl ViaFlagArg for &&&&&&To<bool> {}

pub trait ViaIdOptArg {
    fn _arg<'a, T: 'a>(&self) -> IdOptArg<'a> {
        IdOptArg(Default::default())
    }
}
impl<'a> ViaIdOptArg for &&&&&To<&'a str> {}

pub trait ViaOptionalIdOptArg {
    fn _arg<'a, T: 'a>(&self) -> OptionalIdOptArg<'a> {
        OptionalIdOptArg(Default::default())
    }
}
impl<'a> ViaOptionalIdOptArg for &&&&To<Option<&'a str>> {}

pub trait ViaOptionalParseOptArg {
    fn _arg<T>(&self) -> OptionalParseOptArg<T> {
        OptionalParseOptArg(Default::default())
    }
}
impl<T> ViaOptionalParseOptArg for &&&To<Option<T>> where T: FromStr {}

pub trait ViaSequenceIdArg {
    fn _arg<'a, S>(&self) -> SequenceIdArg<'a, S> {
        SequenceIdArg(Default::default())
    }
}
impl<'a, S> ViaSequenceIdArg for &&To<S> where S: Extend<&'a str> + Default {}

pub trait AppendParse<T: FromStr> {
    fn append(&mut self, v: &str) -> Result<(), T::Err>;
}

impl<S, T> AppendParse<T> for S
where
    T: FromStr,
    S: Extend<T>,
{
    fn append(&mut self, v: &str) -> Result<(), T::Err> {
        self.extend(core::iter::once(v.parse()?));
        Ok(())
    }
}

pub trait Id<S> {}
impl<S> Id<S> for S {}

pub trait ViaSequenceParseArg<T, S> {
    /// The issue here is that we can only use 1 type parameter in `_arg`, but our marker has 2.
    /// I think there is another way to fix this, by adding the type of To<T> to each trait,
    /// but this works to, it allows to link the trait & function traits
    fn _arg<Sp: Id<S>>(&self) -> SequenceParseArg<S, T> {
        SequenceParseArg(Default::default())
    }
}
impl<T, S> ViaSequenceParseArg<T, S> for &To<S>
where
    /// We need to restrict Extend<_> to be for T, because Extend has two possible impls
    /// One for copy & references, one for "owned" values.
    /// IntoIterator allows use to choose the "owned" case.
    S: IntoIterator<Item = T>,
    S: AppendParse<T>,
    T: FromStr,
{
}

pub trait ViaParseOptArg {
    fn _arg<T>(&self) -> ParseOptArg<T> {
        ParseOptArg(Default::default())
    }
}
impl<T> ViaParseOptArg for To<T> where T: FromStr {}

In order to get the marker we can do the following:

let {field_name}_token = (&&&&&&To::<{field_type}>(Default::default()))._arg::<{field_type}>();

Note that the number of & is equal to the number of & in the bool implementation. When trying to call the _arg method the compiler will first try with the supplied value. If this value does not implement the ViaFlagArg trait, the compiler will dereference once the value, arriving at &&&&&To::<T>, where it will try the ViaIdOptArg trait.

The compiler will then stop at the first dereference that works, giving us the correct marker for our type!

We can then implement methods with the same name of each marker for the different kind of specialization we need (does the field take a value, what is the inital value, how do we parse it, ....). The entire parsing code is available here.

With our markers we can trivially generate the initial value & realize the value, as this is simply a for loop in the derive macro. Extracting the values from the iterator is a bit more involved.

Parsing each argument

We are going to have a loop of this form:

let desc = &[ /* Generate the `Argument` from the {field_name}_token values */ ];
// Make the argument iterator
let args = Arguments::new(desc, args);
for arg in args {
    match arg? {
        ::sheshat::ParsedArgument::Positional(value) => ...,
        ::sheshat::ParsedArgument::Flag(name) => ...,
        ::sheshat::ParsedArgument::Option(name, value) => ...,
    }
}

Generating the actions to perform for Flag and Option are pretty easy:

match name {
    /// For each field
    {struct_name}Name::{field} => {field}_token.handle(....)?,
}

For positional arguments we don't have something that is as trivial. We need a bit of state that allows us to know which positional argument we need to interact with. We can define a mut positional_index = 0, and match on the value of the index.

This gives something of the form:

match positional_index {
    0 => field_0_token.handle(...)?,
    1 => field_1_token.handle(...)?,
    _ => /* Raise an error due to too many positional arguments */,
}

We then need to:

This is pretty straightforward to do with our marker types.

Handling Subcommands

Subcommands are pretty useful when caller can perform different functions depending on the input. They are positional arguments that change the meaning of the subsequent arguments.

The article is already pretty long, so I won't go into too many details on them, but they are pretty easy to fit in.

In essence, we can define another derive macro, on an enum this time, which will implement the following trait:

pub trait SheshatSubCommand<'a>: Sized {
    type SubCommandErr;

    fn parse_subcommand<T: AsRef<str>>(
        subcommand: &'a str,
        args: lex::Arguments<'a, T>,
        cursor: lex::ArgCursor,
    ) -> Result<Self, SubCommandError<'a, Self::SubCommandErr>>;
}

When the argument parsing loop encounters the index of the subcommand it calls this method on the subcommand type, passing it the arguments that have yet to be parsed. The parsing can then continue exactly as before, in a recursive manner.

Do note that the exact implementation is a bit tricky around the errors, in order to avoid allocations.

With this argument parsing library we are going to be able to write auxiliary tooling which is needed for us to continue to implement our kernel.

For more in depth details please look at the code! It's not too complicated, though it's written with very few comments.