Superpower: The parser combinator library [Part 2] ??
SUPERPOWER
The parser combinator library – Part 2
Believe me: you’ll want to learn about this
Parsing is really a superpower
You never thought you would want to learn about parsing stuff, but that’s because you either haven’t heard about this topic or you’re scared of it. If the former is true, it’s time to learn about this interesting subject. If the latter applies… fear not! Parsers are like Math. They are as cool as difficult! But simpler things are usually easy to do.
Let’s get into it!
Meet Superpower
Superpower is the parser combinator library that will make you wish you had known it before.
You can define your parser with plain C#. Even better: You can define your parsers using LINQ! Look at this:
var assignmentParser = from identifer in Identifier from eq in Token.EqualTo(SimpleToken.Equal) from expr in Expression select new Assignment(identifer, expr);
This is a parser combinator. It’s called “combinator” because it allows you to define a parser by combing existing parsers.
In the code above we are creating a parser for assignment expressions that will parse expressions like a = 1+b where:
- “a” is the identifier,
- “=” is the equal sign
- “1+b” is the expression
If you look at the LINQ syntax, you will notice that the definition is nicely readable and almost matches the definition of a prototypical assignment expression. For instance, in BNF we could define a assignment like this:
<assignment> ::= <identifier> “=” <expression>
That reads like “an assignment is an identifier followed by the equal sign followed by an expression”.
Notice how this definition resembles the syntax of the LINQ example above… I think this way to define a parser is great!
But, before creating a parser, we need to deal with tokens.
Tokens
Parsers don’t consume text directly. Doing that would greatly increase complexity a lot. To keep things easy and we invented a preliminary stage in which the stream of characters is turned into what we call “tokens”.
What’s a token?
Tokens are simpler fragments of text that have a sense for our language. A piece with the smallest meaning for us. What does it mean? Let’s illustrate it with an example:
Imagine that your language represents math expressions like 2*(3+4).
We have numbers and symbols, and after a bit of analysis, we decide that we need the following tokens:
- Number
- Plus
- Asterisk
- OpenParenthesis
- CloseParenthesis
So, our original expression “2+(3*4)” turned into a sequence of tokens would be:
Number Asterisk OpenParenthesis Number Plus Number CloseParenthesis
To make it clearer, here is the previous sequence with the values annotated:
Number(“2”) Asterisk(“*”) OpenParenthesis(“(“) Number(“3”) Plus(“+”) Number(“4”) CloseParenthesis(“)”)
What have we done?
In a few works, we assigned a meaning to the input characters and categorized them to make them richer: the tokens. Characters of the string “123” are no longer a just bunch of characters, but a meaningful pack: they conform a token of type “Number” and its value is “123”. In this way we can handle the input string easily in the actual parsing stage.
Tokenizers
But how do we turn an input string into a sequence of tokens? Good question. We have what Superpower calls “tokenizers”. They are artifacts that do exactly that: They turn the input text into tokens. Technically, “tokenizer” is another name for “lexer”. Lexers do what in the jargon is call the lexical analysis of the input string to tokens in a stage that is usually called “tokenization”, hence the name.
Parsing stages
To make everything clearer, I’ll show you the different stages that parsing involves.
Order | Artifact | Consumes | Produces |
1 | Lexer (Tokenizer) | Characters | Tokens |
2 | Parser | Tokens | AST (Abstract Syntax Tree) |
There are only 2, but if we were parsing a complex language, we will have much more than only 2. For instance, Stage 3 would be the semantic analysis. But I won’t cover it in this post, because it’s more related to compilers. If you want to know more, you can take a look the State Machine Compiler project I created some time ago.
How do we build a tokenizer?
Superpower has a very simple builder that will make it rather easy to create your own tokenizer. Let’s create a tokenizer for the syntax we considered above:
var ourTokenizer = new TokenizerBuilder<MyToken>() .Match(Numerics.Integer, MyToken.Number) .Match(Span.EqualTo('+'), MyToken.Plus) .Match(Span.EqualTo('*'), MyToken.Asterisk) .Match(Span.EqualTo('('), MyToken.OpenParenthesis) .Match(Span.EqualTo(')'), MyToken.CloseParenthesis) .Build();
As simple as that.
Let’s see the complete tokenizer definition and the output
void Main() { var ourTokenizer = new TokenizerBuilder<MyToken>() .Match(Numerics.Integer, MyToken.Number) .Match(Span.EqualTo('+'), MyToken.Plus) .Match(Span.EqualTo('*'), MyToken.Asterisk) .Match(Span.EqualTo('('), MyToken.OpenParenthesis) .Match(Span.EqualTo(')'), MyToken.CloseParenthesis) .Build(); var tokens = ourTokenizer.Tokenize("2*(3*4)"); tokens.Select(x => new { Kind = x.Kind, Text = x.Span.ToStringValue() }).Dump(); } public enum MyToken { Number, Plus, Asterisk, OpenParenthesis, CloseParenthesis, }
To see it better, I’ve captured a snapshot of LINQPad (a very useful tool) with the code and the formatted output of the tokenization
As you can see, the table in the bottom part of the snapshot, the tokenizer outputs the very same sequence of tokens we had before ?
Parsing tokens
Once we have the tokenizer that emits a sequence of tokens, we’re ready to parse them.
This stage is called the syntactic analysis (the parsing itself) in which we will construct a tree (the AST). Why a tree? Because it’s a structure that can grow indefinitely to hold all the meaning of the constructs of our language.
Building a parser is what we did in the first part of this post. Do you remember? Let’s refresh our mind:
var assignmentParser = from identifer in Identifier from eq in Token.EqualTo(SimpleToken.Equal) from expr in Expression select new Assignment(identifer, expr);
As I told you before, a parser in Superpower is created by composing one or more parsers.
In the case above, the Assignment Parser is made of
- The “Identifer”, defined elsewhere.
- An in-place parser created by the EqualTo(SimpleToken.Equal). Token.EqualTo(…) creates a parser that matches the specified token
- The “Expression” parser, that is also defined elsewhere.
But how is the Identifier parser defined?
Identifier = Token.EqualTo(SimpleToken.Identifier).Select(x => x.ToStringValue());
Easy, uh? It’s a parser that will match the Identifier value of the SimpleToken enum and will return the text of the token (whatever text it holds).
And what about the parser called “Expression”?
Expression = CallExpression.Try().Or(NumberParameter).Or(TextParameter).Or(IdentifierParameter);
What does it mean? That an expression is either CallExpression, a NumberParameter, a TextParameter or an IdentifierParameter. The Try() operator is needed because if the parser consumes a token and it’s not a match, the parsing pipeline should go back and “retry” with the rest of the parsers.
I hope you get the idea.
Operators
There are a lot of operators that you can use to create your parsers. This is a list of the most common ones:
- Matches zero or more appearances. For example, Identifier.Many() will match 0 or more identifiers.
- It will match at least one appearance. Example: Number.AtLeastOnce()
- OptionalOrDefault. It will match zero or one appearances. If no match is found, it will return the default value of the type of the parser. For example:
from i in Identifier from n in Number.OptionalOrDefault() select new { Id = i, Nomber = n };
If there’s no number after the identifier, n will be null. This operator is useful to create parsers with optional parts
- ManyDelimitedBy. It’s the same as Many, but you can specify a parser for delimiters. This is useful to match lists like “red,blue,white,green”, that are names delimited by a comma. In this example, it would be defined like Name.ManyDelimitedBy(Token.EqualTo(Token.Comma))
- Between. It will match x between y and z. To understand it just look at this example: Imagine that you wanted to match this:
(a, b, c, d)
You can define it with the following:
- Comma = Token.EqualTo(Token.Comma)
- LParen = Token.EqualTo(Token.LParen) – left parenthesis
- RParen = Token.EqualTo(Token.RParen) – right parenthesis
- ManyDelimitedBy(Comma).Between(LParen, RParent)
- Chain. Will match operands associated by left-associative operands. This is a bit difficult to understand, but it is basically used when you want to define parsers for expressions like 1+3*4. It “chains” numbers with the operators.
Example:
Parse.Chain(Operator, Operand, (operation, left, right) => new BinaryOperation(operation, left, right))
Well, this is just the start and you have plenty of tips to start creating your own parsers and tokenizers.
If you want to see one simple parser in action, I’ve created one that is part of a super-simple scripting engine called SimpleScript (https://github.com/SuperJMN/SimpleScript/tree/master/SimpleScript)
Concretely, you can take a look to the tokenizer and the parser definitions.
If you want to know more, don’t forget to ask me via Twitter via @SuperJMN to get in touch.
Learn more
Also, if you want to read more about Superpower, don’t hesitate to read the introductory article by its author, Nicholas Blumhardt. And don’t forget to check out its GitHub repo.
I hope you have learnt something about parsers with me. If you haven’t, don’t worry. Get the samples, clone the repos and get a taste. It can be hard at the beginning, but as soon as you get some working, you’ll love Superpower as much as I do.
Of course, you can use the comments section to give us your impressions and to get in touch, too.
And thanks for reading!
Written by: Jose Manuel Nieto, part of Idiwork’s team.
Very helpful, but the Chain example really goes over my head. Could you break that down a bit? The descriptive example you gave was 1 + 3 * 4. The sample code doesn’t seem to be doing that, or I’m just not understanding.