Understanding how top-down operator precedence parsing works

Published on January 21st, 2024.

In a recent stream, Jonathan Blow and Casey Muratori discussed how to implement a parser for a programming language. Jon has been working on his own programming language, Jai, since around 2014.

Jai is not a toy language. It’s already being used to develop a serious 3D puzzle game with modern graphics, compute shaders, etc. It has advanced features like arbitrary compile-time code execution, implicit context that can be overwritten to provide configurable functionality to libraries, and the ability for meta-programs to intercept and alter constructs in the compiled program on the fly.

In that talk, they go over how expression parsing was implemented in the Jai compiler, the reason why it wasn’t ideal anymore, and how the replacement, top-down operator precedence parsing (TDOP), works.

There’s some material on TDOP online, but I’ve always found them hard to understand. It never really clicked. Nobody seemed to have a simple way of explaining it. That is, until Jon’s stream.

He laid out a few examples that make it clear how the technique works and how to implement it. The code is trivial.

I’m writing this down while things are still in my brain cache. It’ll come in handy if the details get fuzzy later. Maybe this will help you as well.

There’ll be some C-flavoured C++ code. Look away now if you’re afraid of pointers. Note that the code here does not deal with error cases. I kept it tight and simple to illustrate the concepts more clearly.

All credit to the ideas presented here goes to Jonathan Blow.

A tiny expression language

Suppose we have an expression language that supports single-letter (lowercase) variable names and basic arithmetic operations (addition, subtraction, multiplication, and division).

Expressions in this language look like this:

a + b + c + d
a - b + c
a + b * c
a / b + c

Expressions such as a + b + c + d and a - b + c are trivial to evaluate: We can execute the + or - operators in any order and the result will be the same. Things get tricky when we add * and / to the mix.

In a + b * c, if we evaluate a + b first and then multiply it by c we get an incorrect result. The correct thing to do is to evaluate b * c and then add a. The same applies to the expression a / b + c: First do the division and then the addition.

To deal with the different evaluation order, we could use parentheses to make the order explicit: a + b * c could be turned into a + (b * c) and a / b + c into (a / b) + c.

If the a parser understands that parenteses indicate that section should be evaluated first, we don’t need to care about operator precedence. Just put the parentheses around the bits you care about and the parser will do the right thing. The problem is that this sucks to type and read (sorry Lisp people).

That’s where assigning precedences to different operators comes in: We can set up the precedence rules in a way that allows us to minimize the need to add explicit precedence markers (parentheses).

What does operator precedence even mean

This always confused me. I’m not totally sure why. Maybe because of the way parsing works.

When parsing we (generally) create a tree structure that represents the content we want to manipulate. The tree encodes the language’s grammar rules. In the case of our expression language, the tree also represents the evaluation order of its subexpressions. An expression like a + b * c - d could look something like this in tree form:

* is higher precedence than + and must be evaluated before the +. Since tree traversals start at the top, and we want the tree to encode the precedence of the operators, b * c must be lower in the tree in relation to the +.

An operator’s precedence indicates whether it should be placed higher or lower in the parse tree (in relation to another operator).

Remember: Higher-precedence = evaluated first = lower in the parse tree.

A naive parser

Now that we know what operator precedence is, let’s take a look at a parser implementation that is completely unaware of operator precedence: All operators are evaluated left to right. The parser handles expressions written for the language outlined earlier.

The parser should provide a tree structure encoding both leaf nodes (single-letter lowercase variable names) and intermediary nodes (arithmethic operators).

enum NodeKind {
    NodeKind_Add,
    NodeKind_Sub,
    NodeKind_Div,
    NodeKind_Mul,
    NodeKind_Variable,
};

// Parse tree node definition
struct Node {
    NodeKind kind;
    union {
        struct {
            Node * left;  // Left subtree of a binary operator
            Node * right; // Right subtree of a binary operator
        };
        char variable_name;   // Single-letter variable name 
    };
};

When the parser tries to handle an expression it first reads a leaf node (variable name). If the next token in the input is an arithmetic (binary) operator (+, -, /, *) it consumes the token and tries to parse the rest of the expression. If the next token is not an operator, the parser just returns the leaf node it read.

static Node * parse_expression(Parser * parser) {
    Node * left = parse_leaf(parser);

    Token next_token = get_next_token(parser);
    if (is_binary_operator(next_token)) {
        Node * right = parse_expression(parser);
        
        return make_binary_op_node(to_binary_op(next_token), left, right);
    }

    return left;
}

The parser starts on the “left” of the input and it keeps walking to the “right” by consuming tokens until it is done.

Let’s take a look at how parse_expression would handle a + b + c + d.

First the call to parse_leaf consumes a and creates a leaf node for it.

Then it sees that the next token is the + operator and recursively calls parse_expression to deal with the rest of the expression. The process repeats itself till we reach the variable d.

d gets consumed by the parse_leaf function and get_next_token returns an end of file marker (since there’s nothing left in the expression to parse). The end of file marker is not a binary operator, so the function returns the node containing the variable d.

At this point we have four separate tree nodes, none of which are connected.

As the recursion unwinds, the tree binary operator nodes are constructed and linked together.

The parser works and gives us a correct tree structure. Note how we get a left-leaning tree. We’ll come back to this later.

If we run it on the expression c + a * b, again the parser gives us the correct answer: * gets evaluated first (lower in the tree), followed by +.

But if we swap the order of the terms, turning it into a * b + c, things break down. + is placed lower in the tree and, therefore, gets evaluated before *. That yields the wrong result, even though the expression is effectively the same.

A tree-rewriting parser

If we look at the tree we get from the naive parser for the expression a * b + c, we can see how to fix it:

  1. The * node right subtree points to the + node left subtree.
  2. The + node left subtree points to the * node.
  3. The + node becomes the top of the tree.

static Node * parse_expression(Parser * parser) {
    Node * left = parse_leaf(parser);

    Token next_token = get_next_token(parser);
    if (is_binary_operator(next_token)) {
        Node * right = parse_expression(parser);
        Node * result = make_binary_op_node(to_binary_op(next_token), left, right);

        if (result->kind == NodeKind_Mul || result->kind == NodeKind_Div) {
            if (result->right->kind == NodeKind_Add || result->right->kind == NodeKind_Sub) {
                result->right = right->left;
                right->left = result;
                result = right;
            }
        }
        return result;
    }

    return left;
}

While this works for a simple expression like a * b + c, it’s a bit funny on something more complex like a * b / c + d:

The final expression is not necessarily wrong, but it goes against convention. In maths, operators with the same precedence (+ and -, * and /) are evaluated left to right. In this case, * should be lower in the tree compared to / since it is more to the left of the expression.

We can fix that issue by making a slight change: As long as the right child of the current node has lower (or equal) precedence, we keep pushing the current node down the tree.

static Node * parse_expression(Parser * parser) {
    ...

    if (is_binary_operator(next_token)) {
        ...

        Node dummy = { .left = current_node };
        Node * current_node_parent = &dummy;

        if (current_node->kind == NodeKind_Mul || current_node->kind == NodeKind_Div) {
            while (current_node->right->kind == NodeKind_Mul ||
                   current_node->right->kind == NodeKind_Div ||
                   current_node->right->kind == NodeKind_Add ||
                   current_node->right->kind == NodeKind_Sub) {
    
                // Push the current node down the tree.
                Node * tmp = current_node->right;
                current_node->right = current_node->right->left;
                tmp->left = current_node;

                // Update the parent pointer.
                current_node_parent->left = tmp;

                // Set up the new parent.
                current_node_parent = tmp;
            }
        }

        return dummy.left;
    }
    ...
}

If we do this at every level, we’re guaranteed to have a well formed expression tree at the end of the parsing procedure.

This is analogous the approach Jon describes as what he used in Jai for nearly 9 years before moving to TDOP. For simple languages with few operators this is easy to implement. It’s just writing code to do what you would do in your own head. But languages evolve over time, more operators get added, they interact in weird ways and need special handling and things get unwieldy.

As an example, our code is still not totally correct. For consecutive + and - sub-expressions, the resulting tree will still not evaluate left to right. The fix for that is simple: Like with * and / nodes, you keep pushing the current+ (or -) node down the tree while its right child has lower or equal precedence.

static Node * parse_expression(Parser * parser) {
    ...

    if (is_binary_operator(next_token)) {
        ...

        if (current_node->kind == NodeKind_Mul || current_node->kind == NodeKind_Div) {
            ...
        } else if (current_node->kind == NodeKind_Add || current_node->kind == NodeKind_Sub) {
            while (current_node->right->kind == NodeKind_Add ||
                   current_node->right->kind == NodeKind_Sub) {
    
                // Push the current node down the tree.
                Node * tmp = current_node->right;
                current_node->right = current_node->right->left;
                tmp->left = current_node;

                // Update the parent pointer.
                current_node_parent->left = tmp;

                // Set up the new parent.
                current_node_parent = tmp;
            }
        }

        ...
    }
    ...
}

A moment of insight

Let’s recap what we have seen so far.

The naive parser builds left-leaning trees. This happens because we just consume tokens from left to right and grow the tree by attaching sub-expression nodes as the right child of a tree node. There’s no concept of operator precedence: We just keep chugging along till the expression is fully parsed.

In the tree-rewriting parser, different operator precedences are accounted for: Higher-precedence operators get pushed down the tree. That causes them to be evaluated first. To rewrite the tree, we hoist the right child up and push the higher-precedence operator node down to the left subtree. That has the effect of tilting the tree to the right.

We tilt the tree to the right when we find a higher precedence operator because we are unwinding the recursion. We are going right to left when we do that. This requires us to consume the entire expression before we are able to adjust the tree at all.

What if we swapped things around? What if we adjusted the tree as we went left to right?

The tilting condition would be the opposite of the tree-rewriting parser.

As we go left to right, as long as the operators increase in precedence, we can keep building a left-leaning tree (sub-expressions get parsed as the right child of the previous node). Once the precedence of the operators becomes non-increasing (equal or lower), we tilt the tree to the right: Everything that came before the new lower precedence operator needs to be evaluated first and gets shifted downwards.

Finally an actual top-down operator precedence parser

We can assign explicit precedence scores to each operator. A higher score means higher precedence. Since all operators are known ahead of time, we can build a table where each index contains the precedence score for the operator matching it.

static int PRECEDENCE_TABLE[TokenKind__COUNT] = {
    0,  // TokenKind_Variable
    2,  // TokenKind_Plus
    2,  // TokenKind_Minus
    3,  // TokenKind_Slash
    3,  // TokenKind_Asterisk
    0,  // TokenKind_EOF
};

To parse an expression, we just make a minor adjustment: As long as operators of higher precedence are found to the right, we keep going.

Once we find an operator of lower or equal precedence we create a new node for it, make the current tree its left subtree, turn the new node into the tree root, and adjust the target precedence to that of the new operator. After that we keep going right as before.

static Node * parse_expression(Parser * parser, int min_precedence) {
    Node * left = parse_leaf(parser);

    while (true) {
        Node * node = parse_increasing_precedence(parser, left, min_precedence);

        if (node == left) break;
        left = node;
    };

    return left;
}

static Node * parse_increasing_precedence(Parser * parser, Node * left, int min_precedence) {
    Token next_token = get_next_token(parser);
    int precedence = PRECEDENCE_TABLE[next_token.kind];
    if (precedence <= min_precedence) {
        go_back_to_previous_token(parser);
        return left;
    }

    if (is_binary_operator(next_token)) {
        Node * right = parse_expression(parser, precedence);
        return make_binary_op_node(to_binary_op(next_token), left, right);
    } else {
        return left;
    }
}

With this in place, extending the expression language is easy. Say we want to add comparison operators like < and >. All we have to do is update the precedence table for those tokens and assign them the appropriate precedence score. For comparison operators it is handy to write expressions such as a + b > c * d + f and have them evaluate as (a + b) > ((c * d) + e): Both sides of the expression are evaluated before the comparison is done, without the need for explicit parentheses.

How do we do that? We give < and > lower precedence scores so that they are put higher in the tree and hence evaluated later.

static int PRECEDENCE_TABLE[TokenKind__COUNT] = {
    ...
    1,  // TokenKind_LessThan
    1,  // TokenKind_GreaterThan
    0,  // TokenKind_EOF
};

And that’s about it. We have an easy and concise way of extending the parser for new operators as needed.

After going through the three implementations I listed here, it feels like an intuitive thing to do. Maybe going through it step by step and visualising what happens at each step helped.

References