Building a Programming Language Pt. 2 - Parsing

24 July 2019

In pt. 1 of this series we introduced a language called Smol and built a simple lexer to tokenize Smol code. In this post we'll build a parser to read those tokens and spit out an AST (Abstract Syntax Tree).

As with the previos post, all code is written in typescript and can be run and modified from this repl.it sandbox.

The Smol AST

An AST is an intermediate representation of code that can be easily translated into something a computer can understand. For example, given the following Smol code:

let x = 1;
let y = 7;

if y > x {
    print("Hello world!");
}

Our parser will output an AST that looks something like this:

[ { type: 'variable-declaration', name: 'x', varType: 'let' },
  { type: 'function-call',
    function: '=',
    args:
     [ { type: 'number', value: 1 },
       { type: 'identifier', name: 'x' } ] },
  { type: 'variable-declaration', name: 'y', varType: 'let' },
  { type: 'function-call',
    function: '=',
    args:
     [ { type: 'number', value: 7 },
       { type: 'identifier', name: 'y' } ] },
  { type: 'if',
    condition:
     { type: 'function-call',
       function: '>',
       args:
        [ { type: 'identifier', name: 'x' },
          { type: 'identifier', name: 'y' } ] },
    body:
     [ { type: 'function-call',
         function: 'print',
         args: [ { type: 'string', value: 'Hello world!' } ] } ] } ]

Lets move on to writing the code for our parser.

A Smol Parser

Lets start by defining an AST type for our parser to produce:

// Here we define AST as a recursive array that can hold instructions and other ASTs.
export interface AST extends Array<Instruction | AST> { }

// We define as a descriminated union of statements.
// So an intruction is simply one of a set of statements defined below identified by it's type
export type Instruction =
    SmolFunction |
    FunctionCall |
    SmolNumber |
    SmolBool |
    SmolString |
    Variabledeclaration |
    IfStatement |
    WhileStatement |
    SmolIdentifier |
    ReturnStatement;


// Below here we define the types of statements we support

export interface Statement {
    type: string;
}

// A function can have args and a body
export interface SmolFunction extends Statement {
    type: "function";
    args: string[];
    body: AST;
}

export interface FunctionCall extends Statement {
    type: "function-call";
    function: string;
    args: AST;
}

// Primitive instruction that returns a number
export interface SmolNumber extends Statement {
    type: "number";
    value: number;
}

// Primitive instruction that returns a string stored in memory
export interface SmolString extends Statement {
    type: "string";
    value: string;
}

// Primitive instruction that returns a bool stored in memory
export interface SmolBool extends Statement {
    type: "boolean";
    value: boolean;
}

export interface Variabledeclaration extends Statement {
    type: "variable-declaration";
    name: string;
    varType: "let" | "var";
}

export interface IfStatement extends Statement {
    type: "if";
    condition: AST | Instruction;
    body: AST | Instruction;
}

export interface ReturnStatement extends Statement {
    type: "return";
    exp: AST | Instruction;
}

export interface WhileStatement extends Statement {
    type: "while";
    condition: AST | Instruction;
    body: AST | Instruction;
}

export interface SmolIdentifier extends Statement {
    type: "identifier";
    name: string;
}

Now that we have definied our AST lets move onto parsing.

We'll begin with the statement parser. This portion of the parser breaks down a single statement into an Instruction. In Smol a statement is essentially anything terminated by a semicolon (and in some cases a comma, right parenthesis or right curly brace). An example would be let x = 3 * 2;.

One of the key features of our language is that it supports infix notation. This means that our language can understand arithmetic operations such as 2 * (4 + 3). In order to do this we'll use an algorithm called the shunting-yard algorithm.

The shunting-yard algorithm defines two collections, an output queue and an operator stack. We'll use this to form the basis of our statement parser. For now we'll assume statements are just simple aritmetic operations (2 * (4 + 3)) terminated by a semicolon.

const parseStatement = (tokens: Token[]): Instruction => {
    const output: AST = []; // Anything that is not an infix operator.
    const operator: Token[] = []; // Anything that IS an infix operator.

    // We consume tokens until we run out or we hit a terminator.
    while (tokens.length > 0) {
        const token = tokens[0];

        // The following tokens get added to the output stack.

        if (token.type === "number") {
            output.push({ type: "number", value: Number(token.value) });
            tokens.shift(); // Token is no longer needed.
            continue;
        }

        if (token.type === "string") {
            output.push({ type: "string", value: token.value });
            tokens.shift(); // Token is no longer needed.
            continue;
        }

        if (token.type === "boolean") {
            output.push({ type: "boolean", value: token.value === "true" });
            tokens.shift(); // Token is no longer needed.
            continue;
        }

        if (token.type === "identifier") {
            output.push({ type: "identifier", name: token.value });
            tokens.shift(); // Token is no longer needed.
            continue;
        }

        // Here we begin getting into the meat of the shunting-yard algorithm. We've hit an
        // operator token.
        if (token.type === "operator") {

            // We start by checking the existing operator stack.
            while (operator.length > 0) {
                const op = operator[operator.length - 1];

                // If the last operator we found has a greater or equal operator precidence as our token,
                // we remove it from the operator stack. Helper functions such as operatorPrecedence
                // will be defined later.
                if (operatorPrecedence(op.value) >= operatorPrecedence(token.value)) {

                    // Here we convert the last operator in our operator stack into a function call
                    // where the arguments are the last two items on the output stack.
                    output.push({
                        type: "function-call",
                        function: operator.pop()!.value,
                        args: [output.pop()!, output.pop()!]
                    });

                    // We continue looping as long as the condition of this if statement is satisfied.
                    continue;
                }

                // The current token must have a higher precidence than the last item in our
                // operator stack (or there is nothing in that stack). So we stop checking the
                // operator stack.
                break;
            }

            // Finally we add the token to our operator stack
            operator.push(tokens.shift()!);
            continue;
        }

        // A left paren means we have found a sub expression like (4 + 3). We'll recursively call
        // this function to parse it.
        if (token.type === "left-paren") {
            tokens.shift(); // Get rid of the left-paren token;
            output.push(parseStatement(tokens)); // Add the expression to our output stack
            tokens.shift(); // Get rid of the right paren
            continue;
        }

        // We've hit the end of a sub expression (or function call). Break from the loop.
        // Note that we do not consume the token with tokens.shift(). More on that later.
        if (token.type === "right-paren") {
            break;
        }

        // The statement is terminated with a semicolon. Consume the semicolon and break out of
        // our main token consuming while-loop
        if (token.type === "terminator") {
            tokens.shift();
            break;
        }

        throw new Error(`Unexpected token: ${token}`);
    }

    // This is the final bit of the shunting-yard algorithm. We consume any remaining operators
    // in the operator stack and apply the items on the output stack as the operators arguments.
    while (operator.length > 0) {
        output.push({
            type: "function-call",
            function: operator.pop()!.value,
            args: [output.pop()!, output.pop()!]
        });
    }

    // By now our output stack should only have a single instruction inside of it. Because output
    // can technically contain Instructions AND AST we have to tell the compiler that output[0] will
    // only ever be an Instruction. We do this with the `as Instruction` annotation.
    return output[0] as Instruction;
}

// Here is our operator precidence function. The higher the number the higher the precidence.
const operatorPrecedence = (operator: string): number => {
    const precidences: Record<string, number> = {
        "and": 0,
        "or": 0,
        "xor": 0,
        "==": 0,
        "<": 0,
        ">": 0,
        ">=": 0,
        "<=": 0,
        "<>": 0,
        "?": 0,
        "+": 1,
        "-": 1,
        "*": 2,
        "/": 2,
        "^": 3,
        ".": 4,
        "=": 5,
        "=>": 5
    }
    return precidences[operator];
}

We now have a simple statement parser capable of handling infix notation. That was probably the hardest part of the parser. Lets expand it to support assignment, if, return, function definition and function call statements.

// We've added a terminator parameter. This allows callers to tell the parser to end when
// it sees a specified terminator token. This will become helpful later.
const parseStatement = (tokens: Token[], terminator?: Token): Instruction => {
    const output: AST = [];
    const operator: Token[] = [];

    while (tokens.length > 0) {
        const token = tokens[0];

        //New: Stop parsing if the token matches a supplied terminator token.
        if (terminator && token.type === terminator.type && token.value === terminator.value) {
            tokens.shift(); // Remove said terminator token.

            // Stop consuming tokens and finish up parsing
            break;
        }

        if (token.type === "number") { /** unchanged */ }
        if (token.type === "string") { /** unchanged */ }
        if (token.type === "boolean") { /** unchanged */ }

        // New: Keyword handling.
        if (token.type === "keyword") {

            // We've found a variable declaration instruction.
            if (token.value === "let") {

                // The next token is the identifier, we grab it's value but don't consume it.
                const next = tokens[1];
                output.push({ type: "variable-declaration", name: next.value, varType: "let" });
                tokens.shift(); // Consume the let token.

                // Our statement parser will assume the declaration is a single instruction
                // and will break immediately.
                break;
            }

            // Same behavior as let
            if (token.value === "var") {
                const next = tokens[1];
                output.push({ type: "variable-declaration", name: next.value, varType: "var" });
                tokens.shift();
                break;
            }

            // We found an if statement!
            if (token.value === "if") {

                // Consume the if token.
                tokens.shift();

                // Parse the condition. We know the condition expression will be terminated
                // by a left curly (if condition {}) so we tell the parseStatment function to stop
                // when it sees one.
                const condition = parseStatement(tokens, { type: "left-curly", value: "{" });

                // The body of an if statement can have many instructions. So we call our main
                // parser to get a full AST. We'll define the main parser later.
                const body = parser(tokens);

                output.push({ type: "if", condition, body });

                // If statements are really an expression that return a value in our language so
                // we continue parsing the statement as an expression until we hit a ;.
                continue;
            }

            // A return statement!
            if (token.value === "return") {
                tokens.shift(); // Get rid of the return token.
                output.push({ type: "return", exp: parseStatement(tokens) });
                continue;
            }

            throw new Error(`Unkown keyword: ${token.value}`);
        }

        // New: We found an identifier! Lets figure what to do with it.
        if (token.type === "identifier") {

            // Preview the next token.
            const next = tokens[1];

            // If the next token is a left-paren, the identifier is being used as a function call.
            if (next && next.type === "left-paren") {
                tokens.shift(); // get rid of the identifier.
                tokens.shift(); // get rid of the left paren.
                output.push({
                    type: "function-call",
                    function: token.value,
                    args: parseFnCall(tokens), // We'll define this later.
                });

                // Continue consuming tokens.
                continue;
            }

            // It's just an identifier if we got here.
            output.push({ type: "identifier", name: token.value });
            tokens.shift(); // get rid of the identifier.
            continue;
        }

        if (token.type === "operator") { /** unchanged */ }

        // New: We've found a function definition.
        if (token.type === "pipe") {
            tokens.shift(); // get rid of the pipe token.
            const args = parseArgs(tokens); // defined later.
            tokens.shift(); // Get rid of left curly ('|| {' <- that one, we dont need it anymore).
            output.push({ type: "function", args, body: parser(tokens) });

            // That's all you need to do to parse a function. Wow! That was easy wasn't it?
            continue;
        }

        if (token.type === "left-paren") { /** unchanged */ }

        // New: We found a body. In Smol a body `{ <expressions> }` is an expression that returns a value,
        // much like rust. So we need to extract it's AST,
        if (token.type === "left-curly") {
            tokens.shift();
            output.push(parser(tokens));
            continue;
        }

        // New.
        if (token.type === "right-curly") {
            break;
        }

        if (token.type === "right-paren") { /** unchanged */ }

        // New.
        if (token.type === "comma") {
            break;
        }

        if (token.type === "terminator") { /** unchanged */ }

        throw new Error(`Unexpected token: ${token}`);
    }

    while (operator.length > 0) { /** unchanged */ }

    return output[0] as Instruction;
}

// Here is our function call parser. Each argument of the call is assumed to be an expression.
const parseFnCall = (tokens: Token[]): AST => {
    const ast: AST = [];

    let token = tokens[0];

    // Loop through the comma seperated list of expressions until we hit a right paren. This is
    // an example of why we some terminators are not conumed by the parseStatement function.
    while (token.type !== "right-paren") {
        if (token.type === "comma") {
            tokens.shift();
        }

        ast.push(parseStatement(tokens))
        token = tokens[0];
    }
    tokens.shift(); // Remove right paren;

    return ast;
}

// This function is used for finding arguments at the definition of a function
const parseArgs = (tokens: Token[]): string[] => {
    const args: string[] = [];

    let token = tokens.shift()!;

    // Loop through the comma seperated list of identifiers.
    // We've reached the end of our argument list when we hit the closing pipe.
    while (token.type !== "pipe") {
        if (token.type === "comma") {
            token = tokens.shift()!;
            continue;
        }

        args.push(token.value);
        token = tokens.shift()!;
    }

    return args;
}

And now, the main event. Our full parser:

export const parser = (tokens: Token[]): AST => {
    const ast: AST = [];

    // We essentially just call parseStatement repeatedly until there are no more tokens
    while (tokens.length > 0) {

        // If we hit a right curly, this function was probably called by the statement parser
        // to parse a body, function body, or an if statement body. So we can stop.
        if (tokens[0].type === "right-curly") {
            tokens.shift();
            break;
        }

        ast.push(parseStatement(tokens));
    }

    return ast;
}

Thats all there is to it. For now we've left out while loops and comment parsing. I may do another post in the future to support it, but it could be a fun exersice to figure out on your own as well. The parser also assumes the code its is syntactically correct. Otherwise it barfs unpleasently.

This is my first serious parser so if you have any suggestions on how I could make it better, don't hesitate to contact me. In our next post we'll make a basic Smol interpreter.

Drew Youngwerth's Blog

Code, Art and the Universe.

Building a Programming Language Pt. 2 - Parsing

24 July 2019

The Smol AST

A Smol Parser