Writing a Chip 8 Assembler Part 1

This is the first post in a series about writing a Chip 8 assembler. I wrote a Chip 8 interpreter a while ago, and I thought it'd be interesting to develop a set of tools for Chip 8 development. The next step in the toolkit would be an assembler. I've also been interested in writing my own assembler for a while, so these reasons made it a good candidate for my next project.

There are different types of syntaxes that the Chip 8 assemblers in the wild use. For this one, I'll be using the syntax in Cowgod's guide. The other popular one is the Octo syntax. Octo is much higher level than Cowgod's. That makes it easier to use, but harder to implement.

I'd like to implement a more traditional assembler (at least initially), so I'm going to use Cowgod's syntax. I may get around to implementing something similar to Octo later. A traditional assembler is simpler to implement; the grammar is much easier than for a high level language (as seen below), and I'd just like to get something off the ground.

This will be written in Rust; I find Rust suited for these kinds of projects due to its strongly-typed system and functional semantics. It's basically an ML in disguise and those kinds of languages are really well-suited for compilers.

Overview of the assembler

The first step in creating compilers is to create a grammar and then create a lexer and parser for that grammar. The lexer transforms the incoming strings of text into tokens. We use tokens instead of strings they're more compact and easier to work with; they abstract away the text and add more meaning to it. After we've got the tokens, they're sent to the parser for validation and to build a syntax tree. Then the tree is traversed and lowered into machine code.

I'm creating an assembler here, but since I'm a compiler engineer (and I treat all problems like these with the same hammer), I'll be using a lexer and parser generator to do the heavy lifting for me. No use reinventing the wheel when I don't have to!

Here's some sample assembly:

main:
LD V1, 2
LD V2, 4
OR V1, V2
CALL foo
JP main

; This is function foo
foo:
LD V0, 5
RET

This just loads some registers, calls foo and then jumps to the start of main to repeat the whole thing again. Nothing useful, but it gives an idea of what the assembly will look like.

Grammar

Here's an extended Backus-Naur form (EBNF) grammar for the above snippet:

labels: label*

label: identifier ":" (instructions | data)

instructions: instruction*

instruction: opcode op "," op
            | opcode op
            | jmp_opcode identifier

data: number

number: binary | octal | decimal | hex

opcode: "LD" | "OR" | "ADD" ...

jmp_opcode: "CALL" | "JP"

op: register | number

register: "V0" | "V1" | "V2" | "V3" | "V4" ... | "VF"

identifier: alpha-numeric-string

Note: I find EBNF easier to read and use because it supports syntax like:

A brief primer on grammars: grammars are composed of terminals and non-terminals. Terminals are the symbols in a grammar that cannot be replaced by other symbols, hence the name. The parser "terminates" on them. Examples above (not exhaustive): register, identifier, binary, octal, decimal and hex.

Non-terminals are symbols that can be replaced by other symbols using the production rules of the language. A production rule specifies which symbols can be replaced by other symbols. E.g. op: register | number is a production rule composed of the terminal register and the non-terminal number. number can be further replaced by binary, octal, decimal, or hex.

These rules can also be combined recursively to generate more complicated expressions. This grammar doesn't have too much of that. However, by combining these rules together, we can create Chip 8 assembly programs that our assembler will accept and assemble.

A grammar might be overkill for something like this, but since I'll be using LALRPOP to generate the parser, I'll need a grammar anyways. I might be expanding this grammar in a later post if I need to support additional constructs.

Lexer

For the lexer, I'll be using logos. This library allows us to create an enum that contains all of our tokens along with the regex for each of them. I could write my own lexer from scratch, but it's much faster, clearer and easier to use a lexer generator. I've written one in the past and they're good to write to learn how they work but that won't be covered here; it just takes too much time. There are some projects that do use a hand-rolled lexer though. Rhai is an example of one.

Create a new Rust project, using cargo new --bin chip8asm. Then, create src/lib.rs, and src/lexer.rs. Run cargo add logos to add Logos to the project.

In lib.rs, we'll need to reference our lexer:

pub mod lexer;

We'll need to create an enum to hold all of our tokens in lexer.rs:

#[derive(Logos, Debug, Clone, PartialEq)]
pub enum Token {
    #[token(",", priority = 500)]
    Comma,

    #[token("V0", priority = 501)]
    V0,

    #[token("V1", priority = 502)]
    V1,

    #[token("V2", priority = 503)]
    V2,

    #[token("V3", priority = 504)]
    V3,

    // ...
}

The #[derive(Logos)] will transform this enum into something that can be used by Logos. Each member of Token takes a #[token] with a regex for the string value and a priority. Logos will attempt to match the string based on the priority. I like to keep my tokens in alphabetical order, and if I insert new tokens in the middle, the priority numbers need to be updated. That's a lot work and since I'm lazy, I use a python script to generate all the enum members for me:

# generate_tokens.py

tokens = {
    ":" : "Colon",
    "," : "Comma",
    "." : "Dot",
    "[" : "LBracket",
    "]" : "RBracket",

    "V0" : "V0",
    "V1" : "V1",
    "V2" : "V2",
    "V3" : "V3",
    "V4" : "V4",
    "V5" : "V5",
    "V6" : "V6",
    "V7" : "V7",
    "V8" : "V8",
    "V9" : "V9",
    "V10" : "V10",
    "V11" : "V11",
    "V12" : "V12",
    "V13" : "V13",
    "V14" : "V14",
    "V15" : "V15",
    "I" : "I",

    "SYS" : "Sys",
    "CALL" : "Call",
    "RET" : "Ret",
    "JP" : "Jmp",
    "SE" : "Se",
    "SNE" : "Sne",
    "LD" : "Ld",
    "ADD" : "Add",
    "OR" : "Or",
    "AND" : "And",
    "XOR" : "Xor",
    "SUB" : "Sub",
    "SHR" : "Shr",
    "SUBN" : "Subn",
    "SHL" : "Shl",
    "SNE" : "Sne",
    "RND" : "Rnd",
    "DRW" : "Drw",
    "SKP" : "Skp",
    "SKNP" : "Sknp",
}

priority = 500

print("Lexer tokens:\n")

for val in tokens:
    print('#[token("' + val + '", priority = ' + str(priority) + ")]")
    print(tokens[val] + ",\n")
    priority += 1

The key is the opcode string, and the value is the name of the enum member. Running it and pasting it into our Token, we get:

#[derive(Logos, Debug, Clone, PartialEq)]
pub enum Token {
    #[token(":", priority = 500)]
    Colon,

    #[token(",", priority = 501)]
    Comma,

    #[token(".", priority = 502)]
    Dot,

    #[token("[", priority = 503)]
    LBracket,

    #[token("]", priority = 504)]
    RBracket,

    #[token("V0", priority = 505)]
    V0,

    #[token("V1", priority = 506)]
    V1,

    #[token("V2", priority = 507)]
    V2,

    #[token("V3", priority = 508)]
    V3,

    #[token("V4", priority = 509)]
    V4,

    #[token("V5", priority = 510)]
    V5,

    #[token("V6", priority = 511)]
    V6,

    #[token("V7", priority = 512)]
    V7,

    #[token("V8", priority = 513)]
    V8,

    #[token("V9", priority = 514)]
    V9,

    #[token("V10", priority = 515)]
    V10,

    #[token("V11", priority = 516)]
    V11,

    #[token("V12", priority = 517)]
    V12,

    #[token("V13", priority = 518)]
    V13,

    #[token("V14", priority = 519)]
    V14,

    #[token("V15", priority = 520)]
    V15,

    #[token("I", priority = 521)]
    I,

    #[token("SYS", priority = 522)]
    Sys,

    #[token("CALL", priority = 523)]
    Call,

    #[token("RET", priority = 524)]
    Ret,

    #[token("JP", priority = 525)]
    Jmp,

    #[token("SE", priority = 526)]
    Se,

    #[token("SNE", priority = 527)]
    Sne,

    #[token("LD", priority = 528)]
    Ld,

    #[token("ADD", priority = 529)]
    Add,

    #[token("OR", priority = 530)]
    Or,

    #[token("AND", priority = 531)]
    And,

    #[token("XOR", priority = 532)]
    Xor,

    #[token("SUB", priority = 533)]
    Sub,

    #[token("SHR", priority = 534)]
    Shr,

    #[token("SUBN", priority = 535)]
    Subn,

    #[token("SHL", priority = 536)]
    Shl,

    #[token("RND", priority = 537)]
    Rnd,

    #[token("DRW", priority = 538)]
    Drw,

    #[token("SKP", priority = 539)]
    Skp,

    #[token("SKNP", priority = 540)]
    Sknp,
}

With that, we've got a lexer for all of our opcodes! Using a lexer generator saved a ton of time. And even better, we can generate most of it using a script! #automation The only things left are numbers, identifiers, and lexer errors.

To add basic error handling, we need to create an error enum, and tell Logos to use it (we really should be adding more info to our errors but that can be done later):

#[derive(Default, Debug, Clone, PartialEq)]
pub enum LexingError {
    ParseNumberError,
    #[default]
    InvalidToken,
}

impl fmt::Display for LexingError {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "{:?}", self)
    }
}

impl From<std::num::ParseIntError> for LexingError {
    fn from(_: std::num::ParseIntError) -> Self {
        LexingError::ParseNumberError
    }
}

impl From<std::num::ParseFloatError> for LexingError {
    fn from(_: std::num::ParseFloatError) -> Self {
        LexingError::ParseNumberError
    }
}

#[derive(Logos, Debug, Clone, PartialEq)]
#[logos(error = LexingError)]
enum Token {
    // ...
}

For numbers, we can use this nasty regex and a number parsing function that will also use our LexingError:

    #[regex(r"(-?([0-9][_0-9]*))|(0[xX]([0-9a-fA-F][_0-9a-fA-F]*))|(0[oO]([0-7][_0-7]*))|(0[bB]([0-1][_0-1]*))", priority = 10, callback = parse_num)]
    Int(i64),

    // ...

fn parse_num(lex: &mut logos::Lexer<Token>) -> Result<i16, LexingError> {
    let slice = lex.slice().replace('_', "");

    if let Some(ch) = slice.chars().next() {
        if ch == '0' && slice.len() > 1 {
            return match slice.chars().nth(1) {
                Some('x') | Some('X') => Ok(i16::from_str_radix(&slice[2..], 16)?),
                Some('o') | Some('O') => Ok(i16::from_str_radix(&slice[2..], 8)?),
                Some('b') | Some('B') => Ok(i16::from_str_radix(&slice[2..], 2)?),
                _ => Err(LexingError::ParseNumberError),
            };
        }
    }
    return Ok(slice.parse()?);
}

This will attempt to parse hex, octal and binary first, and then default to decimal.

We can clean this up a bit though. Logos allows us to add subpatterns for regexes. The regex for parsing binary, octal, hex and decimal is pretty big and unreadable, so we can use subpatterns to simplify them. Above Token, we can add the following and change the regex to:

#[derive(Logos, Debug, Clone, PartialEq)]
#[logos(error = LexingError)]
#[logos(subpattern decimal = r"[0-9][_0-9]*")]
#[logos(subpattern hex = r"[0-9a-fA-F][_0-9a-fA-F]*")]
#[logos(subpattern octal = r"[0-7][_0-7]*")]
#[logos(subpattern binary = r"[0-1][_0-1]*")]
enum Token {
    #[regex(r"(-?(?&decimal))|(0[xX](?&hex))|(0[oO](?&octal))|(0[bB](?&binary))", priority = 10, callback = parse_num)]
    Int(i64),
}

Logos makes identifiers pretty easy too:

    #[regex("_?[a-zA-Z_][a-zA-Z0-9_]*", priority = 6, callback = |lex| lex.slice().to_string())]
    Identifier(String),

And lastly, we need a way to print it, so for now we can add this:

impl fmt::Display for Token {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "{:?}", self)
    }
}

Testing

Now that we have our lexer, we need to write some tests to make sure it works as inteded, and that future changes don't regress it. The tests for the lexer are going to be largely the same: we pass it an input string and check to make sure the output is the token we expect. For these kinds of tests, I like to create a macro and then modify the python script above to generate all the invocations of it.

At the bottom of the file, add the following:

#[cfg(test)]
mod tests {
    use super::*;

    // Generates tests for checking individual tokens
    macro_rules! gen_test {
        ($name: ident, $input: expr, $expected_res: expr) => {
            #[test]
            fn $name() {
                let mut lex = Token::lexer($input);

                assert_eq!(lex.next(), Some(Ok($expected_res)));
                assert_eq!(lex.slice(), $input);
                assert_eq!(lex.next(), None);
            }
        };
    }

    macro_rules! gen_test_err {
        ($name: ident, $input: expr, $expected_res: expr) => {
            #[test]
            fn $name() {
                let mut lex = Token::lexer($input);

                assert_eq!(lex.next(), Some(Err($expected_res)));
                assert_eq!(lex.slice(), $input);
                assert_eq!(lex.next(), None);
            }
        };
    }

Then in generate_tokens.py, add this at the bottom:

print("\nTests:")

for val in tokens:
    print('gen_test!(test_' + tokens[val].lower() + ', "' + val + '", Token::' + tokens[val] + ');')

We get the following test suite:

    gen_test!(test_comma, ",", Token::Comma);
    gen_test!(test_dot, ".", Token::Dot);
    gen_test!(test_lbracket, "[", Token::LBracket);
    gen_test!(test_rbracket, "]", Token::RBracket);
    gen_test!(test_v0, "V0", Token::V0);
    gen_test!(test_v1, "V1", Token::V1);
    gen_test!(test_v2, "V2", Token::V2);
    gen_test!(test_v3, "V3", Token::V3);
    gen_test!(test_v4, "V4", Token::V4);
    gen_test!(test_v5, "V5", Token::V5);
    gen_test!(test_v6, "V6", Token::V6);
    gen_test!(test_v7, "V7", Token::V7);
    gen_test!(test_v8, "V8", Token::V8);
    gen_test!(test_v9, "V9", Token::V9);
    gen_test!(test_v10, "V10", Token::V10);
    gen_test!(test_v11, "V11", Token::V11);
    gen_test!(test_v12, "V12", Token::V12);
    gen_test!(test_v13, "V13", Token::V13);
    gen_test!(test_v14, "V14", Token::V14);
    gen_test!(test_v15, "V15", Token::V15);
    gen_test!(test_i, "I", Token::I);
    gen_test!(test_sys, "SYS", Token::Sys);
    gen_test!(test_call, "CALL", Token::Call);
    gen_test!(test_ret, "RET", Token::Ret);
    gen_test!(test_jmp, "JP", Token::Jmp);
    gen_test!(test_se, "SE", Token::Se);
    gen_test!(test_sne, "SNE", Token::Sne);
    gen_test!(test_ld, "LD", Token::Ld);
    gen_test!(test_add, "ADD", Token::Add);
    gen_test!(test_or, "OR", Token::Or);
    gen_test!(test_and, "AND", Token::And);
    gen_test!(test_xor, "XOR", Token::Xor);
    gen_test!(test_sub, "SUB", Token::Sub);
    gen_test!(test_shr, "SHR", Token::Shr);
    gen_test!(test_subn, "SUBN", Token::Subn);
    gen_test!(test_shl, "SHL", Token::Shl);
    gen_test!(test_rnd, "RND", Token::Rnd);
    gen_test!(test_drw, "DRW", Token::Drw);
    gen_test!(test_skp, "SKP", Token::Skp);
    gen_test!(test_sknp, "SKNP", Token::Sknp);

We still need to test the identifiers and numbers though:

    gen_test!(test_identifier_abc, "abc", Token::Identifier("abc".to_string()));
    gen_test!(
        test_identifier_und_abc,
        "_abc",
        Token::Identifier("_abc".to_string())
    );
    gen_test!(
        test_identifier_und_und_abc,
        "__abc",
        Token::Identifier("__abc".to_string())
    );
    gen_test!(
        test_identifier_und_und_und_abc,
        "___abc",
        Token::Identifier("___abc".to_string())
    );
    gen_test!(
        test_identifier_und_abc1,
        "_abc1",
        Token::Identifier("_abc1".to_string())
    );
    gen_test!(
        test_identifier_und_abc1abc,
        "_abc1abc",
        Token::Identifier("_abc1abc".to_string())
    );
    gen_test!(
        test_identifier_und_abc_und_1abc,
        "_abc_1abc",
        Token::Identifier("_abc_1abc".to_string())
    );
    
    gen_test!(test_int_123, "123", Token::Int(123));
    gen_test!(test_int_negative_123, "-123", Token::Int(-123));
    gen_test!(test_int_1_2_3, "1_2_3", Token::Int(1_2_3));
    gen_test!(test_int_negative_1_2_3, "-1_2_3", Token::Int(-1_2_3));
    gen_test!(test_int_12_3, "12_3", Token::Int(12_3));

    gen_test!(test_oct_0o123, "0o123", Token::Int(0o123));
    gen_test!(test_oct_0capo123, "0O123", Token::Int(0o123));
    gen_test!(test_oct_0o1_2_3, "0o1_2_3", Token::Int(0o1_2_3));

    gen_test!(test_binary_0b101, "0b101", Token::Int(0b101));
    gen_test!(test_binary_0capb101, "0B101", Token::Int(0b101));
    gen_test!(test_binary_0b1_0_1, "0b101", Token::Int(0b1_0_1));

    gen_test!(test_hex_0x123f, "0x123f", Token::Int(0x123f));
    gen_test!(test_hex_0capx123, "0X123f", Token::Int(0x123f));
    gen_test!(test_hex_0x12_3_f, "0x12_3_f", Token::Int(0x12_3_f));
}

We also need to check our number parsing:

    gen_test_err!(test_int_12345678_err, "12345678", LexingError::ParseNumberError);
    gen_test_err!(test_oct_0o1232345, "0o1232345", LexingError::ParseNumberError);
    gen_test_err!(test_hex_0x123f, "0x123ffffff", LexingError::ParseNumberError);

In the next part, we'll implement the grammar using LALRPOP!