Back to the Lab (again)

No One Can Stop You From Making a Toy VM

I've spent the last 48 hours plugging away at my latest creation: Brian's Virtual Machine (bvm).

The bvm code and docs can be found here.

There was no real goal here, except to practice a bit for an upcoming technical interview. I remembered vaguely how to do this sort of thing from college, but it wasn't until I sat down to try it again that I realized something:

Making a toy VM is really fucking easy, and you can do it too. I'll walk you through all the required pieces here, and link you to my implementation so you can see how it looks in practice. Mine is in rust, but you could use any language really. Let's do it!

The Ops

To keep things simple you can just use a stack-based VM; don't worry about heap or registers or any of that. If you can compute fibonacci numbers with just a stack I'm fairly certain you can do basically anything. The stack can just be a list of integers (but it could also be a list of typed enums! The world is your oyster!)

The basic idea is that your VM is going to run through a list of operations (the program) and execute them sequentially. Each operation will perform some action at the top of the stack, for example adding the top two stack items as integers and replacing the top of the stack with the result. Some operations will also skip to a different operation in the program, which allows you to have things like control flow and subroutines.

The first bit of code you're going to want to write is an enumeration of all your potential ops. These will change quite a bit as you develop and realize you don't quite know wtf you're doing, so keep it loose in the beginning. A good starting set of ops would include:

There will also be control flow operations, which I'll talk about in the next section.

You can use bvm's Op enum as a reference.

...but remember, you also don't have to! There's a lot of creativity that can go into here. You could make your stack be a stack of colors, and each operation be some transformation made to the top colors of the stack. This is the fun part, go nuts, they can't stop you.

A Note on Control Flow

Normally when the VM executes an operation it then procedes to the next operation in the list. Control flow refers to operations which cause the VM to not do this, and to instead "jump" to a different operation in the sequence and keep going from there. In bvm there are two types of control flow: JUMP and CALL.

JUMP operations simply jump to a different point in the operation list. This point might be specified absolutely ("jump to operation at index 5") or relatively ("jump to the operation 3 ahead of this one"). There are also conditional parameters which can be added to the JUMP operations, which, when combined with the comparison operation, allow for if-else branches.

The CALL operation has a sibling: the RETURN operation. These together are used to implement sub-routines (aka functions). The CALL operation, generally, pushes the operation pointer (aka the operation's index in the list of operations which comprise the program, also called the instruction pointer or program counter) onto the stack before jumping to some other operation. The RETURN operation does the opposite: it takes an operation pointer off the stack and jumps to that pointer. The idea is that a program uses CALL to jump into the code for a sub-routine, and the sub-routine's code will then use RETURN to jump back to where the CALL happened and continue from there.

The test_incr_fn test cast in bvm demonstrates JUMP, CALL, and RETURN.

The details of call and return, in terms of how they enable passing arguments into a sub-routine and return values back, can be tricky, and there's lots of ways to go about it. You could copy what I've done with bvm, or look up the documentation on your favorite CPU architecture, to see what they do. You could also just skip call/return for an MVP, you can get pretty far with just jump calls, it's just more tedious.

Execution Loop

With all (or enough) of your operations defined, it's time to execute them. Your program will just be a list of operations, and executing just means going through the list, looking at each, and doing what it says.

The next_inner method of bvm's VM type implements a single step of the execution loop.

I recommend structuring your code so that the VM only performs a single operation at a time, and leaving the "loop" to the code outside the VM. This allows you to have more control over how the VM runs, for example limiting how many cycles the program can run for (useful for tests) and giving an opportunity for inputting values into the program and pulling values out (if you have operations for that sort of thing).

Regardless of how you structure your loop. your VM will need to keep track of the current operation pointer. This pointer will tell it which index in the operation list it is going to execute next, and control-flow operations will be able to modify this pointer when they are executed.

When the VM passes beyond the final operation in the list it can consider the program ended. Alternatively you could error if the VM attempts to go outside the operation list, and instead introduce an explicit "halt" operation telling the VM to stop execution. bvm does the latter, as it helps ensure that a program intends to stop and that it didn't just run out of operations as a bug.

Bytecode, Or Not

When we think of VMs we tend to think of a program which runs "bytecode", which are files filled with bytes where each byte corresponds to an operation, optionally followed potentially by more bytes representing arguments to those operations. When you look in the bytecode reference for VMs you will often see very esoteric looking descriptions filled with hex numbers and bits which may or may not be set.

Here's the thing, you're not making a real VM. You're making your own. And the only thing that actually matters is that the VM is able to receive a list of the operation enumeration type you defined earlier. The purpose of the bytecode is to act as a serialization format, allowing you to store programs on disk for later. But you know what? You don't have to invent a serialization format. There's a thousand which already exist, just pick one. Msgpack, protobufs, JSON... hell you can use XML if you really want. This is a toy, have fun.

Anyway, you will want some way to store your list of operations to disk, and to load it back into memory later. But where does that list of operations come from in the first place? It's parsed from your assembly file, of course!

You Get To Design an Assembly Language Too!

Assembly is just a plaintext format describing a list of your operations, with some niceities built on top. Does your VM support an addition operation which takes a single argument? Well your assembly language is going to parse a line reading "ADD 5" into that operation.

An example of the assembly I came up with for bvm looks like this:

; A program which reads numbers in and outputs their corresponding fibonacci
; number.
;
; Everything after a ; is a comment.
MAIN:
    ; Read an i64, halt if none are remaining
    READ 8
    JUMP 2 IF!=0
    HALT
    POP

    PUSH 0
    PUSH 1
    CALL FIB 3
    WRITE
    JUMPTO MAIN

; FIB is a sub-routine which expects on the stack (bottom-to-top) the following
; numbers:
; * N - which fibonacci number to return
; * 0 (A)
; * 1 (B)
FIB:
    ; If N is 0 then we return A
    PUSH 0
    CMP 3
    JUMP 6 IF!=0
    POP
    POP
    SWAP 1
    POP
    RETURN 1

    ; Swap A and B and add them. The stack will be:
    ; N
    ; B
    ; A+B
    POP    ; Pop value which was used for CMP
    SWAP 1
    ADD 1

    ; Subtract 1 from N
    PUSH -1
    ADD 3
    SWAP 3
    POP

    ; JUMP back to FIB (tail recursion!). This is like calling:
    ; FIB(N-1, B, A+B)
    JUMPTO FIB

Assembly is much easier to parse than a higher-level language. You simply read a file, line by line, trim off whitespace, trim off everything after your comment character if there is one, trim whitespace again, then split what's left on whitespace.

The first string in the resulting group is the operation name, the rest (if any) are arguments. These get parsed directly into your operation enumeration. Parsing of arguments is up to you, and (as always) you're allowed to have fun. See my very cheeky `IF!=` arguments to JUMP. So daring!

Labels

The one hangup with parsing assembly are labels. In the program above the labels are `MAIN` and `FIB`. These name an index in the program which can be used in JUMP/CALL operations in place of the absolute numeric index. This is nice because as you work on a program the absolute index of things changes frequently, so giving a constant name to a specific spot is immensely useful.

The problem with labels is that they can be used in a program prior to actually being defined. This happens to `FIB` in the above. This means that as you're parsing your assembly file line-by-line you can't just stick label->index mappings in a lookup table and expect that whenever you see a label being used it will be in that table.

Instead you have to first go through all the lines, collect all the label->index mappings, then go through the lines again and do the parsing. Careful if you go down this route, as you need to properly track which index a label is pointing to in the face of empty lines, and also accounting for the fact that the label itself doesn't count for an operation.

Alternatively you can parse lines and fill the lookup map on the first pass, but leave labels unresolved in the parsed control flow operations, then do a second pass over your list of parsed operations and resolve the labels. In bvm I opted for this option, but rust being extremely typed made the result a bit ugly. It's possible having done the former would have resulted in prettier code.

A Nice CLI

You're now able to take an assembly file, compile it to a list of operations, store those operations to a file, load them back from that file, and pass the operations into an execution loop. All the pieces are ready, it's time for a pretty CLI!

Your CLI only needs to do two things: compile assembly files to "bytecode" files (or whatever serialization format you landed on), and execute those "bytecode" files. You could even do it old school and have a different CLI for each task, a la javac and java.

Here's a full end-to-end example of bvm's assemble and execute modes:

# Compiles fib_tail.bass into fib_tail.bop. Despite all the shit I was talking
# earlier, I did implement my own bytecode format too (Brian's OPcodes).
bvm assemble fib_tail.bass

# Executes the fib.bop program, passing it integers sequentially and printing
# its output to stdout.
bvm exec fib_tail.bop -i8 -i9 -i91
# 21
# 34
# 4660046610375530309

You Did It! Now Go Write a Blog Post

If you write a toy VM and don't write about it on your blog, did you even waste 2 days doing it?


Hi! I'm available for remote contract work. You can learn more about me and my skillset by browsing around this site, then head over to my resume site to find my work history and professional contact form.


This site is a mirror of my gemini capsule. The equivalent gemini page can be found here, and you can learn more about gemini at my 🚀 What is Gemini? page.