I always wanted to create my own programming language.

It might be one of the programming projects with the highest floor-to-ceiling ratio: a simple interpreter can take up less than a hundred lines of code, but adding features and improving the speed quickly increases the complexity, until it reaches the heights of industrial strength compilers like gcc, with millions of lines of code.

At around 7k lines of code this project lies somewhere inbetween (at least on a logarithmic scale).

The language is a Lisp1 modeled after Scheme and runs on a simple bytecode-VM, so that it is possible to write non-trivial programs and have them run at a reasonable speed.

(defn fib (n)
      (if (<= n 1)
          n
          (+ (fib (- n 1))
             (fib (- n 2)))))

(println (fib 30))

Of course this is a pretty meaningless benchmark, but on my machine the recursive fibonacci fuction above takes 1.1s to execute in EulerLisp which gets makes it between 2x and 10x slower than a few other languages I tested:

  • Ruby, 0.11s
  • Python, 0.23s
  • Elixir, 0.48s
  • Chicken Scheme (interpreted), 0.68s
  • Chicken Scheme (compiled), 0.07s
  • C (gcc -O3), 0.08s

Motivation

As an incentive to make the language fast and usable, I’m using it to work through some Project Euler math problems.

Project Euler badge

Up till now, I’ve solved around 200 of the problems and written around 7k lines of EulerLisp code.

TODO: List of favorite problems

Data Types

The main data type is called Datum and has different variants:

  • Bool
  • Integer (64bit signed)
  • Bignum (multiple precision)
  • Rational (64bit numerator and denominator)
  • Float (64bit)
  • Char
  • String
  • Symbol (pointer into a symbol table)
  • Vector
  • Builtin (reference to a function implemented in rust)
  • Closure (reference to a function in the target language)
  • PriorityQueue (for performance reasons, should be implemented in lisp)
  • Undefined
  • Nil

VM

EulerLisp runs on a stack-based bytecode VM.

Structure

The VM has one register for storing values and one for storing function references.

There are three kinds of stacks (to avoid wrapping all values in the Datum type).

  1. Main value stack
  2. Environment reference stack
  3. Program counter stack

Arguments are passed to functions through a Vec<Datum> called a “frame”.

Instructions

Because the main usecase is solving math problems, a lot of the instructions are math related.

TODO: Describe instructions

Number Manipulation

  • INC, DEC
  • ADD, SUB, MUL, DIV, MOD, INTDIV

Cons-Cell Manipulation

Note: I’m using fst (first) and rst (rest) instead of the dated car and cdr. cadr can be written as frst.

  • CONS
  • FST,RST

Logic, Equality, Comparison, Predicates

  • NOT
  • EQUAL, EQ, NEQ
  • GT, GTE, LT, LTE
  • ISZERO, ISNIL

Vector Manipulation

  • VECTORREF
  • VECTORSET

Register / Stack Manipulation

  • PUSHVALUE
  • POPFUNCTION
  • CONSTANT
  • PUSHCONSTANT

Variable References

There are three types of variable references, global (in the outmost scope), shallow (in the current environment) or deep (in some parent of the current environment.

For each of these types, there is an instruction to load it into the main register, push it on the stack or set it to the value of the main register.

  • CHECKEDGLOBALREF
  • GLOBALREF
  • PUSHCHECKEDGLOBALREF
  • PUSHGLOBALREF
  • GLOBALSET
  • SHALLOWARGUMENTREF
  • PUSHSHALLOWARGUMENTREF
  • SHALLOWARGUMENTSET
  • DEEPARGUMENTREF
  • PUSHDEEPARGUMENTREF
  • DEEPARGUMENTSET

Jumps

Two types of jumps would suffice (conditional & unconditional), the other ones exists because I wanted to play around with peephole-optimization.

  • JUMP
  • JUMPFALSE
  • JUMPTRUE
  • JUMPNIL
  • JUMPNOTNIL
  • JUMPZERO
  • JUMPNOTZERO

Closures & Function Calls

Closures can accept either a fixed (fix closures) or a variable number of arguments (dotted closures).

Most functions take between zero and three arguments, to increase efficiency they have special call instructions.

  • FIXCLOSURE
  • DOTTEDCLOSURE
  • STOREARGUMENT
  • CONSARGUMENT
  • FUNCTIONINVOKE
  • PRESERVEENV
  • RESTOREENV
  • EXTENDENV
  • UNLINKENV
  • CALL1
  • CALL2
  • CALL3
  • CALLN

Function Calls & Frames

  • ALLOCATEFRAME
  • ALLOCATEDOTTEDFRAME
  • ALLOCATEFILLFRAME

Exit

  • RETURN
  • FINISH

Resources

If this inspired you to start building your own programming language, here are some resources that helped me:

Background

Algorithms & Data Structures

Implementation

Specs

TODO

Garbage Collection

Currently I’m using rusts reference counted pointers Rc and RefCell as a substitute for implementing my own garbage collection.

This works fine in most cases, but because Scheme allows creating circular lists it it possible to create objects that are not reachable by some root but have non-zero reference counts.

(defn fill-memory ()
      (let ((circular (cons 1 2)))
        (set-rst! circular circular))
      (fill-memory))

(fill-memory)

In the future it would be nice to write my own simple garbage collector.

Source code on Github: l3kn/rust_lisp