Writing a compiler in Ruby bottom up

This is part of a series I started in March 2008 - you may want to go back and look at older parts if you're new to this series.

My choice of Ruby as my implementation language was pretty arbitrary. The choice doesn't matter all that much at this stage, but I really like Ruby.

At some point, however, there will be a series of steps to bring the language and the compiler closer together. What I mean by that is that I want the compiler to become self-hosting: It should be capable of compiling itself.

That means that either the compiler will have to support at least a subset of Ruby, or there will be a translation step to rewrite parts of the compiler in the language the compiler will understand.

While that doesn't dictate the implementation language, it means it's always worthwhile at least ensuring that your implementation language is reasonably close to where you want to go, unless you're not interested in making the compiler self-hosted.

It is also worthwhile to seriously avoid using complex language constructs in your implementation if you want the compiler to become self-hosted. Keep in mind that you need to support every language construct the compiler will need, and so if you use lots of fancy functionality you'll either have to support equivalent functionality in the compiler, or run the risk of having to do massive changes to the structure when you're ready to take the jump. Not fun.

One distinct benefit of Ruby as an implementation language, however (and it shares this with certain other languages, such as Lisp), is that you can build literal tree structures very easily - in Ruby using Array or Hash and in Lisp using lists.

It means I can start out by "faking" the abstract syntax tree as arrays, and neatly skip over the step of writing a parser. Yay! The cost is ugly, but temporarily bearable, syntax.

Hello World

This is what Hello World will look like:


     [:puts,"Hello World"]

What we need to handle here is pretty trivial: We need to be able to push an argument onto the stack, and we need to be able to call a function.

So let's see how to do that in x86 assembler. I compiled this C program with "gcc -S":


    int main()
    {
       puts("Hello World");
    }

And looked at the output. Below I've reproduced the bits that are actually relevant, which I found by looking at the output from last time and seeing what had changed:


            .section        .rodata
    .LC0:
            .string "Hello World"
            .text
    [...]
            subl    $4, %esp
            movl    $.LC0, (%esp)
            call    puts
            addl    $4, %esp
    [...]

If you understand som asm, even if you've not worked with x86 before, this is quite easy to pick up on:

It defines a constant string
In the code it allocated 4 bytes on the stack by subtracting 4 from the stack pointer.
It then moves the address of the string onto the stack, into the 4 freshly allocated bytes.
We call "puts", conveniently provided by glibc (all of this series will assume gcc/gas + glibc - on Linux odds are you already have it installed)
Afterwards we clean up the stack by adding four bytes back on.

So how do we make use of this in our compiler? First of all we need a way to handle the string constants, so we add this to the Compiler class from last time. (I'm making the full Ruby code available here so you don't have to guess what to copy and paste in where)


      def initialize
        @string_constants = {}
        @seq = 0
      end
    
      def get_arg(a)
        # For now we assume strings only                                                                                                                                               
        seq = @string_constants[a]
        return seq if seq
        seq = @seq
        @seq += 1
        @string_constants[a] = seq
        return seq
      end

This code simply maps a string constant to an integer that we'll use to point to the labels. You'll not multiple string constants will map to the same label and get output once. It's not necessary to do it that way, but it's a trivial optimization to use a Hash instead of an Array to ensure uniqueness.

Here's the function I added to output the string constants:


      def output_constants
        puts "\t.section\t.rodata"
        @string_constants.each do |c,seq|
          puts ".LC#{seq}:"
          puts "\t.string \"#{c}\""
        end
      end

That just leaves compiling the call itself:


      def compile_exp(exp)
        call = exp[0].to_s
    
        args = exp[1..-1].collect {|a| get_arg(a)}
    
        puts "\tsubl\t$4,%esp"
    
        args.each do |a|
          puts "\tmovl\t$.LC#{a},(%esp)"
        end
    
        puts "\tcall\t#{call}"
        puts "\taddl\t$4, %esp"
      end

You'll probably notice an ugly inconsistency here: The code above pretends to be able to handle multiple arguments, but then it proceeds to subtract only 4 from the stack pointer, and it doesn't apply a displacement to the stack pointer, so multiple arguments will just overwrite each other.

We'll fix that in a moment. For our simple one argument Hello World the above will do.

Notice a few other things about this code:

It doesn't check if the function even exists - we'll let gcc/gas worry about that, though it means unhelpful error messages.
We can actually call any function we can link to, as long as it takes a single string argument in.
There's lots of ugly bits that really should be abstracted away, such as how I get hold of the address to call, all that inlining of assembly code that tightly ties the compiler to i386, etc. Bear with me, it will (slowly) get addressed.

It's time to try running the second step of the compiler. It should generate this output:


            .text
    .globl main
            .type   main, @function
    main:
            leal    4(%esp), %ecx
            andl    $-16, %esp
            pushl   -4(%ecx)
            pushl   %ebp
            movl    %esp, %ebp
            pushl   %ecx
            subl    $4,%esp
            movl    $.LC0,(%esp)
            call    puts
            addl    $4, %esp
            popl    %ecx
            popl    %ebp
            leal    -4(%ecx), %esp
            ret
            .size   main, .-main
            .section        .rodata
    .LC0:
            .string "Hello World"

Here's how to test it:


    [vidarh@dev compiler]$ ruby step2.rb >hello.s
    [vidarh@dev compiler]$ gcc -o hello hello.s
    [vidarh@dev compiler]$ ./hello 
    Hello World
    [vidarh@dev compiler]$

So, how do we handle more than one argument?

I'm not going to show C and the corresponding assembler again - it's easy enough to do a number of calls with different number of arguments and look at the result. Instead I'll skip straight to the changed compile_exp (complete source here)


      PTR_SIZE=4
    
      def compile_exp(exp)
        call = exp[0].to_s
    
        args = exp[1..-1].collect {|a| get_arg(a)}
    
        # gcc on i386 does 4 bytes regardless of arguments, and then                                                                                                                   
        # jumps up 16 at a time. We will blindly do the same.                                                                                                                          
        stack_adjustment = PTR_SIZE + (((args.length+0.5)*PTR_SIZE/(4.0*PTR_SIZE)).round) * (4*PTR_SIZE)
        puts "\tsubl\t$#{stack_adjustment}, %esp"
        args.each_with_index do |a,i|
          puts "\tmovl\t$.LC#{a},#{i>0 ? i*PTR_SIZE : ""}(%esp)"
        end
    
        puts "\tcall\t#{call}"
        puts "\taddl\t$#{stack_adjustment}, %esp"
      end

So what's going on here? The changes aren't huge:

Instead of allocating a specific amount of stack (4 bytes in the previous version), we're adjusting the stack based on the number of arguments. I'll plainly admit to not knowing why gcc uses the specific adjustment it does - it really doesn't matter at this stage, though I'd guess it's for alignment of the stack. Optimize/clean up later, and don't change things when you're not sure why they've done it that way...
As you can see the arguments are then moved onto the stack one by one. We're still assuming they are all the same size of pointers (so 4 bytes on i386).
You can also see that the arguments to the function is moved onto the stack so that the leftmost argument is at the lowest address allocated. If you haven't done any assembly programming, draw boxes to visualize it if you can't visualize it in your mind, and keep in mind that stacks as here usually grow downwards in memory. When allocating space, we move the stack pointer down in memory, and here we then copy arguments onto the stack upwards in memory (using higher and higher indexes to %esp, just as if you're accessing an array).

That's enough to compile something like this:


    [:printf,"Hello %s\\n","World"]

What's next

It's a baby step forward, and I promise the next steps will get more and more useful, as the number of constructs that are required to compile useful programs is actually very small. I'll also make the following steps terser and handle "meatier" additions, as I've explained a lot of the motivations and so will add less and less explanation for the simpler bits of functionality as we go on.

Next I'll extend the compiler to handle multiple arguments, followed by chaining statements, sub expressions, handling return values from sub expressions and so on.

Within perhaps a dozen or so steps of similar complexity to this, we'll have function definition, argument passing, conditionals, a runtime library, and even a stunted "lambda" operator to give us anonymous functions (proper closures comes a lot later)

Within a few more steps we'll have enough to compile a simple text mangling program to give the input somewhat nicer syntax than Ruby Array's and symbols (only somewhat, though, a proper parser is not on the table until much later)

Writing a compiler in Ruby bottom up - step 2/?? 2008-04-05

Hello World

So, how do we handle more than one argument?

What's next