Writing a compiler in Ruby bottom up

I've just posted the second part of this series. See part 2

I've been sitting on this for a long time - the oldest entries in the series I'm about to start posting is stuff I started writing back in early 2005 before I even started version 1 of this blog.

I've always enjoyed compiler technology, and I've written several toy compilers. A while back I started writing a simple compiler in Ruby, and I wrote copious notes along the way exactly with the intention of posting it somewhere. Problem was I was that by the time I was finishing up the steps I have so far, my blog was languishing, and so it's been just gathering dust.

I have about 30 parts mostly written, all the way to a simple but working compiler. It'll take me a while to post them, as at least some of them need some cleanup, and I don't have mjuch time to write, but I might post a couple of parts a week, depending on my schedule.

Without further ado, here's the first entry:

Some background, and the first tiny bit of code

Usually, when writing an experimental compiler I've started top down. In other words, I've taken the traditional approach of designing a parser first, whether manually or using a parser generator. I've then started writing various tree transformation code to turn the AST (Abstract Syntax Tree) into symbol tables, doing error checking, annotating the tree with various information, and then finally code generation. Often after I switched to Linux I've "chickened out" and compiled to C, which is ok if your language match well with C semantics, which it often will given how low level C is.

My first compiler though, was written in M68000 assembler and output M68000 assembler.

I bootstrapped it by allowing inline assembler all the way through, and gradually adding structure:

First I'd add functions and function calls, and start making use of them to structure my compiler. Then I'd add basic arithmetics that worked on assembler registers. Then I'd add local variables etc., but parse the assembly so that the register allocator would stay clear of registers used in the assembler that was interspersed.

Gradually the compiler morphed from M68000 to my hybrid language, and less and less assembler.

I want to bootstrap a compiler again. Though I have no intention of writing it in assembler this time. But I will start at the bottom: With the code generator. And I'll have the following goals:

Simplicity over performance every time. The goal is to get to the point where the compiler can be rewritten to work in its target language - in other words I want it self hosted. Since the initial code will be thrown away, it's a waste of time to do more work than what's needed to make it work. Performance can come later.
Don't make decisions that will constrain performance. While performance in itself isn't a goal, I should no lightly start doing thing that will forever doom the language to mediocre performance, or that will take tremendous amount of work to fix later. Ruby, I'm looking at you (=> write about Ruby runtime dynamism)
Simplicity over convenience every time. Don't add features just for convenience - add features only when they support the bootstrap effort. That means they have to make life easier than the added complexity makes translation of the compiler into the target language harder.
Delay designing the language - design features. I want to make a capable and highly flexible code generator first, supporting an AST that allow encoding of a very diverse set of functionality, or make it easy to add a diverse set of functionality, though I'll have my eye on LISP/Scheme and Smalltalk in terms of expressiveness of the low level functionality (concepts, not specific details), and aim for something resembling Ruby in terms of syntax, meta programming capabilities and object model.
I don't want to learn x86 assembler, yet I'm going to target x86. It's a bit of an exaggeration. I'm proficient with M68K assembler and 6510 assembler, and I know enough bits and pieces of x86 assembler to be able to read it reasonably well. Yet I've never, ever, written a significant piece of code using it, and I don't intend to start. A lot of what I DO know comes from looking at assembler output from gcc. "gcc -S" is your friend. My knowledge of assembler in general, though, is enough to have a reasonable understanding of what's going on even when I don't know the specifics of the x86. I'll likely make stupid mistakes, but hopefully I'll learn from them (and anyone foolish enough to trust my initial thoughts on how the x86 works better get their head examined - I'm learning here)

Sounds like a lot of shortcut and still a lot of pain, doesn't it? But it'll be fun!

So lets start with the tiniest bit of useless code:


    #!/bin/env ruby
    
    class Compiler
    
      def compile(exp)
        # Taken from gcc -S output
        puts <<PROLOG
        .file   "bootstrap.rb"
        .text
    .globl main
        .type   main, @function
    main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
    PROLOG
    
        puts <<EPILOG
        popl    %ecx
        popl    %ebp
        leal    -4(%ecx), %esp
        ret
    EPILOG
    
        puts <<GLOBAL_EPILOG
        .size   main, .-main
    GLOBAL_EPILOG
      end
    end
    
    prog = [:puts,"Hello World"]
    
    Compiler.new.compile(prog)

It does nothing. Really. It's just the result of compiling this with gcc -S:


    int main()
    {
    }

... and breaking it into pieces.

It's a fully working compiler... Sort of.... Unfortunately it compiles any input into a noop, and so it's completely useless. But lets start with baby steps.

Consider this series of articles a "flow of consciousness". I do this for fun. I'm not about to sit down and make a nice clean design upfront. Nor am I going to be shy about changing my mind midstream and rip out code, or put stuff I previously ripped out back in again.

What you see here is my second pass: Each SVN commit I make is accompanied by notes for my posts, but the posts contain a lot of extra explanation. Sometimes that explanation makes me change my mind about things - but I won't make major changes unless I think it makes the explanations clearer or a point I made in the first pass is blatantly wrong. Even then I might leave it in, possibly with a note that I'll correct myself later. Sometimes I've contracted steps that took me a while to do but that are trivial to explain "after the fact".

Comments are very welcome, but keep in mind that while I may take your comments to heart, I have already written a lot of this series - I won't start rewriting large parts of it to take into account suggestions, but I will take notes and keep it in mind for whenever I catch up to what I've written already

Next time I'll make it actually do something with it's input, I promise.

Writing a compiler in Ruby bottom up - step 1/?? 2008-03-27

Some background, and the first tiny bit of code