Vidar Hokstad V2.0

Home Blog

2008-03-27 16:27 UTC Writing a compiler in Ruby bottom up - step 2/??

This is the second in a series. Please read part 1 first if you haven't already to get the background. Part 3 is here

My choice of Ruby as my implementation language was pretty arbitrary. The choice doesn't matter all that much at this stage, but I really like Ruby.

At some point, however, there will be a series of steps to bring the language and the compiler closer together. What I mean by that is that I want the compiler to become self-hosting: It should be capable of compiling itself.

That means that either the compiler will have to support at least a subset of Ruby, or there will be a translation step to rewrite parts of the compiler in the language the compiler will understand.

While that doesn't dictate the implementation language, it means it's always worthwhile at least ensuring that your implementation language is reasonably close to where you want to go, unless you're not interested in making the compiler self-hosted.

It is also worthwhile to seriously avoid using complex language constructs in your implementation if you want the compiler to become self-hosted. Keep in mind that you need to support every language construct the compiler will need, and so if you use lots of fancy functionality you'll either have to support equivalent functionality in the compiler, or run the risk of having to do massive changes to the structure when you're ready to take the jump. Not fun.

One distinct benefit of Ruby as an implementation language, however (and it shares this with certain other languages, such as Lisp), is that you can build literal tree structures very easily - in Ruby using Array or Hash and in Lisp using lists.

It means I can start out by "faking" the abstract syntax tree as arrays, and neatly skip over the step of writing a parser. Yay! The cost is ugly, but temporarily bearable, syntax.

Hello World

This is what Hello World will look like:

 [:puts,"Hello World"]

What we need to handle here is pretty trivial: We need to be able to push an argument onto the stack, and we need to be able to call a function.

So let's see how to do that in x86 assembler. I compiled this C program with "gcc -S":

int main()
{
   puts("Hello World");
}

And looked at the output. Below I've reproduced the bits that are actually relevant, which I found by looking at the output from last time and seeing what had changed:

        .section        .rodata
.LC0:
        .string "Hello World"
        .text
[...]
        subl    $4, %esp
        movl    $.LC0, (%esp)
        call    puts
        addl    $4, %esp
[...]

If you understand som asm, even if you've not worked with x86 before, this is quite easy to pick up on:

  • It defines a constant string
  • In the code it allocated 4 bytes on the stack by subtracting 4 from the stack pointer.
  • It then moves the address of the string onto the stack, into the 4 freshly allocated bytes.
  • We call "puts", conveniently provided by glibc (all of this series will assume gcc/gas + glibc - on Linux odds are you already have it installed)
  • Afterwards we clean up the stack by adding four bytes back on.

So how do we make use of this in our compiler? First of all we need a way to handle the string constants, so we add this to the Compiler class from last time. (I'm making the full Ruby code available here so you don't have to guess what to copy and paste in where)

  def initialize
    @string_constants = {}
    @seq = 0
  end

def get_arg(a) # For now we assume strings only seq = @string_constants[a] return seq if seq seq = @seq @seq += 1 @string_constants[a] = seq return seq end

This code simply maps a string constant to an integer that we'll use to point to the labels. You'll not multiple string constants will map to the same label and get output once. It's not necessary to do it that way, but it's a trivial optimization to use a Hash instead of an Array to ensure uniqueness.

Here's the function I added to output the string constants:

  def output_constants
    puts "\t.section\t.rodata"
    @string_constants.each do |c,seq|
      puts ".LC#{seq}:"
      puts "\t.string \"#{c}\""
    end
  end

That just leaves compiling the call itself:

  def compile_exp(exp)
    call = exp[0].to_s

args = exp[1..-1].collect {|a| get_arg(a)}

puts "\tsubl\t$4,%esp"

args.each do |a| puts "\tmovl\t$.LC#{a},(%esp)" end

puts "\tcall\t#{call}" puts "\taddl\t$4, %esp" end

You'll probably notice an ugly inconsistency here: The code above pretends to be able to handle multiple arguments, but then it proceeds to subtract only 4 from the stack pointer, and it doesn't apply a displacement to the stack pointer, so multiple arguments will just overwrite each other.

We'll fix that in a moment. For our simple one argument Hello World the above will do.

Notice a few other things about this code:

  • It doesn't check if the function even exists - we'll let gcc/gas worry about that, though it means unhelpful error messages.
  • We can actually call any function we can link to, as long as it takes a single string argument in.
  • There's lots of ugly bits that really should be abstracted away, such as how I get hold of the address to call, all that inlining of assembly code that tightly ties the compiler to i386, etc. Bear with me, it will (slowly) get addressed.

It's time to try running the second step of the compiler. It should generate this output:

        .text
.globl main
        .type   main, @function
main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        subl    $4,%esp
        movl    $.LC0,(%esp)
        call    puts
        addl    $4, %esp
        popl    %ecx
        popl    %ebp
        leal    -4(%ecx), %esp
        ret
        .size   main, .-main
        .section        .rodata
.LC0:
        .string "Hello World"

Here's how to test it:

[vidarh@dev compiler]$ ruby step2.rb >hello.s
[vidarh@dev compiler]$ gcc -o hello hello.s
[vidarh@dev compiler]$ ./hello 
Hello World
[vidarh@dev compiler]$ 

So, how do we handle more than one argument?

I'm not going to show C and the corresponding assembler again - it's easy enough to do a number of calls with different number of arguments and look at the result. Instead I'll skip straight to the changed compile_exp (complete source here)

  PTR_SIZE=4

def compile_exp(exp) call = exp[0].to_s

args = exp[1..-1].collect {|a| get_arg(a)}

# gcc on i386 does 4 bytes regardless of arguments, and then # jumps up 16 at a time. We will blindly do the same. stack_adjustment = PTR_SIZE + (((args.length+0.5)*PTR_SIZE/(4.0*PTR_SIZE)).round) * (4*PTR_SIZE) puts "\tsubl\t$#{stack_adjustment}, %esp" args.each_with_index do |a,i| puts "\tmovl\t$.LC#{a},#{i>0 ? i*PTR_SIZE : ""}(%esp)" end

puts "\tcall\t#{call}" puts "\taddl\t$#{stack_adjustment}, %esp" end

So what's going on here? The changes aren't huge:

  • Instead of allocating a specific amount of stack (4 bytes in the previous version), we're adjusting the stack based on the number of arguments. I'll plainly admit to not knowing why gcc uses the specific adjustment it does - it really doesn't matter at this stage, though I'd guess it's for alignment of the stack. Optimize/clean up later, and don't change things when you're not sure why they've done it that way...
  • As you can see the arguments are then moved onto the stack one by one. We're still assuming they are all the same size of pointers (so 4 bytes on i386).
  • You can also see that the arguments to the function is moved onto the stack so that the leftmost argument is at the lowest address allocated. If you haven't done any assembly programming, draw boxes to visualize it if you can't visualize it in your mind, and keep in mind that stacks as here usually grow downwards in memory. When allocating space, we move the stack pointer down in memory, and here we then copy arguments onto the stack upwards in memory (using higher and higher indexes to %esp, just as if you're accessing an array).

That's enough to compile something like this:

[:printf,"Hello %s\\n","World"]

What's next

It's a baby step forward, and I promise the next steps will get more and more useful, as the number of constructs that are required to compile useful programs is actually very small. I'll also make the following steps terser and handle "meatier" additions, as I've explained a lot of the motivations and so will add less and less explanation for the simpler bits of functionality as we go on.

Next I'll extend the compiler to handle multiple arguments, followed by chaining statements, sub expressions, handling return values from sub expressions and so on.

Within perhaps a dozen or so steps of similar complexity to this, we'll have function definition, argument passing, conditionals, a runtime library, and even a stunted "lambda" operator to give us anonymous functions (proper closures comes a lot later)

Within a few more steps we'll have enough to compile a simple text mangling program to give the input somewhat nicer syntax than Ruby Array's and symbols (only somewhat, though, a proper parser is not on the table until much later)


Comments

2008-03-28 06:21 UTC
Nice to see you are continuing with the previous blog and are going to see this through (well thus far anyway).

I am finding it quite interesting. Having just recently studied grammer/language parsing and lexical analysis I find this interesting to follow.

2008-03-28 07:37 UTC
August Lilleaas
Your webpage in Firefox 3:

http://pastie.org/171932

No compiler for me =(

2008-03-28 12:11 UTC
Vidar Hokstad
August, that's interesting - I'm using Firefox myself (but 2.x still) and it looks fine for me. I'll download 3 tonight and see what's wrong. CSS isn't my thing, so I've no doubt done something stupid.

Boyter, thanks for the feedback. I have about 30 parts ready and waiting - they just need 20-30 minutes of cleanup each, so it'll continue at least until I've exhausted the pre-written material, probably with 1-2 a week... After that it might take me a bit longer between posts, depending on my schedule.

2008-04-05 11:32 UTC
Vidar Hokstad
As a heads up, I've just posted part 3

2008-04-06 01:47 UTC
Matthias Georgi
Vidar, this series is just fantastic. I've spent a lot of time building my own toy compiler. It compiled a subset of ruby, which was loaded on the fly as ruby extension.

I really like your style of writing, as you just follow the flow of a coding session. Well, it's how you called it: bottom up.

So, keep it up and thank you for this great stuff!

2008-04-06 01:50 UTC
Matthias Georgi
To fix your stylesheet for Firefox 3, I suggest to change the width of the entry class to 90% instead of auto. I don't know how this looks in IE, but for FF3 its perfect.
2008-04-06 12:04 UTC
Vidar Hokstad
Matthias,

Thanks for the comments. Have you considered writing up what you've done for your own toy compiler?

Ideas around compiling Ruby is very interesting to me, particularly given the dynamic aspects of the language, and while this series will take a long time to get there, most of my choices in terms of what to implement beyond what has to be there is based on what would be useful to implement Ruby like languages.

Regarding the stylesheet, thanks for the fix - it works reasonably well in both Opera, Safari, FF2 and FF3 for me now, though it has a bit of wasted space in at least Opera. I haven't tested IE. I guess I should.

2008-04-06 13:44 UTC
Matthias Georgi
My toy compiler needs a rewrite I think, so I can't wait for your upcoming parts to gather useful inspiration.

The ugliest part is the type conversion. If you want to optimize for particular types like Fixnums, a pluggable tree walker api is needed. Furthermore it would be useful to compile mini languages, which serve special purposes like data queries.

2008-04-06 14:07 UTC
Vidar Hokstad
I've not written the parser parts yet, but I'm intending to use a parser generator to generate both the parser itself and the tree handling code. Haven't given much thought to the optimization yet, though.

Type conversion is particularly nasty for Ruby like languages in one way, in that you often have very few hints about what types will be in use at runtime, so you just have to default to extremely generic code. But I do think there's room for optimizing quite heavily there, though I suspect it will almost always involve runtime type checks and including a slow "fallback path" all over the place. Yuck.

As for the parser generation, my one problem there is that I don't like any of the ones I know of (Ragel is perhaps the closest one I've seen to something I could live with). I have a strong aversion to parser generators that require intermixing code and the parser specification in particular, and I find most of them more verbose than I'd like.

I may take a detour and write one. I've written a few before, and when I wrote these parts originally I was also experimenting with a VM and "assembly language" for writing parsers that I could plug various frontends into (I wrote one to "compile" BNF into my parser assembly), and I must admit I have thought about using some of the code for this compiler to generate assembly for the parser directly too.

Post a Comment

Basic HTML allowed. Javascript required for anti-spam check

About me

E-mail: vidar@hokstad.com
Skype: vhokstad
View my LinkedIn profile

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Recent posts to my blog

Categories

StumbleUpon My link page

(Links I have stumbled and like)