2008-03-27 16:27 UTC Writing a compiler in Ruby bottom up - step 2/??
This is the second in a series. Please read part 1 first if you haven't already to get the background. Part 3 is here
My choice of Ruby as my implementation language was pretty arbitrary. The choice doesn't matter all that much at this stage, but I really like Ruby.
At some point, however, there will be a series of steps to bring the language and the compiler closer together. What I mean by that is that I want the compiler to become self-hosting: It should be capable of compiling itself.
That means that either the compiler will have to support at least a subset of Ruby, or there will be a translation step to rewrite parts of the compiler in the language the compiler will understand.
While that doesn't dictate the implementation language, it means it's always worthwhile at least ensuring that your implementation language is reasonably close to where you want to go, unless you're not interested in making the compiler self-hosted.
It is also worthwhile to seriously avoid using complex language constructs in your implementation if you want the compiler to become self-hosted. Keep in mind that you need to support every language construct the compiler will need, and so if you use lots of fancy functionality you'll either have to support equivalent functionality in the compiler, or run the risk of having to do massive changes to the structure when you're ready to take the jump. Not fun.
One distinct benefit of Ruby as an implementation language, however (and it shares this with certain other languages, such as Lisp), is that you can build literal tree structures very easily - in Ruby using Array or Hash and in Lisp using lists.
It means I can start out by "faking" the abstract syntax tree as arrays, and neatly skip over the step of writing a parser. Yay! The cost is ugly, but temporarily bearable, syntax.
Hello World
This is what Hello World will look like:[:puts,"Hello World"]What we need to handle here is pretty trivial: We need to be able to push an argument onto the stack, and we need to be able to call a function. So let's see how to do that in x86 assembler. I compiled this C program with "gcc -S":
int main()
{
puts("Hello World");
}
And looked at the output. Below I've reproduced the bits that are actually relevant, which I found by looking at the output from last time and seeing what had changed:
.section .rodata
.LC0:
.string "Hello World"
.text
[...]
subl $4, %esp
movl $.LC0, (%esp)
call puts
addl $4, %esp
[...]
If you understand som asm, even if you've not worked with x86 before, this is quite easy to pick up on:
- It defines a constant string
- In the code it allocated 4 bytes on the stack by subtracting 4 from the stack pointer.
- It then moves the address of the string onto the stack, into the 4 freshly allocated bytes.
- We call "puts", conveniently provided by glibc (all of this series will assume gcc/gas + glibc - on Linux odds are you already have it installed)
- Afterwards we clean up the stack by adding four bytes back on.
def initialize @string_constants = {} @seq = 0 end def get_arg(a) # For now we assume strings only seq = @string_constants[a] return seq if seq seq = @seq @seq += 1 @string_constants[a] = seq return seq endThis code simply maps a string constant to an integer that we'll use to point to the labels. You'll not multiple string constants will map to the same label and get output once. It's not necessary to do it that way, but it's a trivial optimization to use a Hash instead of an Array to ensure uniqueness. Here's the function I added to output the string constants:
def output_constants puts "\t.section\t.rodata" @string_constants.each do |c,seq| puts ".LC#{seq}:" puts "\t.string \"#{c}\"" end endThat just leaves compiling the call itself:
def compile_exp(exp) call = exp[0].to_s args = exp[1..-1].collect {|a| get_arg(a)} puts "\tsubl\t$4,%esp" args.each do |a| puts "\tmovl\t$.LC#{a},(%esp)" end puts "\tcall\t#{call}" puts "\taddl\t$4, %esp" endYou'll probably notice an ugly inconsistency here: The code above pretends to be able to handle multiple arguments, but then it proceeds to subtract only 4 from the stack pointer, and it doesn't apply a displacement to the stack pointer, so multiple arguments will just overwrite each other. We'll fix that in a moment. For our simple one argument Hello World the above will do. Notice a few other things about this code:
- It doesn't check if the function even exists - we'll let gcc/gas worry about that, though it means unhelpful error messages.
- We can actually call any function we can link to, as long as it takes a single string argument in.
- There's lots of ugly bits that really should be abstracted away, such as how I get hold of the address to call, all that inlining of assembly code that tightly ties the compiler to i386, etc. Bear with me, it will (slowly) get addressed.
.text
.globl main
.type main, @function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $4,%esp
movl $.LC0,(%esp)
call puts
addl $4, %esp
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
.section .rodata
.LC0:
.string "Hello World"
Here's how to test it:
[vidarh@dev compiler]$ ruby step2.rb >hello.s [vidarh@dev compiler]$ gcc -o hello hello.s [vidarh@dev compiler]$ ./hello Hello World [vidarh@dev compiler]$
So, how do we handle more than one argument?
I'm not going to show C and the corresponding assembler again - it's easy enough to do a number of calls with different number of arguments and look at the result. Instead I'll skip straight to the changed compile_exp (complete source here)PTR_SIZE=4 def compile_exp(exp) call = exp[0].to_s args = exp[1..-1].collect {|a| get_arg(a)} # gcc on i386 does 4 bytes regardless of arguments, and then # jumps up 16 at a time. We will blindly do the same. stack_adjustment = PTR_SIZE + (((args.length+0.5)*PTR_SIZE/(4.0*PTR_SIZE)).round) * (4*PTR_SIZE) puts "\tsubl\t$#{stack_adjustment}, %esp" args.each_with_index do |a,i| puts "\tmovl\t$.LC#{a},#{i>0 ? i*PTR_SIZE : ""}(%esp)" end puts "\tcall\t#{call}" puts "\taddl\t$#{stack_adjustment}, %esp" endSo what's going on here? The changes aren't huge:
- Instead of allocating a specific amount of stack (4 bytes in the previous version), we're adjusting the stack based on the number of arguments. I'll plainly admit to not knowing why gcc uses the specific adjustment it does - it really doesn't matter at this stage, though I'd guess it's for alignment of the stack. Optimize/clean up later, and don't change things when you're not sure why they've done it that way...
- As you can see the arguments are then moved onto the stack one by one. We're still assuming they are all the same size of pointers (so 4 bytes on i386).
- You can also see that the arguments to the function is moved onto the stack so that the leftmost argument is at the lowest address allocated. If you haven't done any assembly programming, draw boxes to visualize it if you can't visualize it in your mind, and keep in mind that stacks as here usually grow downwards in memory. When allocating space, we move the stack pointer down in memory, and here we then copy arguments onto the stack upwards in memory (using higher and higher indexes to %esp, just as if you're accessing an array).
[:printf,"Hello %s\\n","World"]
Comments