Vidar Hokstad V2.0

Home Blog

Tag: compiler in Ruby bottom up

2008-05-03 17:43 UTC Writing a compiler in Ruby bottom up - step 6

This is the sixth in a series. Please start on part 1 first if you haven't already to get the background, or see here for the full series so far

Since we established last time that I'm going to blatantly steal stuff from Lisp, Scheme and friends (as well as from all over the place - I feel no need to be original with this project... Not until we're much further along, anyway), now is the right time to start introducing some more powerful concepts.

How about some deferred evaluation and anonymous functions?

Lambdas, or anonymous functions, can be passed around like values and called at your convenience (or not at all). Generally, they can access variables from the surrounding scope that gets "bound" to the function as an environment that allows it to pass state. That's a closure. What we're going to do this time is not going to bring full closure support, but it's the start, and we'll get to full closures down the road.

Let's be clear: As almost everything in programming languages, they are syntactic sugar. You can consider them equivalent to defining a class with a single method to call it and instance variables to be the environment, for example (or you can treat closures as a way of building an object system instead - a concept explored by Wouter van Oortmerssen who has been a hero of mine ever since I discovered Amiga E. If you're a programming language geek you need to take a look at the stuff Wouter has done over the years) - many concepts are reasonably orthogonal.

(As further digression, have you noticed a lot of people like me who are obsessed with programming languages that have this horrible tendency to nest sentence fragments and abuse punctuation to interject random thoughts all over the place, or is that just me?)

But instead of having to go to all that trouble, and litter your namespace with small functions, lambdas let you define them inline, and return a value representing the function instead of executing it.

For now, we'll add these constructs:

(lambda (args) body)

(call f (args))

The former will return the function - for now just the address - instead of executing the body. "call" will call the address passed with the arguments it's given, no surprise there.

So how does this differ from "real" closures?

The tricky part about real closures is that if you refer to variables from the surrounding scope, they are supposed to "stay live" for the next invocation. That is, they are meant to be kept around whenever you call the closure later on. It should be clear that returning just the address to the function isn't sufficient to achieve that. So let's look at one way of doing it, so you get an idea of the work involved (it's not huge, but not trivial either):

  • We need to create an "environment", namely somewhere to store the variables that should survive past the lifetime of the function the closure is created in. This environment must exist on the heap, and each invocation of the function surrounding the lambda must result in a new environment.
  • What is returned must allow you to access the environment. You can do this by creating an "object" and make the environment represent instance variables, or you can use a "thunk" - a small generated function that contains a pointer to the object and loads it into a predefined location before calling the anonymous function, or any number of other ways.
  • You need to decide what variables go into the environment. This could be either "everything" accessible at the time of the execution of the "lambda" expression, or you could scan the expression body of the lambda and explicitly put only the ones used by one or more lambda expression in the function in the environment. The latter is potentially a lot more space efficient, but is slightly more work.

Ok, so for now just lets get the "anonymous function" in place. As usual I'll go through the changes step by step, but note that I've done some minor cleanups that are inconsequential to these changes as well, that I won't go through because they'll lust clutter things up.

The first piece is the code to handle the "lambda" expression:

  def compile_lambda args, body
    name = "lambda__#{@seq}"
    @seq += 1
    compile_defun(name, args,body)
    puts "\tmovl\t$#{name},%eax"
    return [:subexpr]
  end

Hopefully this is pretty self-explanatory. All it does is create a function name on the form lambda__[number] that we'll use to refer to the function, since we won't actually output it inline. Nothing really prevents us from spitting it out inline, but I find it messy, so for now anyway, I'll treat it as just another function. We then call compile_defun to generate just another named function - so it's only anonymous for the user. We then move the address of the function into %eax, which is where we've established the result of our subexpressions go. Note that this is yet another shortcut - eventually we'll need to do something more sophisticated to handle more complex expressions properly, but register allocation is a complex subject (and the alternative, to push everything onto the stack works but is slow).

Last we return [:subexpr] to indicate to the caller where/how to find the result of this expression.

Next we'll do some refactoring. You may have noticed that the inner loop of #compile_exp was getting ugly, handling different types of arguments. So we extract this:

  def compile_eval_arg arg
    atype, aparam = get_arg(arg)
    return "$.LC#{aparam}" if atype == :strconst
    return "$#{aparam}" if atype == :int
    return aparam.to_s if atype == :atom
    return "%eax"
  end

Notice the appearance of :atom, too. This allows us to take the address of C functions and pass them to :call. It was so simple to add, I figured why not. To go with that, we add this to #get_arg:

   return [:atom, a] if (a.is_a?(Symbol))

Next up, as part of the refactoring, :call falls out almost by itself:

  def compile_call func, args
    stack_adjustment = PTR_SIZE + (((args.length+0.5)*PTR_SIZE/(4.0*PTR_SIZE)).round) * (4*PTR_SIZE)

puts "\tsubl\t$#{stack_adjustment}, %esp" args.each_with_index do |a,i| param = compile_eval_arg(a) puts "\tmovl\t#{param},#{i>0 ? i*4 : ""}(%esp)" end

res = compile_eval_arg(func) res = "*%eax" if res == "%eax" # Ugly. Would be nicer to retain some knowledge of what "res" contains puts "\tcall\t#{res}" puts "\taddl\t$#{stack_adjustment}, %esp" return [:subexpr] end

This might look familiar. It's because it's the guts of #compile_exp, with #compile_eval_arg replacing some of the more ugly bits. The main other change is that it calls compile_eval_arg to get the function too, and does something weird to "%eax" - it prepends a "*".

Either you're confused now, or you might start seeing possibilities - both of doing nice things and of shooting yourself in the foot. The above is the equivalent of casting an arbitrary expression into a pointer and jumping to it with no validation. It makes it deliciously easy to cause a segfault by jumping to random addresses. It also, incidentally, makes things like a virtual pointer table for a class system trivially easy, and a number of other things. Safety will have to come later. As it turns out, the "*" is needed in front of indirect "call"'s.

So what does #compile_exp look like now? The obvious answer is "a hell of a lot cleaner":

 def compile_do(*exp)
    exp.each { |e| compile_exp(e) }
    return [:subexpr]
  end

def compile_exp(exp) return if !exp || exp.size == 0 return compile_do(*exp[1..-1]) if exp[0] == :do return compile_defun(*exp[1..-1]) if (exp[0] == :defun) return compile_ifelse(*exp[1..-1]) if (exp[0] == :if) return compile_lambda(*exp[1..-1]) if (exp[0] == :lambda) return compile_call(exp[1],exp[2]) if (exp[0] == :call) return compile_call(exp[0],exp[1..-1]) end

Nice, isn't it? Compile_call, as it turns out does almost exactly the same as what the guts of compile_exp used to do, apart from what was farmed out to the other functions.

So we do a little test:

prog = [:do,
  [:call, [:lambda, [], [:puts, "Test"]], [] ]
]

(yeah, not exactly earth-shattering)

Compile and run:

$ ruby step6.rb >step6.s
$ make step6
cc    step6.s   -o step6
$ ./step6
Test

The full code for this step is here

Following parts

Since I combined three parts last time, here's an updated list of the remaining pre-written "just-in-need-of-cleanup" parts - I guess I better free up some time to write some new stuff soon, as I'm fairly sure I might end up combining a couple of the parts below when I get around to cleaning them up (such much for posting it as 30 parts...)

  • Step 7: Revisiting loops with anonymous functions, accessing function arguments
  • Step 8: Adding assignment and basic arithmetic
  • Step 9: A cleaner "while" loop
  • Step 10: Testing the language: Writing a simple text mangling program to make input cleaner
  • Step 11: Refactoring the code generation and starting to abstract out the target architecture
  • Step 12: Discussion of various concepts and future direction
  • Step 13: Adding arrays
  • Step 14: Local variables and multiple scopes
  • Step 15: Accessing variable length arguments
  • Step 16: Revisiting the text mangler - cleanups to test new functionality, and run it on itself
  • Step 17: Identifying the constructs needed to self host the compiler
  • Step 18: A start on a proper parser


2008-04-27 19:31 UTC Writing a compiler in Ruby bottom up - step 5

This is the fifth in a series. Please start on part 1 first if you haven't already to get the background, or see here for the full series so far

Last time I promised to post these more frequently, and blatantly failed to find the time... In return, I've taken the time to combine what would have been parts 5, 6 and 7, that by themselves would've been annoyingly short. Here goes:

Handle number literals

So far our program have only handled numbers as an accidental by-product of not doing any type checking of any kind, and even then only if you have an external function that produces them.

So lets take a look at how gcc handles integers up to long long for C. Again we're restricting it to 32bit x86:

void foo1(unsigned short a) {}
void foo2(signed short a) { }
void foo3(unsigned int a) {}
void foo4(signed int a) { }
void foo5(unsigned long a) {}
void foo6(signed long a) {}

void foo7(unsigned long long a) { } void foo8(signed long long a) {}

int main() { foo1(1); foo2(2); foo3(3); foo4(4); foo5(5); foo6(6); foo7(7); }

I'm omitting most of the gcc output, you can run gcc -S yourself if you care. The interesting bit is the calls to the various functions, which end up like this:

    movl    $1, (%esp)
    call    foo1
    movl    $2, (%esp)
    call    foo2
    movl    $3, (%esp)
    call    foo3
    movl    $4, (%esp)
    call    foo4
    movl    $5, (%esp)
    call    foo5
    movl    $6, (%esp)
    call    foo6
    movl    $7, (%esp)
    movl    $0, 4(%esp)
    call    foo7

In other words, for the purposes of a function call at least, gcc threats all types except long long as a single 32 bit value. We'll conveniently forget about long long and stick to values up to 32 bit then, since that means we can keep ignoring types completely.

Laziness is a virtue.

We'll also blatantly ignore floats? Why? Because you can easily write a compiler with only integer math, and so adding float support now would only be a waste of time, when things are bound to change down the line anyway.

Besides, when I was young we didn't have those fancy FPU's, sonny, and we did just fine anyway with fixed point calculations using integers.

So what do we actually change?

In #get_arg, before we handle string constants, we add this:

  return [:int, a] if (a.is_a?(Fixnum)) 

In #compile_exp, we add this to use the return values from #get_arg:

  elsif atype == :int then param = "$#{aparam}" 

Aaand, we're done. That was almost to easy.

Here's a test:

prog = [:do,
  [:printf,"'hello world' takes %ld bytes\\n",[:strlen, "hello world"]],
  [:printf,"The above should show _%ld_ bytes\\n",11]
]


Intermission: Some thoughts on primitive data types

"Pure" object oriented languages are great, but not at the relatively low level of the code generator, in my opinion. Whether or not you want to implement a pure OO language, I strongly believe that it's worth implementing primitives that can work on primitive data types first. Then you can decide whether or not to hide those primitives completely from the user at a later stage, or find ways to make them look like objects, or quietly convert them back and forth as needed behind the users back.

Note that the Matz Ruby Interpreter (MRI) does this: Numbers for example are quietly treated differently than "real" objects, but the interpreter does it best to hide this fact from the user. Personally I don't think MRI goes far enough.


If ... then ... else

You don't get very far without some kind of conditional logic. some form of if .. then .. else ... exists in most languages that wants to be useful. For now we will implement [:if, condition, if-arm, else-arm], and do it the C way, in other words both 0 and null pointers evaluate to false, everything else to true.

Again a simple test:

void foo() {}
void bar() {}

int main() { if (baz()) { foo(); } else { bar(); } }

Giving the relevant part:

    call    baz
    testl   %eax, %eax
    je  .L6
    call    foo
    jmp .L10
.L6:
    call    bar
.L10:

This is a pretty much universal pattern for compiling if .. then ... else, across a huge number of languages and architectures:

  • Calculate the expression that occurs in the condition
  • Test it in some way (here with "testl" - other variations that are common for various architectures is a "cmp" [compare] instruction or to subtract the register from itself). testl does a comparision of the left and right operands, and set various flags.
  • Next, do a conditional branch to the
  • else arm. In other words, we're checking if the condition is NOT true. In this case it's done with "je" which translates to "jump on equal", in other words if the preceding comparison marked the result as equal (note that on most CPU's, lots of instructions set the condition codes, not just explicit tests).
  • Then execute the "if-arm".
  • Jump over the else-arm to after the end of the entire if..then..else construct.
  • Place a label for the else-arm, and the code to execute it.
  • Place a label marking the end.

There are lots of variations, such as reordering the if/else arms depending on estimates on which condition is most likely and knowledge of whether a missed or taken branch is cheapest on a particular architecture, but the above is all that's needed for now.

Again the compilation is deceptively simple - it's really just a direct translation of the above:

  def ifelse cond, if_arm,else_arm 
    compile_exp(cond) 
     puts "\ttestl   %eax, %eax" 
     @seq += 2 
     else_arm_seq = @seq - 1 
     end_if_arm_seq = @seq 
     puts "\tje  .L#{else_arm_seq}" 
     compile_exp(if_arm) 
     puts "\tjmp .L#{end_if_arm_seq}" 
     puts ".L#{else_arm_seq}:" 
     compile_exp(else_arm) 
     puts ".L#{end_if_arm_seq}:" 
  end 

It should be pretty self explanatory - it's basically calling compile_exp for each of the condition, if-arm and else-arm and weaving that into the above code, using @seq to generate unique labels.

To hook it in, we add this right after "return defun ...." in #compile_exp:

  return ifelse(*exp[1..-1]) if (exp[0] == :if) 

And here's a simple test:

prog = [:do,
  [:if, [:strlen,""],
    [:puts, "IF: The string was not empty"],
    [:puts, "ELSE: The string was empty"]
  ],
  [:if, [:strlen,"Test"],
    [:puts, "Second IF: The string was not empty"],
    [:puts, "Second IF: The string was empty"]
  ]
]

Here is the result

As usual you run it like this:

$ ruby step5.rb >step5.s
$ make step5
cc    step5.s   -o step5
$ ./step5
ELSE: The string was empty
Second IF: The string was not empty
$

Some thoughts on loops

Unconditional loops are trivial to add to the language, but do we need to? The answer is strictly speaking no. We can already do loops with recursion, so why pollute the language?

For that to be a good solution, though, we need tail recursion elimination, and I'm not prepared to get into that right now. Tail recursion elimination, or it's more general form - tail call elimination - essentially means to identify cases where the exit out of a function consists of a call with the same or fewer arguments, followed by returning it's return value. If that is the case, you can reuse the current functions call frame for the arguments of the function you are calling, followed by a "jmp" instruction instead of "call". "jmp" doesn't push a return address onto the stack, and so when the function that you jump to return, it will return to the place the current function was called from, rather than to the current function.

This achieves a couple of things: First and foremost, the stack doesn't grow. Secondly, you save some precious cycles. Combined with appropriate other optimizations, with tail call elimination, you can implement a loop like this without being worried about running out of stack:

[:defun, :loop, [], [:do, 
  [:puts, "I am all loopy"], 
  [:loop] 
], 

[:loop]

In other words, tail call elimination means that for any function on the form "(defun foo () (do bar foo))", stack usage goes from being proportional to the depth of the tail calls to being 1 frame deep at that point...

The above will run with the current compiler, but it'll quickly overrun the stack and crash. Not good.

I sense a disturbance in the force. All two geeks reading this shuddering at the thought of the stack growing with each iteration..

For now, we'll just ignore this, and not do anything to abuse the stack too much. Instead we'll later implement a proper loop construct as a primitive. At least for now - if/when I add tail call elimination we can consider ripping it out again and make it part of the runtime library instead.

So we can do non-terminating loops... That doesn't do us much good now does it?.

Well, we can already do while loops too:

(defun some-while-loop () (if condition (some-while-loop) ()))

Not very satisfying, but it works. It's too ugly to be an acceptable workaround, though, so a "while" primitive is on the list.

(As the realization that I've used Lisp like syntax sinks in, there's another disturbance in the force as the same two geeks as before shriek in agony and bolt for the door...)

I'm not a Lisp'er. I can't handle the parentheses overload... But Lisp syntax is convenient when the language doesn't really have a syntax of it's own yet - that'll come after bootstrapping when it's time to write a proper parser. Incidentally, most of what I've implemented and will implement will bear some kind of resemblance to badly broken LISP - as it turns out, Lisp constructs are very powerful if you can stand the syntax, and even if you don't want to program in it, learning more about Lisp and/or Scheme is worthwhile.

Personally I don't heed my own advice, and most of the Lisp-like ideas I'll be using are probably based on incomplete understanding from the very limited time I've spent looking at Lisp.

Next time: Anonymous functions, maybe more.


2008-04-17 17:59 UTC Writing a compiler in Ruby bottom up - step 4

This is the fourth in a series. Please start on part 1 first if you haven't already to get the background, or see here for the full series so far

Sorry for the long delay since the last part, I've simply been too caught up in other stuff. As the list from last time indicated, this part will cover defining functions and adding support for building a simple "runtime library".

Defining functions

A programming language without either functions or methods isn't much of a language. And as it happens, all parts of an object oriented language can be implemented easily enough in terms of a procedural language: A method isn't much more than a function that takes an object as an extra argument. So adding support for functions is obviously rather critical.

It is also very simple. Lets take a look at some C and the resulting assembly again:

void foo()
{
  puts("Hello world");
}

.. results in this, with gcc:

.globl foo
        .type   foo, @function
foo:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        movl    $.LC0, (%esp)
        call    puts
        leave
        ret
        .size   foo, .-foo

The function call is easy enough to recognize, and that leaves us with a small prolog and epilog to the function:

The prolog stores %ebp onto the stack, and then copies %esp into %ebp. "leave" undoes the damage, and "ret" pops the return address off the stack and returns to where the function was called from. Why does it store %esp (the stack pointer) in %ebp? Well, one obvious advantage is that you can mess up the stack as much as you like, and just copy %ebp back again at the end and all is well. We can see above that GCC takes advantage of that by not freeing the stack space it allocated for the call to puts again, and leaving that to the "leave" instruction - it'd just be wasteful otherwise.

So if that's all, the changes should be fairly simple. And they are.

First we modify Compiler#initialize to create a Hash to store our functions in:

  def initialize 
    @global_functions = {} 

Then we add a method to output the functions:

  def output_functions
    @global_functions.each do |name,data|
      puts ".globl #{name}"
      puts ".type   #{name}, @function"
      puts "#{name}:"
      puts "\tpushl   %ebp"
      puts "\tmovl    %esp, %ebp"
      compile_exp(data[1])
      puts "\tleave"
      puts "\tret"
      puts "\t.size   #{name}, .-#{name}"
      puts
    end
  end

You see we include the ".globl" and ".type" and ".size" bits too. ".globl" indicates that the function should be visible outside of the specific file you're assembling, which is important if you want to link multiple object files together. ".type" and ".size" I believe are there primarily for debugging, indicating the the symbol represents a function, and the size respectively.

Apart from that, this function is pretty trivial - it calls "compile_exp" to do the actual work.

We add a little helper to define the functions:

  def defun name, args, body
    @global_functions[name] = [args,body]
  end

We add a couple of lines to #compile_exp:

    return if !exp || exp.size == 0
    return defun(*exp[1..-1]) if (exp[0] == :defun)

The first line is there partly for robustness, partly to let us pass nil or empty arrays as "do nothing", just as you'd expect to be able to write an empty function declaration. It makes sense in the context of the next line, that doesn't check if it's about to define an empty function.

You may or may not have noticed that this thing lets us define functions recursively. [:defun,:foo,[:defun, :bar, []]] is completely legal. You may also have noticed that this currently leads to defining two functions that are both globally accessible. Oh well. It doesn't do any harm, so we'll deal with that later ("that" being either preventing it, or making the inner function only visible inside the outer function - I haven't decided yet which I prefer).

All that remains is to output the functions, so we add this to #compile, right before output_constants:

 output_functions

Adding support for a runtime library

First we change the name of the current #compile to #compile_main. Then we redefine #compile as follows:

  def compile(exp) 
    compile_main([:do, DO_BEFORE, exp, DO_AFTER]) 
  end 

Then we define the constants DO_BEFORE and DO_AFTER (put them in a separate file if you prefer, for now I've just put them at the top):

DO_BEFORE= [:do,
  [:defun, :hello_world,[], [:puts, "Hello World"]]
]

DO_AFTER= []

Just admit it, you'd expected something more advanced. But that would defeat the goals set out at the start. The above is sufficient to define a proper runtime library. If you want something that'd need to be implemented in assembler or C, that works too: Just link to an object file containing the functions in question, since we still respect C calling conventions.

Lets test it. Add this before Compiler.new.compile(prog):

prog = [:hello_world]

And compile and run:

$ ruby step4.rb >step4.s
$ make step4
cc    step4.s   -o step4
$ ./step4
Hello World
$

You can find the result here

Accessing function arguments?

This leaves one big gaping omission: Accessing function arguments. As it happens that is a fairly large change. Trust that it won't be forgotten, it's a major part of part 8, and I'll try not leaving so long between the next few parts.


2008-04-05 11:31 UTC Writing a compiler in Ruby bottom up - step 3

This is the third in a series. Please start on part 1 first if you haven't already to get the background, or see here for the full series so far

I hoped to post this earlier, but I've just been swamped this week - though cleaning up my old material so far have only taken about half an hour per post. Anyway, here is part three, and at the end I've posted a list of roughly what will be covered in the following parts. Roughly because I've tried grouping a few of them together with the intent of posting longer, more meaningful parts (this part is a good example why - it's very short), so I've cut my original thirty or so parts down to 20 (though that's just what I finished writing - that by no means finishes the job).

Chaining expressions with "do"

So far, the second version from last time can execute one single expression, and that's it. Not very useful. So I need to add a way to chain expressions like in the body of a function. As it stands, that's terribly simple. I'll add a keyword - "do" - that will accept an unlimited (meaning, in practice, limited by memory) list of expressions and evaluate them. It'll look like this:

prog = [:do, 
   [:printf,"Hello"], 
   [:printf," "], 
   [:printf,"World\\n"] 
] 

It's actually trivially simple. Right at the beginning of #compile_exp we add this:

    if exp[0] == :do                                                                                                                                                        
      exp[1..-1].each { |e| compile_exp(e) }                                                                                                                                
      return                                                                                                                                                                
    end

Recursion will play a great role here - you're descending a tree structure after all, so the core method to compile an expression will be called on deeper and deeper nodes as well, especially to handle proper sub-expressions, which is our next goal.

Adding sub-expressions, take 1

Here's the test we'd like to make work:

prog = [:printf,"'hello world' takes %ld bytes\\n",[:strlen, "hello world"]], 

Here's the first change. In #get_arg, we add this at the start:

    # Handle strings or subexpressions                                                                                                                                      
    if a.is_a?(Array)
      compile_exp(a)
      return nil # What should we return?                                                                                                                                   
    end

If you try to make the code above compile the test, you'll get an error from gcc, because we expect get_arg to return a sequence number for the string constants and nothing else, but that clearly doesn't work for sub expressions.

Adding sub-expressions, take 2: Return values

So how does GCC handle this. Getting the assembly for this:

int main()
{
  printf("'Hello world' takes %ld bytes\n",foo("Hello world"));
}

... results in this (just the relevant part of main):

    subl    $20, %esp
    movl    $.LC0, (%esp)
    call    foo
    movl    %eax, 4(%esp)
    movl    $.LC1, (%esp)
    call    printf
    addl    $20, %esp

... which shows that it's pretty straightforward. Gcc first calls the sub expression (foo), and then expects the return value of a function to be in the register "%eax", and so that needs to be copied onto the stack instead of the address of a string constant.

First up let's fix #get_arg:

  def get_arg(a)                                                                                                                                                            
    # Handle strings or subexpressions                                                                                                                                      
    if a.is_a?(Array)                                                                                                                                                       
      compile_exp(a)                                                                                                                                                        
      return [:subexpr]                                                                                                                                                     
     end                                                                                                                                                                    

seq = @string_constants[a] return seq if seq seq = @seq @seq += 1 @string_constants[a] = seq return [:strconst,seq] end

The only changes are the "return" expressions, where we indicate what is returned - this will expand considerably later.

The remaining change is pretty much a rewrite of the rest of compile_exp. Instead of just collecting the results of get_arg, we iterate over it and output directly (which is why the stack adjustment's also change, since the "args" array has gone away):

    stack_adjustment = PTR_SIZE + (((exp.length-1+0.5)*PTR_SIZE/(4.0*PTR_SIZE)).round) * (4*PTR_SIZE)

puts "\tsubl\t$#{stack_adjustment}, %esp" if exp[0] != :do

exp[1..-1].each_with_index do |a,i| atype, aparam = get_arg(a) if exp[0] != :do if atype == :strconst param = "$.LC#{aparam}" else param = "%eax" end puts "\tmovl\t#{param},#{i>0 ? i*4 : ""}(%esp)" end end

As you can see the change isn't that complex. We just check the return value from #get_arg and pick a string constant or %eax depending. This part will expand considerably as we add more different types of things that can be returned.

You can find the latest version here

Upcoming parts

These are the almost ready pre-written parts only. I expect once I need to start catching up on new parts the focus will be on a simple parser with the goal of making the compiler self-hosted (i.e. able to compile itself) as quickly as possible.

  • Step 4: Introducing a runtime, and defining functions
  • Step 5: Handle literals other than strings
  • Step 6: If ... then ... else
  • Step 7: Looping constructs
  • Step 8: Anonymous functions (lambda)
  • Step 9: Revisiting loops with anonymous functions, accessing function arguments
  • Step 10: Adding assignment and basic arithmetic
  • Step 11: A cleaner "while" loop
  • Step 12: Testing the language: Writing a simple text mangling program to make input cleaner
  • Step 13: Refactoring the code generation and starting to abstract out the target architecture
  • Step 14: Discussion of various concepts and future direction
  • Step 15: Adding arrays
  • Step 16: Local variables and multiple scopes
  • Step 17: Accessing variable length arguments
  • Step 18: Revisiting the text mangler - cleanups to test new functionality, and run it on itself
  • Step 19: Identifying the constructs needed to self host the compiler
  • Step 20: A start on a proper parser


2008-03-27 16:27 UTC Writing a compiler in Ruby bottom up - step 2/??

This is the second in a series. Please read part 1 first if you haven't already to get the background. Part 3 is here

My choice of Ruby as my implementation language was pretty arbitrary. The choice doesn't matter all that much at this stage, but I really like Ruby.

At some point, however, there will be a series of steps to bring the language and the compiler closer together. What I mean by that is that I want the compiler to become self-hosting: It should be capable of compiling itself.

That means that either the compiler will have to support at least a subset of Ruby, or there will be a translation step to rewrite parts of the compiler in the language the compiler will understand.

While that doesn't dictate the implementation language, it means it's always worthwhile at least ensuring that your implementation language is reasonably close to where you want to go, unless you're not interested in making the compiler self-hosted.

It is also worthwhile to seriously avoid using complex language constructs in your implementation if you want the compiler to become self-hosted. Keep in mind that you need to support every language construct the compiler will need, and so if you use lots of fancy functionality you'll either have to support equivalent functionality in the compiler, or run the risk of having to do massive changes to the structure when you're ready to take the jump. Not fun.

One distinct benefit of Ruby as an implementation language, however (and it shares this with certain other languages, such as Lisp), is that you can build literal tree structures very easily - in Ruby using Array or Hash and in Lisp using lists.

It means I can start out by "faking" the abstract syntax tree as arrays, and neatly skip over the step of writing a parser. Yay! The cost is ugly, but temporarily bearable, syntax.

Hello World

This is what Hello World will look like:

 [:puts,"Hello World"]

What we need to handle here is pretty trivial: We need to be able to push an argument onto the stack, and we need to be able to call a function.

So let's see how to do that in x86 assembler. I compiled this C program with "gcc -S":

int main()
{
   puts("Hello World");
}

And looked at the output. Below I've reproduced the bits that are actually relevant, which I found by looking at the output from last time and seeing what had changed:

        .section        .rodata
.LC0:
        .string "Hello World"
        .text
[...]
        subl    $4, %esp
        movl    $.LC0, (%esp)
        call    puts
        addl    $4, %esp
[...]

If you understand som asm, even if you've not worked with x86 before, this is quite easy to pick up on:

  • It defines a constant string
  • In the code it allocated 4 bytes on the stack by subtracting 4 from the stack pointer.
  • It then moves the address of the string onto the stack, into the 4 freshly allocated bytes.
  • We call "puts", conveniently provided by glibc (all of this series will assume gcc/gas + glibc - on Linux odds are you already have it installed)
  • Afterwards we clean up the stack by adding four bytes back on.

So how do we make use of this in our compiler? First of all we need a way to handle the string constants, so we add this to the Compiler class from last time. (I'm making the full Ruby code available here so you don't have to guess what to copy and paste in where)

  def initialize
    @string_constants = {}
    @seq = 0
  end

def get_arg(a) # For now we assume strings only seq = @string_constants[a] return seq if seq seq = @seq @seq += 1 @string_constants[a] = seq return seq end

This code simply maps a string constant to an integer that we'll use to point to the labels. You'll not multiple string constants will map to the same label and get output once. It's not necessary to do it that way, but it's a trivial optimization to use a Hash instead of an Array to ensure uniqueness.

Here's the function I added to output the string constants:

  def output_constants
    puts "\t.section\t.rodata"
    @string_constants.each do |c,seq|
      puts ".LC#{seq}:"
      puts "\t.string \"#{c}\""
    end
  end

That just leaves compiling the call itself:

  def compile_exp(exp)
    call = exp[0].to_s

args = exp[1..-1].collect {|a| get_arg(a)}

puts "\tsubl\t$4,%esp"

args.each do |a| puts "\tmovl\t$.LC#{a},(%esp)" end

puts "\tcall\t#{call}" puts "\taddl\t$4, %esp" end

You'll probably notice an ugly inconsistency here: The code above pretends to be able to handle multiple arguments, but then it proceeds to subtract only 4 from the stack pointer, and it doesn't apply a displacement to the stack pointer, so multiple arguments will just overwrite each other.

We'll fix that in a moment. For our simple one argument Hello World the above will do.

Notice a few other things about this code:

  • It doesn't check if the function even exists - we'll let gcc/gas worry about that, though it means unhelpful error messages.
  • We can actually call any function we can link to, as long as it takes a single string argument in.
  • There's lots of ugly bits that really should be abstracted away, such as how I get hold of the address to call, all that inlining of assembly code that tightly ties the compiler to i386, etc. Bear with me, it will (slowly) get addressed.

It's time to try running the second step of the compiler. It should generate this output:

        .text
.globl main
        .type   main, @function
main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        subl    $4,%esp
        movl    $.LC0,(%esp)
        call    puts
        addl    $4, %esp
        popl    %ecx
        popl    %ebp
        leal    -4(%ecx), %esp
        ret
        .size   main, .-main
        .section        .rodata
.LC0:
        .string "Hello World"

Here's how to test it:

[vidarh@dev compiler]$ ruby step2.rb >hello.s
[vidarh@dev compiler]$ gcc -o hello hello.s
[vidarh@dev compiler]$ ./hello 
Hello World
[vidarh@dev compiler]$ 

So, how do we handle more than one argument?

I'm not going to show C and the corresponding assembler again - it's easy enough to do a number of calls with different number of arguments and look at the result. Instead I'll skip straight to the changed compile_exp (complete source here)

  PTR_SIZE=4

def compile_exp(exp) call = exp[0].to_s

args = exp[1..-1].collect {|a| get_arg(a)}

# gcc on i386 does 4 bytes regardless of arguments, and then # jumps up 16 at a time. We will blindly do the same. stack_adjustment = PTR_SIZE + (((args.length+0.5)*PTR_SIZE/(4.0*PTR_SIZE)).round) * (4*PTR_SIZE) puts "\tsubl\t$#{stack_adjustment}, %esp" args.each_with_index do |a,i| puts "\tmovl\t$.LC#{a},#{i>0 ? i*PTR_SIZE : ""}(%esp)" end

puts "\tcall\t#{call}" puts "\taddl\t$#{stack_adjustment}, %esp" end

So what's going on here? The changes aren't huge:

  • Instead of allocating a specific amount of stack (4 bytes in the previous version), we're adjusting the stack based on the number of arguments. I'll plainly admit to not knowing why gcc uses the specific adjustment it does - it really doesn't matter at this stage, though I'd guess it's for alignment of the stack. Optimize/clean up later, and don't change things when you're not sure why they've done it that way...
  • As you can see the arguments are then moved onto the stack one by one. We're still assuming they are all the same size of pointers (so 4 bytes on i386).
  • You can also see that the arguments to the function is moved onto the stack so that the leftmost argument is at the lowest address allocated. If you haven't done any assembly programming, draw boxes to visualize it if you can't visualize it in your mind, and keep in mind that stacks as here usually grow downwards in memory. When allocating space, we move the stack pointer down in memory, and here we then copy arguments onto the stack upwards in memory (using higher and higher indexes to %esp, just as if you're accessing an array).

That's enough to compile something like this:

[:printf,"Hello %s\\n","World"]

What's next

It's a baby step forward, and I promise the next steps will get more and more useful, as the number of constructs that are required to compile useful programs is actually very small. I'll also make the following steps terser and handle "meatier" additions, as I've explained a lot of the motivations and so will add less and less explanation for the simpler bits of functionality as we go on.

Next I'll extend the compiler to handle multiple arguments, followed by chaining statements, sub expressions, handling return values from sub expressions and so on.

Within perhaps a dozen or so steps of similar complexity to this, we'll have function definition, argument passing, conditionals, a runtime library, and even a stunted "lambda" operator to give us anonymous functions (proper closures comes a lot later)

Within a few more steps we'll have enough to compile a simple text mangling program to give the input somewhat nicer syntax than Ruby Array's and symbols (only somewhat, though, a proper parser is not on the table until much later)


About me

E-mail: vidar@hokstad.com
Skype: vhokstad
View my LinkedIn profile

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Categories

StumbleUpon My link page

(Links I have stumbled and like)