2008-03-26 00:22 UTC Writing a compiler in Ruby bottom up - step 1/??

I've just posted the second part of this series. See part 2

I've been sitting on this for a long time - the oldest entries in the series I'm about to start posting is stuff I started writing back in early 2005 before I even started version 1 of this blog.

I've always enjoyed compiler technology, and I've written several toy compilers. A while back I started writing a simple compiler in Ruby, and I wrote copious notes along the way exactly with the intention of posting it somewhere. Problem was I was that by the time I was finishing up the steps I have so far, my blog was languishing, and so it's been just gathering dust.

I have about 30 parts mostly written, all the way to a simple but working compiler. It'll take me a while to post them, as at least some of them need some cleanup, and I don't have mjuch time to write, but I might post a couple of parts a week, depending on my schedule.

Without further ado, here's the first entry:

Some background, and the first tiny bit of code

Usually, when writing an experimental compiler I've started top down. In other words, I've taken the traditional approach of designing a parser first, whether manually or using a parser generator. I've den started writing various tree transformation ode to turn the AST (Abstract Syntax Tree) into symbol tables, doing error checking, annotating the tree with various information, and then finally code generation. Often after I switched to Linux I've "chickened out" and compiled to C, which is ok if your language match well with C semantics, which it often will given how low level C is.

My first compiler though, was written in M68000 assembler and output M68000 assembler.

I bootstrapped it by allowing inline assembler all the way through, and gradually adding structure:

First I'd add functions and function calls, and start making use of them to structure my compiler. Then I'd add basic arithmetics that worked on assembler registers. Then I'd add local variables etc., but parse the assembly so that the register allocator would stay clear of registers used in the assembler that was interspersed.

Gradually the compiler morphed from M68000 to my hybrid language, and less and less assembler.

I want to bootstrap a compiler again. Though I have no intention of writing it in assembler this time. But I will start at the bottom: With the code generator. And I'll have the following goals:

  • Simplicity over performance every time. The goal is to get to the point where the compiler can be rewritten to work in its target language - in other words I want it self hosted. Since the initial code will be thrown away, it's a waste of time to do more work than what's needed to make it work. Performance can come later.
  • Don't make decisions that will constrain performance. While performance in itself isn't a goal, I should no lightly start doing thing that will forever doom the language to mediocre performance, or that will take tremendous amount of work to fix later. Ruby, I'm looking at you (=> write about Ruby runtime dynamism)
  • Simplicity over convenience every time. Don't add features just for convenience - add features only when they support the bootstrap effort. That means they have to make life easier than the added complexity makes translation of the compiler into the target language harder.
  • Delay designing the language - design features. I want to make a capable and highly flexible code generator first, supporting an AST that allow encoding of a very diverse set of functionality, or make it easy to add a diverse set of functionality, though I'll have my eye on LISP/Scheme and Smalltalk in terms of expressiveness of the low level functionality (concepts, not specific details), and aim for something resembling Ruby in terms of syntax, meta programming capabilities and object model.
  • I don't want to learn x86 assembler, yet I'm going to target x86. It's a bit of an exaggeration. I'm proficient with M68K assembler and 6510 assembler, and I know enough bits and pieces of x86 assembler to be able to read it reasonably well. Yet I've never, ever, written a significant piece of code using it, and I don't intend to start. A lot of what I DO know comes from looking at assembler output from gcc. "gcc -S" is your friend. My knowledge of assembler in general, though, is enough to have a reasonable understanding of what's going on even when I don't know the specifics of the x86. I'll likely make stupid mistakes, but hopefully I'll learn from them (and anyone foolish enough to trust my initial thoughts on how the x86 works better get their head examined - I'm learning here)

Sounds like a lot of shortcut and still a lot of pain, doesn't it? But it'll be fun!

So lets start with the tiniest bit of useless code:

#!/bin/env ruby

class Compiler

def compile(exp) # Taken from gcc -S output puts <<PROLOG .file "bootstrap.rb" .text .globl main .type main, @function main: leal 4(%esp), %ecx andl $-16, %esp pushl -4(%ecx) pushl %ebp movl %esp, %ebp pushl %ecx PROLOG

puts <<EPILOG popl %ecx popl %ebp leal -4(%ecx), %esp ret EPILOG

puts <<GLOBAL_EPILOG .size main, .-main GLOBAL_EPILOG end end

prog = [:puts,"Hello World"]

Compiler.new.compile(prog)

It does nothing. Really. It's just the result of compiling this with gcc -S:

int main()
{
}

... and breaking it into pieces.

It's a fully working compiler... Sort of.... Unfortunately it compiles any input into a noop, and so it's completely useless. But lets start with baby steps.

Consider this series of articles a "flow of consciousness". I do this for fun. I'm not about to sit down and make a nice clean design upfront. Nor am I going to be shy about changing my mind midstream and rip out code, or put stuff I previously ripped out back in again.

What you see here is my second pass: Each SVN commit I make is accompanied by notes for my posts, but the posts contain a lot of extra explanation. Sometimes that explanation makes me change my mind about things - but I won't make major changes unless I think it makes the explanations clearer or a point I made in the first pass is blatantly wrong. Even then I might leave it in, possibly with a note that I'll correct myself later. Sometimes I've contracted steps that took me a while to do but that are trivial to explain "after the fact".

Comments are very welcome, but keep in mind that while I may take your comments to heart, I have already written a lot of this series - I won't start rewriting large parts of it to take into account suggestions, but I will take notes and keep it in mind for whenever I catch up to what I've written already

Next time I'll make it actually do something with it's input, I promise.



Comments - Newest first

2010-01-09 04:24 UTC
Paul Pittman
All this talk about horrendous syntax reminds me of a time in the programming section of Borders Book Store in Long Beach, California.

A programming professor walked in after 15 minutes of intense perusal of the offerings on the shelves, he acknowledged my existence and said, "I'm looking for a book to teach this new programming language but my students know that programming language but I having problems teaching them this one!?!"

I looked at him and blurted, "Why don't you use that language everyone knows to teach the one they don't?" Can I get a duh on that one? No? He looked at me and turned around and walked out!

Simplicity rocks. Ruby is a rock. If you predicate life and programming on this, it may not make it easy, easier, yes. And maybe one day, the impossible becomes possible!

Predicating life on "The best way to solve a problem is not to have it." is my approach to the predicament of life and programming. Why not use Ruby to generate the nasty syntax of the compiler to streamline, simplify and accelerate productivity. (I love that word!)

This it reason I got into Ruby to find a language to write syntax and then run it. I used VBS in ASP to generate Active Server Pages in 2001 in Advanced Server 2000, where it would then run the pages that were generated. I could swear it was teraflopping using text file databasing with ADO on top of PWS on top of IIS.

A few months later Microsoft .Netted me with ASP.Net. which nuked my code from running off any virtual host running Frontpage extensions. A few months later Solaris 10 was teraflopping and in 2003 a thing called Rails bore a strong resemblance to my code earlier.

Ruby is a great syntax generator (1.8.X). Rails is proof. Especially when use $global variables to write generic code which allows you to segregate code into different require '????.rb' files. It is like having dynamically linked libraries (ddls) without knowing how to write them!

2008-03-28 11:46 UTC
Frank Davis
Vidar,

I just wanted to say thanks. I've started working on a toy language compiler myself, both as a learning experience and because I have some ideas that might allow it to grow into a useful tool for myself. The bootstrapping approach you are taking is similar to the approach I'm taking, except that I'm inlining C# rather than assembler. I hope you find the time to continue the series!

2008-03-28 02:40 UTC
Vidar Hokstad
postmodern,

Thanks for for link - I'd seen Metasm before, but forgotten all about it. In the later ones of the steps I've already written for this series I did start abstracting out the asm into a separate class, and also hiding a lot of the i386 details, but I'll make sure to take a look at Metasm again to see if there are any ideas to make use of (or if it's worth using it to do the assembly).

2008-03-28 01:35 UTC
This reminds me a bit of Metasm, a pure Ruby assembler. Unfortunately, it isn't available as a RubyGem and is largely unknown to the greater Ruby community.
2008-03-27 11:22 UTC
Vidar Hokstad
Maledictus, That's pretty much the direction this code will go too - a "normal" object system is actually pretty simple to compile. For that matter, Ruby's is too - the problem isn't compiling it, it's compiling it without absolutely murdering performance.

For anyone who haven't looked at Matz' Ruby implementation: For each object the instance variables are stored in a hash table, and for each class the same is true for the methods, so the number of hash table lookups are huge. Translating that straight in a compiler is easy enough, but a good compiler will really want to try to remove as many of those hash table lookups as possible.

I think a good Ruby compiler would inevitably need a JIT if it want's to both keep the features of Ruby and get reasonable performance - then it could optimize the code a lot more at the cost of having to recompile pieces of the program if/when a class gets modified.

2008-03-27 11:10 UTC
Vidar Hokstad
ohxten,

I agree with you regarding M68000 syntax - it was when I gave up the Amiga that I stopped using assembler, largely because x86 assembly is so horrendous...

But I actually prefer the AT&T syntax. To me the FASM/NASM syntax will always have the operands in the wrong order - I just can't deal with it without getting annoyed.

In any case, the actual assembler will become less and less noticeable as I go along with this (including eventually start to get abstracted out)

2008-03-27 04:08 UTC
she
actually this is kinda cool

it gives people new ideas

hopefully it gets continued though :-)

2008-03-27 03:08 UTC
Motorola 68000 assembler syntax is very nice.

If you look at GNU as syntax, it's terrible AT&T syntax. I recommend you look at assembly syntax such as that used by FASM or NASM. It looks a _lot_ nicer... instead of:

--- leal 4(%esp), %ecx andl $-16, %esp pushl -4(%ecx) pushl %ebp movl %esp, %ebp pushl %ecx ---

you'd have something like:

--- lea ecx, [esp+4] and esp, -16 push [ecx+4] push ebp mov ebp, esp push ecx ---

... a lot nicer on the eyes. :D

2008-03-27 03:07 UTC
Maledictus
Hi, thanks for your comment!

I'm already looking at the other ruby implementations except ironruby and ruby.net.

Interesting thing for me is not having a VM; but your absolutly right that it makes many things much harder. I think about simply ignoring all the dynamic features at first.

2008-03-27 02:57 UTC
Vidar Hokstad
Maledictus,

There might be parts which'll be useful for you. I will certainly discuss compilation of features that are needed to compile Ruby, though quite a few of the first entries will be too low level to be that useful for you. Most of the approaches I took while writing this series originally should be doable to generate C code from instead of x86 asm, though it'll take quite a bit of adaptation. I also don't touch on anything resembling a Ruby parser in any of the first 30 parts I wrote.

If you really want to compile Ruby to C, you ought to take a look at the Ruby 1.8.6 and 1.9 parsers, and also the 1.9/YARV and Rubinius VM's at least - they're doing most of the hard lifting, and a "worst case" for compilation to C is to generate C from the bytecode, though that won't be as efficient as it could be. It's probably worthwhile looking at JRuby too.

You will also run into issues with the dynamic nature of Ruby, which is much easier to handle in a VM.

2008-03-27 02:51 UTC
Vidar Hokstad
Bill,

I know that know. I didn't when I started. I've chosen to keep it mostly as I wrote it partly because I think it has educational value to make it match up well with the process I went through, partly because I think it makes sense for the output to be easily reproducible - the ASM is straight cut and paste :)

Will, Probably not - based on what I learned when I wrote it I could probably do a lot better job at providing a cohesive narrative now, but unfortunately I don't have time at the moment to clean it up. As I get time, I will certainly consider pulling together a separate set of pages giving a good overview. In the meantime I'm hoping the entries will spur some discussion while I'm posting them, and hopefully I'll learn even more.

Thanks for the comments...

I'll aim to post the next entry tomorrow or Friday - even though the text is mostly written it does deserve _some_ cleanup after having been put aside for that long..

2008-03-27 02:48 UTC
Maledictus
I've been thinking about a ruby compiler for a while now. A compiler written in ruby which compiles ruby to C. Do you think this series is going to be helpful?
2008-03-27 02:42 UTC
Will
Think a series of blog entries is really the best way to present this?
2008-03-27 02:33 UTC
Interesting. However, why start with such complicated ASM? A lot of that is x86 optimization that you could do without for your beginning steps.

I say this as someone who has written an x86 compiler that output ASM, mostly from gcc -S output :)

2008-03-27 01:56 UTC
Dan
Interesting stuff!

I'm intrigued to see where this goes, as I've been wanting to try writing a compiler for a while.

Post a Comment

Basic HTML allowed. Javascript required for anti-spam check (I am testing a new anti-spam measure. Problems commenting? Please e-mail me: vidar@hokstad.com)

About me

E-mail: vidar@hokstad.com Skype: vhokstad
Twitter: vhokstad
View my LinkedIn profile.

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married and we just had our first child, Tristan Ikemefuna Hokstad.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Twitter Updates

    follow me on Twitter