Tag: compiler in Ruby bottom up

2010-02-22 17:35 UTC Writing a (Ruby) compiler in Ruby bottom up - step 24

This is part of a series I started in March 2008 - you may want to go back and look at older parts if you're new to this series.

Apologies for the delay... This part was ready before christmas, but for various reasons I never got around to posting it. And to make matters worse I managed to post the wrong part yesterday. Sigh.

The Outer Scope

Stepping back from the attr_* debacle for a bit... I wanted to look at something else, mostly because it affects all my ad-hoc testing.

One of the downsides of starting this without a specific design in mind is that after I decided to go down the Ruby path we've been dragging along quite a bit of cruft.

But that is the cost of experimentation.

In particular, the compiler has a weird distinction between functions and methods: Define something with def outside of a class and it becomes a function, inside it becomes a method.

We need support for functions to make implementation reasonably easy: It makes bootstrapping the object model far easier, as well as integration with the outside C-based world.

But Ruby doesn't have functions.

So what to do?

Recently I wrote about hiding the "low level plumbing" and the change I'm about to show you gets us a bit closer to that while at the same time starting to clean up the function vs. method mess.

In Ruby, self outside of a class is the main object - an instance of Object. Logically then, for a def outside of a class, the method will be defined on `Object. So, we'll create that object.

What about function calls?

Defining and calling functions will still be possible, but only using the s-expression syntax. That reasonably cleanly "hides" the plumbing we use functions for (in fact, it means we could stop the annoying convention I started of prefixing the names with "__" since they effectively now live in their own namespace, though I haven't changed that yet).

Some preparations

First we must rewrite a few functions fully as s-expressions. This is straightforward. For example:

    -def __get_string(str)
    -  s = String.new
    -  %s(callm s __set_raw (str))
    +%s(defun __get_string (str) (let (s)
    +  (assign s (callm String new))
    +  (callm s __set_raw (str))
       s
    -end
    +))

Otherwise the upcoming changes will turn them into methods, which is not what we want.

I'm not going to present all of them. The biggest one is probably __new_class_object, but all of these are straight forward translations.

These changes were done on the master branch prior to starting this feature proper.

Splitting up "defun"

You can follow these remaining changes on the main-object branch

So far defun has been made up by two mostly different branches: If occurring inside a class, it's compiled as a method, if not it'd compile a function.

This is both messy and not very logical.

Going forward we'll separate it into defm that defines a Ruby style method, and defun, which, as before, defines a function. defun will be completely hidden from normal Ruby code - you'll have to dip down into the s-expression syntax to access it.

First our list of compile_* methods in compiler.rb:

    -                   :do, :class, :defun, :if, :lambda,
    +                   :do, :class, :defun, :defm, :if, :lambda,

The new, simplified compile_defun:

    # Compiles a function definition.
    # Takes the current scope, in which the function is defined,
    # the name of the function, its arguments as well as the body-expression
    # that holds the actual code for the function's body.
    #
    # Note that compile_defun is now only accessed via s-expressions
    def compile_defun(scope, name, args, body)
      f = Function.new(args, body,scope)
      name = clean_method_name(name)

# add function to the global list of functions defined so far @global_functions[name] = f

# a function is referenced by its name (in assembly this is a label). # wherever we encounter that name, we really need the adress of the label. # so we mark the function with an adress type. return [:addr, clean_method_name(name)] end

This is really more or less just tearing out the method part, and brings it roughly back to how it was before we added the method calling bit.

But compile_defm is also somewhat simpler than the old method part:

    # Compiles a method definition and updates the
    # class vtable.
    def compile_defm(scope, name, args, body)
      scope = scope.class_scope

f = Function.new([:self]+args, body, scope) # "self" is "faked" as an argument to class methods.

@e.comment("method #{name}")

body.depth_first do |exp| exp.each do |n| scope.add_ivar(n) if n.is_a?(Symbol) and n.to_s[0] == ?@ && n.to_s[1] != ?@ end end

cleaned = clean_method_name(name) fname = "__method_#{scope.name}_#{cleaned}" scope.set_vtable_entry(name, fname, f)

# Save to the vtable. v = scope.vtable[name] compile_eval_arg(scope,[:sexp, [:call, :__set_vtable, [:self,v.offset, fname.to_sym]]])

# add the method to the global list of functions defined so far # with its "munged" name. @global_functions[fname] = f

# This is taken from compile_defun - it does not necessarily make sense for defm return [:addr, clean_method_name(fname)] end

The most important changes?

  • The call to scope.class_scope at the top.
  • The change to call __set_vtable near the bottom instead of hardcoding the asm to set the vtable entry.

So, why those changes?

If you look at GlobalScope, we've added a @class_scope instance variable that is initialized to this:

    @class_scope = ClassScope.new(self,"Object",@vtableoffsets)

This means that if you have a :defm occurring in the global scope (we'll get to that) it will be compiled in that class scope, as a method of class Object.

Poof and our functions are (almost) gone.

(In ClassScope we just return self from the new class_scope method.)

For the second change, adding __set_vtable is just the purist in me wanting to expel as much assembler as possible. Here it is (from lib/core/class.rb):

    # FIXME: This only works correctly for the initial
    # class definition. On subsequent re-opens of the class
    # it will fail to correctly propagate vtable changes 
    # downwards in the class hierarchy if the class has
    # since been overloaded.
    %s(defun __set_vtable (vtable off ptr)
      (assign (index vtable off) ptr)
    )

Ok, so not just the purist in me. In fact, this makes re-opening classes "cleaner" in that this function will actually get reasonably complex as we fix this issue, and also later handle define_method and otherwise deal with cases where no vtable slot is actually available.

Anything else? A few pieces of cleanups to deal with the scope changes, and this little gem / turd depending on your point of view, in lib/core/core.rb:

    +
    +# OK, so perhaps this is a bit ugly...
    +self = Object.new
    +

Ehm. We probably should put that in the compiler, or at least prevent arbitrary assigning to self in user code. But it works for now, and we've done worse in the past. To reiterate: Get things working first. Make them fast/pretty later.

The last vital change is this, in Parser#parse_def:

    -    return E[pos, :defun, name, args, exps]
    +    return E[pos, :defm, name, args, exps]
       end

It does what it looks like - make the parser for the Ruby code spit out only :defm nodes instead of :defun nodes. Our code is now free of functions anywhere outside of s-expressions.

What's the effect?

These changes make this work as expectd:

    class Foo
      def hello
        puts "hello"
      end
    end

def hello_world puts "hello world!" end

%s(puts "foo") # Calls the C function Foo.new.hello # Calls Foo#hello which calls Object#puts puts "world" # Calls Object#puts (via the "main" instance)

hello_world # Calls Object#hello_world (via the "main" instance)

Adding this on the end on the other hand still fails (but it works in MRI) because __set_vtable doesn't propagate the vtable update downwards to subclasses of Object:

    Foo.new.hello_world

When re-opening a class, new methods needs to be added "properly" by processing all child classes and updating their vtables.

The alternative (which moves the cost from the point of defining a method to the point of calling the method), is to implement a proper default method missing that ascends the class hierarchy to look for an implementation before giving an error.

We really should do both, since we eventually need to handle methods without vtable entries.

But that's for a future installment



2009-12-16 18:26 UTC Writing a (Ruby) compiler in Ruby bottom up - step 23

This is part of a series I started in March 2008 - you may want to go back and look at older parts if you're new to this series.

Continuing down the rabbit hole: String

A couple of parts ago we established some of the problems with supporting even the seemingly simple attr_reader, attr_writer and attr_accessor. To reiterate, a naive implementation looks like this:

def attr_accessor sym
  attr_reader sym
  attr_writer sym
end

def attr_reader sym define_method sym do %s(ivar self sym) end end

def attr_writer sym define_method "#{sym.to_s}=".to_sym do |val| %s(assign (ivar self sym) val) end end

We resolved the issue of having a working sym.to_s, but that still leaves a number of things to address. For example, we need at least the start of a "proper" String class that we can add #to_sym to. That's what we'll look at this time.

The first bit is very similar to what we did for Symbol: We won't type tag, and we'll create String objects for literal strings by calling a low level function.

Quick sidebar: Hiding the low level implementation

MRI "hides" the gory details of String and other basic classes by not making them real Ruby objects per-se. There's data there that is simply not accessible without dipping into the C API. We may eventually want to take a similar tack - add a mechanism for allocating space for, setting and retrieving low level variables via the s-expression syntax that are completely invisible for the higher level API.

Currently we don't. The "raw" string used for Symbol and for String objects is a normal instance variable, and it's perfectly possible to expose it, and accessing it thinking it's a real object will trivially easily cause nasty crashes.

In general, we would eventually benefit from creating a mechanism to "hide" the low level plumbing from accidental access by normal Ruby. For now we'll retain the ability to shoot ourselves in the foot easily.

The basics

The String class will for now contain just a pointer to a C-style string. It will be immutable in it's first incarnation. This gets us a decent step further very easily - we can just keep the address to a C-style string constant.

As it turns out, this will be an easy foundation to build on:

MRI uses a number of flags for strings, for example to do "copy on write". Once we need to make strings mutable, we'll do the same. Strings created from constants will start out marked as "copy on write", and when doing an update, the string buffer will be copied into freshly allocated space first. The hope is that many string constants will never be updated.

We also need to implement String#to_sym. This is thankfully also reasonably easy: We need to access the low level function we added that will retrieve a symbol based on a "raw" c-style string.

The code

As the last couple of times, this was unfortunately committed a bit piecemeal, so the commits aren't easy to follow. But lets go through this step by step.

First of all, we're going to rewrite the tree, to alter a literal string into

%s(call __get_string ConstantReferringToTheRawString)

So, we alter compile to add a call to rewrite_strconst:

#
# Re-write string constants outside %s() t
# %s(call __get_string [original string constant])
def rewrite_strconst(exp)
  exp.depth_first do |e|
    next :skip if e[0] == :sexp
    is_call = e[0] == :call
    e.each_with_index do |s,i|
      if s.is_a?(String)
        lab = @string_constants[s]
        if !lab
          lab = @e.get_local
          @string_constants[s] = lab
        end
        e[i] = [:sexp, [:call, :__get_string, lab.to_sym]]
        # FIXME: This is a horrible workaround to deal
        # with a parser inconsistency that leaves calls
        # with a single argument with the argument "bare"
        # if it's not an array, which breaks with this rewrite.
        e[i] = [e[i]] if is_call && i > 1 
      end
    end
  end
end

Effectively this visits all nodes that are not s-expressions, and if it finds a string, it will add a string constant. It will then use the label that is allocated to rewrite the expression like this:

[:sexp, [:call, :__get_string, lab.to_sym]]

As the following line says, it's a nasty workaround, and it will be torn out as soon as I get a parser, so just ignore that... pretend it's not even there ;)

Now, the reason we wrap the call to __get_string is that otherwise it will be rewritten to a method call because there really is no such thing as function calls in Ruby, but for some of our low-level plumbing we need actual C-style function calls (such as for calling out to C code).

We're also changing strconst, the method used to get the "value" of a string constant as follows:

-  def strconst str
-    lab = @string_constants[str]
-    return lab if lab
-    lab = @e.get_local
-    @string_constants[str] = lab
-    return lab
+  def strconst(a)
+    lab = @string_constants[a]
+    if !lab # For any constants in s-expressions
+      lab = @e.get_local
+      @string_constants[a] = lab
+    end
+    return [:addr,lab]
   end

What's left? Not much, really. You can see the basi c String class

Here's the important bit:

class String
   def __set_raw(str)
     @buffer = str
   end
end

def __get_string(str) s = String.new %s(callm s __set_raw (str)) s end

It's worthwhile reading the comments in at the link above if you want to see more about where this is heading.

We'll flesh out the String class in coming parts, but this groundwork now means we're working on a "real" object instead of on a raw pointer to a c-style string, which means it's mostly a runtime library issue instead of requiring much in terms of compiler changes.

One of the things worth mentioning that is covered in one of the comments in lib/core/string.rb is that the whole __get_string and __get_symbol thing is a bit ugly. As I mentioned earlier in this part, hiding the low level implementation would be nice, and that includes preventing direct calls to __get_string and _get_symbol, and that can be done trivially by simply ensuring the Ruby code can't directly do function calls (as opposed to method calls). But that doesn't solve String#__set_raw. To fix that, I'll need to either remove the need for the method, or by making it possible to "hide" certain methods entirely from the Ruby code...

It's not a priority now, though - first we get things working, then we make it pretty.

What about String#to_sym? Well, this ought to do the trick:

class String
  def to_sym
    __get_symbol(@buffer)
  end
end

So, that leaves us with a working define_method and %s(lambda) with support for variables to get attr_reader,attr_writer and attr_accessor working... We're heading down that road soon.



2009-11-10 14:20 UTC Writing a (Ruby) compiler in Ruby bottom up - step 22

This is part of a series I started in March 2008 - you may want to go back and look at older parts if you're new to this series.

A diversion into Method missing

So far the method_missing implementation has just printed a notice and quit.

During the trip down the rabbit hole that is attr_accessor and friends that became a major annoyance.

The problem is that this notice has not included a stack backtrace or any way to figure out where it occurred. It's also been impossible to override it and actually figure out what method was being called because we currently get to method_missing by jumping straight into the vtable.

We get method_missing when we hit the pointer that's installed there as default when nothing has overridden it.

So how do we get better debug output? And how do we support users overriding method_missing and actually getting a symbol to use?

Thunks

A "thunk" in terms of a object oriented languages is generally a small piece of compiler generated code that gets inserted to "adjust" a function or method call.

In this specific case, we will generate a separate thunk for each vtable entry.

Instead of inserting a pointer to method_missing directly, we will insert the address of a small thunk. The thunk will not even create a full stack frame, but simply add the address of the Symbol corresponding to the vtable slot as the first argument on the stack, and then jump straight into method_missing, thereby simulating a "direct" call to method_missing with the symbol as the first argument.

It's actually very simple - we just need to pop the real return address off the stack, push the symbol onto the stack, and then push the real return address back on.

Ok, so we still cheat a bit. Eventually we need to make our method_missing into a real method, but for now it's a function. Here's the code I've added to create a "base" vtable that is used to initialize the vtable slots of Object:

    def output_vtable_thunks
      @vtableoffsets.vtable.each do |name,_|
        @e.label("__vtable_missing_thunk_#{clean_method_name(name)}")
        # FIXME: Call get_symbol for these during initalization 
        # and then load them from a table instead.  
        compile_eval_arg(GlobalScope.new, ":#{name.to_s}".to_sym)
        @e.popl(:edx) # The return address 
        @e.pushl(:eax)
        @e.pushl(:edx)
        @e.jmp("__method_missing")
      end
      @e.label("__base_vtable")
      # For ease of implementation of __new_class_object we
      # pad this with the number of class ivar slots so that the
      # vtable layout is identical as for a normal class 
      ClassScope::CLASS_IVAR_NUM.times { @e.long(0) }
      @vtableoffsets.vtable.to_a.sort_by {|e| e[1] }.each do |e|
        @e.long("__vtable_missing_thunk_#{clean_method_name(e[0])}")
      end
    end

I hope it's reasonably easy to follow. First it generates a number of functions that will look like this:

    __vtable_missing_thunk_to_yaml:
        subl    $4, %esp
        movl    $1, %ebx
        movl    $.L110, (%esp)
        movl    $__get_symbol, %eax
        call    *%eax
        addl    $4, %esp
        popl    %edx
        pushl   %eax
        pushl   %edx
        jmp __method_missing

This can be optimized a lot, but if you've followed this series, you know getting something working is higher priority. In this case we generate a call to __get_symbol, which was introduced in the last part, and we pass the string corresponding to the name:

    .L110:
        .string "to_yaml"

Then we adjust the stack as mentioned above.

The next step is to create the __base_vtable. Here's an excerpt:

    __base_vtable:
        .long 0
        .long 0
        .long __vtable_missing_thunk_new
        .long __vtable_missing_thunk___send__
        .long __vtable_missing_thunk___get_symbol
        .long __vtable_missing_thunk___method_missing
        .long __vtable_missing_thunk_array
        .long __vtable_missing_thunk___new_class_object
        .long __vtable_missing_thunk_define_method
        ...

Then we need to modify __new_class_object to assign entries from __base_vtable instead of just blindly assigning a pointer to __method_missing:

    # size <= ssize *always* or something is severely wrong.                                                                                                                
    def __new_class_object(size,superclass,ssize)
      ob = 0
      %s(assign ob (malloc (mul size 4))) # Assumes 32 bit                                                                                                                  
      i = 1
      %s(while (lt i ssize) (do
          (assign (index ob i) (index superclass i))
          (assign i (add i 1))
      ))
      %s(while (lt i size) (do
           # Installing a pointer to a thunk to method_missing                                                                                                              
           # that adds a symbol matching the vtable entry as the                                                                                                            
           # first argument and then jumps straight into __method_missing                                                                                                   
           (assign (index ob i) (index __base_vtable i))
           (assign i (add i 1))
      ))
      %s(assign (index ob 0) Class)
      ob
    end

Finally we make __method_missing output the symbol, instead of just spitting out "Method missing":

    def __method_missing sym
      %s(printf "Method missing: %s\n" (callm sym to_s))
      %s(exit 1)
      0
    end



2009-11-01 22:38 UTC Writing a (Ruby) compiler in Ruby bottom up - step 21

This is part of a series I started in March 2008 - you may want to go back and look at older parts if you're new to this series.

I've been lazy lately... Well, not really, I've been extremely busy, but I ought to have fit this in earlier. It's gotten harder and harder to get done too, since it's now more work since I had to go back and figure out a lot of the reasons for what I'd done.

Anyway, finally a new part, though short.

Down the rabbit hole: attr_(reader|writer|accessor)

Adding attr_reader / "attr_writer" / "attr_accessor" Should be easy, right? After all, all they do is allow read/write or both of member variables.

Trouble is you can't know that in advance.

    class Class
      def attr_reader foo
        puts "Hah!"
      end
    end

class Foo attr_reader :bar end

foo = Foo.new p foo.bar

Ouch. This is part of what makes Ruby exceptionally painful to compile.

It doesn't mean we can't make some assumptions, though, as long as we can handle the worst case where someone does something stupid (later we may want to add an option to make it assume you're not being stupid, and enable additional optimizations).

So how do we do this then?

Well, the obvious answer is to implement it in Ruby. Here are naive initial implementations (that only handle a single symbol, not an array like the real thing):

    def attr_accessor sym
      attr_reader sym
      attr_writer sym
    end

def attr_reader sym define_method sym do %s(ivar self sym) end end

def attr_writer sym define_method "#{sym.to_s}=".to_sym do |val| %s(assign (ivar self sym) val) end end

Note the s-expressions that rely on a new "ivar" primitive (that's not been added yet). That part is simple enough to add, but what else does the above require to be added to the compiler?

That's the ugly part. Here's a (possibly incomplete) list:

  • define_method
  • Real Symbol class
  • Symbol#to_s
  • Real String class
  • String.to_sym
  • Indirectly the :lambda s-expression construct needs to have support for variables

This is part of the reason I picked attr_* as the next thing to implement: It's an important, frequently used piece of functionality that snowballs and help drive implementation of functionality that is actually likely to be used.

Baby steps

First take a look at this commit

This is one of our assumptions (in between some other cruft): If you use attr_accessor etc., you are likely to need a vtable entry for that method.

Remember, we still don't implement a proper "fallback" in the form of a hash table for cases where the vtable gets "too big" (whatever we decide too big is) so for now we need this, but even later it'll be a useful optimization for many cases.

Worst case? We waste a vtable slot.

Next is where we actually add the basic implementations shown above, plus some debug statements, and stub for "define_method".

The comments bring us to an important question:

To type-tag or not to type-tag?

In MRI Symbol objects are "type tagged" integers. That is, they are not real objects at all, rather each symbol is represented by a specific 32 bit value, and those values can be identified as symbols by looking for a specific bit-pattern in the least significant byte.

This has the advantage of saving space - no actual instances need to be constructed. In this instance, however, it creates a lot of complication, by requiring the type tags to be checked on each and every method call.

For this reason we will, at least for now, avoid it.

(For Fixnum we will run into this problem again, and it will be even more tricky - for Symbol the number of objects can be expected to be reasonably small, but what about Fixnum? Ugh... Lets think about that later)

Instead we will keep a hash table of allocated symbols, which we will use to return the same object for the same symbol literal

I'm going to skip over most of the commits here, and show you the current state of the Symbol class and its associated compiler changes, since I must admit my commits have been quite messy and all over the place while putting in place these changes, and it's not very condusive to explaining the actual changes.

    class Symbol
      # Using class instance var instead of class var
      # because the latter is not properly implemented yet,
      # though in this case it may not make a difference
      #  @symbols = {} # FIXME: Adding values to a class ivar like this is broken

# FIXME: Should be private, but we don't support that yet def initialize(name) @name = name end

def to_s @name end

# FIXME # The compiler should turn ":foo" into Symbol.__get_symbol("foo"). # Alternatively, the compiler can do this _once_ at the start for # any symbol encountered in the source text, and store the result. # def self.__get_symbol(name) # Symbol.new(name) # sym = @symbols[name] # if !sym # sym = Symbol.new(name) # end # sym # end end

def __get_symbol(name) Symbol.new(name) end

Uh, yes. I started out with actually having it turn symbols from text into an object at runtime, but I quickly realized this makes no sense for the case where a literal symbol is present in the program.

Note that we still need a proper Symbol#__get_symbol as commented out here later, because it is necessary to handle things like String#to_sym. However for now I've skipped it for a simple reason:

It requires implementing a hash table. It's not that hash tables are hard. But currently implementing Hash in pure Ruby likely will require features that are not in place yet... So, lets sort out the literal Symbols first, and then work our way up.

The trivial version above is well and good, but we need to make the compiler call __get_symbol..

So we replace Compiler#intern with this:

    # Allocate a symbol
    def intern(scope,sym)
      # FIXME: Do this once, and add an :assign to a global var, and use that for any
      # later static occurrences of symbols.
      args = get_arg(scope,sym.to_s.rest)
      get_arg(scope,[:sexp,[:call,:__get_symbol, sym.to_s]])
    end

As you can see from my comment, it really makes no sense to call __get_symbol over and over - it's inefficient. But it works for now. We'll go back and fix that later.

Also note that this doesn't exactly match the above commit, but also incorporate a change from a later commit: We do [:sexp ..] there because :sexp nodes don't get rewritten to a :callm, and __get_symbol here is indeed a function not a method (this distinction doesn't really exist in Ruby, but it exist at the implementation level in our compiler, because it eases interoperability with C - this may change whenever I get to the point where it makes sense to implement FFI support)

The other changes are simply there to reflect the fact that intern now returns the result of a get_arg instead of some arbitrary integer:

    -      return [:int,intern(name.rest)] if name[0] == ?:
    +      return intern(scope,name.rest) if name[0] == ?:

That's it for now, folks

I promise it won't be nearly as long to the next part. I need to untangle my changes for splat operator support, method_missing debug improvements, and work on the Array and String classes. I'll likely give each one of them a short part each before I get fully "back on track" - I intend to write the future parts in parallel with actually making the code changes.



2009-05-05 19:16 UTC Writing a (Ruby) compiler in Ruby bottom up - step 20


You may or may not have seen my recent post where I admitted to more or less having decided to make my compiler project a Ruby compiler. On the downside this means a lot of complexity that may make it harder to follow. On the upside... Well, you get to read about a compiler for a "real" language as opposed to a toy, and hopefully the result will be usable and I'll manage to contain myself and not go off on wild tangents (I have a long list of experiments I want to do)

Since the last part things have moved quite far, and I'm not going to do a "patch by patch" overview of what's happened since. Instead I will this time give a rough overview of the state of the compiler as of starting to write this - April 25th (slightly derailed by having a child). Parts of the compiler with more extensive changes will get more "coverage".

Let me again point out that all of the code is on Github. Specifically the state of the code when writing this can be found at tag "step-20c".

Furthermore, I'd like to extend an invitation to anyone who's interested in contributing. Go check out the project from Github, fork it if you like, and send me pull requests when you have something you'd like me to consider for the main tree, or just let me know if it's ok for me to pull things in whenever I see something interesting. There's now two of us - Christopher Bertels started contributing a while back, and have added lots of rdoc documentation and a range of other improvements, such as part of the instance variable support.

We've so far decided to go with the same license as Matz' version of Ruby, but I'm open to discussion about that. Specifically I would prefer the core library to be licensed under terms sufficiently loose to allow closed source apps to be compiled as well.

Anyway, lets see what things look like now.

The Parser

If you've followed my blog, you'll probably have seen the announcement that the compiler can now parse itself. I'm sure it still has bugs (it certainly did when I posted the announcement, as I've found out since), and it most certainly can't compile the AST it generates, but it's still a fairly big milestone. It also means that it can parse a small but relatively central subset of Ruby (there are many, many gaps).

The parser is split in three components:

  • The low level s-expression style parser. This part has hardly changed.
  • The operator precedence parser used for expressions. This has changed a lot, and needs a cleanup, but I'll attempt a description of the current state
  • The "high level" parser used for control structures etc.. This is a simple recursive descent parser. It also would benefit from a cleanup, but it's not hard to understand (I think).

First of all, let us take a look at something else: Test cases. Parsing Ruby is nasty. As much as I love to program in Ruby, the grammar makes me feel downright dirty. It is full of context sensitive parts, exceptions and weird rules that make it painful to write a small and readable parser for it. All the more reason to write extensive sets of test cases.

Testing the parser components

I've been using Cucumber to add test cases, mainly because so far for the parser the test cases are very repetitive "here's the input, it's meant to generate this tree" kind of tests, and Cucumber makes that look nice and readable:
Feature: Shunting Yard
    In order to parse expressions, the compiler uses a parser component that uses the shunting yard
    algorithm to parse expressions based on a table.

Scenario Outline: Basic expressions Given the expression expr When I parse it with the shunting yard parser Then the parse tree should become tree

Examples: | expr | tree | | "__FILE__" | :__FILE__ | | "1 + 2" | [:add,1,2] | | "1 - 2" | [:sub,1,2] | | "1 + 2 * 3" | [:add,1,[:mul,2,3]] | | "1 * 2 + 3" | [:add,[:mul,1,2],3] | | "(1+2)*3" | [:mul,[:add,1,2],3] | | "1 , 2" | [:comma,1,2] | | "a << b" | [:shiftleft,:a,:b] | | "1 .. 2" | [:range,1,2] | | "a = 1 or foo + bar" | [:or,[:assign,:a,1],[:add,:foo,:bar]]| | "foo and !bar" | [:and,:foo,[:not,:bar]] | | "return 1" | [:return,1] | | "return" | [:return] | | "5" | 5 | | "?A" | 65 | | "foo +\nbar" | [:add,:foo,:bar] | | ":sym" | :":sym" | | ":[]" | :":[]" |

As of writing, the parser components have 106 scenarios (each entry in the example tables counts as one scenario) including a few failing ones. Whenever we find anything broken, a new test case goes in. 106 scenarios is nowhere near complete, but it helps tremendously with debugging, particularly with the operator precedence parser, as it's been fairly tedious to adjust and adapt it to handle the peculiarities of Ruby. It also serves as documentation of sorts of what the parser is expected to deliver

The operator precedence parser

We've covered this component before of sorts (go revisit it if you're unsure about it), but it's grown significantly in complexity since, mostly to account for weirdness in the Ruby grammar, some to account for missing functionality.  A word of warning before we start: It's gotten messy, and it needs to be cleaned up and re-factored. But it's not that hard to figure out.

I'm going to assume you're familiar with the previous parts, and have a rough idea of Dijkstra's Shunting Yard algorithm which this parser component is based on. We'll examine it top-down the way it's being used:

  def self.parser(scanner, parser)
     ShuntingYard.new(TreeOutput.new,Tokens::Tokenizer.new(scanner), parser)
  end
We start by instatiating the parser and passing  the most frequently used components to it. By passing an alternative to TreeOutput you an turn the output into reverse polish notation or anything you like. But we want a parse tree. Tokens::Tokenizer is used to retrieve a stream of tokens instead of working on individual characters. Last but not least we're passing it a reference to the recursive descent parser. Yuck. We do this because of Ruby's block syntax, which because of the way blocks can look like literal hashes and/or can be chained and used as part of an expression means we need to be able to call bak out of the shunting yard parser. 
It might be that it'd be just as well to combine the two, but I don't like tight coupling and I'm still living in hope of getting the time to figure out a way to reduce the coupling between these two components, and make the shunting yard parser generic enough to be reusable. One of the promises of an operator precedence parser is to be configured with a simple table of operators, and it's possible more of the exceptions in this current model can be handled by adding a few flags.

On to the next bit:
    def parse(inhibit=[])
      out = @out.dup
      out.reset
      tmp = self.class.new(out, @tokenizer,@parser)
      res = tmp.shunt(@tokenizer,[],inhibit)
      res ? res.result : nil
    end
A couple of oddities here. I've introduced an argument to allow specifying a set of tokens to inhibit and terminate the expression. This is needed because of peculiarities in how Ruby handle commas. Some places it's ok as it separates function arguments or array elements, but you also need to be able to put expressions  IN argument lists as default assignments, and there commas are not allowed without parantheses, as they're needed to separate arguments, and the argument list itself is parsed by the recursive descent parser. It would be possible to change this, but I'm not sure I want to try to make the argument list be parsed by the same component until I'm more sure of the consequences. Besides the current arrangement works.
You can also see we do some annoying initialization - copying and resetting the tree output object. This is done because the parser recursively calls itself, and the state of the output tree builder need to be clean when it gets called for a subtree in parantheses for example.

The next bit is in a fairly shameful state at the moment, but rather than wait until it's cleaned up I'd rather show you the intermediate state, and we'll revisit again later once I have it all figured out - the shunting parser probably deserve it's own follow up article as it can get a bit hard to follow.


Leading up to the main loop, which I'll break up in several chunks:

    def shunt(src, ostack = [], inhibit = [])
      possible_func = false     # was the last token a possible function name?
      opstate = :prefix         # IF we get a single arity operator right now, it is a prefix operator
                                # "opstate" is used to handle things like pre-increment and post-increment that
                                # share the same token.
      lp_on_entry = ostack.first && ostack.first.type == :lp
      opcall  = Operators["#call#"]
      opcallm = Operators["."]
      lastlp = true

The commented local vars are hopefully reasonably understandable. "lp_on_entry" is set to true if the method is passed an :lp operator in - that means '(','[', '{' etc.. We'll see how that's used later.

opcall and opcallm are just convenience shortcuts, though the hardcoded references to Operators[] are ugly (we'll meet more of them, and they'll go as soon as I find a clean way of decoupling the logic in this method).

"lastlp" is set to indicate whether or not the last token was an :lp type operator.

      src.each do |token,op|
        if inhibit.include?(token)
          src.unget(token)
          break
        end

Ok, so this is the start of the main loop. We get the token as a string in in "token", and if it's an operator we get the operator object in "op". We check if the token is member of a set of "inhibited" tokens that we don't allow to be part of an expression. This is used to handle cases where the recursive descent parser wants an expression that stops on encountering something that is normally a legal part of the operator. We'll see it used when we look at the recursive descent parser.

        if op
          op = op[opstate] if op.is_a?(Hash)

We then start handling operator tokens - most of the loop is split between handling operators and handling non-operator tokens, with very little shared logic. In some cases there will be two operators that share the same token. Currently we handle :infix_or_postfix vs. :prefix operators, and the only case it's currently used for is "*" as multiplication operator vs. as the "splat" operator (which expands arrays).

          # This makes me feel dirty, but it reflects the grammar:
          # - Inside a literal hash, or function call arguments "," outside of any type of parentheses binds looser than a function call,
          #   while outside of it, it binds tighter... Yay for context sensitive precedence rules.
          # This whole module needs a cleanup
          op = Operators["#,#"] if op == Operators[","] and lp_on_entry

The comment above speaks for itself, no? And we see a use for lp_on_entry to help decide the precedence rule to use for the comma, by switching objects around (Note that this also shows an anti-pattern that needs to be fixed: When you have hashes, try to avoid string keys. Strings are expensive objects where symbols get mapped to an integer - for things like this using symbols as keys tends to be more efficient).

          if op.sym == :hash_or_block || op.sym == :block
            if possible_func || ostack.last == opcall || ostack.last == opcallm
              @out.value([]) if ostack.last != opcall
              @out.value(parse_block(token))
              @out.oper(Operators["#flatten#"])
              ostack << opcall if ostack.last != opcall
            elsif op.sym == :hash_or_block
              op = Operators["#hash#"]
              shunt(src, [op])
              opstate = :infix_or_postfix
            else
              raise "Block not allowed here"
            end

The whole block above gets executed if the current operator is a '{' (:hash_or_block) or :block ("do"), and it's one of the hairy bits... If we've seen what is possibly the start of a block (possibly, because '{' could also start a hash, hence the symbol), we first check if we're possibly looking at a function call (a pre-requisite for a block) based on the previous token, OR if the last operator to have been pushed on the operator stack is the function or method call operators (note: We currently still differentiate between function and method calls, though that is not a distinction Ruby makes; it is helpful for some aspects of the parser, and it is helpful for the low level s-expression syntax, but this distinction will normally become hidden from the "end-user"). 

If there's a function call being handled, then we know we're dealing with a block. We output an empty argument array if needed, and then call back up into the recursive descent parser via "parse_block" (token is passed so that the parser can determine what symbol the block should end on - '}' vs. 'end'). The "fake" operator "#flatten#" is output to aid in the tree building as the tree builder doesn't directly handle cases where a node has more than two children/arguments. This is a bit of a hack, and there is actually no real reason why a shunting yard parser can't be adapted to directly handle operators that are prefixes to multiple operands, so that may be part of the cleanup later. 

If we're NOT dealing with a function call, we check if we're looking at the :hash_or_block ('{') operator, and if so we know we're now dealing with a Hash, and we recursively call shunt to handle the interior of the Hash.

          else
            if op.type == :rp
              @out.value(nil) if lastlp
              @out.value(nil) if src.lasttoken and src.lasttoken[1] == Operators[","]
              src.unget(token) if !lp_on_entry
            end

If we are not dealing with a potential block, we move on. First, above, we check for a :rp (right parenthesis, bracket or brace) operator. We "fake" a value if we've just seen an empty pair of parentheses, so we don't have to deal with operand-less operators elsewhere. We also fake value if the last token was a comma operator. This is to handle the convenience case in Ruby where arrays or hashes can have a "dangling" comma at the end like so: [1,2,3,] (useful for symmetry and machine generating Ruby source.

            reduce(ostack, op)

We call reduce in order to fore output of any tighter binding operators. In the case of, say, 1 * 2 + 3, (* 1 2) will get output when encountering '+'.

            if op.type == :lp
              shunt(src, [op])
              opstate = :infix_or_postfix
              # Handling function calls and a[1] vs [1]
              ostack << (op.sym == :array ? Operators["#index#"] : opcall) if possible_func

If it's an lp type operator, we do as we did for Hash ('{'), except we also want to selectively output "#index#" or the "#call#" operators if we're dealing with a function call. This occurs when we have an expression like "foo(1)", in which case "possible_func" will be true after "foo" has been encountered, and we then parse "(1)" and pass the result onto the output handler, and then output the "#call#" operator to tie "foo" and it's arguments together. As the slightly misleading comment says, "#index#" is used if we see '[...]' after something that might indicate a function call, as we need to differentiate a[1] (a call to the method "[]" on the object reference held in "a") vs [1] (constructing an Array object).

            elsif op.type == :rp
              break

If we dealt with an :rp we want to exit from the loop.

            else
              opstate = :prefix
              ostack << op
            end
          end

... if not we just push the operator on the operator stack.

        else
          if possible_func
            reduce(ostack)
            ostack << opcall
          end
          @out.value(token)
          opstate = :infix_or_postfix # After a non-operator value, any single arity operator would be either postfix,
                                      # so when seeing the next operator we will assume it is either infix or postfix.
        end
        possible_func = op ? op.type == :lp :  !token.is_a?(Numeric)
        lastlp = false
        src.ws if lp_on_entry
      end
Ok, time to handle non-operator values. Not much to say about this - it's hopefully reasonably understandable. The last line about will skip whitespace including line feeds if we're inside a parenthesized sub-expression, while normally (inside the tokenizer) we only skip whitespace excluding linefeeds. This is because the general rule in Ruby is to allow linefeeds anywhere where it doesn't create an ambiguity. I'm sure there are more special cases we'll need to handle, but for now this approximates the rule closely enough.
      if opstate == :prefix && ostack.size && ostack.last && ostack.last.type == :prefix
        # This is an error unless the top of the @ostack has minarity == 0,
        # which means it's ok for it to be provided with no argument
        if ostack.last.minarity == 0
          @out.value(nil)
        else
          raise "Missing value for prefix operator #{ostack[-1].sym.to_s}"
        end
      end

reduce(ostack) return @out if ostack.empty? raise "Syntax error. #{ostack.inspect}" end

This is the last part we'll look at (for the "reduce" method, see the earlier post), and it's outside the look. It's down to handling errors and optional operands, and then reducing the operator stack completely before returning the output handler (which should at this point hopefully contain a complete expression).

The recursive descent parser

The operator precedence parser was the hard part of parsing. There's not really all that much to say about the changes to the recursive descent parser - it's all really formulaic, so I'll pick one of the larger functions and walk through that:

  def parse_block(start = nil)
    pos = position
    return nil if start == nil and !(start = expect("{")  || expect("do"))
    close = (start.to_s == "{") ? "}" : "end"
    ws
    args = []
    if expect("|")
       ws
      begin
        ws
        if name = parse_name
          args << name
          ws
        end
      end while name and expect(",")
      ws
      expect("|")
    end
    exps = parse_block_exps
    ws
    expect(close) or expected("'#{close.to_s}' for '#{start.to_s}'-block")
    return E[pos,:block] if args.size == 0 and !exps[1] || exps[1].size == 0
    E[pos,:block, args, exps[1]]
  end

This is what the shunting yard parser calls back into. Most of this should be fairly understandable assuming you remember that "expect" tries to match a string, and "ws" skips past whitespace. 
The major new thing here is "position" and that "E[]" stuff. "position" is a method that returns what the scanner thinks is the current position. It isn't guaranteed to be 100% accurate all the time as the scanner doesn't itself keep track of the length of lines it has scanned past, so it depends on the token being "unget" to contain a postition for that (see scanner.rb - not going in depth into that). 

In reality it works well enough to give reasonable error reporting during parsing. But what about errors identified in later stages?

That's where E[] comes in. It's a shortcut to create an instance of a subclass of Array: AST::Expr. We'll be using this class to carry additional annotation of the nodes in the parse tree in later stages. For now the additional element is the position, which it takes from the first argument that either is a position or has one. See ast.rb for the details. 

Compiling the syntax tree

The compiler class itself hasn't changed all that much since last time apart from much better error reporting (see error()). The most significant worthwhile change to look at is the implementation of basic support for instance variables in objects. As for vtables for classes we're taking major shortcuts to start with.

The magic starts in "scope.rb".  Specifically we now keep an array of instance variables we've seen for this current class. This chunk in Scope#get_arg will return [:ivar, offset] where "offset" is the number of the "slot" of PTR_SIZE values starting from the beginning of an instance object that this instance variable belongs to.
    # instance variables.
    # if it starts with a single "@", it's a instance variable.
    if a.to_s[0] == ?@ or @instance_vars.include?(a)
      offset = @instance_vars.index(a)
      add_ivar(a) if !offset
      offset = @instance_vars.index(a)
      return [:ivar, offset]
    end
At the moment the above contains a bug, as the instance variables will actually get added too late for the "new" method to correctly allocate the object of the right size. So in the next batch of changes will be one to "visit" the nodes of a class definition to pre-allocate the instance variable offsets.
This current approach also has another problem: In Ruby you can dynamically add instance variables to an object. There's no guarantee that all (or any) of the instance variables identified will actually be added to any specific instance. Here the compiler will need to make a trade-off:

We can allocate space for all the instance variables we see (we still need a way to handle dynamically allocated ones) or we can pick a subset that is likely to be present on most objects (if it's set in "initialiaze" for example). For dynamically allocated ones we can set aside space for pointers to a hash table to hold them. We can also use a level of indirection to "pack" the instance variables more to avoid having to resort to the hash tables as often as we otherwise might do. There's no guarantee any specific one of these strategies will be best for exactly your app, so there's room for switches here, or have the system dynamically optimize the choices at runtime (at the cost of more complexity). For now, though, we'll just allocate sufficient space for everything we see, and ignore dynamically allocated ones. Then later, we'll add support for handling dynamically allocated ones, and then we can start looking at optimizations. As usual we take the shortcut that bring us functionality first.

The compilation of the instance variables is pretty trivial - load and saves happens in compiler.rb, in compile_assign and compile_eval_arg respectively. Here's what we do in compile_assign:
    if atype == :ivar
      ret = compile_eval_arg(scope,:self)
      @e.save_to_instance_var(source, ret, aparam)
      return [:subexpr]
    end
Emitter#save_to_instance_var just stores the value to the appropriate offset into the object.

The support libraries

We've barely started on the support libraries, so this will be short, but it's worth taking a brief look at what's in the main tree so far, not least because it demonstrates how I intend to proceed - even core elements of the object model, such as the Object and Class classes will largely be written in Ruby, and make use of only very limited bootstrapping help from the compiler. Only a small number of elements will require us to "dip out" of pure Ruby, and in those cases we'll as much as possible rely on the s-expression syntax that allow direct access to the AST. Doing C etc. is a last resort or temporary workaround only - a goal of this project is minimal dependencies.

Here's what we use to bootstrap Class:
def __new_class_object(size)
  ob = malloc(size)
  %s(assign (index ob 0) Class)
  ob
end

class Class def new # @instance_size is generated by the compiler. YES, it is meant to be # an instance var, not a class var ob = malloc(@instance_size*4) %s(assign (index ob 0) self) ob end end

Calls to __new_class_object is called by the bootstrap code generated by the compiler (in compile_class). It just uses the C-library's malloc() for now, and assigns a pointer to the class Class in the first slot.
Class#new is implemented so that it will take the instance size from each subclass (that's why we use an instance variable of the class instead of a class variable - class variables are shared across subtrees of the inheritance tree in Ruby), and then does more or less what __new_class_object does.

It's worth noting that we make this raw class reference available as the instance variable @__class__ in the objects. Why not as "class"? Well, the Ruby object model is finicky. Modules and meta-classes are inserted into the inheritance chain (for good reason - it makes things a lot easier), but the are "hidden" from you when you access self.class. So @__class__ is an implementation detail that should not be dependent on other than to help us implement the core classes. The same is true for "@instance_size" in class Class.

Final words (for now)

As usual, if you have questions or comments, they are welcome. Get all of the code from Github. The state of the code when writing this can be found at tag "step-20c".

And feel free to get in touch to contribute, or just fork the code and hack away.



There's already a ton of changes since I started writing this part. The next milestone is to get the compiler to compile itself - that's probably 2-3 articles away. The first step will be to get it to generate code that will link, but I fully expect it to have plenty of problems that will prevent it from running... An article on debugging the code generation is likely to be forthcoming at that stage...
Once it can compile itself, my next goal is to work on getting it to compile mspec, in order to get a measure of what is missing from Ruby. It will require quite a few fixes to the parser.





About me

E-mail: vidar@hokstad.com Skype: vhokstad
Twitter: vhokstad
View my LinkedIn profile.

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married and we just had our first child, Tristan Ikemefuna Hokstad.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Twitter Updates

    follow me on Twitter