Writing a (Ruby) compiler in Ruby bottom up - step 33 2014-04-06


This is part of a series I started in March 2008 - you may want to go back and look at older parts if you're new to this series.

It took me two months to push this out, but I promise it doesn't mean I'm about to stop publishing these again anytime soon. I intend to shorten my "lead time" further again, so I'm hoping to publish the next one in 2-3 weeks time.

Register Allocation

We saw in some of the recent parts that the code is getting more and more prone to problems due to attempts at reusing registers.

Register allocation is the proper solution to this. That is, to introduce a mechanism for determining what code gets to use which registers when, and that handles the situation where too few registers are available properly (by "spilling" registers onto the stack)

Let me first make it clear that the really naive alternative to this is to avoid register allocation entirely, and to declare a few registers to be "scratch" registers that can only be safely used within code where no code generation is delegated further down.

This is possible to do by pushing all intermediate results onto the stack, and popping them off only briefly to carry out necessary operations and push them back onto the stack again (optionally with simple optimizations to operate directly on the top of the stack for single operand operations or when possible with the last of multiple operand instructions).

Basically, this means you're "emulating" a stack machine with registers.

This is an approach taken by many simple compilers for unoptimized code, deferring the problem of register allocation to optimization passes.

In reality we're very close to that, and in some respects I wish I'd kept purely to that approach, not least because it took me two months to get time to wrap up this part properly. But on the other hand, doing very basic register allocation is not that hard, and since we've started down that path we might as well go a bit further.

Only a bit mind you.

We will in fact in this part refactor some code to need register allocation less, to the point where we can get away with only minor adjustments to the original really primitive allocator.

Optimal register allocation is hard, and tends to require analysis that will require or at least strongly favor a solution with a low level intermediate representation that allows more in-depth analysis of register usage.

A first step is generally to analyse "liveness" of each variable, including temporary variables introduced by the compiler. Then the problem basically boils down to finding a way to map as many variable references as possible to registers so that you don't exceed the available number of registers for your architecture at any one point.

The "at any one point" part adds complexity, since reuse of registers is central to minimizing the number of memory accesses, which is the main goal of register allocation:

For most systems, memory accesses are substantially slower than register access (though this is less of an issue in these days of multi-level processor caches than in the old days where every memory access when to external memory; when you fail to hit cache, though, you are far more brutally penalized on modern CPUs as the gap between CPU performance and memory latency has widened)

If you have n registers, and no more than n variables are "live" at once, then you have it easy if you are able to determine wich variables are "live".

The fun begins when (and on an architecture like i386, this is likely to be pretty much all the time) the code block you are allocating registers for is referencing more stuff than you have registers for.

This either means "spilling" the contents of a register to memory in the middle of a function in order to load another variable, or keeping some variables you'd like to have in a register in memory instead, other than when operating on it.

That's where the challenge arises.

(though Ruby reduces this problem through other inefficiences: Almost everything is accessed via methods, which reduces the number of variables that are easy to put into registers; we may hope to improve on that later, but for now it reduces the immediate benefit of register allocation)

Another typical approach is graph coloring, which I'm not going to go into at all, as it's slow and complicated. A more recent approach is Linear Scan allocators, covered on the Wikipedia page for register allocation, which is popular for JIT's etc because it is in general a lot faster (and simpler) at the cost of some performance in the generated code.

In any case, down the rabbit hole you go...

I have no desire to deal with that anytime soon, and especially not on an architecture like i386 where the number of registers is so small. On, say, x86-64, or M68k it'd be a different matter, as you have vastly better chances of being able to keep a substantial proportion of intermediate results in registers and largely avoid spilling the results to the stack, and so you're a real sucker if you don't take maximal advantage of the registers (however, as mentioned above, without other work, we'd have problems taking full advantage).

(translation: I'll have to deal with this some day, but not now)

What we will do for now is to mostly rely on keeping stuff in memory (including on the stack), except for short term loading stuff into %eax when operating on it. As needed we'll also use %edx (see below). Beyond that, we will just guess wildly at some heuristics...

Apart from the guessing wildly at some heuristics part, this is roughly what we've done so far anyway...

But so far we've done the register allocation extremely ad-hoc, so it is worth first establishing our actual conventions. We start with the C calling convention, since we're trying to allow our code to call C code directly:

C calling convention on i386

We roughly follow the cdecl calling convention for i386.

That is, we will aim for %eax, %ecx and %edx to be caller saved. That is, if we rely on them to maintain their values, we will push them onto the stack before calling another method.

In practice we will treat %eax as a scratch register that is only used for short windows, and so we will rarely if ever want to bother to save it, especially as it is also generally used as the return value anyway.

Everything else is to be callee saved (but see caveat below); that is, if they are used in the called method, they will be saved there, and needs to be back to their original value by the time the method returns.

And we store arguments on the stack.

Next we add on some extensions:

Additions for splats and arity

Specifically we store the number of arguments in %ebx as methods needs to be able to determine the precise number of arguments whether due to "splats" or to throw appropriate exceptions to enforce arity (ensure that the number of arguments matches the declared number of arguments for the method; we don't do this yet).

Note that this breaks the "cdecl" calling convention. However this is only a problem if C code (or other code conforming to the cdecl calling convention) calls into our code.

For now we're not sticking our head into that worms nest (though you need to be aware of this if you want to experiment with the code and use C-functions like qsort which uses callbacks), so it's a non-issue (if/when we do, we'll either need to consider preserving %ebx, or more likely we'll need to bridge calls, as the method will likely require %ebx to hold the number of arguments for more than just the splats to work (e.g. Ruby methods are expected to throw ArgumentError if called with the wrong number of arguments, which will eventually force us to add additional checks at the start of each method).

As a result, %ebx is in practice free other than on entering a function, as we save it to the local stack frame. I'm of two minds whether to keep passing the number of arguments this way, as the code has gotten to the point where we end up temporarily pushing it on the stack anyway on call, and pushing it on the stack again afterwards. The only real current benefit of using a register for it is that we can directly call C functions that are totally unaware of how we pass the number of arguments.

These C functions don't know about the "fake" numargs "variable" anyway, and so we could as an alternative introduce a way of indicating we want C calling convention that would just leave out numargs entirely. It'd be slightly more hassle to use, but it'd only ever be used for low level %s() code anyway.

We're not going to change this right now, regardless.

What's left?

Well, on i386 we have %ecx,%edx,%esi, %edi. The "problem" is that all of them have special purposes in some cases. We've run into one of them: %edx is used to hold the high 32 bits of the dividend for idivl, and the remainder of the result, and needs special treatment as a result. There's also a number of floating point registers.

We also have %ebp which we use for the frame pointer to access local variables and method arguments via, and %esp, the stack pointer.

As discussed above, we could in theory use %ebx, with the caveat that while the splat handling is as is is, it'll get clobbered often, so for now I've kept it out.

(There are many other register names on i386, but many of them alias 16- and 8-bit portions of the 32-bit registers referred to above; as someone who grew up with M68k asm, where this kind of thing did not occur, in general I'll pretend I never heard about that wart)

What to use the free registers for?

As I said, we'll guess wildly.

As a first approximation, we'll make %esi hold self from the first reference to self in each method. We'll cover the implementation of that in the next part, so for now we will ignore %esi. This is because method calls and instance variable access both require self, and so we can assume it will be quite frequent, and as I will show next time, this pans out quite nicely.

I toyed with the idea of making %edi hold self.class, based on the same assumption, but this has two problems: It reduces the set of registers available to us further, and it complicates handling of eigenclasses:

The "internal" inheritance chain for Ruby objects can actually change at nearly any time, if someone decides to extend the object with methods tied to the objects meta class / eigenclass. If we cache self.class that limits us in how to handle that further down the line. So for now I won't do anything about that.

This leaves us with %ecx, %edx and %edi free for allocation so far. We further allow "forced" allcation of specific registers in order to handle cases like idivl metioned earlier.

At the same time we'll make some small changes that has the effect of minimizing the amount of time temporary values are kept in registers that has the effect of ensuring we never need many of them.

Then we'll add a "quick and dirty scan" step that will simply create a set of candidate variables as follows:

Note that at this point we could go the next step and store information about liveness ranges, and use that to allow register usage to overlap, but one step at a time. The main thing is to get the overall mechanism in place.

What will we do with this information?

As little as possible. On accesing any local variables or arguments, we check if they're in a register first, if not we try to load it into a free register if there is one, and if not we fall back on the current behaviour of using %eax (and reloading / saving the variable on each access).

This totally sidesteps the need for liveness analysis, but also means we don't get the full benefit, and it's easy to create pathological cases where it performs really badly (e.g. lots of variables accessed few times, used one after the other, and then once you've referenced enough variables, reference them all again, right after they've been evicted from the registers; repeat - in this way you can construct cases where you get no benefit from the register allocation at all, while if you'd kept a subset of the variables in the registers throughout the code block, you would). But this is still no worse than the code we currently generate.

Abstracting the registers out of Compiler

A number of methods surrently refer to registers through symbols or even strings that directly reference them. A first step is to try to abstract away more of this so that we as much as possible request registers from the Emitter so that the registers can be replaced later if possible/necessary.

As a bonus, this will set us on the right path to make the compiler easier to retarget to other architectures.

This first changeset adds methods for the stack pointer, for using %ebx as scratch register, and a couple of other helpers, as well as start cleaning up the Compiler class to not explicitly mention registers by name so many places. You can see it in 92aa83f - I won't cover that commit in more detail.

Breaking our current primitive allocator

We already have with_register that sort-of tried to do a little bit of very basic register allocation. But with_register comes with substantial caveats: It knows nothing about what to do if it runs out of registers to allocate. So lets try to provoke it to fail, so we have something to fix to start with:


        %s(printf "%d = 2\n" (div (div (div 16 2) 2) 2))

I picked div as our recent rewrite away from runtime.c left us with code in compile_div that does dual nested with_register calls already, so it's well suited to ensure we run out of registers in our current regime.

You should get something like this if you try to compile it:

./emitter.rb:364:in `with_register': Register allocation FAILED (RuntimeError)

The problem is simply that we've only set aside %edx and %ecx, and furthermore we're not doing anything to allow spilling the values to the stack, or even to make them available again when caling into another function.

Triggering this in actual Ruby code is harder, as almost everything quickly devolves into method calls that effectively "reset" the register allocations by virtue of the calling conventions requirements for who saves which registers.

But lets sort this out first of all anyway.

First, since this revealed the danger of the nested with_register calls, which will remain inefficient at best, lets address the simpler case of compile_2 which we've used for dual-operand arithmetic and comparisons:

(You'll find these changes and the div test case in 0fdfb34)


     def compile_2(scope, left, right)
        src = compile_eval_arg(scope,left)
        @e.with_register do |reg|
          @e.movl(src,reg)
          @e.save_result(compile_eval_arg(scope,right))
          yield reg
        end
        [:subexpr]
      end

Just a tiny change here: We've lifted the compile_eval_arg(scope,left) call out. Frankly we should probably remove the need fully, and just use the stack. I'll consider that later. In the meantime this reduces register usage substantially in cases where the left-hand expression tree is deep.

The bigger change is compile_div:


      def compile_div(scope, left, right)
        @e.pushl(compile_eval_arg(scope,left))
    
        res = compile_eval_arg(scope,right)
        @e.with_register(:edx) do |dividend|
          @e.with_register do |divisor|
            @e.movl(res,divisor)
    
            # We need the dividend in %eax *and* sign extended into %edx, so
            # it doesn't matter which one of them we pop it into:
                    
            @e.popl(@e.result)
            @e.movl(@e.result, dividend)
            @e.sarl(31, dividend)
            @e.idivl(divisor)
          end
        end
        [:subexpr]
      end

This is pretty much a rewrite - the old version was horribly messy and brittle, because it tried to work around not being able to tell with_register exactly what it wanted.

So to simplify, lets make with_register handle that.

And at the same time we lift the argument evaluation out of the actual division, which massively simplifies things:

First we evaluate the left, and push the result onto the stack. Then we evaluate the right, and leave the result.

Then we forcibly allocate %edx since we know idivl needs it for both arguments and return value. Then we allocate a register for the divisor (right hand).

We then get the dividend (left hand) off the stack, into %eax, move it into the allocated register, which will be %edx. We then shift right to only leave the sign bit (see last time we dicussed the divisions for more on this).

Finally we do the division.

You may note that this is going to be wasteful in cases where the left hand expression is a static value etc. that we could just load directly into the right register. Adding some code to detect that will save some stack manipulation. But that's not a priority now.

The corresponding change to emitter.rb to expand with_register looks like this for now:


    -  def with_register
    +  def with_register(required_reg = nil)
         # FIXME: This is a hack - for now we just hand out :edx or :ecx
         # and we don't handle spills.
     
         @allocated_registers ||= Set.new
    -    free = @free_registers.shift
    +
    +    if required_reg
    +      free = @free_registers.delete(required_reg)
    +    else
    +      free = @free_registers.shift
    +    end
    +

In other words, this will still blow up spectacularly if we need more registers.

Note that that the changes we've done means the register allocation now only surrounds a handful of instructions doing calculation, and no further calls.

Luckily this means that in reality we shouldn't need more. We'd like more, but most of our register usage will only be for caching variables we've allocated space in memory for.

As such, we know the worst case for the register allocation is that we need to juggle %edx around, as that's the only register we specifically need (so far) for idivl.

As "stage 1" this gets our previous example to compile and run.

For now we'll move on to the meatier part of register allocation: Moving variables into registers.

Identifying variables for our registers

You're not going to like this part. It is hairy and ugly, and involves changing #rewrite_let_env and find_vars in transform.rb, which are not exactly the prettiest part of the compiler to begin with.

The relevant part of rewrite_let_env is the simplest:


     # We use this to assign registers 
    
     freq   = Hash.new(0)
     
     vars,env= find_vars(e[3],scopes,Set.new, freq)

We set aside a Hash with 0 as the default value, to keep count of the variable references we see.

Then we assign the frequency data to a new extra field of the AST nodes:


          e[3].extra[:varfreq] = freq.sort_by {|k,v| -v }.collect{|a| a.first }

extra is simply adding this to AST::Expr (in afdcd99):


    +
    +    def extra
    +      @extra ||= {}
    +    end

Back to transform.rb, the changes to find_vars are more extensive, but most of them are about threading freq through the recursive calls, and some refactoring (especially breaking out #is_special_name. The meat of it is this little change:


          elsif n.is_a?(Symbol)
            sc = in_scopes(scopes,n)
            freq[n] += 1 if !is_special_name?(n)

Specifically, when we get to a "leaf", if it is not a "special" name as defined in the new utility method, we count it as a variable reference to take into account for the later allocation. You can find the full changes to transform.rb here: 3f2ab88

An overview of the register allocator

I've split out the "backend" of the new register allocation code in regalloc.rb in 608adc1.

It could probably do with some refactoring already, but that will have to wait.

Most of the explanations here can be found in the source as well.

The RegisterAllocator class contains the Cache class which is used to hold information about the current state of a register that is currently used to hold a variable.


      class Cache
        # The register this cache entry is for
        attr_accessor :reg
    
        # The block to call to spill this register if the register is dirty (nil if the register is not dirty)
        attr_accessor :spill
    
        # True if this cache item can not be evicted because the register is in use.
        attr_accessor :locked
    
        def initialize reg
          @reg = reg.to_sym
        end
    
        # Spill the (presumed dirty) value in the register in question,
        # and assume that the register is no longer dirty. @spill is presumed
        # to contain a block / Proc / object responding to #call that will
        # handle the spill.
        def spill!
          @spill.call if @spill
          @spill = nil
        end
      end

The RegisterAllocator class itself is initialized like this:


      def initialize
        # This is the list of registers that are available to be allocated
        @registers      = [:edx,:ecx, :edi]
    
        # Initially, all registers start out as free.
        @free_registers = @registers.dup
    
        @cached = {}
        @by_reg = {} # Cache information, by register
        @order = {}
      end                                

The @cached and @by_reg hashes contains the Cache objects referenced by variable name and register respectively. @order gets assigned an array of variables in descending order of priority. As previously discussed, this in effect means by descending order of number of times the variable is referenced at this point.

Additionally, #with_register initializes two more instance variables as needed: @allocated_registers and @allocators, which holds information on which registers have been allocated for temporary use, and debug information about the code that called #with_register respectively.

I won't go through every little method - you can look at the commit (and feel free to ask questions if anything is not clear), but we'll take a look at #cache_reg! which is used to cache a register, as well as evict which is used to remove a variable from the registers (that is, remove the association in the compiler, we do not generate code to actually clear the register), and #with_register, which is used to allocate temporary registers.

#cache_reg! will only allocate registers for variables that have been "registered" with the register allocator. First we try to obtain a register from the list of free/unallocated/uncached registers:


      def cache_reg!(var)
        if !@order.member?(var.to_sym)
          return nil
        end
        free = @free_registers.shift
        if free
          debug_is_register?(free)

If a register is found, we create a new Cache object, referencing the register. If not, we (for now) output a warning. If a register is found, it is returned, otherwise, we return nil.


          c = Cache.new(free)
          @cached[var.to_sym] = c
          @by_reg[free] = c
        else
          STDERR.puts "NO FREE REGISTER (consider evicting if more important var?)"
        end
        free
      end

We'll look at how this is used later.

evict has the opposite role: When we need to ensure that a variable is retrieved from memory next time, we pass it.

We iterate over the variables passed, and try to delete them from the cache. If they were in fact there, we call the spill handler if one was set. The spill handler is any object (such as a Proc or lambda) that responds to #call, and it is its responsibility to store the contents of the register back to memory before it is freed.

This is only assigned if the in-register version of the variable is intentionally modified.

We then add the register back in the list of free regsiters.

For convenience, I've added an #evict_all method that evicts all currently cached variables.


      def evict vars
        Array(vars).collect do |v|
          cache = @cached.delete(v.to_sym)
          if cache
            r = cache.reg
            debug_is_register?(r)
            cache.spill!
            @by_reg.delete(r)
            @free_registers << r
          end
          r ? v : nil
        end.compact
      end
    
      def evict_all
        evict(@cached.keys)
      end

The longest method in the register allocator for now is #with_register. This has been lifted from Emitter and adapted, and in fact we've covered changes to that one above, but I'll go through it from scratch.

First we check if the client has requested a specific register. If so, we try to retrieve this register. If not, we get the first one available. This is the change we did above to the Emitter version.

If none was immediately vailable, we go through the variables cached from least frequently used (in this method), to most, and evict the first one that is not "locked" in place.

(Note, there's a potential problem here: The client code does need to be able to handle the case where the preferred register is already taken, as we currently don't go on to evict the variable specifically allocated to that register, but I punted on that when changing the version previously in Emitter - we probably should fix that, but now now)


     # Allocate a temporary register. If specified, we try to allocate
      # a specific register.
      def with_register(required_reg = nil)
        @allocated_registers ||= Set.new
    
        if required_reg
          free = @free_registers.delete(required_reg)
        else
          free = @free_registers.shift
        end
    
        if !free
          # If no register was immediately free, but one or more
          # registers is in use as cache, see if we can evict one of
          # them.
    
          if !@cached.empty?
            # Figure out which register to drop, based on order.
            # (least frequently used variable evicted first)
            @order.reverse.each do |v|
              c = @cached[v]
              if c && !c.locked
                reg = c.reg
                evict(v)
                free = reg
                break
              end
            end
          end
        end

The next part is simply debug output, primarily in case we actually run out of registers, which means we're trying to use more temporary registers at one time than the allocator has registers available total (3 currently). It should not happen if we are careful about what we allocate temporary registers for:


        debug_is_register?(free)
    
        if !free
          # This really should not happen, unless we are
          # very careless about #with_register blocks.
    
          STDERR.puts "==="
          STDERR.puts @cached.inspect
          STDERR.puts "--"
          STDERR.puts @free_registers.inspect
          STDERR.puts @allocators.inspect
          raise "Register allocation FAILED"
        end

And some more debug support:


       # This is for debugging of the allocator - we store
        # a backtrace of where in the compiler the allocation
        # was attempted from in case we run out of registers
        # (which we shouldn't)
        @allocators ||= []
        @allocators << caller

Finally, we mark the register as allocated, yield to the client code, and free the register again.


        # Mark the register as allocated, to prevent it from
        # being reused.
        @allocated_registers << free
    
        yield(free)
    
        # ... and clean up afterwards:
    
        @allocators.pop
        @allocated_registers.delete(free)
        debug_is_register?(free)
        @free_registers << free
      end

Tieing the allocator into the Emitter and Compiler

The changes to compiler.rb are fairly minor. We will go through the changes to each of get_arg, output_functions, compile_if, compile_let and compile_assign from e1049e3 separately, and intersperse that with how it ties in with the changes to Emitter in the same commit.

Lets start with #get_arg:


    @@ -86,7 +86,7 @@ class Compiler
       # If a Fixnum is given, it's an int ->   [:int, a]
       # If it's a Symbol, its a variable identifier and needs to be looked up within the given scope.
       # Otherwise, we assume it's a string constant and treat it like one.
    -  def get_arg(scope, a)
    +  def get_arg(scope, a, save = false)
         return compile_exp(scope, a) if a.is_a?(Array)
         return [:int, a] if (a.is_a?(Fixnum))
         return [:int, a.to_i] if (a.is_a?(Float)) # FIXME: uh. yes. This is a temporary hack
    @@ -94,7 +94,23 @@ class Compiler
         if (a.is_a?(Symbol))
           name = a.to_s
           return intern(scope,name.rest) if name[0] == ?:
    -      return scope.get_arg(a)
    +
    +      arg = scope.get_arg(a)
    +
    +      # If this is a local variable or argument, we either
    +      # obtain the argument it is cached in, or we cache it
    +      # if possible. If we are calling #get_arg to get
    +      # a target to *save* a value to (assignment), we need
    +      # to mark it as dirty to ensure we save it back to memory
    +      # (spill it) if we need to evict the value from the
    +      # register to use it for something else.
    +
    +      if arg.first == :lvar || arg.first == :arg
    +        reg = @e.cache_reg!(name, arg.first, arg.last, save)
    +        return [:reg,reg] if reg
    +      end
    +
    +      return arg
         end

The comment says almost all that needs to be said. The main thing to notice here is the "save" argument, which we'll see used later by compile_assign. Let's take a look at Emitter#cache_reg!.

First we see if this variable is currently cached in a register.

If save was passed, we mark this entry as dirty, so that if the variable is later evicted from the register, we spill the value back to memory. Regardless whether or not we got a register back, we return, as we certainly don't want to load the value into memory just to overwrite it and then spill it later.

(Note that we could add code to request a register and fill it with the modified value, and immediately mark it dirty; in some cases this might be worthwhile by potentially saving us a load later on, but that's speculative enough that I'd want to do real tests first)


      def cache_reg!(var, atype, aparam, save = false)
        reg = @allocator.cached_reg(var)
    
        if (save)
          mark_dirty(var, atype, aparam) if reg
          return reg
        end

Then we output some comments for debugging purposes and to make it easier for us to examine the results of the allocation later on (we might strip this out later), and if the register was not already in the cache, we try to request a register to cache it in, and if we get one we load it:


        comment("RA: Already cached '#{reg.to_s}' for #{var}") if reg
        return reg if reg
        reg = @allocator.cache_reg!(var)
        return nil if !reg
        comment("RA: Allocated reg '#{reg.to_s}' for #{var}") if reg
        comment([atype,aparam,reg].inspect)
        load(atype,aparam,reg)
        return reg
      end

Let us also take a quick look at Emitter#mark_dirty:

The most important part here is the lambda that is used installed to handle spills. It simply outputs another debug comment, and saves the register back where it came from:


      def mark_dirty(var, type, src)
        reg = cached_reg(var)
        return if !reg
        comment("Marked #{reg} dirty (#{type.to_s},#{src.to_s})")
        @allocator.mark_dirty(reg, lambda do
                                comment("Saving #{reg} to #{type.to_s},#{src.to_s}")
                                save(type, reg, src)
                              end)
      end
    

As for Compiler#output_functions, the only big change there is that it now passes the variable frequency information:


        varfreq = func.body.respond_to?(:extra) ? func.body.extra[:varfreq] : []
        @e.func(name, func.rest?, pos, varfreq) do

So lets see what Emitter#func does with it. It ensures all registers are evicted before we generate code for a new function, as the function obviously can't control where it is called from. We then install the new frequency information in the register allocator. And on the way out again, we evict the registers again, for good measure. Actually the latter one is necessary in case any of the registers needs to be spilled. The former one is a precaution - the registers ought to have been evicted before we get there.


    -  def func(name, save_numargs = false, position = nil)
    +  def func(name, save_numargs = false, position = nil,varfreq= nil)
         @out.emit(".stabs  \"#{name}:F(0,0)\",36,0,0,#{name}")
         export(name, :function) if name.to_s[0] != ?.
         label(name)
    @@ -479,12 +518,17 @@ class Emitter
         lineno(position) if position
                            @out.label(".LFBB#{@curfunc}")
    
    +    @allocator.evict_all
    +    @allocator.order(varfreq)
         pushl(:ebp)
         movl(:esp, :ebp)
         pushl(:ebx) if save_numargs
         yield
         leave
         ret
    +
    +    @allocator.evict_all
    +
                              emit(".size", name.to_s, ".-#{name}")
         @scopenum ||= 0
         @scopenum += 1
    @@ -494,6 +538,7 @@ class Emitter
       end

The change to Compiler#compile_if is simple: We simply need to explicitly pass the register to test, rather than rely on it always being %eax:


      def compile_if(scope, cond, if_arm, else_arm = nil)
        res = compile_eval_arg(scope, cond)
        l_else_arm = @e.get_local
        l_end_if_arm = @e.get_local
        @e.jmp_on_false(l_else_arm, res)
        compile_eval_arg(scope, if_arm)
        @e.jmp(l_end_if_arm) if else_arm
        @e.local(l_else_arm)
        compile_eval_arg(scope, else_arm) if else_arm
        @e.local(l_end_if_arm) if else_arm
        return [:subexpr]
      end

In compile_let the only change is that we want to ensure we evict all registers that the %s(let ...) node aliases, as otherwise we will be using values from the wrong variable:


          @e.evict_regs_for(varlist)
          @e.with_local(vars.size) { compile_do(ls, *args) }
          @e.evict_regs_for(varlist)

In compile_assign, our only concern is passing a truthy value for "save":


          args = get_arg(scope,left,:save)

Optimizing temporary register usage for already cached values

We'll do one tiny little additional change, and then we'll look at the resulting code.

In e74d92d we introduce Emitter#with_register_for:


      def with_register_for(maybe_reg)
        c = @allocator.lock_reg(maybe_reg)
        if c
          comment("Locked register #{c.reg}")
          r = yield c.reg
          comment("Unlocked register #{c.reg}")
          c.locked = false
          return r
        end
        with_register {|r| emit(:movl, maybe_reg, r); yield(r) }
      end

The purpose of this is as a small optimization in cases where we need a temporary register to hold a variable, but already have the variable in a register. We need to ensure it doesn't get evicted, same as if we allocate a temporary register.

And here's the only place we use it so far:


      def compile_2(scope, left, right)
        src = compile_eval_arg(scope,left)
        @e.with_register_for(src) do |reg|
    #      @e.emit(:movl, src, reg)
          @e.save_result(compile_eval_arg(scope,right))
          yield reg
        end
        [:subexpr]
      end
    }}}    
       
    If the variable is already in a register, it can save us a movl.
    
    
    ## A quick look at the resulting code ##
    
    Let us compile the example code from earlier. It should give something like this:
    
    -asm-
    __method_Object_foo:
        .stabn  68,0,2,.LM336 -.LFBB61
    .LM336:
    .LFBB61:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $36, %esp
        .stabn  68,0,3,.LM337 -.LFBB61
    .LM337:
        subl    $20, %esp
        movl    $5, -20(%ebp)
        movl    $2, -8(%ebp)
        movl    $5, -12(%ebp)
        movl    $10, -16(%ebp)
        # RA: Allocated reg 'edx' for a
        # [:lvar, 0, :edx]
        movl    -8(%ebp), %edx
        # Locked register edx
        # RA: Allocated reg 'ecx' for b
        # [:lvar, 1, :ecx]
        movl    -12(%ebp), %ecx

Here we see the first uses, and as you can see from the "Locked" comment above, this also made use of the optimization where we'd previously have allocated another register and moved or reloaded the variable:


        movl    %ecx, %eax
        addl    %edx, %eax
        # Unlocked register edx

And here we're assigning the result of (add a b) back to a, which currently lives in %edx. As a result it is marked "dirty": It needs to be written back to memory when evicted.


        # Marked edx dirty (lvar,0)
        movl    %eax, %edx
        # RA: Already cached 'edx' for a
        # Locked register edx
        # RA: Allocated reg 'edi' for c
        # [:lvar, 2, :edi]
        movl    -16(%ebp), %edi

And we use a again, from %edx and start reaping the rewards:


        movl    %edi, %eax
        addl    %edx, %eax
        # Unlocked register edx
        # Marked edx dirty (lvar,0)
        movl    %eax, %edx

Of course, note above, that we could save much more with smarter handling of these registers - in these examples we could have done addl %edi, %edx directly, and saved two further movl's - we have tons of further optimizations to do.

And here are some more examples where we reuse a


        # RA: Already cached 'edx' for a
        # Locked register edx
        movl    $2, %eax
        imull   %edx, %eax
        # Unlocked register edx
        # Marked edx dirty (lvar,0)
        movl    %eax, %edx
        subl    $8, %esp
        movl    $2, %ebx
        movl    $.L83, %eax
        movl    %eax, (%esp)
        # RA: Already cached 'edx' for a
        movl    %edx, 4(%esp)
        movl    $printf, %eax

And here we finally spill a back to memory from %edx, right before we call printf, as %edx can be overwritten:


        # Saving edx to lvar,0
        movl    %edx, -8(%ebp)
        call    *%eax

(Incidentally, this is where liveness analysis makes a big difference: after the last use of a, it'll still get spilled, but that's of course pointless since this is a local variable)

Final words

I'd like to reiterate that this is a trivial and primitive allocator. It misses tons of opportunities, and may do stupid things, like load stuff, use it once, have to evict it, load it again, use it once, have to evict it, and so on.

The important thing, though, is to get some basic infrastructure in place that we can expand on. We can now later add more advanced logic to determine which variables to cache when with much less effort.

Next time, we'll look at another side of this: Caching self in %esi, which we'll handle quite differently (and in a much shorter part...)


blog comments powered by Disqus