Vidar Hokstad V2.0

Home Blog

Tag: ruby

2008-05-16 15:13 UTC Writing a compiler in Ruby bottom up - step 7

This is the seventh in a series. Please start on part 1 first if you haven't already to get the background, or see here for the full series so far

I've combined two of the planned parts this time, what was in the list from last time as parts 7 and 8.

Making use of lambda / call

We can implement loops using recursion "manually", but adding lambda's now should make it possible to create a slightly cleaner version by actually defining a "while" function:

   (defun while (cond body) 
      (if (call cond ()) (do (call body ()) (while cond body) ()))
    )

In a way, this isn't so far from Ruby blocks, except that we don't provide the syntactic sugar that allows the blocks to be defined without any extra vocabulary, so in use our "while" function would look like this:

   (while (lambda () (cond)) (lambda () (body)))

Not terrible, and far better than earlier, but still not very clean. We'll do better shortly, but for now lets start with a pre-requisite that prevents the code above from actually working in any meaningful sense of the word:

We finally have to add support for using the arguments passed.

In our Ruby based syntax, the while function using function arguments looks like this:

 [:defun, :while, [:cond, :body],
    [:if, [:apply, :cond, []], [:do,
        [:apply, :body, []],
        [:while, :cond, :body]
      ],
      []
    ]
  ]

As a sidebar, indirectly this also means we're providing a hackish way of providing local variables. "Proper" local variables is largely syntactic sugar again. Most stuff in programming is syntactic sugar:

  (call (lambda (i) (code here)) (0))

The code above defines the local variable (i), visible to the code inside, and initializes it to 0. Not pretty, but as usual we defer pretty until later. As usual it's also highly inefficient, since it depends on creating a brand new function, but we'll deal with that later.

Adding function arguments

But lets move on and actually figure out what changes to make to access the function arguments. These preparations also pave the way for proper local variables and more down the line.

First we'll add a "scope" object, and do some minor refactoring. The "scope" defines which sets of variables are actually visible to use at any time.

class Function
  attr_reader :args,:body
  def initialize args,body
    @args = args
    @body = body
  end
end

class Scope def initialize compiler,func @c = compiler @func = func end

def get_arg a a = a.to_sym @func.args.each_with_index {|arg,i| return [:arg,i] if arg == a } return [:atom,a] end end

The reason we separate Function and Scope here is that I'll later introduce scopes that match other things than functions, such as classes etc.

Part of the refactoring mentioned is to thread the scope objects through the compiler, so that scope changes are transparent (by just passing the new scope down).

We hook Scope#get_arg in into Compiler#get_arg by changing:

   return [:atom, a] if (a.is_a?(Symbol))

into this:

   return scope.get_arg(a) if (a.is_a?(Symbol)) 

#output_functions changes into this, to start a new scope for each function:

 def output_functions
    @global_functions.each do |name,func|
     puts ".globl #{name}"
     puts ".type   #{name}, @function"
     puts "#{name}:"
     puts "\tpushl   %ebp"
     puts "\tmovl    %esp, %ebp"
     compile_exp(Scope.new(self,func,func.body)
     puts "\tleave"
     puts "\tret"
     puts "\t.size   #{name}, .-#{name}"
     puts
    end
  end   

And #compile_defun changes to create a proper Function object instead of just an array of arguments and the body:

  def compile_defun scope,name, args, body
    @global_functions[name] = Function.new(args,body)
    return [:subexpr]
  end 

Then we need to actually support accessing the arguments. Again we resort to "gcc -S" to find out how. This:

void bar(const char * str, unsigned long arg, unsigned long arg2)
{
  printf(str,arg,arg2);
}

turns into:

bar:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $24, %esp
        movl    16(%ebp), %eax
        movl    %eax, 8(%esp)
        movl    12(%ebp), %eax
        movl    %eax, 4(%esp)
        movl    8(%ebp), %eax
        movl    %eax, (%esp)
        call    printf
        leave
        ret

As you can see, the arguments are accessed relative to %ebp, which has been loaded with a copy of %esp at the beginning of the function. Why the offset of 8 for the first argument? Well, the return address gets pushed onto the stack, creating an offset of 4, and the the old value of %ebp is pushed on, giving us an offset of 8 to access the arguments. Remember that %esp grows down in memory, which is why we're adding offsets to get past the last entries pushed onto the stack.

That leads to this addition to #compile_eval_arg as the last "if" check:

   if atype == :arg
      puts "\tmovl\t#{PTR_SIZE*(aparam+2)}(%ebp),%eax"
    end

Last but not least we create a Function object for "main" by modifying the call to #compile_exp in #compile_main:

   @main = Function.new([],[])
    compile_exp(Scope.new(self,@main),exp)

Time to test how it works:

prog = [:do,
  [:defun,:myputs,[:foo],[:puts,:foo]],
  [:myputs,"Demonstrating argument passing"],
]

Then:

$ ruby step7.rb >step7.s
$ make step7
cc    step7.s   -o step7
$ ./step7
Demonstrating argument passing

The latest version is here


2008-05-13 09:22 UTC Giant balls of typeless source files

Posted in: , , ,
Cedric posted this entry about an article by Steve Yegge called Dynamic languages strike back (it's slightly interesting, but nothing really new if you already like dynamic languages, though the section on trace trees is wel worth a read). Most of Cedric's comments were unremarkable, but this made me cringe:

What will keep preventing dynamically typed languages from displacing statically typed ones in large scale software is not performance, it's the simple fact that it's impossible to make sense of a giant ball of typeless source files, which causes automatic refactorings to be unreliable, hence hardly applicable, which in turn makes developers scared of refactoring. And it's all downhill from there. Hello bit rot.

First of all, let me be generous and assume he meant type annotated source files, because most prominent dynamic languages are strongly typed. In Ruby, for example, everything is an object, and every object has a type and a class (note the distinction: A Ruby object has a type that may or may not coincide with just the class, because of the concept of eigenclasses, which allows you to extend a single object without changing the class).

This makes the argument slightly more tenable, but not much so. Type annotations, or lack of them, is not a distinguishing feature of static languages vs. dynamic. Many static languages, such as Haskell, can forego most (in some cases all) type annotation because of good type inference engines.

But one of the things I love about Ruby is the lack of unnatural type restrictions. In fact, the few times I've come across code with explicit type annotations (using #is_a? etc.) it's usually been a hindrance rather than a help, because the implementor of the class in question made assumptions about what classes could be provided as input that just plain were more restrictive than they needed to be.

The "giant balls of typeless source files" just have yet to be a problem for me. First of all, they're not giant balls - I find myself routinely writing far less code to achieve the same goals than what I'd do in most other languages that I know.

But more than that, I end up writing code where type largely doesn't matter, and where the small bits of type that do matter is clearly documented. The small bits that do matter are NOT type or class names, but things like this:

  • Argument "foo" must implement #each to iterate over a collection.
  • Argument "bar" must support Comparable.

Coupling tends to be far reduced, allowing me to reason about, and test, the code in much smaller units. Which again means that I rarely - if ever - need to even consider "giant balls" of source files at all. Most of my Ruby applications are on the order of a few hundred to a few thousand lines of code, but they build on a shitload of libraries. And while I like having the source to them in case I need a new feature, I can honestly say that I've never looked at the source of most of them for other than curiosity about how they do something.

Most of the time, I won't look at source or even documentation. I'll be doing something like this:

$ irb -rmodel
irb(main):001:0> (Item.methods - Object.methods).sort
=> ["[]", "add_hook", "after_create", "after_destroy", "after_initialize", "after_save", "after_update", "all_association_reflections", "associate", "association_reflection", "associations", "before_create", "before_destroy", "before_save", "before_update", "belongs_to", "cache_key_from_values", "cache_store", "cache_ttl", "columns", "create", "create_table", "create_table!", "create_with", "create_with_params", "database_opened", "dataset", "db", "db=", "def_hook_method", "delete_all", "destroy_all", "drop_table", "fetch", "find", "find_or_create", "has_and_belongs_to_many", "has_hooks?", "has_many", "has_validations?", "hooks", "implicit_table_name", "is", "is_a", "is_dataset_magic_method?", "join", "load", "many_to_many", "many_to_one", "method_missing", "no_primary_key", "one_to_many", "one_to_one", "plugin_gem", "plugin_module", "primary_key", "primary_key_hash", "schema", "serialize", "set_cache", "set_cache_ttl", "set_dataset", "set_primary_key", "set_schema", "skip_superclass_validations", "subset", "super_dataset", "table_exists?", "table_name", "validate", "validates", "validates_acceptance_of", "validates_confirmation_of", "validates_each", "validates_format_of", "validates_length_of", "validates_numericality_of", "validates_presence_of", "validations"]
irb(main):002:0> 

... and I tend to find what I'm looking for.

I have so far not once felt even the slightest need to use a refactoring tool with Ruby. I've never once done a mass renaming of methods with Ruby that spanned more than a single file, and where emacs search replace wasn't more than sufficient.

This is an impedance mismatch between the idea what is necessary that just doesn't match reality for a lot of people working with dynamic languages.

I don't want a re-factoring tool. It doesn't even place in my top 10 of features or functionality I'd like to have for Ruby, to the point where I've never looked to see if one already exists.

Does that mean I don't refactor? No it doesn't. But refactoring when the module you're working on is measured in hundreds of lines and a handful of files, and is meaningfully separate and not coupled to the rest of your code, is pretty trivial, and not something that needs tool support.

What it boils down to is that the very need for advanced refactoring tools is a big red flashing warning sign. It means the language has failed in making life easy for you and/or you have failed as a designer.

It is one of the reasons I truly loathe highly coupled systems - it's a design smell that I've had to endure enough in the past that I'll go to great lengths to avoid it. The result is better software, with far more reusability and maintainability.

And systems without giant balls of source files - typeless or not.


2008-05-06 12:59 UTC Unholy: Converting Ruby 1.9 bytecode to Python bytecode

Posted in: , , ,
_why is at it again...

I can't quite agree with myself if this is seriously demented or tremendously cool... (Yes, it's mostly only an idea, with only the barest hint of doing anything "useful", but anyway..)


2008-05-03 17:43 UTC Writing a compiler in Ruby bottom up - step 6

This is the sixth in a series. Please start on part 1 first if you haven't already to get the background, or see here for the full series so far

Since we established last time that I'm going to blatantly steal stuff from Lisp, Scheme and friends (as well as from all over the place - I feel no need to be original with this project... Not until we're much further along, anyway), now is the right time to start introducing some more powerful concepts.

How about some deferred evaluation and anonymous functions?

Lambdas, or anonymous functions, can be passed around like values and called at your convenience (or not at all). Generally, they can access variables from the surrounding scope that gets "bound" to the function as an environment that allows it to pass state. That's a closure. What we're going to do this time is not going to bring full closure support, but it's the start, and we'll get to full closures down the road.

Let's be clear: As almost everything in programming languages, they are syntactic sugar. You can consider them equivalent to defining a class with a single method to call it and instance variables to be the environment, for example (or you can treat closures as a way of building an object system instead - a concept explored by Wouter van Oortmerssen who has been a hero of mine ever since I discovered Amiga E. If you're a programming language geek you need to take a look at the stuff Wouter has done over the years) - many concepts are reasonably orthogonal.

(As further digression, have you noticed a lot of people like me who are obsessed with programming languages that have this horrible tendency to nest sentence fragments and abuse punctuation to interject random thoughts all over the place, or is that just me?)

But instead of having to go to all that trouble, and litter your namespace with small functions, lambdas let you define them inline, and return a value representing the function instead of executing it.

For now, we'll add these constructs:

(lambda (args) body)

(call f (args))

The former will return the function - for now just the address - instead of executing the body. "call" will call the address passed with the arguments it's given, no surprise there.

So how does this differ from "real" closures?

The tricky part about real closures is that if you refer to variables from the surrounding scope, they are supposed to "stay live" for the next invocation. That is, they are meant to be kept around whenever you call the closure later on. It should be clear that returning just the address to the function isn't sufficient to achieve that. So let's look at one way of doing it, so you get an idea of the work involved (it's not huge, but not trivial either):

  • We need to create an "environment", namely somewhere to store the variables that should survive past the lifetime of the function the closure is created in. This environment must exist on the heap, and each invocation of the function surrounding the lambda must result in a new environment.
  • What is returned must allow you to access the environment. You can do this by creating an "object" and make the environment represent instance variables, or you can use a "thunk" - a small generated function that contains a pointer to the object and loads it into a predefined location before calling the anonymous function, or any number of other ways.
  • You need to decide what variables go into the environment. This could be either "everything" accessible at the time of the execution of the "lambda" expression, or you could scan the expression body of the lambda and explicitly put only the ones used by one or more lambda expression in the function in the environment. The latter is potentially a lot more space efficient, but is slightly more work.

Ok, so for now just lets get the "anonymous function" in place. As usual I'll go through the changes step by step, but note that I've done some minor cleanups that are inconsequential to these changes as well, that I won't go through because they'll lust clutter things up.

The first piece is the code to handle the "lambda" expression:

  def compile_lambda args, body
    name = "lambda__#{@seq}"
    @seq += 1
    compile_defun(name, args,body)
    puts "\tmovl\t$#{name},%eax"
    return [:subexpr]
  end

Hopefully this is pretty self-explanatory. All it does is create a function name on the form lambda__[number] that we'll use to refer to the function, since we won't actually output it inline. Nothing really prevents us from spitting it out inline, but I find it messy, so for now anyway, I'll treat it as just another function. We then call compile_defun to generate just another named function - so it's only anonymous for the user. We then move the address of the function into %eax, which is where we've established the result of our subexpressions go. Note that this is yet another shortcut - eventually we'll need to do something more sophisticated to handle more complex expressions properly, but register allocation is a complex subject (and the alternative, to push everything onto the stack works but is slow).

Last we return [:subexpr] to indicate to the caller where/how to find the result of this expression.

Next we'll do some refactoring. You may have noticed that the inner loop of #compile_exp was getting ugly, handling different types of arguments. So we extract this:

  def compile_eval_arg arg
    atype, aparam = get_arg(arg)
    return "$.LC#{aparam}" if atype == :strconst
    return "$#{aparam}" if atype == :int
    return aparam.to_s if atype == :atom
    return "%eax"
  end

Notice the appearance of :atom, too. This allows us to take the address of C functions and pass them to :call. It was so simple to add, I figured why not. To go with that, we add this to #get_arg:

   return [:atom, a] if (a.is_a?(Symbol))

Next up, as part of the refactoring, :call falls out almost by itself:

  def compile_call func, args
    stack_adjustment = PTR_SIZE + (((args.length+0.5)*PTR_SIZE/(4.0*PTR_SIZE)).round) * (4*PTR_SIZE)

puts "\tsubl\t$#{stack_adjustment}, %esp" args.each_with_index do |a,i| param = compile_eval_arg(a) puts "\tmovl\t#{param},#{i>0 ? i*4 : ""}(%esp)" end

res = compile_eval_arg(func) res = "*%eax" if res == "%eax" # Ugly. Would be nicer to retain some knowledge of what "res" contains puts "\tcall\t#{res}" puts "\taddl\t$#{stack_adjustment}, %esp" return [:subexpr] end

This might look familiar. It's because it's the guts of #compile_exp, with #compile_eval_arg replacing some of the more ugly bits. The main other change is that it calls compile_eval_arg to get the function too, and does something weird to "%eax" - it prepends a "*".

Either you're confused now, or you might start seeing possibilities - both of doing nice things and of shooting yourself in the foot. The above is the equivalent of casting an arbitrary expression into a pointer and jumping to it with no validation. It makes it deliciously easy to cause a segfault by jumping to random addresses. It also, incidentally, makes things like a virtual pointer table for a class system trivially easy, and a number of other things. Safety will have to come later. As it turns out, the "*" is needed in front of indirect "call"'s.

So what does #compile_exp look like now? The obvious answer is "a hell of a lot cleaner":

 def compile_do(*exp)
    exp.each { |e| compile_exp(e) }
    return [:subexpr]
  end

def compile_exp(exp) return if !exp || exp.size == 0 return compile_do(*exp[1..-1]) if exp[0] == :do return compile_defun(*exp[1..-1]) if (exp[0] == :defun) return compile_ifelse(*exp[1..-1]) if (exp[0] == :if) return compile_lambda(*exp[1..-1]) if (exp[0] == :lambda) return compile_call(exp[1],exp[2]) if (exp[0] == :call) return compile_call(exp[0],exp[1..-1]) end

Nice, isn't it? Compile_call, as it turns out does almost exactly the same as what the guts of compile_exp used to do, apart from what was farmed out to the other functions.

So we do a little test:

prog = [:do,
  [:call, [:lambda, [], [:puts, "Test"]], [] ]
]

(yeah, not exactly earth-shattering)

Compile and run:

$ ruby step6.rb >step6.s
$ make step6
cc    step6.s   -o step6
$ ./step6
Test

The full code for this step is here

Following parts

Since I combined three parts last time, here's an updated list of the remaining pre-written "just-in-need-of-cleanup" parts - I guess I better free up some time to write some new stuff soon, as I'm fairly sure I might end up combining a couple of the parts below when I get around to cleaning them up (such much for posting it as 30 parts...)

  • Step 7: Revisiting loops with anonymous functions, accessing function arguments
  • Step 8: Adding assignment and basic arithmetic
  • Step 9: A cleaner "while" loop
  • Step 10: Testing the language: Writing a simple text mangling program to make input cleaner
  • Step 11: Refactoring the code generation and starting to abstract out the target architecture
  • Step 12: Discussion of various concepts and future direction
  • Step 13: Adding arrays
  • Step 14: Local variables and multiple scopes
  • Step 15: Accessing variable length arguments
  • Step 16: Revisiting the text mangler - cleanups to test new functionality, and run it on itself
  • Step 17: Identifying the constructs needed to self host the compiler
  • Step 18: A start on a proper parser


2008-04-28 21:31 UTC Customizing the Ruby syntax highlighter for x86 assembler

I wrote about syntax hightlighting in Ruby earlier. The Ruby Syntax library supports Ruby, YAML and XML out of the box. But it's also pretty easy to extend to handle other languages.

Since I've been writing my compiler in Ruby series and including a lot of x86 assembler, I figured I'd see how much (or little) work adding a syntax highlighter for assembler would take.

It's by no means perfect - I've only spent half an hour or so throwing this together, but it's reasonable, and easy enough to keep adjusting.

Here it is:

require 'rubygems'
require 'syntax'

class AsmTokenizer < Syntax::Tokenizer def setup @state = :newline end

def step @state = :newline if bol? if @state == :newline # Handle labels and operands if label = scan(/[a-zA-Z.][a-zA-z0-9_]*:/) then start_group :label, label elsif words = scan(/\.[a-zA-Z0-9_]*/) start_group :directive, words @state = :operands elsif words = scan(/[a-zA-Z]+/) start_group :operator, words @state = :operands else start_group(:normal, getch) end else # Handle operators and assorted punctuation

if words = scan(/,/) then start_group :comma, words elsif words = scan(/[\-0-9][0-9]*/) then start_group :number, words elsif words = scan(/%[a-zA-Z]+/) then start_group :register, words elsif words = scan(/[\.a-zA-Z][a-zA-Z0-9]*/) then start_group :label, words elsif words = scan(/\$/) then start_group :value, words elsif words = scan(/[\(\)]/) then start_group :paren, words elsif words = scan(/\".*\"/) then start_group :quoted, words else start_group(:normal, getch) end end end end

# Register the custom highlighter Syntax::SYNTAX['asm'] = AsmTokenizer

I don't have time do write a lot of explanation - the Syntax manual does a reasonable job of describing it. The one pitfall to be aware of, is that you must make sure that your #step method advances at least one character no matter what, or you'll get stuck in an infinite loop.

Here's an example of how to use the new highlighter:

require 'syntax/convertors/html'

convertor = Syntax::Convertors::HTML.for_syntax "asm" puts convertor.convert( File.read("/tmp/step5.s"))

And here's some CSS to color the output:

body { background: black; }                                                                                                          
.directive { color: purple; }                                                                                                        
.comma     { color: white; }                                                                                                         
.paren     { color: white; }                                                                                                         
.value     { color: white; }                                                                                                         
.number    { color: yellow; }                                                                                                        
.label     { color: blue; }                                                                                                          
.register  { color: brown; }                                                                                                         
.operator  { color: lightgrey; }                                                                                                     
.quoted    { color: green; }                                                                                                         

And some example output from the compiler series:

        .text
.globl main
        .type   main, @function
main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        subl    $4,%esp
        movl    $.LC0,(%esp)
        call    puts
        addl    $4, %esp
        popl    %ecx
        popl    %ebp
        leal    -4(%ecx), %esp
        ret
        .size   main, .-main
        .section        .rodata
.LC0:
        .string "Hello World"


About me

E-mail: vidar@hokstad.com
Skype: vhokstad
View my LinkedIn profile

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Categories

StumbleUpon My link page

(Links I have stumbled and like)