Vidar Hokstad V2.0

Home Blog

Tag: compiler

2008-05-06 12:59 UTC Unholy: Converting Ruby 1.9 bytecode to Python bytecode

Posted in: , , ,
_why is at it again...

I can't quite agree with myself if this is seriously demented or tremendously cool... (Yes, it's mostly only an idea, with only the barest hint of doing anything "useful", but anyway..)


2008-05-03 17:43 UTC Writing a compiler in Ruby bottom up - step 6

This is the sixth in a series. Please start on part 1 first if you haven't already to get the background, or see here for the full series so far

Since we established last time that I'm going to blatantly steal stuff from Lisp, Scheme and friends (as well as from all over the place - I feel no need to be original with this project... Not until we're much further along, anyway), now is the right time to start introducing some more powerful concepts.

How about some deferred evaluation and anonymous functions?

Lambdas, or anonymous functions, can be passed around like values and called at your convenience (or not at all). Generally, they can access variables from the surrounding scope that gets "bound" to the function as an environment that allows it to pass state. That's a closure. What we're going to do this time is not going to bring full closure support, but it's the start, and we'll get to full closures down the road.

Let's be clear: As almost everything in programming languages, they are syntactic sugar. You can consider them equivalent to defining a class with a single method to call it and instance variables to be the environment, for example (or you can treat closures as a way of building an object system instead - a concept explored by Wouter van Oortmerssen who has been a hero of mine ever since I discovered Amiga E. If you're a programming language geek you need to take a look at the stuff Wouter has done over the years) - many concepts are reasonably orthogonal.

(As further digression, have you noticed a lot of people like me who are obsessed with programming languages that have this horrible tendency to nest sentence fragments and abuse punctuation to interject random thoughts all over the place, or is that just me?)

But instead of having to go to all that trouble, and litter your namespace with small functions, lambdas let you define them inline, and return a value representing the function instead of executing it.

For now, we'll add these constructs:

(lambda (args) body)

(call f (args))

The former will return the function - for now just the address - instead of executing the body. "call" will call the address passed with the arguments it's given, no surprise there.

So how does this differ from "real" closures?

The tricky part about real closures is that if you refer to variables from the surrounding scope, they are supposed to "stay live" for the next invocation. That is, they are meant to be kept around whenever you call the closure later on. It should be clear that returning just the address to the function isn't sufficient to achieve that. So let's look at one way of doing it, so you get an idea of the work involved (it's not huge, but not trivial either):

  • We need to create an "environment", namely somewhere to store the variables that should survive past the lifetime of the function the closure is created in. This environment must exist on the heap, and each invocation of the function surrounding the lambda must result in a new environment.
  • What is returned must allow you to access the environment. You can do this by creating an "object" and make the environment represent instance variables, or you can use a "thunk" - a small generated function that contains a pointer to the object and loads it into a predefined location before calling the anonymous function, or any number of other ways.
  • You need to decide what variables go into the environment. This could be either "everything" accessible at the time of the execution of the "lambda" expression, or you could scan the expression body of the lambda and explicitly put only the ones used by one or more lambda expression in the function in the environment. The latter is potentially a lot more space efficient, but is slightly more work.

Ok, so for now just lets get the "anonymous function" in place. As usual I'll go through the changes step by step, but note that I've done some minor cleanups that are inconsequential to these changes as well, that I won't go through because they'll lust clutter things up.

The first piece is the code to handle the "lambda" expression:

  def compile_lambda args, body
    name = "lambda__#{@seq}"
    @seq += 1
    compile_defun(name, args,body)
    puts "\tmovl\t$#{name},%eax"
    return [:subexpr]
  end

Hopefully this is pretty self-explanatory. All it does is create a function name on the form lambda__[number] that we'll use to refer to the function, since we won't actually output it inline. Nothing really prevents us from spitting it out inline, but I find it messy, so for now anyway, I'll treat it as just another function. We then call compile_defun to generate just another named function - so it's only anonymous for the user. We then move the address of the function into %eax, which is where we've established the result of our subexpressions go. Note that this is yet another shortcut - eventually we'll need to do something more sophisticated to handle more complex expressions properly, but register allocation is a complex subject (and the alternative, to push everything onto the stack works but is slow).

Last we return [:subexpr] to indicate to the caller where/how to find the result of this expression.

Next we'll do some refactoring. You may have noticed that the inner loop of #compile_exp was getting ugly, handling different types of arguments. So we extract this:

  def compile_eval_arg arg
    atype, aparam = get_arg(arg)
    return "$.LC#{aparam}" if atype == :strconst
    return "$#{aparam}" if atype == :int
    return aparam.to_s if atype == :atom
    return "%eax"
  end

Notice the appearance of :atom, too. This allows us to take the address of C functions and pass them to :call. It was so simple to add, I figured why not. To go with that, we add this to #get_arg:

   return [:atom, a] if (a.is_a?(Symbol))

Next up, as part of the refactoring, :call falls out almost by itself:

  def compile_call func, args
    stack_adjustment = PTR_SIZE + (((args.length+0.5)*PTR_SIZE/(4.0*PTR_SIZE)).round) * (4*PTR_SIZE)

puts "\tsubl\t$#{stack_adjustment}, %esp" args.each_with_index do |a,i| param = compile_eval_arg(a) puts "\tmovl\t#{param},#{i>0 ? i*4 : ""}(%esp)" end

res = compile_eval_arg(func) res = "*%eax" if res == "%eax" # Ugly. Would be nicer to retain some knowledge of what "res" contains puts "\tcall\t#{res}" puts "\taddl\t$#{stack_adjustment}, %esp" return [:subexpr] end

This might look familiar. It's because it's the guts of #compile_exp, with #compile_eval_arg replacing some of the more ugly bits. The main other change is that it calls compile_eval_arg to get the function too, and does something weird to "%eax" - it prepends a "*".

Either you're confused now, or you might start seeing possibilities - both of doing nice things and of shooting yourself in the foot. The above is the equivalent of casting an arbitrary expression into a pointer and jumping to it with no validation. It makes it deliciously easy to cause a segfault by jumping to random addresses. It also, incidentally, makes things like a virtual pointer table for a class system trivially easy, and a number of other things. Safety will have to come later. As it turns out, the "*" is needed in front of indirect "call"'s.

So what does #compile_exp look like now? The obvious answer is "a hell of a lot cleaner":

 def compile_do(*exp)
    exp.each { |e| compile_exp(e) }
    return [:subexpr]
  end

def compile_exp(exp) return if !exp || exp.size == 0 return compile_do(*exp[1..-1]) if exp[0] == :do return compile_defun(*exp[1..-1]) if (exp[0] == :defun) return compile_ifelse(*exp[1..-1]) if (exp[0] == :if) return compile_lambda(*exp[1..-1]) if (exp[0] == :lambda) return compile_call(exp[1],exp[2]) if (exp[0] == :call) return compile_call(exp[0],exp[1..-1]) end

Nice, isn't it? Compile_call, as it turns out does almost exactly the same as what the guts of compile_exp used to do, apart from what was farmed out to the other functions.

So we do a little test:

prog = [:do,
  [:call, [:lambda, [], [:puts, "Test"]], [] ]
]

(yeah, not exactly earth-shattering)

Compile and run:

$ ruby step6.rb >step6.s
$ make step6
cc    step6.s   -o step6
$ ./step6
Test

The full code for this step is here

Following parts

Since I combined three parts last time, here's an updated list of the remaining pre-written "just-in-need-of-cleanup" parts - I guess I better free up some time to write some new stuff soon, as I'm fairly sure I might end up combining a couple of the parts below when I get around to cleaning them up (such much for posting it as 30 parts...)

  • Step 7: Revisiting loops with anonymous functions, accessing function arguments
  • Step 8: Adding assignment and basic arithmetic
  • Step 9: A cleaner "while" loop
  • Step 10: Testing the language: Writing a simple text mangling program to make input cleaner
  • Step 11: Refactoring the code generation and starting to abstract out the target architecture
  • Step 12: Discussion of various concepts and future direction
  • Step 13: Adding arrays
  • Step 14: Local variables and multiple scopes
  • Step 15: Accessing variable length arguments
  • Step 16: Revisiting the text mangler - cleanups to test new functionality, and run it on itself
  • Step 17: Identifying the constructs needed to self host the compiler
  • Step 18: A start on a proper parser


2008-04-27 19:31 UTC Writing a compiler in Ruby bottom up - step 5

This is the fifth in a series. Please start on part 1 first if you haven't already to get the background, or see here for the full series so far

Last time I promised to post these more frequently, and blatantly failed to find the time... In return, I've taken the time to combine what would have been parts 5, 6 and 7, that by themselves would've been annoyingly short. Here goes:

Handle number literals

So far our program have only handled numbers as an accidental by-product of not doing any type checking of any kind, and even then only if you have an external function that produces them.

So lets take a look at how gcc handles integers up to long long for C. Again we're restricting it to 32bit x86:

void foo1(unsigned short a) {}
void foo2(signed short a) { }
void foo3(unsigned int a) {}
void foo4(signed int a) { }
void foo5(unsigned long a) {}
void foo6(signed long a) {}

void foo7(unsigned long long a) { } void foo8(signed long long a) {}

int main() { foo1(1); foo2(2); foo3(3); foo4(4); foo5(5); foo6(6); foo7(7); }

I'm omitting most of the gcc output, you can run gcc -S yourself if you care. The interesting bit is the calls to the various functions, which end up like this:

    movl    $1, (%esp)
    call    foo1
    movl    $2, (%esp)
    call    foo2
    movl    $3, (%esp)
    call    foo3
    movl    $4, (%esp)
    call    foo4
    movl    $5, (%esp)
    call    foo5
    movl    $6, (%esp)
    call    foo6
    movl    $7, (%esp)
    movl    $0, 4(%esp)
    call    foo7

In other words, for the purposes of a function call at least, gcc threats all types except long long as a single 32 bit value. We'll conveniently forget about long long and stick to values up to 32 bit then, since that means we can keep ignoring types completely.

Laziness is a virtue.

We'll also blatantly ignore floats? Why? Because you can easily write a compiler with only integer math, and so adding float support now would only be a waste of time, when things are bound to change down the line anyway.

Besides, when I was young we didn't have those fancy FPU's, sonny, and we did just fine anyway with fixed point calculations using integers.

So what do we actually change?

In #get_arg, before we handle string constants, we add this:

  return [:int, a] if (a.is_a?(Fixnum)) 

In #compile_exp, we add this to use the return values from #get_arg:

  elsif atype == :int then param = "$#{aparam}" 

Aaand, we're done. That was almost to easy.

Here's a test:

prog = [:do,
  [:printf,"'hello world' takes %ld bytes\\n",[:strlen, "hello world"]],
  [:printf,"The above should show _%ld_ bytes\\n",11]
]


Intermission: Some thoughts on primitive data types

"Pure" object oriented languages are great, but not at the relatively low level of the code generator, in my opinion. Whether or not you want to implement a pure OO language, I strongly believe that it's worth implementing primitives that can work on primitive data types first. Then you can decide whether or not to hide those primitives completely from the user at a later stage, or find ways to make them look like objects, or quietly convert them back and forth as needed behind the users back.

Note that the Matz Ruby Interpreter (MRI) does this: Numbers for example are quietly treated differently than "real" objects, but the interpreter does it best to hide this fact from the user. Personally I don't think MRI goes far enough.


If ... then ... else

You don't get very far without some kind of conditional logic. some form of if .. then .. else ... exists in most languages that wants to be useful. For now we will implement [:if, condition, if-arm, else-arm], and do it the C way, in other words both 0 and null pointers evaluate to false, everything else to true.

Again a simple test:

void foo() {}
void bar() {}

int main() { if (baz()) { foo(); } else { bar(); } }

Giving the relevant part:

    call    baz
    testl   %eax, %eax
    je  .L6
    call    foo
    jmp .L10
.L6:
    call    bar
.L10:

This is a pretty much universal pattern for compiling if .. then ... else, across a huge number of languages and architectures:

  • Calculate the expression that occurs in the condition
  • Test it in some way (here with "testl" - other variations that are common for various architectures is a "cmp" [compare] instruction or to subtract the register from itself). testl does a comparision of the left and right operands, and set various flags.
  • Next, do a conditional branch to the
  • else arm. In other words, we're checking if the condition is NOT true. In this case it's done with "je" which translates to "jump on equal", in other words if the preceding comparison marked the result as equal (note that on most CPU's, lots of instructions set the condition codes, not just explicit tests).
  • Then execute the "if-arm".
  • Jump over the else-arm to after the end of the entire if..then..else construct.
  • Place a label for the else-arm, and the code to execute it.
  • Place a label marking the end.

There are lots of variations, such as reordering the if/else arms depending on estimates on which condition is most likely and knowledge of whether a missed or taken branch is cheapest on a particular architecture, but the above is all that's needed for now.

Again the compilation is deceptively simple - it's really just a direct translation of the above:

  def ifelse cond, if_arm,else_arm 
    compile_exp(cond) 
     puts "\ttestl   %eax, %eax" 
     @seq += 2 
     else_arm_seq = @seq - 1 
     end_if_arm_seq = @seq 
     puts "\tje  .L#{else_arm_seq}" 
     compile_exp(if_arm) 
     puts "\tjmp .L#{end_if_arm_seq}" 
     puts ".L#{else_arm_seq}:" 
     compile_exp(else_arm) 
     puts ".L#{end_if_arm_seq}:" 
  end 

It should be pretty self explanatory - it's basically calling compile_exp for each of the condition, if-arm and else-arm and weaving that into the above code, using @seq to generate unique labels.

To hook it in, we add this right after "return defun ...." in #compile_exp:

  return ifelse(*exp[1..-1]) if (exp[0] == :if) 

And here's a simple test:

prog = [:do,
  [:if, [:strlen,""],
    [:puts, "IF: The string was not empty"],
    [:puts, "ELSE: The string was empty"]
  ],
  [:if, [:strlen,"Test"],
    [:puts, "Second IF: The string was not empty"],
    [:puts, "Second IF: The string was empty"]
  ]
]

Here is the result

As usual you run it like this:

$ ruby step5.rb >step5.s
$ make step5
cc    step5.s   -o step5
$ ./step5
ELSE: The string was empty
Second IF: The string was not empty
$

Some thoughts on loops

Unconditional loops are trivial to add to the language, but do we need to? The answer is strictly speaking no. We can already do loops with recursion, so why pollute the language?

For that to be a good solution, though, we need tail recursion elimination, and I'm not prepared to get into that right now. Tail recursion elimination, or it's more general form - tail call elimination - essentially means to identify cases where the exit out of a function consists of a call with the same or fewer arguments, followed by returning it's return value. If that is the case, you can reuse the current functions call frame for the arguments of the function you are calling, followed by a "jmp" instruction instead of "call". "jmp" doesn't push a return address onto the stack, and so when the function that you jump to return, it will return to the place the current function was called from, rather than to the current function.

This achieves a couple of things: First and foremost, the stack doesn't grow. Secondly, you save some precious cycles. Combined with appropriate other optimizations, with tail call elimination, you can implement a loop like this without being worried about running out of stack:

[:defun, :loop, [], [:do, 
  [:puts, "I am all loopy"], 
  [:loop] 
], 

[:loop]

In other words, tail call elimination means that for any function on the form "(defun foo () (do bar foo))", stack usage goes from being proportional to the depth of the tail calls to being 1 frame deep at that point...

The above will run with the current compiler, but it'll quickly overrun the stack and crash. Not good.

I sense a disturbance in the force. All two geeks reading this shuddering at the thought of the stack growing with each iteration..

For now, we'll just ignore this, and not do anything to abuse the stack too much. Instead we'll later implement a proper loop construct as a primitive. At least for now - if/when I add tail call elimination we can consider ripping it out again and make it part of the runtime library instead.

So we can do non-terminating loops... That doesn't do us much good now does it?.

Well, we can already do while loops too:

(defun some-while-loop () (if condition (some-while-loop) ()))

Not very satisfying, but it works. It's too ugly to be an acceptable workaround, though, so a "while" primitive is on the list.

(As the realization that I've used Lisp like syntax sinks in, there's another disturbance in the force as the same two geeks as before shriek in agony and bolt for the door...)

I'm not a Lisp'er. I can't handle the parentheses overload... But Lisp syntax is convenient when the language doesn't really have a syntax of it's own yet - that'll come after bootstrapping when it's time to write a proper parser. Incidentally, most of what I've implemented and will implement will bear some kind of resemblance to badly broken LISP - as it turns out, Lisp constructs are very powerful if you can stand the syntax, and even if you don't want to program in it, learning more about Lisp and/or Scheme is worthwhile.

Personally I don't heed my own advice, and most of the Lisp-like ideas I'll be using are probably based on incomplete understanding from the very limited time I've spent looking at Lisp.

Next time: Anonymous functions, maybe more.


2008-04-17 17:59 UTC Writing a compiler in Ruby bottom up - step 4

This is the fourth in a series. Please start on part 1 first if you haven't already to get the background, or see here for the full series so far

Sorry for the long delay since the last part, I've simply been too caught up in other stuff. As the list from last time indicated, this part will cover defining functions and adding support for building a simple "runtime library".

Defining functions

A programming language without either functions or methods isn't much of a language. And as it happens, all parts of an object oriented language can be implemented easily enough in terms of a procedural language: A method isn't much more than a function that takes an object as an extra argument. So adding support for functions is obviously rather critical.

It is also very simple. Lets take a look at some C and the resulting assembly again:

void foo()
{
  puts("Hello world");
}

.. results in this, with gcc:

.globl foo
        .type   foo, @function
foo:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        movl    $.LC0, (%esp)
        call    puts
        leave
        ret
        .size   foo, .-foo

The function call is easy enough to recognize, and that leaves us with a small prolog and epilog to the function:

The prolog stores %ebp onto the stack, and then copies %esp into %ebp. "leave" undoes the damage, and "ret" pops the return address off the stack and returns to where the function was called from. Why does it store %esp (the stack pointer) in %ebp? Well, one obvious advantage is that you can mess up the stack as much as you like, and just copy %ebp back again at the end and all is well. We can see above that GCC takes advantage of that by not freeing the stack space it allocated for the call to puts again, and leaving that to the "leave" instruction - it'd just be wasteful otherwise.

So if that's all, the changes should be fairly simple. And they are.

First we modify Compiler#initialize to create a Hash to store our functions in:

  def initialize 
    @global_functions = {} 

Then we add a method to output the functions:

  def output_functions
    @global_functions.each do |name,data|
      puts ".globl #{name}"
      puts ".type   #{name}, @function"
      puts "#{name}:"
      puts "\tpushl   %ebp"
      puts "\tmovl    %esp, %ebp"
      compile_exp(data[1])
      puts "\tleave"
      puts "\tret"
      puts "\t.size   #{name}, .-#{name}"
      puts
    end
  end

You see we include the ".globl" and ".type" and ".size" bits too. ".globl" indicates that the function should be visible outside of the specific file you're assembling, which is important if you want to link multiple object files together. ".type" and ".size" I believe are there primarily for debugging, indicating the the symbol represents a function, and the size respectively.

Apart from that, this function is pretty trivial - it calls "compile_exp" to do the actual work.

We add a little helper to define the functions:

  def defun name, args, body
    @global_functions[name] = [args,body]
  end

We add a couple of lines to #compile_exp:

    return if !exp || exp.size == 0
    return defun(*exp[1..-1]) if (exp[0] == :defun)

The first line is there partly for robustness, partly to let us pass nil or empty arrays as "do nothing", just as you'd expect to be able to write an empty function declaration. It makes sense in the context of the next line, that doesn't check if it's about to define an empty function.

You may or may not have noticed that this thing lets us define functions recursively. [:defun,:foo,[:defun, :bar, []]] is completely legal. You may also have noticed that this currently leads to defining two functions that are both globally accessible. Oh well. It doesn't do any harm, so we'll deal with that later ("that" being either preventing it, or making the inner function only visible inside the outer function - I haven't decided yet which I prefer).

All that remains is to output the functions, so we add this to #compile, right before output_constants:

 output_functions

Adding support for a runtime library

First we change the name of the current #compile to #compile_main. Then we redefine #compile as follows:

  def compile(exp) 
    compile_main([:do, DO_BEFORE, exp, DO_AFTER]) 
  end 

Then we define the constants DO_BEFORE and DO_AFTER (put them in a separate file if you prefer, for now I've just put them at the top):

DO_BEFORE= [:do,
  [:defun, :hello_world,[], [:puts, "Hello World"]]
]

DO_AFTER= []

Just admit it, you'd expected something more advanced. But that would defeat the goals set out at the start. The above is sufficient to define a proper runtime library. If you want something that'd need to be implemented in assembler or C, that works too: Just link to an object file containing the functions in question, since we still respect C calling conventions.

Lets test it. Add this before Compiler.new.compile(prog):

prog = [:hello_world]

And compile and run:

$ ruby step4.rb >step4.s
$ make step4
cc    step4.s   -o step4
$ ./step4
Hello World
$

You can find the result here

Accessing function arguments?

This leaves one big gaping omission: Accessing function arguments. As it happens that is a fairly large change. Trust that it won't be forgotten, it's a major part of part 8, and I'll try not leaving so long between the next few parts.


2008-04-08 07:49 UTC Strawman for a new parser generator

I have very strong opinions when it comes to parser generators. Most parsers I've written have ended up being hand written or have used some half-assed parser generators I've written myself, because I've yet to find a parser generator I like. I've found many I like aspects of, but they invariably seem to fail on one or more of the following points. I'd love to get feedback on suggestions for parser generators I ought to look at.

Here are my "rules":

  • Low coupling. In this case that translates to avoiding a pet peeve of mine, namely intermingling code in the target language interspersed with the grammar. A good grammar in a clean syntax is documentation as well as code. Messing it up by interspersing actual code is annoying. But there's a more important reason: Any reasonably sized grammar gets complex to understand pretty quickly. Thus I don't want to have to rewrite it if I want to reuse the grammar in another target language. This implies not only expressing the grammar without mixing in code in another language, but also avoiding dependence on external code to change the workings of the grammar. For some languages such as C++ and Ruby that means the typical functionality of a parser generator needs to be extended (C++ for example needs the parser to know whether a given string is a type name or not in some cases, if I remember correctly, and Ruby has that awful - from a parsing perspective at least - syntax for multiple "here documents").
  • Clean syntax. I'm picky about syntax in all languages. I'm extra picky about it in parser specifications. I like BNF style syntax, but I even there I have very specific ideas about what looks good. For example I can't stand the ABNF syntax specified in RFC 2234. I find it outright painful to use "/" instead of "|" for "or". Yes, it probably is because I've gotten used to the pipe-character, rather than any kind of objective qualities, but it matters nonetheless. More important than looks, though, is that the syntax is concise.
  • I can't stand parser generators that don't make the input really terse. For most simple languages writing a terse recursive descent parser that is extremely readable can be done by implementing a tiny little reusable set of parser combinators (functions or objects that handle a tiny aspect of parsing, such as "or" and defer the rest to other objects, that can be combined by function- or object references or writing functions). If the input to the parser generator isn't terse, much of the advantage (certainly not all) goes out the window.

  • Easy to retarget. I really, really can't stand parser generators that can't easily be modified to target a new language. Most of the good ones like ANTLR and Ragel are fairly nice in that respect (and Ragel might, as far as I can tell, support not mixing in target language code too? I haven't looked enough at Ragel yet).
  • Supports multiple purposes One of my BIG issues with many parser generators is that they tend to support either generating push parsers or generating pull parsers or feed you events or a full parse tree, or only generating parsers etc. Many of them make it hard to for example generate syntax diagrams or tree walkers or document generation, or they do one or more of those, but in one way only and if your model doesn't fit perfectly to theirs, then you have to dig into the core code. It's by no means true for all of them, but it something I've run across time after time. This is a typical case where high cohesion and low couplings means something. Making the AST of the parser generator itself trivially accessible for plugins or filters so that other cools can benefit is a huge plus.
  • Good grammars are simple grammars Ruby's grammar for example scares the shit out of me in terms of all the horrendous edge cases. MOST Ruby usage is clean, and MOST of the time the parser do what you think it should, but it's a monstrous grammar when you actually start digging into it. It doesn't help that the parse.y file of Matz' Ruby interpreter is far more verbose than it needs be. Languages like Oberon have grammars that can fit on an A4 page. With large fonts. And comments. While a language like Ruby can afford to have a somewhat larger grammar, there's no excuse for making it as big as it is. But nor is there an excuse for making matters worse with a verbose parser generator, or one with a complex grammar. If the grammar of the parser generator takes up more than a page, I run away screaming. I want something simple and formal.
  • Generate simple code, or at least offer it as an option. I hate trying to debug grammars without being able to sensibly instrument (automatically or manually) the generated code and be able to read and understand it. Table driven generators like most Yacc/Bison style generators are particularly nasty, but most generators I've seen fail here. I like generators that at least give the option of creating simple, clean, recursive descent parsers, as recursive descent tends to be very simple for humans to understand compared to many of the alternatives. They are also simple to test and instrument, particularly if you can call directly in to specific rules.

A couple of years ago I wrote an experimental "parser assembler" and built a BNF to "assembler" generator and a VM to run it on. It was moderately successful as an experiment in that I was able to build a number of parsers from it very easily. Most BNF could be translated almost mechanically, and work, and at the same time the BNF parser generated a reasonably clean AST that would be simple to manipulate. The parser "assembler" made it highly portable, and made the output readable (with a little bit of effort, admittedly) and made the parsers compact, and I was toying with making another backend to generate x86 assembler directly. Because of the execution model, the parser assembly could be used both as pull or push parsers as desired, and that was what I liked the most about it.

But in retrospect I'm still not happy. The syntax was too messy, and the way it interacted with the "outside" wasn't optimal (you had to keep track of numbered events), and it didn't do anything to generate AST's or make writing tree-walkers simpler etc. So while it avoided a couple of typical pitfalls, and was tiny and didn't intermingle target language code with the parser, I still want something better.

I started on what I thought would be a cleaner grammar, and I believe I got to something that was reasonably nice. I came across it in my notes again the other day when I was cleaning up my latest compiler post. Consider it a strawman - it's not complete, and probably have stupid bugs, since the generator hasn't actually been implemented so I could test it on itself. This grammar defines the hypothetical parser generator syntax in (a subset of) itself:

grammar ParseGen

rules { start< :grammar> { "grammar" @name @rules } rules { "rules" "{" $rule* "}" } rule { @name ("<" @var ">")? "{" @expr* "}"? } expr { $or_expr } or_expr { post_expr<@left> ("|" or_expr<@right>)? } post_expr { $cut | $glue | $simple } simple { @primary ("<" @var ">")? ("?" | "*" | "+")<@modifier>? } primary { paren<@expr> | @keywords | @name | @string | @number | @set } parent { "(" $rule+ ")" } keywords { "ANY" | "EOF" } string { (('"' _ ([~\"]*)<$str> _' "') | ("'"_ ([~\']*)<$str> _ "'")) } var { ('@'|'$'|':') _ <@vartype> @name } number { @base10 | @base16 } name { ([a-zA-Z][a-zA-Z0-9_]*) } set { '[' ('~'? ("\\]"|[~\]])*) ']' ] cut { "!"< :cut > } glue { "_"< :glue > } base10 { [0-9]+ } base16 { 'x'_[0-9a-fA-F]+ }

WHITESPACE { 9 | 13 | 10 | ('#' [~\n]* 10) } }

The observant might notice that the syntax appears to steal a lot of elements from Ruby. That's not a coincidence. I like the way it looks, and it also allows me to use Ruby syntax highlighting to make it look good. I've tried not abusing it too much in terms of changing meaning around. Yes, I'm too lazy to figure out how to write my own syntax highlighter in elisp, so sue me.

Here are some notes on what the syntax is intended to achieve:

  • Mostly is a realtively typical EBNF "dialect" and should be fairly recognizable if you've looked at various grammars in EBNF dialects.
  • One of the things I hate about many grammars is the way they handle whitespace. I made the decision to make a "WHITESPACE" rule that by default is allowed everywhere unless expressly blocked with the "glue" operator "_". You can then trivially enough get "normal" behavior by setting the WHITESPACE rule to {} (nothing) and creating your own whitespace rule and use it where you want whitespace instead of using the glue operator where you don't want whitespace.
  • I want to be able to generate AST's and events driven parsers from the same thing. By default, I am assuming that the name of a rule is going to become either basis for an event or set of events, or a node in the tree. The exception is when a variable starting with "$" is present.
  • A variable starting with "$" indicates that the value returned will become the event or AST node, meaning that the current rule will not appear anywhere, and is for convenience only
  • Rule names or applications can be suffixed by a variable name enclosed in less than and greater than. If so, that name designates the name of the event or AST node generated if a rule name, or the name of the variable to be set in the event or node if a rule application.
  • If a rule application is prefixed with '$' or '@', then the rule name itself is used as the variable name. So "@var" is shorthand for "var<@var>"
  • If a rule application or variable name is prefixed with ":" then that indicates that the name itself will be returned as a symbol (or equivalent in the target language) if the rule matches, or nil/null/equivalent when it doesn't.
  • The "cut" operator, "!", designate a point beyond which backtracking causes an error. This isn't really needed in a language with only one symbol lookahead, but in this case there's no distinction between a separate lexer and the parser. It also doesn't hurt to be more flexible as it often simplifies the parser (at the cost of making it easy to write badly performing ones if you rely on a lot of backtracking). I find this a convenient way of adding error handling, by setting an error code at the cut point (though the current grammar doesn't support that). The "cut" operator is borrowed from Prolog.

The goal is to get a very simple, compact syntax while still retaining enough information to generate full syntax trees. I do aim to add some way of calling out to the target language to carry out some operations too, but I consider that a last resort. I do intend to look at ways of handling at least things like Ruby's "here documents" without resorting to callbacks.

I don't think I'll have time to write actual code for this for a long time, but I'd love comments on the ideas/thoughts above, and I'll definitively keep thinking about it.


About me

E-mail: vidar@hokstad.com
Skype: vhokstad
View my LinkedIn profile

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Categories

StumbleUpon My link page

(Links I have stumbled and like)