Strawman for a new parser generator 2008-04-08


I have very strong opinions when it comes to parser generators. Most parsers I've written have ended up being hand written or have used some half-assed parser generators I've written myself, because I've yet to find a parser generator I like. I've found many I like aspects of, but they invariably seem to fail on one or more of the following points. I'd love to get feedback on suggestions for parser generators I ought to look at.

Here are my "rules":

A couple of years ago I wrote an experimental "parser assembler" and built a BNF to "assembler" generator and a VM to run it on. It was moderately successful as an experiment in that I was able to build a number of parsers from it very easily. Most BNF could be translated almost mechanically, and work, and at the same time the BNF parser generated a reasonably clean AST that would be simple to manipulate. The parser "assembler" made it highly portable, and made the output readable (with a little bit of effort, admittedly) and made the parsers compact, and I was toying with making another backend to generate x86 assembler directly. Because of the execution model, the parser assembly could be used both as pull or push parsers as desired, and that was what I liked the most about it.

But in retrospect I'm still not happy. The syntax was too messy, and the way it interacted with the "outside" wasn't optimal (you had to keep track of numbered events), and it didn't do anything to generate AST's or make writing tree-walkers simpler etc. So while it avoided a couple of typical pitfalls, and was tiny and didn't intermingle target language code with the parser, I still want something better.

I started on what I thought would be a cleaner grammar, and I believe I got to something that was reasonably nice. I came across it in my notes again the other day when I was cleaning up my latest compiler post. Consider it a strawman - it's not complete, and probably have stupid bugs, since the generator hasn't actually been implemented so I could test it on itself. This grammar defines the hypothetical parser generator syntax in (a subset of) itself:


    grammar ParseGen
    
    rules {
      start< :grammar> { "grammar" @name @rules }
      rules           { "rules"   "{" $rule*   "}" }
      rule            { @name ("<" @var ">")? "{" @expr* "}"? }
      expr            { $or_expr }
      or_expr         { post_expr<@left> ("|" or_expr<@right>)? }
      post_expr       { $cut | $glue | $simple }
      simple          { @primary ("<" @var ">")?  ("?" | "*" | "+")<@modifier>? }
      primary         { paren<@expr> | @keywords | @name | @string | @number | @set }
      parent          { "(" $rule+ ")" }
      keywords        { "ANY"<:any> | "EOF"<:eof> }
      string          { (('"' _ ([\"]*)<$str> _' "') | ("'"_ ([\']*)<$str> _ "'")) }
      var             { ('@'<:var>|'$'<:return>|':'<:symbol>) _ <@vartype> @name }
      number          { @base10 | @base16 }
      name            { ([a-zA-Z][a-zA-Z0-9_]*) }
      set             { '[' (''? ("\\]"|[\]])*) ']' ]
      cut             { "!"< :cut > }
      glue            { "_"< :glue > }
      base10          { [0-9]+ }
      base16          { 'x'_[0-9a-fA-F]+ }
    
      WHITESPACE      { 9 | 13 | 10 | ('#' [~\n]* 10) }
    }

The observant might notice that the syntax appears to steal a lot of elements from Ruby. That's not a coincidence. I like the way it looks, and it also allows me to use Ruby syntax highlighting to make it look good. I've tried not abusing it too much in terms of changing meaning around. Yes, I'm too lazy to figure out how to write my own syntax highlighter in elisp, so sue me.

Here are some notes on what the syntax is intended to achieve:

The goal is to get a very simple, compact syntax while still retaining enough information to generate full syntax trees. I do aim to add some way of calling out to the target language to carry out some operations too, but I consider that a last resort. I do intend to look at ways of handling at least things like Ruby's "here documents" without resorting to callbacks.

I don't think I'll have time to write actual code for this for a long time, but I'd love comments on the ideas/thoughts above, and I'll definitively keep thinking about it.


blog comments powered by Disqus