Vidar Hokstad V2.0

Home Blog

2008-04-08 07:49 UTC Strawman for a new parser generator

I have very strong opinions when it comes to parser generators. Most parsers I've written have ended up being hand written or have used some half-assed parser generators I've written myself, because I've yet to find a parser generator I like. I've found many I like aspects of, but they invariably seem to fail on one or more of the following points. I'd love to get feedback on suggestions for parser generators I ought to look at.

Here are my "rules":

  • Low coupling. In this case that translates to avoiding a pet peeve of mine, namely intermingling code in the target language interspersed with the grammar. A good grammar in a clean syntax is documentation as well as code. Messing it up by interspersing actual code is annoying. But there's a more important reason: Any reasonably sized grammar gets complex to understand pretty quickly. Thus I don't want to have to rewrite it if I want to reuse the grammar in another target language. This implies not only expressing the grammar without mixing in code in another language, but also avoiding dependence on external code to change the workings of the grammar. For some languages such as C++ and Ruby that means the typical functionality of a parser generator needs to be extended (C++ for example needs the parser to know whether a given string is a type name or not in some cases, if I remember correctly, and Ruby has that awful - from a parsing perspective at least - syntax for multiple "here documents").
  • Clean syntax. I'm picky about syntax in all languages. I'm extra picky about it in parser specifications. I like BNF style syntax, but I even there I have very specific ideas about what looks good. For example I can't stand the ABNF syntax specified in RFC 2234. I find it outright painful to use "/" instead of "|" for "or". Yes, it probably is because I've gotten used to the pipe-character, rather than any kind of objective qualities, but it matters nonetheless. More important than looks, though, is that the syntax is concise.
  • I can't stand parser generators that don't make the input really terse. For most simple languages writing a terse recursive descent parser that is extremely readable can be done by implementing a tiny little reusable set of parser combinators (functions or objects that handle a tiny aspect of parsing, such as "or" and defer the rest to other objects, that can be combined by function- or object references or writing functions). If the input to the parser generator isn't terse, much of the advantage (certainly not all) goes out the window.

  • Easy to retarget. I really, really can't stand parser generators that can't easily be modified to target a new language. Most of the good ones like ANTLR and Ragel are fairly nice in that respect (and Ragel might, as far as I can tell, support not mixing in target language code too? I haven't looked enough at Ragel yet).
  • Supports multiple purposes One of my BIG issues with many parser generators is that they tend to support either generating push parsers or generating pull parsers or feed you events or a full parse tree, or only generating parsers etc. Many of them make it hard to for example generate syntax diagrams or tree walkers or document generation, or they do one or more of those, but in one way only and if your model doesn't fit perfectly to theirs, then you have to dig into the core code. It's by no means true for all of them, but it something I've run across time after time. This is a typical case where high cohesion and low couplings means something. Making the AST of the parser generator itself trivially accessible for plugins or filters so that other cools can benefit is a huge plus.
  • Good grammars are simple grammars Ruby's grammar for example scares the shit out of me in terms of all the horrendous edge cases. MOST Ruby usage is clean, and MOST of the time the parser do what you think it should, but it's a monstrous grammar when you actually start digging into it. It doesn't help that the parse.y file of Matz' Ruby interpreter is far more verbose than it needs be. Languages like Oberon have grammars that can fit on an A4 page. With large fonts. And comments. While a language like Ruby can afford to have a somewhat larger grammar, there's no excuse for making it as big as it is. But nor is there an excuse for making matters worse with a verbose parser generator, or one with a complex grammar. If the grammar of the parser generator takes up more than a page, I run away screaming. I want something simple and formal.
  • Generate simple code, or at least offer it as an option. I hate trying to debug grammars without being able to sensibly instrument (automatically or manually) the generated code and be able to read and understand it. Table driven generators like most Yacc/Bison style generators are particularly nasty, but most generators I've seen fail here. I like generators that at least give the option of creating simple, clean, recursive descent parsers, as recursive descent tends to be very simple for humans to understand compared to many of the alternatives. They are also simple to test and instrument, particularly if you can call directly in to specific rules.

A couple of years ago I wrote an experimental "parser assembler" and built a BNF to "assembler" generator and a VM to run it on. It was moderately successful as an experiment in that I was able to build a number of parsers from it very easily. Most BNF could be translated almost mechanically, and work, and at the same time the BNF parser generated a reasonably clean AST that would be simple to manipulate. The parser "assembler" made it highly portable, and made the output readable (with a little bit of effort, admittedly) and made the parsers compact, and I was toying with making another backend to generate x86 assembler directly. Because of the execution model, the parser assembly could be used both as pull or push parsers as desired, and that was what I liked the most about it.

But in retrospect I'm still not happy. The syntax was too messy, and the way it interacted with the "outside" wasn't optimal (you had to keep track of numbered events), and it didn't do anything to generate AST's or make writing tree-walkers simpler etc. So while it avoided a couple of typical pitfalls, and was tiny and didn't intermingle target language code with the parser, I still want something better.

I started on what I thought would be a cleaner grammar, and I believe I got to something that was reasonably nice. I came across it in my notes again the other day when I was cleaning up my latest compiler post. Consider it a strawman - it's not complete, and probably have stupid bugs, since the generator hasn't actually been implemented so I could test it on itself. This grammar defines the hypothetical parser generator syntax in (a subset of) itself:

grammar ParseGen

rules { start< :grammar> { "grammar" @name @rules } rules { "rules" "{" $rule* "}" } rule { @name ("<" @var ">")? "{" @expr* "}"? } expr { $or_expr } or_expr { post_expr<@left> ("|" or_expr<@right>)? } post_expr { $cut | $glue | $simple } simple { @primary ("<" @var ">")? ("?" | "*" | "+")<@modifier>? } primary { paren<@expr> | @keywords | @name | @string | @number | @set } parent { "(" $rule+ ")" } keywords { "ANY" | "EOF" } string { (('"' _ ([~\"]*)<$str> _' "') | ("'"_ ([~\']*)<$str> _ "'")) } var { ('@'|'$'|':') _ <@vartype> @name } number { @base10 | @base16 } name { ([a-zA-Z][a-zA-Z0-9_]*) } set { '[' ('~'? ("\\]"|[~\]])*) ']' ] cut { "!"< :cut > } glue { "_"< :glue > } base10 { [0-9]+ } base16 { 'x'_[0-9a-fA-F]+ }

WHITESPACE { 9 | 13 | 10 | ('#' [~\n]* 10) } }

The observant might notice that the syntax appears to steal a lot of elements from Ruby. That's not a coincidence. I like the way it looks, and it also allows me to use Ruby syntax highlighting to make it look good. I've tried not abusing it too much in terms of changing meaning around. Yes, I'm too lazy to figure out how to write my own syntax highlighter in elisp, so sue me.

Here are some notes on what the syntax is intended to achieve:

  • Mostly is a realtively typical EBNF "dialect" and should be fairly recognizable if you've looked at various grammars in EBNF dialects.
  • One of the things I hate about many grammars is the way they handle whitespace. I made the decision to make a "WHITESPACE" rule that by default is allowed everywhere unless expressly blocked with the "glue" operator "_". You can then trivially enough get "normal" behavior by setting the WHITESPACE rule to {} (nothing) and creating your own whitespace rule and use it where you want whitespace instead of using the glue operator where you don't want whitespace.
  • I want to be able to generate AST's and events driven parsers from the same thing. By default, I am assuming that the name of a rule is going to become either basis for an event or set of events, or a node in the tree. The exception is when a variable starting with "$" is present.
  • A variable starting with "$" indicates that the value returned will become the event or AST node, meaning that the current rule will not appear anywhere, and is for convenience only
  • Rule names or applications can be suffixed by a variable name enclosed in less than and greater than. If so, that name designates the name of the event or AST node generated if a rule name, or the name of the variable to be set in the event or node if a rule application.
  • If a rule application is prefixed with '$' or '@', then the rule name itself is used as the variable name. So "@var" is shorthand for "var<@var>"
  • If a rule application or variable name is prefixed with ":" then that indicates that the name itself will be returned as a symbol (or equivalent in the target language) if the rule matches, or nil/null/equivalent when it doesn't.
  • The "cut" operator, "!", designate a point beyond which backtracking causes an error. This isn't really needed in a language with only one symbol lookahead, but in this case there's no distinction between a separate lexer and the parser. It also doesn't hurt to be more flexible as it often simplifies the parser (at the cost of making it easy to write badly performing ones if you rely on a lot of backtracking). I find this a convenient way of adding error handling, by setting an error code at the cut point (though the current grammar doesn't support that). The "cut" operator is borrowed from Prolog.

The goal is to get a very simple, compact syntax while still retaining enough information to generate full syntax trees. I do aim to add some way of calling out to the target language to carry out some operations too, but I consider that a last resort. I do intend to look at ways of handling at least things like Ruby's "here documents" without resorting to callbacks.

I don't think I'll have time to write actual code for this for a long time, but I'd love comments on the ideas/thoughts above, and I'll definitively keep thinking about it.


Comments

2008-04-09 19:27 UTC
gojo
Hi Vidar,

I wonder if you know of Jeff the Out Spoken blog. He has written a wonderful parsing framework known as Gazelle.

2008-04-10 00:06 UTC
Mark
I agree with your dislike of intermingling implementation language code with the grammar. Bison/Yacc are excruciatingly ugly because of this. Seems like 90% of languages share many, many language features. Especially the C-like languages: assignments, relational expressions, expressions, function calls, statement blocks, etc... If you're writing these types of parser, why not abstract away the boiler plate? I personally like the Antlr way organizing grammar rules with source code. Unfortunately, although Antlr is said to generate code for many languages, support for anything other than Java is a bit spotty. Having said all that, I don't have any comments on your generator in particular.
2008-04-10 01:01 UTC
Hi Vidar,

Hi, I'm "Josh the Outspoken" that gojo linked to above -- thanks for the kind words gojo! (it's Josh, not Jeff btw). I think you'll love my in-progress parsing framework Gazelle -- I am basing its design on a lot of the same principles that you outline here, #1 of which is "keep actions out of the grammar, so that grammars are reusable."

Check out the in-progress Gazelle manual to see more of the details of where I'm going with this.

You have a lot of interesting ideas! There are a lot of issues I haven't resolved yet, and this blog post is food for thought. The runtime portion of Gazelle is getting more and more complete, but the language is still really up in the air. I like your ideas for integrating both AST-building and event-based parsing into a single grammar.

2008-04-10 13:09 UTC
Vidar Hokstad
gojo and Josh, Gazelle looks interesting - I'll take a look at it over the weekend. Thanks for the pointer.

I see you're going for a fairly complex algorithm - I must admit I've tended to go very much in the opposite direction there. I've written three parser generators before, one generated recursive descent parsers, one used combinators that in effect made up a recursive descent parser and my third one effectively encoded DFA in the form of bytecode for a simple virtual machine (the "parser assembler" referred to above). The common thread is that I don't like "runtimes" that result in parsers that aren't human readable.

I haven't looked at what Gazelle generates, so I'm not sure if that's an issue.

(One comment, though, scannerless parsing is far older than the paper on scannerless GLR parsing cited on the Gazelle pages. I know I'd used it before that, and the paper itself cites a reference going back to '89)

Mark,

Even though most languages look quite similar, even the "boilerplate" is often different in enough critical places (slight differences in operator precedence, binding etc.) that it's hard to reuse much beyond very simple rulesets.

ABNF actually does define a tiny core of rules, but they're limited to basic lexical classes (hex digits, decimal digits, alphabetic characters, whitespace past linefeed etc.). There are certainly more simple rules like that - a huge number of languages use a word class consisting of [A-Za-z][A-Za-z0-9_]* (letter followed by zero or more letters, digits or underscores) for example, but the problem with sharing rules that simple is that they're that simple - it's often easier to include the rule explicitly than have to verify whether the rule you want to resue matches.

As for ANTLR, I'd have to first overcome my strong aversion to Java, but all the example grammars I've seen for ANTLR just looks ugly to me, and full of semantic actions to boot...

I also strongly dislike treating separating the scanner/lexer from the rest of the parser - I've done it a couple of times in the past, but ever since I did my first scannerless parsers 12-13 years ago or so I've never looked back... That goes deeper than just separating the grammar parts - I've come to believe the syntax for the parser generator doesn't even have to be different, beyond possibly some control over what is returned (hence the '@' vs '$' in my strawman above)

2008-04-10 15:49 UTC
Vidar, I wouldn't say that Gazelle's algorithm is particularly complex: in most ways it's like what ANTLR is doing. And as far as not liking runtimes, Gazelle's approach sounds like it is very close to what you described as a "parser assembler" -- instead of "runtime" you could think "VM." It's the same thing. I essentially encode all the state machines into a bytecode file, which is then loaded by the runtime/VM. The output is human-readable in the sense that Gazelle can dump graphics for all the DFAs it generated -- here's an example of what that dump currently looks like for JSON:

http://www.reverberate.org/gazelle/json-dump/

Using the runtime/VM approach is the absolute key to getting both good performance and being reusable from any language. If your approach is to always generate code in the target language, your parsers are going to be really slow in languages like Ruby, and one of the big motivations for Gazelle is to bring fast parsing to slow languages.

The runtime/VM also will have a JIT (the same as what you call "another backend to generate x86 assembler directly"), which will be the key to making Gazelle parsers really fast, while still not forcing a compile step for dynamic languages like Ruby.

Thanks for the catch about Scannerless parsing -- I'll have to update that reference.

2008-04-10 21:42 UTC
Vidar Hokstad
Josh,

I love the dump page - that's really cool. A debug tool like that does take away a lot of my concern with a non-readable output. It's certainly far from the "write only" output of Bison/Yacc.

Re: the algorithm, "what ANTLR is doing" is still far more complex than say a simple recursive descent parser. It's not that it's bad - I can certainly see the appeal. I think it's more a matter of what the goal is.

My interest is generally in writing compilers, not other parsing, and hence one of my big goals is to get something self hosted as early as possible. In that scenario it makes sense for me to get output that is trivially simple, and that doesn't require me to hand translate lots of code.

Either I'll need to retarget the output or I'll need to translate a runtime, unless I'm generating something I can seamlessly link in but that's not always trivial.

A generator capable of targeting a simple recursive descent or similar can be retargeted extremely simply (the generator I mentioned that output recursive descent parsers had perhaps about 100 lines at most of language specific code in it - I wish I still had the source, but it was written in Amiga E so it would be mostly for nostalgic reasons... :) ). Of course, with a small enough runtime that can be translated fairly simply too, so it doesn't _have_ to be a problem.

Performance of a language like Ruby doesn't worry me again, because my goal would typically be a self-hosting compiler. If I was to target a Ruby subset with a compiler, the main reason to do so would be to improve performance, and so if the parser was initially slow it would be a temporary nuisance, not a deal-breaker.

To create fast parsers for (currently) slow languages is certainly a worthwhile goal in itself.

2008-04-11 05:08 UTC
Vidar,

I'm glad you like the dump!

Actually, ANTLR also generates recursive descent parsers. And what I'm doing is exactly equivalent to using a recursive descent parser; the only difference is that I explicitly model the control-flow graph for each rule instead of writing imperative code that implicitly models it. That is to say, if your recursive-descent parser has a function like:

parse_expr() {
  parse_number()
  while(lookahead(1) == "+") {
    parse("+")
    parse_number()
  }
}

...I just create a two-node graph with a cycle in it instead (I'd try to show the graph in ASCII-art, but I'm not sure if it would survive intact :)

You certainly could write a program that took Gazelle bytecode as input and spit out a recursive descent parser in any language, because Gazelle's byte-code is essentially a recursive descent parser encoded as a bunch of NFAs and DFAs.

I don't think I quite understand your concern about making things self-hosting. I don't understand what you mean by 'self-hosting' in this context.

2008-04-10 22:40 UTC
Oh shoot, I forgot to close my "pre" tag...
2008-04-11 05:21 UTC
Vidar Hokstad
Fixed the pre.

self-hosting == can compile itself.

Generally one of my first goals when writing a compiler is to get it to the stage where it can build itself. For some languages it's easy to meet C calling conventions, and so linking to something that requires a runtime in another language is simple enough, but that's not always an easy option or something you want to mess with early on - hence my comment about either having to translate the runtime or generate a parser in the target language.

Post a Comment

Basic HTML allowed.

About me

E-mail: vidar@hokstad.com
Skype: vhokstad
View my LinkedIn profile

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Recent posts to my blog

Categories

StumbleUpon My link page

(Links I have stumbled and like)