2008-04-08 11:49 UTC Strawman for a new parser generator
- Low coupling. In this case that translates to avoiding a pet peeve of mine, namely intermingling code in the target language interspersed with the grammar. A good grammar in a clean syntax is documentation as well as code. Messing it up by interspersing actual code is annoying. But there's a more important reason: Any reasonably sized grammar gets complex to understand pretty quickly. Thus I don't want to have to rewrite it if I want to reuse the grammar in another target language. This implies not only expressing the grammar without mixing in code in another language, but also avoiding dependence on external code to change the workings of the grammar. For some languages such as C++ and Ruby that means the typical functionality of a parser generator needs to be extended (C++ for example needs the parser to know whether a given string is a type name or not in some cases, if I remember correctly, and Ruby has that awful - from a parsing perspective at least - syntax for multiple "here documents").
- Clean syntax. I'm picky about syntax in all languages. I'm extra picky about it in parser specifications. I like BNF style syntax, but I even there I have very specific ideas about what looks good. For example I can't stand the ABNF syntax specified in RFC 2234. I find it outright painful to use "/" instead of "|" for "or". Yes, it probably is because I've gotten used to the pipe-character, rather than any kind of objective qualities, but it matters nonetheless. More important than looks, though, is that the syntax is concise.
I can't stand parser generators that don't make the input really terse. For most simple languages writing a terse recursive descent parser that is extremely readable can be done by implementing a tiny little reusable set of parser combinators (functions or objects that handle a tiny aspect of parsing, such as "or" and defer the rest to other objects, that can be combined by function- or object references or writing functions). If the input to the parser generator isn't terse, much of the advantage (certainly not all) goes out the window.
- Easy to retarget. I really, really can't stand parser generators that can't easily be modified to target a new language. Most of the good ones like ANTLR and Ragel are fairly nice in that respect (and Ragel might, as far as I can tell, support not mixing in target language code too? I haven't looked enough at Ragel yet).
- Supports multiple purposes One of my BIG issues with many parser generators is that they tend to support either generating push parsers or generating pull parsers or feed you events or a full parse tree, or only generating parsers etc. Many of them make it hard to for example generate syntax diagrams or tree walkers or document generation, or they do one or more of those, but in one way only and if your model doesn't fit perfectly to theirs, then you have to dig into the core code. It's by no means true for all of them, but it something I've run across time after time. This is a typical case where high cohesion and low couplings means something. Making the AST of the parser generator itself trivially accessible for plugins or filters so that other cools can benefit is a huge plus.
- Good grammars are simple grammars Ruby's grammar for example scares the shit out of me in terms of all the horrendous edge cases. MOST Ruby usage is clean, and MOST of the time the parser do what you think it should, but it's a monstrous grammar when you actually start digging into it. It doesn't help that the parse.y file of Matz' Ruby interpreter is far more verbose than it needs be. Languages like Oberon have grammars that can fit on an A4 page. With large fonts. And comments. While a language like Ruby can afford to have a somewhat larger grammar, there's no excuse for making it as big as it is. But nor is there an excuse for making matters worse with a verbose parser generator, or one with a complex grammar. If the grammar of the parser generator takes up more than a page, I run away screaming. I want something simple and formal.
- Generate simple code, or at least offer it as an option. I hate trying to debug grammars without being able to sensibly instrument (automatically or manually) the generated code and be able to read and understand it. Table driven generators like most Yacc/Bison style generators are particularly nasty, but most generators I've seen fail here. I like generators that at least give the option of creating simple, clean, recursive descent parsers, as recursive descent tends to be very simple for humans to understand compared to many of the alternatives. They are also simple to test and instrument, particularly if you can call directly in to specific rules.
grammar ParseGen rules { start< :grammar> { "grammar" @name @rules } rules { "rules" "{" $rule* "}" } rule { @name ("<" @var ">")? "{" @expr* "}"? } expr { $or_expr } or_expr { post_expr<@left> ("|" or_expr<@right>)? } post_expr { $cut | $glue | $simple } simple { @primary ("<" @var ">")? ("?" | "*" | "+")<@modifier>? } primary { paren<@expr> | @keywords | @name | @string | @number | @set } parent { "(" $rule+ ")" } keywords { "ANY" | "EOF" } string { (('"' _ ([~\"]*)<$str> _' "') | ("'"_ ([~\']*)<$str> _ "'")) } var { ('@'|'$'|':') _ <@vartype> @name } number { @base10 | @base16 } name { ([a-zA-Z][a-zA-Z0-9_]*) } set { '[' ('~'? ("\]"|[~\]])*) ']' ] cut { "!"< :cut > } glue { "_"< :glue > } base10 { [0-9]+ } base16 { 'x'_[0-9a-fA-F]+ } WHITESPACE { 9 | 13 | 10 | ('#' [~\n]* 10) } }The observant might notice that the syntax appears to steal a lot of elements from Ruby. That's not a coincidence. I like the way it looks, and it also allows me to use Ruby syntax highlighting to make it look good. I've tried not abusing it too much in terms of changing meaning around. Yes, I'm too lazy to figure out how to write my own syntax highlighter in elisp, so sue me. Here are some notes on what the syntax is intended to achieve:
- Mostly is a realtively typical EBNF "dialect" and should be fairly recognizable if you've looked at various grammars in EBNF dialects.
- One of the things I hate about many grammars is the way they handle whitespace. I made the decision to make a "WHITESPACE" rule that by default is allowed everywhere unless expressly blocked with the "glue" operator "_". You can then trivially enough get "normal" behavior by setting the WHITESPACE rule to {} (nothing) and creating your own whitespace rule and use it where you want whitespace instead of using the glue operator where you don't want whitespace.
- I want to be able to generate AST's and events driven parsers from the same thing. By default, I am assuming that the name of a rule is going to become either basis for an event or set of events, or a node in the tree. The exception is when a variable starting with "$" is present.
- A variable starting with "$" indicates that the value returned will become the event or AST node, meaning that the current rule will not appear anywhere, and is for convenience only
- Rule names or applications can be suffixed by a variable name enclosed in less than and greater than. If so, that name designates the name of the event or AST node generated if a rule name, or the name of the variable to be set in the event or node if a rule application.
- If a rule application is prefixed with '$' or '@', then the rule name itself is used as the variable name. So "@var" is shorthand for "var<@var>"
- If a rule application or variable name is prefixed with ":" then that indicates that the name itself will be returned as a symbol (or equivalent in the target language) if the rule matches, or nil/null/equivalent when it doesn't.
- The "cut" operator, "!", designate a point beyond which backtracking causes an error. This isn't really needed in a language with only one symbol lookahead, but in this case there's no distinction between a separate lexer and the parser. It also doesn't hurt to be more flexible as it often simplifies the parser (at the cost of making it easy to write badly performing ones if you rely on a lot of backtracking). I find this a convenient way of adding error handling, by setting an error code at the cut point (though the current grammar doesn't support that). The "cut" operator is borrowed from Prolog.
Comments - Newest first
parse_expr() { parse_number() while(lookahead(1) == "+") { parse("+") parse_number() } }...I just create a two-node graph with a cycle in it instead (I'd try to show the graph in ASCII-art, but I'm not sure if it would survive intact :) You certainly could write a program that took Gazelle bytecode as input and spit out a recursive descent parser in any language, because Gazelle's byte-code is essentially a recursive descent parser encoded as a bunch of NFAs and DFAs. I don't think I quite understand your concern about making things self-hosting. I don't understand what you mean by 'self-hosting' in this context.Post a Comment