Vidar Hokstad V2.0

Home Blog

Welcome! This is an ARCHIVED page from my old blog

In addition to taking a look at the entry below, why don't you also take a look at some other recent entries:


If you like what you see, please also sign up to the RSS feed

2005-04-23 15:39 UTC Turtle progress and parser assembler

Turtle progress and parser assembler

After my success with the BNF generator yesterday, I just had to redo my Turtle parser based on the BNF in the Turtle spec. Thanks to the clarity of the grammar I rewrote the parser in less than an hour. Though "rewrote" is exaggerating - I cut and pasted the BNF from the spec and massaged it a bit (for one, I've realised that I messed up the priority of the "or" operator so I need to paranthesise bits - I'll have to fix that) and added the appropriate triggers and start adding error handling.

It seems to pass all the tests referenced in the spec now, though I've taken one or two shortcuts with character set handling that needs to be fixed.

However doing the Turtle parser now and the XML parser last night made me think of one important improvement to make, though it will cause the BNF generator to grow quite a bit:

A lot of the error handling can be added automatically. The key is that an error in a recursive descent parser should generally be added in the first position you know you don't have any further valid paths. In other words, you "only" need to create a graph and do some basic path analysis to figure out that in Turtle for instance, if you've successfully swallowed "@prefix" no other rules can handle the input if you were to backtrack back to the "@".

As a result, you should give an error message immediately if the next step after "@prefix" doesn't match expectation, or if the remainder of the rule as a whole fails.

At the same this this analysis allows significant optimisations of the code: By factoring out the common parts of rules, they can be combined and the code paths merged.

This is essentially what a good NFA/DFA generator will do, and when you look at it, the parser assembler isn't that far from a DFA converted into the steps to execute. The fact that it's human readable, portable and easy to customise or hand write code for is the main advantage as I see it, not the functionality of the VM itself.

As it stands, though, the BNF generator generated Turtle grammar takes only 30 more bytes than the hand written one that I've just retired, and that will have to do for now while I experiment some more (the bug list for the BNF generator and parser assembler is already getting quite long, and it will get longer, but so far all of them have reasonable workarounds so I'm just recording them for now).

All in all, the Turtle parser C++ and BNF combined is now down to just 264 lines of code (without generic support code like the VM at 369 lines), which I'm quite pleased with. To do something useful (like building an efficient RDF graph) will obviously require quite a bit more.

XML Parser with my parser assembler

One of the best ways of testing a new translator/compiler/interpreter is to feed it complex input. So I set about creating an XML parser.

It took me less than 4 hours to get a more-or-less working SAX parser for the complete XML grammar, and more than half of that was troubleshooting the parser assembler (I found a few problems, and a few places I need to make enhancements).

Now, this is what prompted me to look into a push parser architecture in the first place.

The XML grammar was essentially just the grammar from the W3 XML specification with some minor changes to match my BNF variation and add triggers for semantic actions.

It's not 100% complete yet - for instance I don't restrict all the character classes correctly - but it's fairly close. Also, I don't handle any encodings properly yet, but the parser architecture makes that fairly easy to do: I need to add support for byter order markers, and then place a trigger for the encoding attribute and let the trigger handler plug in an object to filter the plugin for the right encoding (providing it's one I want to support).

The entire parser assembler is prepared to handle full unicode characters anyway - it's all 32 bit characters internally.

I love my new toy :)

April 22, 2005

My BNF -> parser assembler translator works!

As part of my parser assembler work, I've gotten my BNF to parser assembler translator to translate it's own BNF to parser assembler equivalent to it's own parser... In other words, it's successfully bootstrapped.

There are still some holes to plug, and the code is a real mess, but this means that the BNF generator is somewhat usable... Saves a lot of work - the BNF for the BNF parser is around 60 lines including whitespace, while the generated "assembly" is 340 lines. The 60 lines include trigger annotations and some error handling. That brings the total for the BNF tool to 600 lines (BNF + C++) - not bad, though it'll easily grow a few hundred lines as I clean up the code and turn it into a bit more polished application (currently it only reads from standard in and writes to standard out, for instance...). There's also certainly room for generating more optimal "assembly".

If I get time I'll post the code this weekend or next (depending on whether I actually get myself to read for my exam on Monday).

April 21, 2005

Parser assembler update

My BNF to parser assembler translator is coming along nicely, allthough slower than I had hoped. My plan is to finish most of it within a week or so, and put together a webpage with the source code etc. the following weekend. Then it'll be time to go back and test it properly with the Turtle grammar before I move back on to my XML parser.

April 17, 2005

Validating a parser generator

A large part of validating a parser generator, such as my BNF to parser assembler tool is ensuring the generated parser is equivalent to a well tested hand written parser.

Generally, my approach is to start bringing the hand written parser for the tool into line with a version of the parser for the tool specified in the tools own language as soon as I possibly can.

The reason is that by being methodical and writing the parser in the same style you expect to use, you quickly identify problems with the generation, but you also make it trivial to compare the two.

Even when it makes hand writing the parser harder, it's worthwhile, because at the stage where you've worked for a long time on the code generation, you've probably debugged the hand written parser quite a lot, and making the generated parser match so well a basic text comparison find only minor differences reduces the validation process to a simple question:

Do you trust the testing you've done of your hand written parser?

Parser generation from BNF

I've been slow to post the last few days because I've been working on my parser assembler, and in particular a tool to generate assembly from BNF. I ran into two problems that's been slowing me down:

- I'm not quite comfortable with the trigger mechanism yet - the issue is that it's hard to gather up enough data so that a trigger is "self sufficient" in that I don't need to collect data from multiple trigger events before taking an action. I have a potential solution that I'll outline below.

- The second is code generation. I initially wanted to avoid building a parse tree from the BNF entirely, and just output assembly right away. Unfortunately that turned out to be extremely painful, so I've resorted to building a tree per BNF production, which seems to work. I now want to simplify the code - the code generation is about 500 lines. While that isn't much, I think I can do quite a bit better without losing much in terms of performance. The tree building here is directly related with the first problem...

One of my previous parser generators had a similar mechanism to my current one, but initially it built parse trees instead of triggering events. I think a hybrid approach might be better. The tree building code worked so that every rule would cause data to get generated. Just as in the BNF example I posted it allowed me to store the results in "slots". However my previous generator used named slots and was typed - returns from built in rules would generate strings, returns from productions would be structures that contained any slots used in that production.

The result was that you could easily build a full parse tree. However one of the reasons I started looking at a push parser was for the flexibility, and much of that is lost once you constrain yourself to building a full parse tree.

The reason just the trigger mechanism is limited is because I either have to pull the store instructions high up into the grammar, in which case it's hard to pick up just the pieces I want, or I need a way of distinguishing, say, multiple strings that would normally be stored in the same slot.

The parse tree approach fixes that by generating a structure so that the store instructions are local to each production - storing data into the structure tht will be returned instead.

An approach that might help is to allow this mechanism, but keep the triggers, so that the triggers still clear out the storage. That way I can still parse huge files, as well as pull the triggers far enough up in the grammar that I can easily process everything without having to coalesce data provided via multiple triggers.

April 12, 2005

BNF -> parser assembler

Ok, so tonight I set myself a goal of preparing an initial version of a BNF grammar with some extensions intended as a starting point for a tool to convert BNF into assembly for my parser assembler (I need to think of a proper name for it soon...). Here is the BNF, and some snippets of how I'm bootstrapping it. The hand converted grammar now parses the whole BNF for itself.

Here's my BNF grammar:

; %triggers define a list of symbolic names for the triggers that the VM will call %triggers { LP = 1 . RP = 2 . CUT = 3 . CALL = 4 . EXP = 5 . TDEF = 6 . PRODL = 7 . PRODR = 8 . RULE = 9 . SUBL = 10 . SUBR = 11 . ORL = 12 . ORR = 13 . STORE = 14 . EXP = 15 . }

; '!' represents my "cut" operator. It breaks the VM with the string argument ; as an error if the remaining part of the rule fails bnf ::= triggers? production* !"EOF Expected" EOF . triggers ::= "%triggers" ws* "{" ws* tdef* "}" ws* . tdef ::= name ws* "=" ws* number ws* "." ws* /TDEF/ . production ::= name /PRODL/ ws* "::=" (ws* rule)* ws* "." ws* /PRODR/ . rule ::= sub_expr /RULE/ .

sub_expr ::= or_expr ws* ("-" ws* /SUBL/ sub_expr /SUBR/)? . or_expr ::= store_expr ws* ("|" ws* /ORL/ or_expr /ORR/)? . store_expr ::= post_expr ws* ("->" ws* const /STORE/ )? . post_expr ::= cut | call | (primary_expr ws* (('?' | '*' | '+') -> #7)? /EXP/) .

primary_expr ::= paren_expr | keywords | name | string | const | set .

paren_expr ::= "(" /LP/ ws* !"Expected at least one rule inside parentheses" rule+ ws* !"Expected )" ")" /RP/ .

cut ::= "!" -> #7 string /CUT/. call ::= "/" -> #7 (number|name) "/" /CALL/. const ::= "#" number .

; --- "Tokens" string ::= ('"' [~"]* -> #6 '"') | ("'" [~']* -> #6 "'") .

keywords ::= ("ANY" | "EOF") -> #1 .

name ::= ([a-zA-Z][a-zA-Z0-9_\-]*) -> #2 .

number ::= (base10 | base16) . base10 ::= [0-9]+ -> #3 . base16 ::= 'x' [0-9a-fA-F]+ -> #4 .

set ::= '[' ('~'? (any - ']')*) -> #5 ']' .

ws ::= ' ' | #9 | #xD | #xA | ';' (ANY - #xA)* #xA .

An excerpt of the parser assembler translation. I've tried making it match what I expect to make the BNF tool generate reasonably well, but not exactly:

:bnf kln $ws jsr $triggers kln $production cut "EOF Expected" eof ret

:triggers req "%triggers" kln $ws req "{" kln $ws kln $tdef req "}" kln $ws ret

:tdef req $name kln $ws req "=" kln $ws req $number kln $ws req "." kln $ws trg #6 ret

:production req $name trg #7 kln $ws req "::=" kln $production_1 kln $ws req "." kln $ws trg #8 ret

:production_1 kln $ws req $rule ret

:rule req $sub_expr trg #9 ret

:paren_expr req "(" trg #1 kln $ws cut "Expected at least one rule" req $rule kln $rule kln $ws cut "Expected ')'" req ")" trg #2 ret

:sub_expr req $or_expr kln $ws cmp #'-' bne $sub_expr_1 eat cut "Expected expression" kln $ws trg #10 req $sub_expr trg #11 :sub_expr_1 ret

Don't have much time for explanations right now, but I'd be happy to answer questions.

Semantic Web for Extending and Linking Formalisms

I came across this paper by chance while searching for material on Z notation and RDF. It's an interesting read on the use of RDF for expressing languages used for formal methods.

While reading it, something occured to me (I'm sure it's not an original idea, but I haven't seen any implementations): It would be great to generate an RDF representation of source code. I'm tempted to spend some time considering if there's an easy way of bolting RDF generation support on to my parser assembler.

It would enable a rapidly growing number of tools that understand RDF to manipulate the code, and would be an interesting way of achieving some of the same as GCC-XML.

With RDF mappings for UML and Z or similar formal languages, I'm sure someone could come up with interesting data mining tools based on combining various data (specifications, models, source) and cross referencing extracted data...

April 09, 2005

Turtle parser update

My Turtle parser is getting along quite nicely. The parser bytecode itself still weighs in at less than a KB, and I've built a very simple RDF model and code to generate triples.

As it stands it now passes most of the pre-requisite tests specified in the Turtle spec. Once it passes the full set, and I'm done with my damn exam, I think it's time to put up a page with the source code and some documentation.

The RDF model is in no way suitable for production use, but I'll abstract out the interface used by the parser to add triples to it, so that it'll be easy to replace it with an adapter to a proper RDF model implementation.

The parser is still fairly messy, so after that I want to go back and improve the assembler to allow local labels and constants, and see whether I should consider some changes to the opcodes. Then I have three candidates for what to continue with: either a BNF parser to generate assembly for my VM, extending the Turtle parser to full N3 and write code to generate assembly from the N3 BNF representation of the N3 grammar, or start on a XML parser.

I suspect I'll start with the BNF tool, as that would massively simplify the XML parser in particular. I've done BNF parser generation tools before, but never quite liked the approach I chose to generating events - I feel the trigger mechanism I've chosen now is fairly nice. The only thing I might want to do is make it easier to pass additional data from the parser to the trigger callback.


April 07, 2005

Parser assembler, RDF and Turtle

I've kept working on my parser assembler, but it's moving slow thanks to actually having a day job to do... However tonight I mostly finished a parser for Turtle - a subset of N3 that allows convenient specification of RDF triples.

It ended up at about 1KB of bytecode, which is quite reasonable. I still need to add some error handling, and then I can start writing some code to actually do some useful stuff with it.

The current code uses an instruction to trigger callbacks from the VM on specific events, and it's proven to work very well. I'm still considering whether or not to add higher level constructs, or possibly replacing a couple of the current instructions, but I want to get more experience with it first.

The current Turtle parser just trigger on each subject, verb and object, and on prefix directives.

April 04, 2005

Error reporting during parsing

While working on my assembler/vm for parsing one of the main problems I'd overlooked (which is silly, considering how many parsers I've written to handle this problem) was error reporting. As it turns out, this is a fairly straight forward problem.

I've written parser generators in the past that generated a high level language parser from BNF with some extra annotation (my first attempt was written in AmigaE about 10-12 years ago). The earliest approach I used was a way of marking where to stop backtracking.

A recursive descent parser with limited lookahead have well defined points where you will have an error condition: anywhere where you exceed the lookahead, backtracking past the lookahead threshold means you've encountered a parsing error.

However one of the things I don't want to do is limiting a user to a specific amount of lookahead. At the same time I want it to be easy to limit it.

The solution is to introduce an operator that marks a "point of no return": Once reached, attempts to backtrack past it will trigger an error.

Incidentally, this is an idea I got from Prolog. Prolog is a logic programming language that depends on evaluating rules, possibly a significant tree of possibilities, and backtrack if a path fails. To cut down processing, and direct the search for a solution, it has something called the "cut operator". The cut operator "prunes" the search tree by marking a similar type of point of no return. The cut operator tells Prolog that if it has reached that point, it shouldn't consider further rules.

A recursive descent parser is a, usually hard coded, decision tree, and using a cut operator saves you from evaluating all possible alternatives when you know they can't possibly work. A typical example is a parser that accepts only numbers and alphabetic names - once you've matched the first character you KNOW the other rule must fail, so it makes no sense to allow the parser to try to evaluate it. In that simple case it doesn't make much difference, but if you have hundreds of rules, it can really make a significant difference.

More than that, however, it makes it easy to flag what went wrong - by attaching an error code or a string to my cut operator, I have a simple mechanism for flagging a human readable error.

I'm letting the VM track column and line numbers, and I am considering also setting a default error string for each label jumped to, so that it becomes simple to automatically generate reasonable standard error messages whenever I insert my cut operator.

So, is my number of operators blowing up? Not really. Over the original instruction set I've added a NOP (no operation) operator, a BLT (Branch on Less Than) and BGT (Branch on Greater Than) operators. I've also added a range comparison operator, but I'm of two minds about it - it can easily be implemented in the assembler only and be made to emit multiple opcodes. That's the approach I've taken with quite a few other operators I've added to the assembler - including "KLN" for "kleene star" (matching a production 0 or more times).

For the most part I think I will try to keep the VM as minimal as possible - at least until I have some profiling data for some test parsers to use for determining what operators truly deserve to be optimized.

I spent a sizable chunk of the weekend (when I should have been reading for my exam) cleaning up the code, and it's getting time to post it, I think, even though the opcodes and assembler syntax will remain unstable for some time. Stay tuned!

March 29, 2005

Parser assembler update

An update on my assembly language for parsing. I now have a parser for the assembly language written using the bytecode it will generate that will parse the full syntax. Total bytecode size?

About 400 bytes.

(*UPDATE*: Ok, so it ended up at 507 bytes. Still pretty good and it will be smaller once I've added some of the enhancements to the instruction set. Though I'll have to add some error reporting - that will probably bring it up to around 1K)

Apart from the additional features I've mentioned earlier I will probably need some additional error handling functionality as well, in order to make it easy to give proper error messages.


March 28, 2005

More on my assembly language for parsing

As I wrote previously I'm experimenting with an 'assembly language' for parsing.

I'm about halfway through writing the parser for the assembler in itself and temporarily hard wiring it into the virtual machine to bootstrap it.

Lessons learned so far include: I really DO want a couple more high level instructions. I've only add BLT and BGT (Branch on Less Than and Branch on Greater Than) to the instruction set so far to make handling ranges easier, but I realise that I really want to create expanded versions of CMP, TRY and REQ (see the earlier entry for descriptions) that will handle ranges instead of single values, as it will dramatically simplify some rules.

The kleene star and one-or-more instructions I hinted at would also have proved very useful, and will certainly be added. All in all I want to focus on two groups of instructions: High usability low level functions (i.e. they manipulate the state of the vm directly) and 100% "composable" functions - that is, instructions that can be composed entirely of low level functions. In my experience that makes implementation so much easier, and maintaining that separation means that you get good coverage of low level instructions so that any high level functionality you leave out is likely to be composable.

Note that I'm NOT aiming for turing completeness, though it wouldn't be strange if it happens by chance. The goal is a very restricted language only intended for parsing.

I've thought quite a bit about it over the weekend, and in particular whether or not it is likely to be particularly useful, considering the availability of similar tools.

What I've realised, though, is that given the complexity of parsers, it's very rare for parser generators to meet my needs, and it's very rare for me to be happy with generated parsers without manual modification. However modifying lex and yacc parsers manually is something you'd probably not want to try. Modifying an assembly like language, however, is fairly straightforward (that's the old 6510 and M68000 demo programmer in me speaking). I've actually written a whole compiler in M68k assembly before, and so this is bringing back memories - M68k assembly was actually quite well suited at least for the parsing aspect.

The advantage of a heavily restricted VM where programs will mostly use relatively high level constructs is that it will be easy to make reasonably well performing interpreter for it. Given the current size (ca. 300 lines of C++) it is reasonable to assume that a competent programmer could port it to any language in a day or to, and instantly get access to working versions of any grammar translated into the language. This is one of the areas where most parser generators become a burden.

Another thing I want to try is building a relatively sophisticated BNF -> parser "compiler". Most of the constructs I've added are very well suited for translating to from BNF, as I'm used to using BNF as a starting point for all parsers I write. There are quite a few things such a compiler could do very easily with the instruction set I'm working on to generate a much faster parser than what I'd write by hand (because if I write by hand I'd be aiming for simplicity...):

- inlining productions.
- "lifting" shared or "almost shared" initial terms in OR groups. If you write a grammar where a bunch of productions all start with a character, and use them in an or expression like this: (foo | bar | baz) it makes sense for the compiler to generate a single check for a range of characters to allow. This is fairly trivial to add.
- merging of similar subtrees. That is, if I have productions in an or-expression that expect the same initial characters it can often be fairly easy to check for the initial characters separately.

All of this serves to bring the resulting assembly closer to expressing an NFA for the language, but just expressing it in bytecode instead of a table of transitions. The advantage is that it'll be fairly easy to make it possible to turn these optimisations off to get readable assembly to look at to debug the parser.

Another advantage I see from this approach is exactly the debugability - I already have a "tracing mode" for my VM that will output the exact instruction stream as executed, and it makes it so much easier to diagnose problems.

The last advantage I see is size. Expat (a C XML SAX parser) on my system is about 126k. The full VM executable (meaning it's also dragged in a lot of library code) plus most of the assembler, and debug output and without optimization is currently 38K. A C version would probably weigh in at significantly less. Clean it up, and add bytecode for an XML parser - given my experience with it so far I think it would weigh in at about 5-6KB and conversion functions for UTF-8/UTF-16 and latin-1 I think it should be easy to get a full XML parser into less than 40K. Quite possibly less than 30... Looking forward to trying.

March 26, 2005

An 'assembly' language for parsing

I've mentioned my forays into push parsers previously. But after looking at that approach, I realised I needed a bit more flexibility. So I got the idea of designing a tiny assembly like language for building push parsers with. This is analogous to building an NFA or DFA, but with more operations, and the potential for being much easier to deal with manually.

I ended up with a tiny set of core commands, and I plan to add a few more convenience commands implemented in terms of the core. Here is the core command set I came up with:

RET -- Return with status flag set to true
BEQ -- Branch if status flag == true
BNE -- Branch if status flag != true
STO -- Store token buffer in numbered slot
TRG -- Trigger an event (used to "plug in" native code to build a parse tree)
ERR -- Return with status flag set to true. Unget any characters retrieved in subroutine, and revert token buffer to pre-subroutine state
CLR -- Clear token buffer
JSR -- Jump to subroutine. Push unget buffer and token buffer to stack.
JMP -- Jump
CMP -- Compare to input character, and set status flag accordingly. Yields if no input.
EAT -- "Eat" input character and add to token buffer and unget buffer. Yields if no input.

In addition I have so far implemented the following two compound commands:
TRY char -- CMP ch; BNE n; EAT; n: ...
REQ char -- CMP ch; BEQ n; ERR; n: EAT;

I plan on adding a few more, including a "range compare" and possibly equivalent TRY/REQ variations, as well as a kleene star command, and possibly a "one or more" (i.e. "foo foo*") operator.

As far as I can see it would be trivial to implement a "compiler" to compile BNF into this language, and the VM is less than 300 lines of C++ (including lots of debug output). It would be trivial to JIT or build a code generator spitting out C/C++ or another language as well.

I plan to spend some time writing a basic assembler for it first (tired of adding calls to build it inline without any relocation/label support...). Then we'll see.

So far I like this approach a great deal better than lex/yacc or other compiler construction toolkits I've seen.

Simple push parsers

I've been toying with a simple table driven push parser class today. Normally I write my parsers as recursive descent either with or without a separate lexer stage.

However I've already disliked pull parsers because it's inflexible - the parser and not you control the amount of IO. As such it easily forces you towards multithreading even when you could've easily multiplexed the application logic.

A push parser by contrast need to work only on the input fed to it. A common way of doing that is in the form of a Nondeterministic finite automaton or a deterministic finite automaton, or similar techniques such as a pushdown automaton, which all can easily be designed to work with single character inputs.

However, I wanted a class that let me easily handwrite parts, so what I ended up with was the following:

A table driven parser with a table per production. For each entry in each table I store a flag to indicate if it's optional, a pointer to another table, and a pointer to an "acceptor object".

The "acceptor" is simply a simple class that provides a method to check whether or not it will accept the current character, and whether or not or not it's reached the end. It allows me to simply customize behaviour, and dramatically cuts down on states by letting me define generic constructs such as "recognise this string".

A simple parser class push states onto a stack until it reaches the first state with no pointer to another production. Once an acceptor is "done", the parser moves to the next entry in the topmost table. Once it reaches the end, it pops the state and skips to the next entry in the new topmost table. It continues until the stack is empty.

This is not to be confused with a pushdown automaton, where the stack is used to store symbols that have been parsed not the history of states.

Actually, this is more or less recursive descent turned outside in - imagine writing a recursive descent parser in a language that supports co-routines: Instead of reading a character, the parser will always yield and won't regain control until a new character is available. Only in this case this is made explicit by returning and retaining an explicitly managed stack

I'm sure this isn't an original technique - it's too simple - but I can't remember if I've seen it describe anywhere. If anyone recognise it from elsewhere, let me know as I'm always interested in finding out if I've missed any obvious optimizations.


About me

E-mail: vidar@hokstad.com
Skype: vhokstad
View my LinkedIn profile

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Recent posts to my blog

StumbleUpon My link page

(Links I have stumbled and like)