2008-05-06 21:39 UTC A brief introduction to Semantic Dictionary Encoding

I've read a lot of research papers over the years, but there I keep coming back to like Professor Dr. Michael Franz' doctoral dissertation: Code-Generation On-the-Fly: A Key to Portable Software (PDF).

I've been harping about Semantic Dictionary Encoding (SDE) ever since I first read the paper back in 1994, and got quite close to actually implementing at one point. So let me give a brief overview, since I can't drag it out often enough (someone take it mainstream, damn it!):

SDE is, at it's most basic, a compression mechanism for the intermediate representation of a compiler. Similiarly to Huffman coding, an SDE encoder and decoder pair both builds a dictionary of the datastream while building it and decoding it respectively. The dictionary entries is the used to encode/decode the following part of the data stream, which cause more entries to be added to the dictionary, etc., meaning that the later you are in the data stream, the more of the stream will refer to previously added dictionary entries instead of having to be all verbose and include the full definition.

But to call it just a compression method would be selling it very short. The appeal of SDE is that it can be used as a machine independent representation of a program, while at the same time it can (if used properly) retain far more semantic information about the program than a typical bytecode. Contrary to most bytecode, SDE retains the structure of the program as represented by the compiler.

This has a few very appealing properties beyond the nice parts of JIT (such as being able to target specific CPU versions):

  • When (re)building the dictionary on decoding, you can store auxiliary information and even partway generated code, to speed up generation of subsequent pieces of code that use that dictionary element. The compression removes the need for pattern recognition - the pattern recognition was already done at encoding time. Franz' specifically tested outright code copying to speed up the code generation, where small pieces of code that was generated was tracked if it was invariant, in which case future uses of the same dictionary entry would lead to just copying the previously generated piece.
  • You retain the ability to do JIT, but have high level flow information available, making it trivial to do things like sub-expression elimination and a number of similar "cheap" optimizations at JIT time based on knowledge available at runtime. The more you can defer generation and the more additional information becomes available before generation, the better you can do.
  • The code generator can be modified to specifically recognize higher level dictionary elements that can be specifically optimized, while dictionary definitions for those elements can be provided that represents them in terms of lower level patterns. What that buy you is a backwards compatible way of improving performance that can provide significantly improved performance when running on systems or platforms for which the newest decoder is available, while still retaining full compatibility with the lowest common denominator - lowering the threshold for porting and for innovation.
  • SDE retains the block structure and type information of the program, making it well suitable for highly dynamic languages (such as Ruby) that tend to either result in complex, bloated and slow bytecode or highly specialized, language specific VM's. With SDE, a highly specialized decoder can still exist, but by providing dictionary entries that act as templates for the specialized constructs of the language as part of the program, a generic decoder can still execute the program. The same benefits apply for more static languages, but it would appear to be more pronounced for dynamic languages where the final generated code can be far more effectively optimized once more information is known (i.e. in Ruby you can potentially do a far better job if you can JIT and cache specialized versions of code that take into account the actual layout of a class once all code that may reopen and extend/modify the class has been loaded - this doesn't really apply to languages where the structure of a class is known definitively at some point before the program is run).
  • Because of the level of semantic information retained in the file format, SDE is not at odds with generating native binaries. Where a typical VM such as JVM's may or may not suffer a semantic disconnect with the destination architecture which means that a native binary generated from their bytecode might end up varying significantly from what a code generator would have generated with full knowledge of both the high level structure of the program and the target architecture, SDE doesn't have the same issue since very little of the semantics of the original program is lost in translation.

    This means that a compiler can reasonably be split into a portable part outputting SDE binaries, and either a runtime SDE decoder/linker and/or a linker producing native code without loss of information. Such a native code generator can be run either prior to delivery of the code, or it can be run on installation, to be able to optimize for a specific CPU model, and can use more expensive optimizations than might be acceptable for runtime code generation.

  • Instructions for tracing, additional safety (bound checking etc. in languages that don't require it) can be generated at code-generation time automatically. Thus you can support constructs that would slow down the final binary and only turn them on if needed. This can go as far as leaving in debug assertions and logging constructs that can be turned on at the command line, that need never even make it into the generated code if the appropriate switch isn't present, thus allowing extra capabilities to be included without burdening the system in normal use.

    For programs where large parts of functionality is switched on or off from the command line, large parts of the program may just get discarded as a result of simple sub-expression elimination in the code generator, something which is particularly beneficial where the various parts cross-cut the same algorithm and substantial simplifications can be made.

Note that the some of the above is beyond the scope of the original dissertation, though all of it is hinted at or implied by the architecture.

One thing that may or may not be obvious from the above is that SDE, beyond efficiently encoding a tree version of the intermediate representation actually also turns the tree into a directed graph. Nodes further down in the tree get reused, linked the tree together at places where common sub-expressions are used.

Compared to a typical VM, SDE has one potential drawback in terms of delayed code generation, and that is that the dictionary can potentially grow quite large. This is ameliorated by scoping - by marking the start of a scope in the dictionary, and then throwing away any data generated specific to the local scope when reaching the end of the local scope the dictionary can be kept fairly compact.

Performance concerns?

Franz' addressed performance reasonably thoroughly, showing that SDE in most cases - on a mid 90's system based around a M68040 CPU running at 40MHz - performed very reasonable compared to loading native binaries. Franz' mused that CPU speed at the time was increasing faster than disk speeds, and that there is/was a very real possibility that this difference might be reversed so that SDE might eventually get faster. I don't know if that held up - it'd be interesting to redo those tests.

As for runtime performance, the paper contains a few benchmark tests against the then current MPW C implementation for Mac, and the tested Oberon code using SDE beat the corresponding C code hands down.

What's happened since?

Not much, unfortunately. Franz has continued work on various code generation methods, but Java came along around the same time and more or less monopolized interest the portability aspect. At least Mac Oberon had support for SDE, but as for much other great stuff coming out of the ETH computer science departments it was hampered by it's ties to Oberon, which fell out of favor together with Modula and Pascal type languages.

Franz' publications page is well worth a look - he's continuing to churn out great stuff.

My interest in SDE

I first got interested in SDE as a potential feature to contribute to a longwinded debate on comp.sys.amiga.misc back in '94 about writing an open source replacement for AmigaOS. It fell by the wayside (but one of the original people, Aaron Digulla has stuck with it, and AROS is probably the closest we today have to a real open source descendant of AmigaOS).

At various times I've wanted to resurrect my original attempts, but never have had the time.

This time? We'll see - I'm loosely thinking about looking at adding SDE support as one of the forthcoming steps for my series about writing a compiler in Ruby. I've learned a lot about compiler technology in the meantime, and I got quite far the first time - SDE isn't hard to do, anyway - so I'm hoping that I'll find the time to do a working implementation at some point in the not to far future.



Comments - Newest first

2008-05-09 14:32 UTC
Vidar Hokstad
Another digression I found interesting, because it actually references a paper co-authored by M. Franz on exactly the subject of dynamic runtime optimization using tracing, can be found here.

It's a very interesting post to a Lua mailing-list about how their new tracing JIT works. Well worth a read.

2008-05-08 14:09 UTC
Vidar Hokstad
Josh,

I think you could do a lot of inlining in the compiler even with SDE, with the caveat that you would loose some semantic information. You'd in effect do the inlining on the AST. It does mean you need to do some guesswork (a "cost estimate" I suppose) with respect to how much code various higher level constructs will typically cause, so there's more room for making silly decisions.

An upside being that if the encoder is properly done, any of that inlining that doesn't massively change the "inlined" code should not have much of an impact on the binary size, as at least parts of it ought to result in shared nodes in the dictionary.

My biggest concern is what optimizations can be afforded to run at load time, but there all the indirection in dynamic languages actually is somewhat of an "advantage" (or rather less of a disadvantage) in that you could reasonably continue optimization at a low priority in a separate thread for a while, and if the program is long running you can start updating the vtables with new, better optimized functions.

Of course, if you can modify code inline, then all the better, but replacing code inline has those pesky synchronization issues, while vtables created very obvious "switch panels" where you can do atomic changes without any mutexes or other annoying cruft...

Would be a great use of idle time, and I know Franz' has continued to publish articles about continuous optimization, though I haven't read much of the work done in that area for the last 5-6 years or so.

There's of course also the "cop out" of using something like this only as a "distribution format" and cache a native version (and/or progressively cache more optimized versions), but I quite like the thought of being able to easily instrument the code with a command line switch etc. (though nothing prevents you from supporting both, I guess).

There's also the crazy possibility of plugging in instrumentation for periods of time at runtime to collect performance data to then guide further low priority optimization targeted at the specific workload (real data triggers specific branches at a higher rate? Reorder code... Yikes - that has the potential for some really nasty bugs and performance regressions if not done exactly right).

For one last, somewhat related digression:

Another member of Wirth and Mössenöcks research group at ETH the same time Franz' was there, Marc Brandis wrote a quite interesting dissertation on writing an optimizing Oberon compiler using Guarded Single Assignment form, btw: ftp://ftp.inf.ethz.ch/pub/publications/dissertations/th11024.ps - I know far too much about what research went on in that department in the mid 90's...

2008-05-08 12:34 UTC
SDE is quite cute. I'm curious how it would fair in modern optimization shoot-outs, as it should have some advantages.

It seems like SDE is the opposite of in-lining (out-lining?), which results in tiny "executables," but also means that a lot of unrolling has to be done at translation time. On the positive side, a lot of flow-based optimization becomes easier if you eschew the bound template forms (ie only generate (. + .) style entires, no (a + .)): you can then templatize multiple function applications in sequence. This seems like it would make register allocation etc a lot cleaner, as you've got a convenient representation of the scope's inputs, paths through it, and outputs. You can make the same optimizations in other representations, but they're not nearly as clear to think about.

One thing that's becoming quite clear to me: 32K ain't much i-cache, so your compiler is crucial. And, honestly, the other 4MB of cache only do you so much good on large datasets, if you're jumping around much. Machine Architecture: Things Your Programming Language Never Told You is a talk Herb Sutter gave. His slides contain two great graphs (pg4 bottom and pg24 top). The first shows the speedups in various components over time, the second shows the mean time per memory access in a chunk of a given size with a given access pattern (ie: the average number of cycles it takes to read a single int out of a block of size N, walking through linearly forward, backwards, or randomly). The takeaway message? Shoot anyone not using explicitly contiguous memory segments in performance-critical code.

(I might add that, when you're building a search engine on trees, this is a hard thing to do!)

Post a Comment

Basic HTML allowed. Javascript required for anti-spam check (I am testing a new anti-spam measure. Problems commenting? Please e-mail me: vidar@hokstad.com)

About me

E-mail: vidar@hokstad.com Skype: vhokstad
Twitter: vhokstad
View my LinkedIn profile.

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married and we just had our first child, Tristan Ikemefuna Hokstad.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Twitter Updates

    follow me on Twitter