Professor Dr. Michael Franz' doctoral dissertation: Code-Generation On-the-Fly: A Key to Portable Software (PDF). I've been harping about Semantic Dictionary Encoding (SDE) ever since I first read the paper back in 1994, and got quite close to actually implementing at one point. So let me give a brief overview, since I can't drag it out often enough (someone take it mainstream, damn it!): SDE is, at it's most basic, a compression mechanism for the intermediate representation of a compiler. Similiarly to Huffman coding, an SDE encoder and decoder pair both builds a dictionary of the datastream while building it and decoding it respectively. The dictionary entries is the used to encode/decode the following part of the data stream, which cause more entries to be added to the dictionary, etc., meaning that the later you are in the data stream, the more of the stream will refer to previously added dictionary entries instead of having to be all verbose and include the full definition. But to call it just a compression method would be selling it very short. The appeal of SDE is that it can be used as a machine independent representation of a program, while at the same time it can (if used properly) retain far more semantic information about the program than a typical bytecode. Contrary to most bytecode, SDE retains the structure of the program as represented by the compiler. This has a few very appealing properties beyond the nice parts of JIT (such as being able to target specific CPU versions):
- When (re)building the dictionary on decoding, you can store auxiliary information and even partway generated code, to speed up generation of subsequent pieces of code that use that dictionary element. The compression removes the need for pattern recognition - the pattern recognition was already done at encoding time. Franz' specifically tested outright code copying to speed up the code generation, where small pieces of code that was generated was tracked if it was invariant, in which case future uses of the same dictionary entry would lead to just copying the previously generated piece.
- You retain the ability to do JIT, but have high level flow information available, making it trivial to do things like sub-expression elimination and a number of similar "cheap" optimizations at JIT time based on knowledge available at runtime. The more you can defer generation and the more additional information becomes available before generation, the better you can do.
- The code generator can be modified to specifically recognize higher level dictionary elements that can be specifically optimized, while dictionary definitions for those elements can be provided that represents them in terms of lower level patterns. What that buy you is a backwards compatible way of improving performance that can provide significantly improved performance when running on systems or platforms for which the newest decoder is available, while still retaining full compatibility with the lowest common denominator - lowering the threshold for porting and for innovation.
- SDE retains the block structure and type information of the program, making it well suitable for highly dynamic languages (such as Ruby) that tend to either result in complex, bloated and slow bytecode or highly specialized, language specific VM's. With SDE, a highly specialized decoder can still exist, but by providing dictionary entries that act as templates for the specialized constructs of the language as part of the program, a generic decoder can still execute the program. The same benefits apply for more static languages, but it would appear to be more pronounced for dynamic languages where the final generated code can be far more effectively optimized once more information is known (i.e. in Ruby you can potentially do a far better job if you can JIT and cache specialized versions of code that take into account the actual layout of a class once all code that may reopen and extend/modify the class has been loaded - this doesn't really apply to languages where the structure of a class is known definitively at some point before the program is run).
- Because of the level of semantic information retained in the file format, SDE is not at odds with generating native binaries. Where a typical VM such as JVM's may or may not suffer a semantic disconnect with the destination architecture which means that a native binary generated from their bytecode might end up varying significantly from what a code generator would have generated with full knowledge of both the high level structure of the program and the target architecture, SDE doesn't have the same issue since very little of the semantics of the original program is lost in translation.
This means that a compiler can reasonably be split into a portable part outputting SDE binaries, and either a runtime SDE decoder/linker and/or a linker producing native code without loss of information. Such a native code generator can be run either prior to delivery of the code, or it can be run on installation, to be able to optimize for a specific CPU model, and can use more expensive optimizations than might be acceptable for runtime code generation.
- Instructions for tracing, additional safety (bound checking etc. in languages that don't require it) can be generated at code-generation time automatically. Thus you can support constructs that would slow down the final binary and only turn them on if needed. This can go as far as leaving in debug assertions and logging constructs that can be turned on at the command line, that need never even make it into the generated code if the appropriate switch isn't present, thus allowing extra capabilities to be included without burdening the system in normal use.
For programs where large parts of functionality is switched on or off from the command line, large parts of the program may just get discarded as a result of simple sub-expression elimination in the code generator, something which is particularly beneficial where the various parts cross-cut the same algorithm and substantial simplifications can be made.