Vidar Hokstad V2.0

Home Blog

Welcome! This is an ARCHIVED page from my old blog

In addition to taking a look at the entry below, why don't you also take a look at some other recent entries:


If you like what you see, please also sign up to the RSS feed

2005-03-31 21:40 UTC La Vida Robot

Main | April 2005 »

La Vida Robot

Wired 13.04: La Vida Robot:

How four underdogs from the mean streets of Phoenix took on the best from M.I.T. in the national underwater bot championship.

A great read, whether you like robots or not..

C++ and reuse

Over at the Manageability blog there's an interesting entry titled Manageability - Google's Coding Culture and C addressing the lack of uniformity in C++.

There's a few points I'd like to discuss in that regard.

*update*: I also posted a rather lengthy comment to the entry linked above...

One of the main things separating C++ from many other languages, whether it is Java with a large "standard" body of code, or Perl with CPAN etc., is the method of disseminating new code.

In the C++ world, the rule is very much a "bazaar" style of development of libraries - everyone are welcome to the party, and people are hawking their wares at every street corner and sometimes in the middle of the street.

This has it's advantages and disadvantages. Among the disadvantages are the issues mentioned in the entry above: There are many things for which there are no standard ways of doing things. There are often a multitude of libraries doing the "same" thing. There are many coding styles.

However, what that fails to recognise is that the situation is like that to a large extent because that's what people want. Not that people want chaos, but people want different things.

Many of us that dislike Java does so exactly because it forces or at least nudges us into patterns we don't want to follow, and because it constrains us to programming models we don't like.

The multitude of C++ libraries doing the same thing comes from a variety of reasons:
- The standard is limited. It is limited because every standards conformat compiler is expected to offer everything in it. That is, sockets support is considered inappropriate because not all systems can support it, and so on. This is a very different approach to Java.
- People don't know about each others efforts, but decide to keep going when they find out because of different goals/features/needs.
- Different goals, features and needs is a big driver: There are trade offs in anything you do, and where the Java approach is to put one approach into the standard, the C++ approach is that if there is no clear consensus on the way to do things it doesn't belong in the standard.

As it is, that leaves C++ with a very powerful but also very limited foundation, with the STL (which IS part of the standard), iostreams and the remaining bits and pieces forming a generic base, and above that you are free to pick and choose.

But look aroud at various open source projects, and you will quickly see some sets of "standard" libraries popping up everywhere.

The C++ world is fragmented, but there is extensive reuse.

While it's easy to say C++ would be better off with a larger standard library, I think that is a two edged sword - many of the people using C and C++ do so because they only "pay for what they use": You don't get a whole lot of stuff with a basic setup that may not be appropriate for your system.

A lot of these people would not use C++ if it grew into the huge standard that Java has become, because it would no longer fit what they are looking for.

The very strength of the C/C++ legacy is the huge amount of code that is out there in forms that are reasonably easily reusable - it just isn't as neatly packages as some for some other languages.

Reuse in C/C++ is just a whole lot more focused around finding code that works for you (has the right space/time trade-off's etc.) and is under the right license.

Part of this is also a result of the pure age of the C/C++ community - a lot of code back to the 80's, and code with roots further back, is still not only in use but also being reused in new systems.

(One classic example is wildmat.c written by Rich Salz in 1986 and still being reused mostly unchanged in new code these days)

BlogPulse Conversation Tracker

Via Mike Liksvayer: BlogPulse Conversation Tracker is a specialized search that attempts to build a view of the "conversation" that is created by people commenting on a blog entry around the web.

This is the kind of application that would be so much easier with widespread use of the previously mentioned "Thread Description Language, by explicitly annotating the pages to describe the relationship of the posts and comments to posts.

Instead of having to search, it becomes a simple matter of traversing the links in the documents. A firefox extension that present a three view based on TDL annotation would be great... Unless someone else gets to it first perhaps it's time to spend some time experimenting (please, let someone else get to it first, I'm spreading myself way to thin these days... :) )

Greasemonkey as a lightweight intermediary

Via Intertwingly:

Jon Udell wrote an entry called The architecture of intermediation on how he as a user would like a way of adding features to a web application.

Simon Willison followed up with this article on using Greasemonkey as a lightweight intermediary

He's using Greasemonkey as a tool for annotating webpages and storing the data on a central system.

I've been thinking about something similar for a while: I read/write some German, French, and bits and pieces of Dutch, Italian and Vietnamese. The problem is that I have far too little time to spend on studying these languages, and it's far too tedious to read extensive texts in them.

Looking up words in a dictionary all the time is too tedious too. What I'd like is a way for me to a) annotate a page with notes on words/grammar etc - Simon's article has lots of great pointers, and b) automatically pull down a list of dictionary definitions for words I haven't indicated I know well enough, giving me easy access to definitions.

I think a tool like that would let me spend a lot more time reading these languages...

March 30, 2005

PHP Naive Bayesian Filter

Thanks to Bitflux Blog for this link to a PHP bayesian filter.

The linked page also points to James Seng's plugin for Movable Type to do Bayesian filtering of comments.

In case you don't read French, I've done a quick (and rough, my French is bad - I need to use it more) translation (feel free to correct me in the comments...):

This is about filtering comments, pingbacks or other trackbacks to your site. I don't play much with that, but the idea of a filter based on the Bayes theorem intrigued me too much to resist doing a PHP implementation.

Simple and efficient

The Bayes theorem is a simple relationship between probabilities. For example if you have a document and two categories spam and nospam, it is difficult to learn the probability that the document belongs in one category or another directly. On the other hand it is simple to learn them by analysing each word of the document.

For the theory, a simple search on Google for "naive bayes theorem" give you numerous references. And if English doesn't stop you, you ought to read
Machine Learning in Automated Text Categorization by Fabrizio Sebastiani. If you prefer Perl to PHP, look at the CPAN modules of Ken Williams like Algorithm::NaiveBayes.

The interest in the naive Bayes algorithm is because it is fast and globally useful. You could for example utilise it for the classification of comments on your site. For example, see the filter for MT that motivated me for making it all in PHP.

Utilisation in practice

In the archive you find a script which allow you to train your database and make a query. It is meant for implementation in a larger system like your blog system.

At first, use the file "mysql.sql" to initialise the database. You should afterwards use the script to create at least two categories, for example "spam" and "nonspam". Afterwards you must train the filter a bit before testing it.

Important functions:

1. train() : To train the filter
2. untrain(): To untrain the filter
3. categorize() : To classify a document
4. updateProbabilities() : To update the probabilities in the database after a series of train() or untrain().

The use of categorize() does not add any information to the database. It only returns the result of the probability calculation.

Update: Replaced the Machine Learning URL with a working one provided by Audun. Thanks!

Floarian Mueller to give up software patent fight

From ZDNet UK: Anti-patent campaigner hangs up his gloves

Florian Mueller has been a tireless campaigner over the last year, and together with the people at FFII he's managed to create enough difficulties for the pro-patent lobby that the directive would have been dead if the EU Presidency had stuck to the EU Council's procedures instead of going to extreme lenghts to quelsh any democratic legitimacy the council had.


We all owe him a great deal of thanks for the work he's put in, and I hope his game project is a great success.

In the meantime, we still have the FFII, it's still possible to kill this directive, or even fix it.

The Temporal content of Web pages

OWL-Time
is an OWL ontology for describing temporal aspects of web pages or web-services.

One very useful aspect of it is that it's fairly readable and well documented, and comes with several example files - as such it's a great way of getting more familiar with OWL.

Bootstrapping assemblers/compilers

I know perfectly well what to do and not to do, yet I keep getting bitten by this anyway:

Do NOT change the language without first taking a copy of a working environment... NO, checking everything in to CVS/Subversion is NOT sufficient - you need to verify you have a working copy..

There is nothing worse than discovering you've just made a change to your compiler/assembler/parser/whatever, only to try to rebuild it with itself and discovering that the change you thought was entirely benevolent in fact broke the damn thing, and be stuck with something written in a language that doesn't exist anymore (since you just modified and broke the only translation tool).

Luckily, when I did this last night it wasn't too bad - I just had to rewrite about 20 lines of my assembler and dig out an older version of the parser to be able to rebuild it again, then change the lines back, correct a bug and rebuild it once more (after taking copies this time)...

But why will I never learn this lesson once and for all? I've written at least a dozen translators that have been bootstrapped to use code written using itself before, and it seems that every single time I forget about this sooner or later.

Threaded Description Language

Thread Description Language (TDL)


TDL is an RDF vocabulary for desribing threaded discussions, such as Usenet, weblogs, bulletin boards, and e-mail conversations.

So what could it be used for? One obvious thing would be to enable client software to access web based message archives without having to care about scraping the HTML to see if all the header information is there - just embed the RSS in a page,

Another would be as a uniform way of storing meta data about messages and their relationships, or exchanging that information with other applications.

Mozilla AOM

Ever wanted to program XUL applications or extend Mozilla? Mozilla Object Reference covers the Mozilla Application Object Model with lots of reference material and tutorials.

(AND it contains an RDF model categorising all the data and a sidebar extension for Mozilla to browse it - a nice demo of some of the stuff you can use RDF for.)

Miss having a cat

There is something extremely compelling about Wil Wheaton's writing. Even when writing about how he has to let go
to save it from further suffering, he manages to do it in an upbeat way by telling the story of how they found him in the first place that immediately made me miss having a cat. Even if the bad thing about having an animal like a cat is that you inevitably outlive it and have to let go.

The last cats we had was when I was still living with my parents.

We got this grey, violent, extremely agressive furball of a female that my parents got talked into it by the previous owners who had to get rid of her due to an allergy.

She was the most brutal cat I've ever known, frequently hiding in the berry bushes waiting for birds. Not sparrows, or similar tiny things, but magpie's, crow's etc. Once we found 3-4 of them under a bush - she never bothered eating them.

Another time a magpie was teasing her on the lawn, seemingly at safe distance. It reached maybe a meter up into the air before our cat was sitting on it's back and forcing it to the ground. Then she let go and waited for it to take off again before jumping on top of it yet again.

After we'd had her for a while, we ended up with another litter of kittens. We kept one of them, a tiny little male that was entirely black.

He was shy and passive from the start, and as he was growing up his mom was quite a bit too watchful - she used to hide behind the curtains and hit him with her paw whenever he passed by (he never learned to spot her or avoid her - silly cat).

You couldn't help feeling sorry for him, and I think his demeanor and the way he was treated was what made me so much more attached to him than his mother or the other cats we'd had over the years.

His mother did keep trying to teach him to hunt and kill but he just wouldn't learn. I don't think he ever did - he was more dependent on his humans than any other cat I've known.

One winter his mother didn't return from one of her nightly trips. Probably run over by a car, but we never found out for sure. With any other cat I'd say a fox might have been a possibility but given how ferocious she was, I'd pity the fox that would have tried attacking her.

Sad as it was, after that our black cat "Svarten" ("svart" is black in Norwegian - not very original) started coming into his own. He livened up, though he was still a real coward.

At the time my parents had a bird, and my brother a couple of rabbits. You'd think they'd have a hard time keeping the cat away, but not so - in fact, once one of the rabbits where left out of it's cage alone home with Svarten. We suspect the rabbit got cornered and gave him a real kick or two, because after that he was afraid of rabbits too...

Whenever the bird was let out, he was even stranger, refusing to even look in it's direction. If you tried getting him to, he'd turn his head away. Fighting temptation perhaps...

A few years after he was born, my parents for some reason decided to get a dog. I still don't understand why, considering the number of animals they had.

I was so angry when I found out, because it turned out Svarten was not the kind of cat you try to get to live with a dog - he promptly moved out and in underneath the house, and stayed there for many years, only coming in for visits when I was around and the dog out of sight - always carefully looking around to see if he was safe.

He was extremely affectionate those times whenever he decided it was safe enough - insisting on curling up on my chest purring when I was going to bed.

A couple of years after I moved out, he finally got up enough courage to move back in, but by then his best years were over. He soon after started developing liver problems, and one weekend I went back to visit he was gone. They'd forgotten to tell me they'd had to let go...

March 29, 2005

Parser assembler update

An update on my assembly language for parsing. I now have a parser for the assembly language written using the bytecode it will generate that will parse the full syntax. Total bytecode size?

About 400 bytes.

(*UPDATE*: Ok, so it ended up at 507 bytes. Still pretty good and it will be smaller once I've added some of the enhancements to the instruction set. Though I'll have to add some error reporting - that will probably bring it up to around 1K)

Apart from the additional features I've mentioned earlier I will probably need some additional error handling functionality as well, in order to make it easy to give proper error messages.


South Korea to promote Linux use

CNET News.com reports that South Korea is to promote Linux use.

It makes sense for governments to be pro Open Source. Regardless of the cost issue, open source has the advantage that it guarantees open access - in a democratic society creating barriers to participation is a significant issue.

Locking people into undocumented data formats threatens participation particularly in poor countries, but does also cause a significant archival problem once the vendor withdraws support.

However it also provides an important possibility to grow the local service and development industry.

Even if Microsoft would turn out to be right, and OSS turned out to be more expensive than their software, that wouldn't change that the alternative is to funnel money into the coffers of a US company or to funnel money into the wallets of local software engineers and IT consultants who in turn will pay a significant chunk back again in tax, and use a significant chunk of the rest to purchase products in the local economy.

These are the two reasons I think should be the most important for governments looking at open source - the cost of Microsoft software is much larger in terms of reduced opportunities and investments in the local economy than it is in direct license and maintenance costs.

Bloody daylight savings

There's few things that irritate me more than forgetting daylight savings. Luckily we had the bank holiday monday due to Easter this year. However there is one thing that is more annoying: Why the .... does Europe and the US need to switch one week apart?

I always forget about it when booking meetings, and inevitably end up stuck in the office an hour later than planned due to unavoidable conference calls with people in California.

The universe is out to get me.

Making the web a long tail broadcast medium

Anybody remember Pointcast? Back in the dot-com boom years, when push technologies was going to be the Next Big Thing, Pointcast was IT. "Everybody" were clamouring for a piece of the push action. Then nothing happened.

So where are we now? Push is finally maturing - conceptually - though data is really being pulled.

RSS has been a driving factor in making us reactive instead of pro-active when it comes to a larger and larger segment of our interactions with websites.

Push was "going to be big" back in the late 90's because it would let people broadcast to an audience, just like in traditional media. And that is what is finally happening.

As I'm watching the stats for my RSS feed, I can see instant feedback whenever I'm active in the form of more readers, exactly because software now works the way push was meant to.

While technically our readers are pulling the data, it's conceptually push - I put an item out there, readers pick it up and feed it to an aggregated view.

The conceptual difference is that readers don't adapt to specific patterns when they read your site, but they read when new material becomes available. This is also the key differentiating factor between books and newspapers, which you read when you have time (though the timeliness issue of newspapers makes the time you consider it interesting limited) and TV/Radio where you tune in when content becomes available.

Yes, you can come back and see it later, much as you can timeshift broadcast. But more and more content consumption is controlled by availability rather than a well defined time when we log on to check a few sites. We're "tuning in" to content rather than a specific source.

So while there are still obviously a lot of websites out there that are interactive, or where actions are user initiated, sites where timeliness is an issue, or where there is a demand for quick access to updated content, we're turning into information consumers in much the same fashion as we are with mass media.

Rather than seek and and research, we're often content to sit back and deal with the information thrown at us.

That brings up the obvious question: Who will find the best way of building and monetizing these audiences? There's clearly already a significant revenue potential in "normal" advertizing, but broadcasting, especially in the form seen with blog's which are more like a talk show than recorded programming, has the advantage that it builds loyalty, and where a personality that builds a large audience has the potential to extract far greater value than basic advertising.

Case in point: Oprah Winfrey. Get a mention for your book on her show, and you're rich. Get it into her book club, and you're even richer because millions of people are members. Both because she's a trusted personality.

Could she have gotten that position in a medium where updates can happen at any time, but you only find out the next time you feel like checking? No. Newspapers online have worked without push because of regularity - they spent fortunes on building brand or have offline editions that does it for them to build an audience, and the audience keeps coming back because they know there will be regular updates.

And while RSS increases the timeliness for such news sources as well, by shortening the time before they have their headline in front of a user, RSS is the poor mans broadcasting - it brings the same timeliness to a tiny blog as a major news source, both significantly better than pre-RSS timeliness for most sites.

The outcome is a levelling of the playing field that makes it significantly easier to create the diverse niche driven push market that push providers like Pointcast was hoping to drive them revenue.

The reason it worked this time around? Decentralisation. Anyone can publish, so the amount of content have exploded.

While this poses a challenge to monetisation, it also creates tremendous possibilities for two groups of people:

Aggregators that can finds ways of sift through all this information that add value, and publishers that can find ways of creating compelling content.

The former because they get to be a "radio channel" with only licensed content - they have much more freedom in creating a line up than a traditional broadcaster which has to take chances on who will produce quality content. They can see what seems to take off, and create premium feeds for content that is valuable enough.

Publishers of content because they're no longer limited to finding someone to take their content - they can put it out their and use it to build an audience.

However it boils down to the long tail: You're suddenly targetting niche markets.

Newspapers targets niche markets. It's just that it's targetting many of them at once. I read a newspaper mostly for the main news headlines, political commentary and technology. I couldn't care less about sports, celebrities, TV programmes, horoscopes, classifieds etc. I don't find technology a compelling reason to buy a paper anymore. Nor political commentary. Nor headlines. I can get all of them from disparate feeds - my news headlines mostly from the BBC. My political commentary from the Guardian and assorted blogs. My tech news from a long list of blogs.

A few people will be able to make significantly profitable blogs or RSS driven "channels" of articles with mass market appeal. But the real money is going to be in figuring out to deal with that long tail - then huge amount of smaller blogs that will never make much money individually, but that are all interesting to someone

Ad networks may be one way to squeeze some returns out of it. But I have a feeling that the real place to be is as a successfull aggregator: Finding the right balance in how to provide the right news source to the right people, and how to combine that with personalised offers that fit the content, and at the same time building an audience that trust you the aggregator implicitly because of your pick of quality sources.

In many ways, the two roles - aggregator and publisher - might merge because one way of building that audience is mixing the aggregation of content with unique content that add value to the aggregated content. A significant number of blog's already do mix these roles, as much of the content is commentary on other content - we're aggregating and commenting the same way a talk show host is.

However, whatever the winning formula: the problem of aggregation must be solved. I'm currently following around a hundred RSS feeds. Maybe 10% of the entries are actually of interest to me, and I suspect the ration will get worse.

Squashed philosophers

Ever wanted to impress friends with your knowledge of philosophy, or just wondered what they were all about, but don't have time (or the interest) to read the full original works and try to understand them?

Squashed Philosophers is a site that provides you with a timeline of important works in Western thinking, with links to abridged versions, complete with summaries, a very reduced version and a somewhat longer one, including reading time estimates.

Note that many of the philosophers are represented with works that may be less known among people without much exposure to them. For example, Marx and Engels is represented with the early work "The German Ideology", not with more well known later work such as the Communist Manifesto.

This selection of lesser known work from some of the represented people is perhaps a good move, as the well known works are also the ones you're most likely to know most about (or even have read).

Exploring the Semantic Web: MeNow and MusicBrainz

crsmith.net has an interesting entry on using RDF data from MusicBrainz to export information about the music tracks he's currently listening to, and how it'll allow him to link that information to, for instance, license data, review information, FOAF data etc. without having to explicitly combine the data sets: MeNow and MusicBrainz

March 28, 2005

1,250 moments of failure

Ghost Sites has a gallery with screenshots of 1,250 web projects that dies between 1998 to 2004 together with 'web elegies' - annotations on the context of the failures.

See Ghost Sites: The Museum of E-Failure

It is hoped that this exhibit - a sample of cultural product created by the dotcom era's lost wunderkind - provides some small iota of insight into the Web's possibly central role in the future history of Dead Media.

Technological singularity

Daniel Lemire has a short entry on technological singularity, or in other word the idea that at some point humanity may develop technology that leads to a phase of rapid development that is so beyond our comprehension that we will be unable to predict even the relatively near future.

Consider hundred years ago. Progress was slow enough that while you couldn't predict most events accurately, you could relatively safely assume that things would be for the most part the same 10-20 years later.

Today, can we? The internet rose to prominence in just a few years. And while cell phones (as the internet) has been around for decades, they to have had transforming effects on society in just a few short years.

Consider a hundred years into the future - can we still assume we'll know for the most part what society will look like 2-3 years ahead? One year?

I find the concept fascinating in part because we know so little about it, or whether it is even a real possibility.

Even if the concept is valid, is the singularity a static point? That is, will we reach a point where humanity is transcended by technology? Or will humanity advance in capabilities sufficiently that the singularity is some sort of ever receding horizon beyond which we can't make predictions with any degree of accuracy?

Daniel mentions he thinks AI is currently out of reach, and it's a view I share. A lot of AI technology such as neural nets are useful, and will continue to improve, but we are still far away from understanding enough to building something intelligent enough to call a real AI.

But he also raises the question if we want to create something more intelligent than ourselves.

To me that question is pointless. The question is "will we?" and the answer is "yes", because as soon as we have the ability anyone not making use of that ability will be left behind, whether it be in terms of defence or in terms of competitive ability in a marketplace.

Another important aspect of the idea of a singularity is that we don't NEED to get to the point where we can directly create something more intelligent than ourselves. We only need to get to the point where we can create a system with self improving intelligence.

If we manage to create ANY form of real intelligence in software, whatever real intelligence is, we already know that genetic programming has the potential to automatically evolve the software. If we manage to create intelligence coupled with a good enough system of increasingly complex competitive pressure, we may at that stage already have created the singularity.

Once that happen, the exponential improvement may happen by itself, in the form of massively accelerated evolution - not directed engineering.

Is it desirable or not? We don't know. There's no way of knowing when/if the singularity happens whatever technological advances will result will be benign or not.

Arrrrgghh..

Off to paint the last bits of the living room... I hate DIY (not as much for the work itself as for losing the time I could otherwise have spent doing something else), so something I think I'm certifiably insane for having bought a house that needs so much work.

More on my assembly language for parsing

As I wrote previously I'm experimenting with an 'assembly language' for parsing.

I'm about halfway through writing the parser for the assembler in itself and temporarily hard wiring it into the virtual machine to bootstrap it.

Lessons learned so far include: I really DO want a couple more high level instructions. I've only add BLT and BGT (Branch on Less Than and Branch on Greater Than) to the instruction set so far to make handling ranges easier, but I realise that I really want to create expanded versions of CMP, TRY and REQ (see the earlier entry for descriptions) that will handle ranges instead of single values, as it will dramatically simplify some rules.

The kleene star and one-or-more instructions I hinted at would also have proved very useful, and will certainly be added. All in all I want to focus on two groups of instructions: High usability low level functions (i.e. they manipulate the state of the vm directly) and 100% "composable" functions - that is, instructions that can be composed entirely of low level functions. In my experience that makes implementation so much easier, and maintaining that separation means that you get good coverage of low level instructions so that any high level functionality you leave out is likely to be composable.

Note that I'm NOT aiming for turing completeness, though it wouldn't be strange if it happens by chance. The goal is a very restricted language only intended for parsing.

I've thought quite a bit about it over the weekend, and in particular whether or not it is likely to be particularly useful, considering the availability of similar tools.

What I've realised, though, is that given the complexity of parsers, it's very rare for parser generators to meet my needs, and it's very rare for me to be happy with generated parsers without manual modification. However modifying lex and yacc parsers manually is something you'd probably not want to try. Modifying an assembly like language, however, is fairly straightforward (that's the old 6510 and M68000 demo programmer in me speaking). I've actually written a whole compiler in M68k assembly before, and so this is bringing back memories - M68k assembly was actually quite well suited at least for the parsing aspect.

The advantage of a heavily restricted VM where programs will mostly use relatively high level constructs is that it will be easy to make reasonably well performing interpreter for it. Given the current size (ca. 300 lines of C++) it is reasonable to assume that a competent programmer could port it to any language in a day or to, and instantly get access to working versions of any grammar translated into the language. This is one of the areas where most parser generators become a burden.

Another thing I want to try is building a relatively sophisticated BNF -> parser "compiler". Most of the constructs I've added are very well suited for translating to from BNF, as I'm used to using BNF as a starting point for all parsers I write. There are quite a few things such a compiler could do very easily with the instruction set I'm working on to generate a much faster parser than what I'd write by hand (because if I write by hand I'd be aiming for simplicity...):

- inlining productions.
- "lifting" shared or "almost shared" initial terms in OR groups. If you write a grammar where a bunch of productions all start with a character, and use them in an or expression like this: (foo | bar | baz) it makes sense for the compiler to generate a single check for a range of characters to allow. This is fairly trivial to add.
- merging of similar subtrees. That is, if I have productions in an or-expression that expect the same initial characters it can often be fairly easy to check for the initial characters separately.

All of this serves to bring the resulting assembly closer to expressing an NFA for the language, but just expressing it in bytecode instead of a table of transitions. The advantage is that it'll be fairly easy to make it possible to turn these optimisations off to get readable assembly to look at to debug the parser.

Another advantage I see from this approach is exactly the debugability - I already have a "tracing mode" for my VM that will output the exact instruction stream as executed, and it makes it so much easier to diagnose problems.

The last advantage I see is size. Expat (a C XML SAX parser) on my system is about 126k. The full VM executable (meaning it's also dragged in a lot of library code) plus most of the assembler, and debug output and without optimization is currently 38K. A C version would probably weigh in at significantly less. Clean it up, and add bytecode for an XML parser - given my experience with it so far I think it would weigh in at about 5-6KB and conversion functions for UTF-8/UTF-16 and latin-1 I think it should be easy to get a full XML parser into less than 40K. Quite possibly less than 30... Looking forward to trying.

Playing with RSS

Been busy all day programming, and testing out RSS enabling assorted stuff, including my e-mail - just love how easy it is to churn out feeds from any information source available and instantly have it accessible from a wide variety of applications, includinf Firefox.

March 27, 2005

Programming Language Texts online

PLT Online is a collection of programming language theory texts and resources, all of which are freely available over the Internet.


Have Mass-Mailed Malware Peaked?

CRN has this article on the six year anniversary of Melissa: Six Years After Melissa, Mass-Mailed Malware Has Peaked

The article doesn't give any reasons for the belief that viruses such as Melissa are past their prime. For one, the article covers spoofed from addresses, but the main reason Melissa and similar viruses were so devastating was exactly that they didn't need any spoofed addresses - their strength was mailing from a user that had you in their address book, and hence using valid from fields wouldn't be a problem.

While most people have hopefully now learned to be careful about attachments etc., the problem with viruses using your address book is that the potential is there to make the virus much more insidious and effective.

For one, there is the potential to not start immediate bombardment of everyone in your address book, but to wait for outgoing messages with attachments and infect the attachments - people are much more likely to trust an attachment attached to what appears to be a fully legitimate message, and they're much less likely to suspect problems if their machine doesn't immediately freeze up due to massive amounts of outgoing mail.

Secondly, your sent mail folder is a trove of information for a virus - there's lots of potential for resending recent messages with attachments adding messages like "Hey, sending you another copy of this as I've made some updates" and similar.

The potential for virus writers is endless - in fact what keeps striking me with each of these virus attacks is how primitive most viruses seems to be. I'd be very surprised if we don't see more massive outbreaks.

I love Wing-Yip

While Croydon is hardly the most glamorous London borough to live in (but hey, Arthur Conan Doyle used to live up the road from me, and D.H. Lawrence used to teach at a school in the road I live in, so at least it was once cultured :) ), it has it advantages, one of which is the Wing Yip 'superstore'.

I love it. Over the last couple of years I've developed an addiction to dim sum, and it's annoying having to go in to central London to get my fix (though now I work just minutes away from New World, one of the best dim sum restaurants in town, provided you can stand the pushy 'service'). That was until we discovered the Wing Yip centre in Croydon only about 20 minutes by bus away.

Isle after isle of dim sum, Chinese snacks, noodles, cakes, exotic (to me anyway) fruits and other fun stuff I'll probably end up tasting even though I don't know what it is (always fun to try to figure it out only from taste...).

We've filled the freezer now, so we'll be steaming dumplings every other day or so for the next few weeks.


Tips for Mastering E-mail Overload

Harvard Business School has a site called "Working Knowledge" and I just happened to come across a link to an article there called The Leadership Workshop: Tips for Mastering E-mail Overload (found it over the blog of Mike Arrington, a great guy I used to work with after I just recently stumbled over his blog)

The article has lots of great advice. For me the e-mail overload has gotten to the point where I hardly answer non-work related e-mails any more (it's easier to get a response from me by comments on here than by e-mail, at least as long a the volume here is so much smaller than my mail volume), which is really bad when I occasionally get mails from old friends etc. that I haven't talked to in ages and I completely forget to mail them back, but the whole combination of work and my MSc. has really killed the concept of having spare time for me.

I do however think a lot about how to improve on e-mail and would really like to get back in the e-mail business - there's still money there, potentially a lot, for whoever comes up with the right productivity enhancing tools.

The very fact that e-mail is such a vital tool makes the value of an application that can save just a few percent extra of your time ridiculously high in a business setting.

At the same time I think pure e-mail is too limited. Testing the water with this blog has convinced me that there is an important space for integrating mail with web based technologies to take more control over how you communicate.

Combining direct communication with feeds of information that is relevant but not personal could do much to reduce the overload that is mainly occuring because what is treated by the reader as a one to one medium (it is likely the message is intended for you) is being treated by the sender as a one to many medium (the message likely has some relevance to all recipients, but is unlikely to need immediate attention from most of them).

While waiting for the technologies to get sorted out, this article provides useful advice for improving the usefullness of the e-mails you do have to deal with.

March 26, 2005

Just finished watching the new Dr Who

I've only seen bits and pieces of the older ones, so I can't speak much for whether or not it matches the old ones in style, but I was pleasantly surprised. Both with the quality of the effects, and the lighthearted approach.

It certainly didn't take itself too seriously, and had a few hilarious moments such as one character being "eaten" by a plastic garbage bin, and Billie Piper failing to realise she was talking to a very obvious plastic copy...

Even my ordinarily rather non-geeky fiancee enjoyed it and wants to watch it regularly...

A proposal to solve the 'Orphan Works' problem

This article at Groklaw looks at one option for solving the orphan works problem:

The Copyright Office has been holding hearings on access to "orphan" works. These aren't movies about kids who have lost their parents -- Little Orphan Annie, say. They are works which are still under copyright but have no copyright holder (or no locatable copyright holder.) It might sound esoteric, maybe even boring, But it isn't. Here's why I think it matters. "Orphan works" probably comprise the majority of the record of 20th century culture, and their orphan status means we have practically no access to them. In all likelihood no copyright owner would show up to object if one digitized an old book, restored an orphan film, or used an obscure musical score. But who can afford to take the risk? The normal response of archivists, libraries, film restorers, artists, scholars, educators, publishers, and others is generally to give up -- it is just not worth the hassle and risk. The result? Needlessly disintegrating films, prohibitive costs for libraries, incomplete and spotted histories, thwarted scholarship, digital libraries put on hold, delays to publication. And all of this waste is entirely unnecessary. Is there any solution? Duke's Center for the Study for the Public Domain has produced a report to the Copyright Office that offers one.

Interesting comments from PJ as usual, and interesting comments, so head over there and take a look.

ACM Queue - On Plug-ins and Extensible Architectures

Via Peter O'Kelly's Reality Check:

ACM Queue - On Plug-ins and Extensible Architectures - an interesting look at plug in architectures, using Eclipse as one example.


An 'assembly' language for parsing

I've mentioned my forays into push parsers previously. But after looking at that approach, I realised I needed a bit more flexibility. So I got the idea of designing a tiny assembly like language for building push parsers with. This is analogous to building an NFA or DFA, but with more operations, and the potential for being much easier to deal with manually.

I ended up with a tiny set of core commands, and I plan to add a few more convenience commands implemented in terms of the core. Here is the core command set I came up with:

RET -- Return with status flag set to true
BEQ -- Branch if status flag == true
BNE -- Branch if status flag != true
STO -- Store token buffer in numbered slot
TRG -- Trigger an event (used to "plug in" native code to build a parse tree)
ERR -- Return with status flag set to true. Unget any characters retrieved in subroutine, and revert token buffer to pre-subroutine state
CLR -- Clear token buffer
JSR -- Jump to subroutine. Push unget buffer and token buffer to stack.
JMP -- Jump
CMP -- Compare to input character, and set status flag accordingly. Yields if no input.
EAT -- "Eat" input character and add to token buffer and unget buffer. Yields if no input.

In addition I have so far implemented the following two compound commands:
TRY char -- CMP ch; BNE n; EAT; n: ...
REQ char -- CMP ch; BEQ n; ERR; n: EAT;

I plan on adding a few more, including a "range compare" and possibly equivalent TRY/REQ variations, as well as a kleene star command, and possibly a "one or more" (i.e. "foo foo*") operator.

As far as I can see it would be trivial to implement a "compiler" to compile BNF into this language, and the VM is less than 300 lines of C++ (including lots of debug output). It would be trivial to JIT or build a code generator spitting out C/C++ or another language as well.

I plan to spend some time writing a basic assembler for it first (tired of adding calls to build it inline without any relocation/label support...). Then we'll see.

So far I like this approach a great deal better than lex/yacc or other compiler construction toolkits I've seen.

Starved for Logic

I've mostly tried not posting anything directly about Schiavo - the closest I've been getting being my comments on a couple of the starvation pieces posted yesterday - because it angers me that people feel they have a right to meddle in what should have been a private matter.

However I would like to draw attention to
this great piece over at Blogcritics highlighting the hypocrisy of complaining about Schiavo starving to death all the while opposing legalised euthansia (as the former is a direct result of the latter being illegal, thus preventing doctors from doing anything but withholding treatment)

However what really got my anger rising was one of the comments posted.

I find in amazing that some people still try to use things like smiles and minor movements as proof that she's not in a PSV.

My question to those people is: Have you ever seen what happens to an Alzheimer patient?

Because long after an Alzheimer patient have stopped recognising you, long after they have stopped being able to talk and move about of their own accord, they will still sometimes smile, move their heads and look like there are glimmers of their old selves in there.

Except by then there isn't enough of the brain left for that to be possible.

Instead, their heads are mostly filled with plaque and mostly dead braincells and they're reduced to the most bare physical reflexes.

What you see is an effect of how the brain works: The more recent and the more controlled by though as opposed to reflexes something is, the earlier it tends to disappear. What you're left with is basic automatic reflexes.

Then motor functions go, and sooner or later an organ fails.

The thing is, a patient can be in a persistent vegetative state for years and give what appears to be responses to someone close to them, while the moment you start to rationally look at what they actually do, you will find no correlation with what happens around them. t.

I've seen it happen to both my grandmothers, which is probably why I react so strongly to the idea of "signs" like that.

One is now dead, the other may live physically for a couple more years, but long before her body gives out she will be reduced to bare physical reflexes as well.

For someone who just sees someone in a persistent vegetative state for a short time, or that wants to believe, it may easily seem like they're sometimes responding. They may smile seemingly at some comment, or chuckle briefly, or widen their eyes when someone enters the room.

But try asking them in any way you can think of to respond, and you will quickly see that the "response" is random.

The only real difference between an Alzheimers patient and what has happened to Schiavo is that Alzheimers is degenerative - you get to watch as part by part of the brain go over a period of many years, so you get used to evaluating what is left of the person you loved. Perhaps that makes it easier to accept.

You may see "signs" if you refuse to believe such a person is gone because if you try hard enough to get a reaction, statistically you will eventually get one, at which point you're happy and stop your experiment.

I'm not a medical expert, and obviously I've not seen Schiavo, so perhaps she isn't in a PSV.

However I would claim that anyone that claims she isn't or can't be based only on what they've seen and heard about her reactions in the media are completely clueless.

It's a 404, loser

404

Do you care about lives or about votes?

First I saw the Blog for America piece that I mentioned earlier tonight, then now I notices Patric Logan has been writing about the same thing:
Making it stick.: Preventing Death

He quotes 13.000 children a day dying of hunger related causes. Add to that another couple of tens of thousands dying of other causes, like easily curable diseases, as well as huge number of adults and it quickly adds up.

Wonder how many lives the members of Congress could have saved if they'd just donated the money their little meddling session cost in transport to get them to and from their little meddling session - surely far more than one.

So the obvious question is whether they care about lives or about votes. I know what I'd place my bet on.

ADTI invents new silly claims about open source

In a completely unsurprising move The Alexis de Tocqueville Institution's Kenneth Brown is trying to discredit OSS again.

After his miserable, failed attempt at discrediting the roots of Linux by claiming it was copied from Minix (a claim disputed even by Minix' author Tanenbaum) Brown is trying a fresh approach:

Brown finds it "intriguing" that many open-source contributors work for large IT companies. "Every day, an untold amount (sic) of employees beholden to strict employee/invention/intellectual property agreements, in their spare time (and even during work-hours) freely give away ideas, code, and products to open source projects," he writes. This opens up questions around the legal ownership of contributions, and could even open an avenue for a "disgruntled employee" to give away company secrets by contributing them to open-source projects, the report argues.

Interestingly, Brown conveniently "forgets" that a lot of these people are paid specifically for the purpose of developing contributions to open source.

He also ignores that this is much more likely to be an issue with proprietary software:

A disgruntled employee that releases code to the public will risk getting discovered immediately, because he/she is spreading the code, well, in public.

A disgruntled employee that goes to a shady competitor may get away with it, and may even get rewarded.

Interestingly enough there have been several lawsuits regarding the latter, but I've yet to see one where the former have been proven.

And the only company currently in active litigation over contributions to OSS, namely SCO, was nearly laughed out of court by the judge, who sardonically pointed out their complete lack of evidence so far. SCO even backtracked on their original copying claims.

Another point that might perhaps be lost on Brown: Most of these strict employee/invetion/intellectual property agreements do not prohibit you from contributing what you do on your own time on your own equipment, and in many jurisdiction any clauses to the contrary aren't even legally binding on an employee even if they sign the agreement.

However, all of this is moot as long as Brown isn't able to show even a single example of what he claims must surely be rampant.

I think Santa Claus robs thousands of old ladies every year, honest, so it must be true even if I don't have a shred of evidence.

In any case, what I found most interesting with this article was the title and subtitle:

Think-tank report lays into Linux Guess what? Organisation is funded by Microsoft

... and the ending paragraph:

Brown's 2004 report alleged that credit for the origin of Linux should go to projects such as Minix, authored by Andrew Tanenbaum. That report drew criticism from many quarters, including Tanenbaum himself. "My conclusion is that Ken Brown doesn't have a clue what he is talking about," Tanenbaum wrote in a web posting at the time.

With a packaging like that, I doubt too many people will take Browns' mindless rants too seriously.

March 25, 2005

Finished my essay!

The hardest part was cutting it down from 6000 to 4000 words, but I did it once I finally managed to stop procrastinating and actually did some work.

So now I'm going to spend the rest of the evening munching cheesecake and writing a parser engine - I got this idea for an assembly like language that would make it trivial to write a push parser based more or less straight from BNF (automated transformation from BNF would be easy too, but I'm not going to deal with that yet). I'll write more about it and perhaps put up some code later this weekend.

Hunger has a cure

Blog for America has a great piece called Hunger is a cure about the fact that 1 in 10 American households experience hunger or the risk of hunger and that more than 9 million children are the recipients of food aid in various forms:

Once again, Americans have lost their focus, and are being distracted by the unfortunate case of one individual, Terri Schiavo. On Monday, March 21, 2005 at 1:11am the President signed a bill in his personal residence, which would allow a federal court to intervene in hopes of replacing her feeding tube. How ironic the president's action was, when there are millions of Americans, including children, who go to bed starving every single night.

It reminds me of a situation in Norway in the early 80's when an aid organization started collecting money for food for USA's starving children, and the US Embassy delivered a formal protest claiming the US could take care of it's own.

So why isn't it happening?

Developed countries (the US is by no means alone in this) still have a far way to go in eradicating poverty and providing proper safety nets.

Support British loos - Join the BTA!

The British Toilet Association - Campaigning for better public toilets for all

I've been looking for an organization to be active in, but perhaps this isn't it... :)

African mobile growth tied to faster development

I've long been aware of the trend towards bypassing landlines in Africa, in part due to the cost structures - while not apparent in developed countries where the vast cost of building out landline networks have been absorbed over a century of to a large extent heavily regulated telephone companies.

Mobile networks are today cheaper to build out, and as a result they've already far bypassed landline usage in many developing countries.

This article over at BBC News isn't new, but it had escaped me until today. It covers the growth levels, and also interestingly shows that growth in mobile usage can also be linked to increases in GDP in Africa.

Now, without seeing the study it's hard to judge whether the numbers say that a country with higher growth is likely to see higher mobile usage, or the other way around.

On the other hand, it's quite logical that as mobile usage grows, it fuels more growth in the economy, thanks to improved communication and the ability to conduct business more effectively.

US to sell arms to yet another military dictator

One might perhaps think that while in the middle of the quagmires of violence that are Afghanistan and Iraq under the guise of "promoting democracy", the US government would at least pretend they actually mean what they say, and not offer F-16s to Pakistan in a move that risk destabilising a highly volatile region (India has already started making objections), a country still dealing with the aftermath of Musharraf's coup and is still suffering from massive civil rights abuses and where the military still holds a significant amount of power over civil life.

But I guess one shouldn't expect anything else, seeing as the practice of providing arms support to people such as Saddam Hussein has a long history with the US government.

Update: Just a quick note before I get any comments above it: Yes I am aware Musharraf held a referendum and election a couple of years back, no I'm not impressed. The election was marred by lots of irregularities, and while I'm sure he does he have some level of public support, to me the fact that he is in his current position thanks to a coup and that he's showed repeated unwillingness of relinquishing his power makes the elections moot - he belongs in jail, not in office. Anything else is an affront to democratic principles.

Grafitti artists 'exhibits' in New York's most famous museums

Grafitti artist 'Banky' managed to sneak his works into several of New York's most famous museums, claiming he could do just as well as the artists exhibited: Wooster Collective on Banksy's stunt (with photos of the works).

I love stuff like this. Not only is it fun when stuffy institutions are stirred up, but it's great fun when it's done in a constructive way.

The most memorable quote, however, is this single sign of modesty (from this article on Banksy's interview w/NY Times):

I wanted to do the Guggenheim but there weren't enough paintings in it, I would have had to appear between two Picasso's and I'm not good enough to get away with that.

But he could get away with putting a painting of an Admiral with a spraycan and anti-war slogans in the background up for several days before being discovered...

Maybe he does have a point about the selection processes for these museums.

Done writing - now for the deletions and references

Finished writing the bulk of the text for my essay... Now I need to trim it down from 6000 words to the 4000 word limit. Somehow I think my 10.000-15.000 word dissertation next year is going to be rather straightforward. I hate having to cut, but I hate writing with the word limit in mind even more, so I tend to write far to much to make sure I cover everything I want, and then go through and try to pick the least important stuff to delete afterwards instead.

Then there's only the references left. At least I'll probably get it in tonight, so that I can enjoy the rest of the easter weekend and focus on writing some cool programs instead.

On a related note, I've just signed up for the last two course before my dissertation. I ended up with a course on computer architectures which should be a breeze (I learned assembly first time 17-18 years ago I think) and one on information security which could actually be quite interesting.

My next question now is what I'll do once I've finished my MSc. I've been toying with the idea of an MBA, or alternatively taking some other courses first - possibly a BSc. in Economics together with some modules on British law. I've gotten addicted to studying again ;)

Easter means back to my essay...

Here in the UK we're off Friday and Monday thanks to Easter, so it means I have all of today and tomorrow to finish my essay on the Semantic Web (it's due by midnight Saturday). Hopefully I'll wrap it up tonight. I'd have finished it last weekend if it wasn't for the fact that I'm so easy to side track whenever I start looking at cool new technology - I have a half finished N3 parser and a half finished C++ DOM implementation to show for it...

If I finish my essay today, though, I should probably finish painting the living room before I let myself tempt to continue work on them, or my significant other might get a tad annoyed.

We've had the house for 7 months now, and we're still not done with all the redecorating - once I've completed painting and laying the flooring in the living room we still have two bedrooms and the kitchen to go.

It really annoys me that the redecorating is keeping me from programming, but then the amount of money we saved by buying a house in need of some work is ridiculous.


March 24, 2005

Lilina News Aggregator

Lilina news aggregator is a browser based RSS/Atom reader with a great, simple interface and which doesn't need a database.

Check out this Lilina based site for an example.

(via vrypan|net|log)


Larry Lessig on Searching Creative Commons

Yahoo! has just launched a Creative Commons search as beta, and over at the Yahoo Search Blog Larry Lessig has this to say on Searching Creative Commons.

There's also this blurb by Mike Linksvayer on the Creative Commons blog.

Now, this is one of those cases where it would've been great if people added RDF-A
style RDF data instead of "just" a normal link, in which case once you'd done this work once it would be instantly and automatically reusable for any other category of data...

Yes, I'm moaning about the Semantic Web again. Deal.

In the meantime though, it's great to see that Yahoo is giving that kind of recognition to Creative Commons (of course I'm biased since it's my employer...)

Wikinerds: Interview with Hurd developer Marcus Brinkmann

Take a look at the interview - it's long, but cover Marcus' work on Hurd, thoughts on GNU, micro kernels, becoming a programmer and more. Interesting stuff.


Revolting co-routines in C

I came across a link to this post (see in particular the response by Tom Duff) over at Brainwagon about using Duff's device (as if Duff's device isn't revolting enough to start with) to implement Coroutines in C.

Eughh.... Though it is kind of cute... No... Please let me resist the urge to actually USE this...

Update: And while you're at it, take a look at this 'threads' package as well. I haven't looked enough at it yet to tell if the implementation is as revolting as the stuff above.

March 23, 2005

Improve my bookmarks!

It just hit me that it's extremely annoying to manage bookmarks, and I REALLY want people to stick more metadata in their document headers, and for bookmark managers to extract it and annotate the bookmarks and let me use the data to search the stored bookmarks. It's one of those blindingly obvious uses of RDF/RDF-A, and one where Dublin Core entities are already widely used for Search Engine Optimisation purposes so a lot of data is already in there, some of it in a form that fits exactly or almost with RDF-A.

Couple it with other RDF data sources available for webpages, such as RSS feeds and Open Directory, and you could get quite good coverage.

The upside of marking up your static content this way is that it would make it trivial to put together a program to scan your site and generate an RSS file of all recently changed pages as an added value for your regular users.

Usable XMLHttpRequest in practice

XMLHttpRequest in practice is a real gem of an article for anyone who wants to update their web applications with an XMLHttpRequest/AJAX type interface.

(Found at Blue Sky On Mars)

RDF and the Semantic Web ludicrous ideas?

I came across this: RDF and the Semantic Web are ludicrous ideas | Semiologic which is a short and thought provoking alternative view of the Semantic Web. I had to post a response, part of which I quote the most important parts here (go visit semiologic to read the rest):


People won't mark up every little bit they put online, but that isn't needed: People will mark up the bits they care about. I'd rather have info that matters marked up than all kinds of fluff.

Companies selling online will mark up their catalogues because 1-2% extra sales for adding some extra processing of their products database is worth it, and it won't take much of an audience to a new product search engine before they'll be able to reach that.

(...)

Your example is contrived, because there's no point in marking up everything - you mark up whatever will have it's value increased through markup. That means data that it's important for you that people can find and reason about.

(...)

A vast amount of the web HAVE semantic information associated with it in the databases and content management systems they're generated from - but that information is lost when it is output into a form that is only human readable and not easily machine parsable.

Unlock 5-10% of the database content that is tied to the net and we already have the Semantic Web.

Blogcritics: Accelerating the 'Roe Effect'

David Flanagan has written an article on Blogcritics called Blogcritics.org: Accelerating The 'Roe Effect'

While I don't at all agree with his arguments, I found it a very interesting read on the ideas surrounding changes to marriage.

Changing the concept of marriage

My main argument I guess is against muddying the religious concept of marriage with the secular benefits bestowed by governments.

Personally I believe it is discriminatory to allow some people to obtain benefits when living together in a comitted relationship while others don't. I also question the rationale for restricting this in any form to two people. The main issue here is whether the government at all should sanction specific forms of relationships over others in this way, and my answer to that is no.

When it comes to the religious idea of marriage, I couldn't care less. If a church refuses to marry two (or more) people, then that is their business.

If someone chose to rename "civil marriage" to "government approved relationship contracts" or whatever, then fine. What it's about is rights and responsibilities in what is essentially a contractual agreement.

The very idea that marriage is a union between one man and one woman is a religious idea that there is no justification for keeping as secular law

Separation of church and state is a protection for those of us with different world views - whether atheists (like me), muslim, or any other that are not tied to the Judeo-Christian image of marriage.

If anything, disconnecting the two might make more people open entering agreements covering their relationships, contrary to the current situation where marriage is an institution that is gradually becoming less important.

The spread of religion

The idea that Christianity spread by having more children does not benefit Christianity now. On the contrary, muslim families are much more likely to have many children, and tend to be much stricter than Christians with regards to abortion.

Muslims are also much more likely to support polygamy than Christians, given that the idea of up to four wives is well ingrained in many parts.

At the same time I doubt this has much effect long term, otherwise how does one explain the significant decline in fundamentalist Christianity in terms of percentage of the population?

Conservative and fundamentalist Christianity is losing out because people are increasingly more critical to their parents viewpoints - rebellion is an accepted part of youth culture, and access to more varied viewpoints have made people more and more likely to make up their own mind rather than blindly accepting what their parents told them.

We're at a stage today where most "christian" countries face a situation where regular church goers are now in a minority and people largely pick and choose to make their own version of Christianity that is significantly watered down over just a few decades ago (ask people how many believe in hell for instance)

This is a trend that's been ongoing for hundreds of years, and is unlikely to be stopped by some slight changes in the number of children born to various types of parents.

About Haskell and why the quicksort example is bogus

About Haskell is a page about the functional programming language Haskell.

It's well worth a read if you haven't read up on functional programming before.

But whenever I read intro's like these, there is one thing that provoke me: That bloody quicksort example.

Here is quicksort in Haskell:

qsort [] = [] qsort (x:xs) = qsort elts_lt_x ++ [x] ++ qsort elts_greq_x where elts_lt_x = [y | y <- xs, y < x] elts_greq_x = [y | y <- xs, y >= x]

It looks quite straightforward, doesn't it: Quicksort of an empty list gives an empty result, otherwise quicksort of a set where x represent an element and xs represent the remaining elements is equal to the result of applying quicksort to all the elements smaller than x plus x plus the result of quicksort of all elements greater or equal than x.

It's even simpler than the description almost.

Contrast that with the pesky C implementation:

qsort( a, lo, hi ) int a[], hi, lo; { int h, l, p, t;

if (lo < hi) { l = lo; h = hi; p = a[hi];

do { while ((l < h) && (a[l] <= p)) l = l+1; while ((h > l) && (a[h] >= p)) h = h-1; if (l < h) { t = a[l]; a[l] = a[h]; a[h] = t; } } while (l < h);

t = a[l]; a[l] = a[hi]; a[hi] = t;

qsort( a, lo, l-1 ); qsort( a, l+1, hi ); } }

Ohh. Nasty.

What the linked page doesn't tell you is that the reason the C version is nasty (apart from being pre-ANSI C spec judging from the signature) is performance. It's a version that sorts an array in place, whereas the Haskell version have all kinds of potential for massive memory wastage unless the compiler is far smarter than most.

That aside, the C version could trivially be simplified too.

What does the C version actually do? The answer is: Almost exactly the same as the Haskell version.

The Haskell version used what Haskell calls list comprehension to partition the input data into two groups and a pivot (the pivot is the value we split the input data by, and in the Haskell example it's "x").

That's what those nested loops does. In Haskell, partitioning is built into the language. In C it's not, but hoisting the loops out of Quicksort easily creates a partitioning function (in C++ we're even better off: std::partition() would do the job for us) that reduces the core of the C quicksort to just a couple of lines:

Partition the array in place, recursively call quicksort on the two subparts before and after the pivot.

I'll post some comments on how simple you can do it in C and C++ later.

Meanwhile, if you want to see flexible sorting in C++ that's usually faster than quicksort (improving on quicksort by selectively switching to other algorithms), take a look at this article by Andrei Alexandrescu.

For the record, Alexandrescu boils the core of quicksort down to:

template <class Iter> void Sort(Iter b, Iter e) { const pair<Iter, Iter> midRange = Partition(b, e, SelectPivot(b, e)); Sort(b, midRange.first); Sort(midRange.second, e); }

Not far from the Haskell version in complexity, is it?

Whatever happened to...?

Endless fun: WEHT.net: The Online Compendium of 'What Ever Happened To' and 'Where Are They Now?

A great site to go to whenever you wonder who you haven't seen that guy in any movies lately and you're not sure if it's just because he's become a successful theatre actor or if he's become a drug addicted killer... :)

Managing your way out of chaos

Through my career, I've "always" (with the exception of a one year stint in the middle) managed people. However it's been something I fell into more or less by chance, and I didn't have any formal training in management practices when I first did it.

I've recounted some of my learning experiences previously. But one of the most important lessons I've had is in how to rescue a team that's failing due to lack of communication and process. In this article I'll mostly focus on communication.

At one of my previous employers I got in over my head. I was managing the development, and managing the recruiting of the team. However I had no support - which isn't unusual in a startup, but also a real problem since much of the staff can be expected to be inexperienced in some of the roles they grow into.

I mean literally, no HR feedback or support in evaluating perforamce, no support or feedback from my direct manager on anything but technical issues which is the area I didn't need any help in.

Of course, at first I didn't realise I needed help, and now I can do without much of it.

The problems I faced were also a key reason I finally decided to take up my studies again, pursuing an MSc part time alongside my work.

There were three main problems:

  • I had up to 10 direct reports (fluctuating through the period), spread over two locations, and no real support staff (team leaders etc.). This is far too much unless some of your team members are experienced managers
  • Feedback to my team wasn't good enough. This was partly a function of the item above, but partly also a function of lack of experience with some personality types and of some management techniques to make it not matter (more on that below)
  • Processes were non-existent to primitive

Generally, large engineering teams - I'm talking about the size of a group directly reporting to the same person - doesn't work. There needs to be a hiearchy in place, even if that hierarchy is fairly informal.

If you're spending enough time on each member of your team, you'll be unlikely to be able to handle more than 5-6 people efficiently, depending on how senior they are. If you're lucky you have people that are strong enough that you can leave them mostly alone, and you can go above this, but you should never count on it.

I did, partly because I didn't know better, partly because I got no guidance from HR (and they were clueless about how engineering teams work), partly because I had great experience with the first few team members to come on: They did what was expected of them and beyond, and they left me enough time to mess around and keep doing some coding.

The problem comes once you get into situations where people have things to talk to you about, or when projects grow large and complex, and you get drawn between technical issues and man management. Man management is "soft stuff". It doesn't have an immediate effect on project deadlines, so it's put aside.

Except it does. It does affect deadlines, and it's even more important in crunch times than when things are easy.

Dealing with a team as large as that with no support wore me out. I nearly quit. I became depressed and stressed out, and started going home early, and became more and more unavailable to my team because I felt I didn't have time: There were always unanswered e-mails, or meetings to go to.

A while later I had a row with HR over my salary, and part of the reason was my lack of communication. I nearly quit over that, in particular because I got extremely provoked by the fact that they had the gall to complain of my lack of communication when a large part of the problem was that I felt I had nobody to talk to sort out the problems. In order words, I blamed my manager and HR.

Improving feedback

While I was right in thinking they were as bad at communicating with me as I was with my staff, I started working hard at improving my part of it as the run in with them made me realise I had to do something. Not for their part, but for my team members part - they certainly did not deserve to be left to face the same burn-out and depression that I had been facing.

I once had to send one of my guys home because he started blowing up in peoples faces after a particularly tough crunch. The warning bells should have been deafening.

Through the reminder of my time at this company I didn't see any evidence of my manager or HR improving their communication with me much. But I did see a marked difference in how I was working, and learned a lot.

Make people give you feedback

No, I don't mean ask for feedback. Some people will give you unsolicited feedback, and those people you generally don't have to worry about - they'll come to you if there are problems. Worry about the people who answer "it's going fine" when you ask them how things are going or how their project is coming along - when something DOES go wrong they'll still say the same thing.

They might not attack you personally behind your back, but they WILL (rightly) complain about how tough things are, and it will

This directly translates into a pattern of periods of relaxed work followed by frenzied crunch periods that just makes things worse

So, make people give you feedback. No torture instruments needed, but the time to sit down one on one with your team members is essential. It doesn't have to be formally, in a meeting room with you taking notes, but it does have to be semi-private. Enough so that someone will feel it's ok to tell you about why they've been grumpy at work, or why their project is going to hell and they need help.

But that's only the beginning. You need to show you're listening. You need to ask questions. You need to probe and show enough of an interest that they get that just fobbing you off with a "everything is fine" won't work. And the most important thing of all: When they do tell you about something that's going wrong, never blame anybody, including the person you're talking to.

Instead ask how they think it could be made better. Ask if things would work better if you assign someone else to help offload them so they can focus on the parts they do best. Be positive: They told you about the problem before it got serious (hopefully), so it can be fixed. That's good, not bad - problems always arise, but they usually only become real risks to the company when they're left unmanaged.

Be available

Never ever tell your staff you don't have time to talk. If you genuinely can't right now, suggest a time when you do have time and put it in your calendar the same day. Often if you have regularly schedules slots for your team members the need to do this rarely arises, as people already knows you're available for them to talk to, and will put off less important issues until you regular meeting.

But that does not mean emergencies can't arise.

Require status updates and read them

A weekly status update at least in writing helps a lot. If someone runs into problems, you should see it on the way their status updates changes. But more importantly, a status update lets you get the basic technical stuff out of the way, and also gives you something to talk to people about - even the people who never talk to you or never talk to you about anything of substance.

Make your team understand you expect these updates. Just asking for them and not following them up doesn't help. Going on and on about them if your team sees no evidence that you ever read them also doesn't help.

Hang around

During my most depressing period at the company I mentioned, I always went home at 5pm. Which was great. For me. I did that in part because I really, really had to get out of there. It was suffocating. I did it in part because I know from past experience that when you're on the verge of burning out you must take time out and try to get your energy back, or you end up working longer and longer days with less and less to show for it - your productivity just nosedives.

What I failed to realise was that it was happening to parts of my team as well. Part of the reasons for the crunches we were facing, apart from bad communication, was that the natural reaction to those crunches was for people to work until late at night.

A significant part of that work could have been done in a fraction of the time if I'd sent these people home, asked them to take a day of, and had them do the same work in a more controlled manner once they were well rested.

I knew that. I've been in that situation myself more than once. But I didn't recognise that in my team. I figured, hey, they want to work late, they can. I did ask them often "not to work too late", but coming from someone leaving them at 5pm when they're facing 5 (or 10!) more hours that's just demoralising.

Even if you can't contribute anything useful, you can provide your support (and cookies and soda, or whatever makes them happy...) And by being there, you can try to stop it from going too far: Even if the crunch is genuinely a problem, and they'll just be working late for a few days, it's counterproductive to have your staff keep working even when you SEE they're not getting anywhere.

Make them go home.

Improving process

Once you've got communications working, you're still faced with chaos. It's just that you finally know just how chaotic things are.

I'm not going to say a great deal about process improvement this time around, as I think I can probably write a long article just on that subject, but I will make some general comments.

First of all, DO NOT try to force a huge process onto an engineering team that has never worked with one before. That goes even if individual members has, or even if the whole team has, but they haven't worked with exactly the process you want. It doesn't work. Been there, done that.

The result can be anything from grumbling "acceptance" coupled with subversion at every step, to outright rebellion.

My process improvement work started with me beginning the process of writing a development manual. Great idea, if it hadn't been so flawed: A development manual should enshrine what you're already doing, not set policy. I am grateful nobody laughed out loud when I provided them with the first draft. Instead it was silently but politely ignored.

Later, what I've observe work is a slow and steady process. Observe how the team works, and put in place one little measure at a time: A weekly status meeting, or even twice a week. Status updates. Change control forms. A simple change control process. Code reviews.

Expect it to take a couple of years to take a team from chaos to structure. But even after the first couple of weeks of actively trying to introduce simple changes you should start seeing improvements.

Successes

After I woke up to the fact that something had to change, I've had multiple opportunities to test out new things and improve my skills, and the result has been great.

One thing I was very happy about was that my first opportunity to improve was with the same team I'd had problems with in the first place - I quickly saw the change in the feedback I got both face to face and via other people, and productivity improved.

But what was the greatest part was when I was tasked with a major new project at a point when the team was split in two. I got a chance to test many ideas with a smaller group. While I still did not use regular one to one meetings, I did introduce more regular calls, and was much more available, and the difference was huge - for the first time in a long time I saw the team truly pulling together, and avoiding nasty surprises, and almost entirely doing away with the problem of regular crunches.

More recently, with my team at Yahoo!, I've had the chance to really make use of what I've learned, and spend a fair amount of my time specifically on strengthening communication and processes. It pays off. When the team works well together, my job is easier, and in the end we all benefit. That's perhaps the key lesson: Becoming a better manager isn't just good for your team, it also makes your day a whole lot more pleasurable.

March 22, 2005

Simple push parsers

I've been toying with a simple table driven push parser class today. Normally I write my parsers as recursive descent either with or without a separate lexer stage.

However I've already disliked pull parsers because it's inflexible - the parser and not you control the amount of IO. As such it easily forces you towards multithreading even when you could've easily multiplexed the application logic.

A push parser by contrast need to work only on the input fed to it. A common way of doing that is in the form of a Nondeterministic finite automaton or a deterministic finite automaton, or similar techniques such as a pushdown automaton, which all can easily be designed to work with single character inputs.

However, I wanted a class that let me easily handwrite parts, so what I ended up with was the following:

A table driven parser with a table per production. For each entry in each table I store a flag to indicate if it's optional, a pointer to another table, and a pointer to an "acceptor object".

The "acceptor" is simply a simple class that provides a method to check whether or not it will accept the current character, and whether or not or not it's reached the end. It allows me to simply customize behaviour, and dramatically cuts down on states by letting me define generic constructs such as "recognise this string".

A simple parser class push states onto a stack until it reaches the first state with no pointer to another production. Once an acceptor is "done", the parser moves to the next entry in the topmost table. Once it reaches the end, it pops the state and skips to the next entry in the new topmost table. It continues until the stack is empty.

This is not to be confused with a pushdown automaton, where the stack is used to store symbols that have been parsed not the history of states.

Actually, this is more or less recursive descent turned outside in - imagine writing a recursive descent parser in a language that supports co-routines: Instead of reading a character, the parser will always yield and won't regain control until a new character is available. Only in this case this is made explicit by returning and retaining an explicitly managed stack

I'm sure this isn't an original technique - it's too simple - but I can't remember if I've seen it describe anywhere. If anyone recognise it from elsewhere, let me know as I'm always interested in finding out if I've missed any obvious optimizations.

Wired: Are socialites still Networking

Joanna Glasner at Wired has written the article Are Socialites Still Networking? that takes a look at whether social networking sites are living up to the hype.

She covers the split into two sub-groups: meeting friends and professional networking.

Personally I think the former is mostly useless, but find the latter of some use via systems such as Linked In that offer limited access (on a recommendation basis) to people that may have significant value for you to get hold of, but which you might have no idea of how to reach.

March 21, 2005

The failure of abstinence in sexual politics

According to a report by Yale and Columbia University researcheers many who pledge abstinence are at risk for STDs.

Among virgins, boys who have pledged abstinence were four times more likely to have had anal sex, according to the study. Overall, pledgers were six times more likely to have oral sex than teens who have remained abstinent but not as part of a pledge.

I'm not the slightest bit surprised by this - they're comparing people of which a large part are likely to have pledged abstinence because of expectations from relatives and social groups around them.

What is most worrying though is this bit:

The pledging group was also less likely to use condoms during their first sexual experience or get tested for STDs, the researchers found.

Again, this is not really surprising, as if you've been pressured into pledging abstinence you're not very likely to be in an environment where you have learned about the risks of STD's and pregnancy or where you are willing to take the risk of your parents and other relatives and "friends" finding out about your sexual escapades.

Abstinence didn't work previously in history, and there's no reason to believe it will work now - unless you force girls to commit to checks to verify their hymen is intact or similar barbaric practices.

It is quite damning how few "succeed", though:

Last year, the same research team found that 88 percent of teens who pledge abstinence end up having sex before marriage, compared with 99 percent of teens who do not make a pledge.

Nice work. So instead of having as many teens having sex, but being well informed about STD's and way of protecting themselves, you instead have a slightly lower number having riskier sex and going into their first sexual experiences largely clueless about the risks and consequences.

The hypocrisy of religious groups that are highly likely to be "pro-life" abortion opponents supporting "sexual education" programs that make unwanted teen pregnancies more likely is what I find most absurd.

Visual programming

PlutoSpin- GIPSpin (Graphical Interface Programming) is the latest in a long range of attempts at visual programming systems (not to be confused with visual IDE's for text based languages).

I'm not convinced of the versatility of the approach they've taken, but I always like taking a look at new attempts in this field.

Why does all diagramming tools SUCK?

During work, and as part of my part time studies, I frequently have the need to create diagrams. Most often things like UML diagrams, flowcharts and ER diagrams.

So far I've yet to find a single usable - be it closed source or open source diagramming package.

The genral problem seems to be one of two things: Either a package is completely unstructured, and you can do whatever you want, OR a package constrains you to rules, 90% of which will be fine but 10% of which inevitably clash with your particular taste.

Visio is the worst I've come across of the latter. If you choose one of the "formalised" types of diagram in Visio, it turns into a nasty, arrogant know-it-all that places a lot of constraints on the way you draw diagrams OR forces you to forfeit all support.

XFig for instance is on the complete opposite: Generally you're left to your own devices, but it means complex diagrams are extremely tedious to draw.

What I want is to be able to draw "anything". That is, I want to be able to break as many constraints as possible. On the other hand, I want it to be easy to stay within the constraints, and I don't mind a (non-obtrusive) reminder of what I'm doing that doesn't fit the current ruleset. If it's easy to ammend the current ruleset, then even better.

One app I really likes in many respects is Ideogramic UML. It's not open source, but a limited version is available for free as in beer. It's main attraction for me was the gestures (which worked surprisingly well and saved me from that other monstrosity: the overgrown palette) and it's basic support for freehand drawing to annotate the diagram. My main criticism is that you're still forced into a too restrictive model, and the freehand drawing (though a great idea) is too limited beause it's essentially, well, freehand, with no drawing tools available.

Surely this can't be that hard? Checking that you conform to a model without yelling and screaming if you don't - just display a warning in a status bar or change the color of the offending element (and let me turn it off) instead of refusing to accept it.

Allow me to use arbitrary geometric shapes.

Provice a library of elements that have a certain look and certain places to put text and attach connectors, without enforcing specific semantics.

I want a smart diagram editor, not a modelling tool enforcing a specific form of modelling. I want nice looking diagram, not source code generation...

So why can't I find anything that's usable?

Semantic web as future reality

This entry at Fred on Something neatly summarises my painful experiences while reading the W3 specs and assorted tutorial this weekend:

The thing is that RDF is not intended to be easily understood by humans like simple XML documents. RDF is intended to be understood by machines.

However, I still think the lack of accessibility of the W3 specs is a big problem. The XML spec is reasonably accessible. Even the XML Schema spec is. I can sit down with them, read them, and start writing a parser. Granted, it wouldn't be a very good parser if I didn't know more than I'd learned from a single reading of the specs, but I'd be able to.

It's less important that the formats are inaccessible if the specs are easily accessible so we get good tools to deal with them.

Nobody cares that Postscript is painfully obtuse to read in a text editor, and that doing so won't really tell you much about the document it describes, because we have good tools to manipulate postscript files and few of us need to interpret the files directly.

However the RDF and OWL specs are painfully dense, and painfully fluffy and full of mathematical terms that for me and most software engineers I know reads as mostly nonsense.

This massively complicates the issue of getting good tools to work with it, and at this early stage even makes it hard to get people to understand the potentials of the technology.

I'm sure these specs represent great work, but it could have been so much better if more effort had been put into 1) examples and 2) presenting the normative semantics by specifying the intended effects in terms of observable effects on the RDF graph, or conceptual addition of RDF triples (even if the implementation wouldn't necessarily have to store these triples).

The triples aren't hard to understand. The RDF graph isn't hard to understand. The bloody description ohe OWL semantics IS.

I wish the W3 would take some cues from ECMA, and do what ECMA did for ECMA 262 (the ECMAScript / Javascript specification), where the document specifies the semantics of the language by presenting expected results in terms of code rather than abstract mathematical terms.

Personally I have this intense hate for these kinds of specs as they're hardly ever needed.

I have no problems understanding how to implement a backpropagation neural network, for instance. However that is thanks to plain English or pseudo code descriptions of the algorithms involved. If somebody tried showing me a mathematical representation of it I'd glaze over instantly.

I've yet to see a single example of something presented in this kind of notation that isn't possible to do just as well in natural language, and that will be significantly more accessible to a significantly larger audience.

If you want to win the Nobel Prize in maths then accessibility to the general public isn't needed as long as other leading scientists understand you. If you try to write specifications with the goal of transforming the web, which became successful largely exactly because it was accessible and anybody could easily understand how to make use of the technology, it is.

N3 as a logic language

After yesterday's entry Understanding the Semantic Web: N3 to the rescue I went on to spend some time actually starting to write a N3 parser. The language is straightforward enough, though the BNF grammar was a bit awkward.

That might be because it was actually generated from an N3 description of N3 itself. I ended up reading the N3 description instead of the BNF, and got most of the way to having a working language checker (as in, it parses most of the language but throws away the result) in a couple of hours, and plan to fill it in to create a proper parser later this week.

N3 seems promising to me both as a way of exploring RDF and OWL and as a data format in it's own right.

I'll want to implement a basic RDF storage model as well, but that seems quite straightforward (I'm looking at a testing ground, not production quality code)

I was looking at Redland yesterday too, and while I'm sure it's a fine system, it just seems far too complex for my taste.

N3 really drove home the idea that what RDF-S and OWL and the rest is really about is simple logic programming based around Horn clauses. It's a very constrained, and simple model, which is good because apart from some toying with Prolog when I was a kid I haven't spent much time on it - this is a great opportunity to read up now that I have real world applications for it.

Open Source as research

Martin Fowler has a short interesting article on how Open Source contributes to R&D

March 20, 2005

Understanding the Semantic Web: N3 to the rescue

I've spent most of today reading up on RDF, OWL and other painful stuff. Things were going really slowly (what f******d decided using set theory to describe the RDF semantics was a good idea, when it could have been so "simple" if they'd instead just explained things in terms of what triples could be inferred) until I came across N3.

I briefly mentioned Metalog earlier, and that was a great start - allowing me to play around with "human readable" assertions. But N3 is a step closer to the "real thing", and in fact Ntriples, a reduced form of N3 can be generated by Metalog.

N3 is part of a Semantic Web Application Platform (or Playground) set up to facilitate practical demonstrations of semantic web technology. So far it's succeeded for me - it's told me far more about the Semantic Web than any of the specifications.

The N3 grammar seems clumsy and badly documented, but there is a great tutorial covering N3 and how to apply it to the Semantic Web.

If you feel brave, you might also want to take a look at Euler - a Java app to verify conclusions by inferring proofs for them. The Java code for Euler is some of the nastiest stuff I've seen (1700 lines in one class and pages upon pages in a single function) but it seems like something worth investigating further once I've digested more of the N3 stuff.

Google sued by French Press

Agence France Presse is suing Google for syndicating their news stories on Google News.
I found out about this via this post at ThreadWatch. See also this CNet News.com article.

This lawsuit cuts to the heart of fair use and the copyright control of databases. The comments on Threadwatch appear divided - particularly with regards to whether opt-out (via robots.txt) is sufficient.

However for most search engine features, opt-out is clearly sufficient: Fair use protects your right to quote from and refer to a work by title and other distinguishing features, or to describe facts about a work or from a work in general. The typical search engine listing is well within the limit, and if the search engines wanted to, they could likely safely ignore robots.txt with no legal consequences.

The features that are more interesting are Google's cache, which is an outright copy, and their aggregation of data that a company such as Agence France Presse may try to assert a database copyright for, since they provide a syndication service.

This is more likely to involve database copyrights.

In the US, a database can only be protected if it shows originality in it's selection, coordination and arrangement. This means that a purely automated aggregation of all news items provided by partners for instance likely would not receive any protection.

However even then, extracting facts from a database is legal, short of duplicating the structure of the whole database.

Under EU law, databases have sui generis protection, that is, you can't reuse data from a database while it is protected (protection lasts for 15 years), even if the data is pure facts. If you want to use these facts you must compile them yourself from other sources. If Google have used AFP's aggregated data to compile the news, this may well be illegal under French law - however it seems weird for AFP to sue in California, as the US Supreme Court have explicitly rejected the idea of such protection under current US copyright law.

The paradox is that Google (and anyone else) clearly have a legal right to publish the location, descriptions and titles of these news articles under fair use, but that there is a chance that aggregating them may violate AFP's copyrights depending on how it's done, and what jurisdiction we're talking about.

(Btw. I am not a lawyer - don't rely on me to decide whether or not it's safe to aggregate data...)

Groklaw: Report from UK PTO software patent workshop

Groklaw has this eyewitness report from the UK Patent Office's Technical Contribution Workshops related to the EU software patents directive.

Don't expect much to come out of this - the UK government is firmly pro-software patents, but it's interesting anyway.

Ah... I get it now: It's Prolog all over again

Metalog - the semantic web query/logical system seems to be exactly what I've been looking for in terms of allowing simple exploration of Semantic Web technologies without all the lofty promises clouding things up.

Take a look at the quick guide and if you've ever read an introductory article on Prolog you'll be right at home... (In fact Metalog uses Prolog for it reasoning support)

It's helpful in that it allows you to translate the Metalog input into RDF triples and RDF/XML format, while playing with it in a pseudo natural language, so it drives home the mappings much more effectively than most tutorials I've seen.

I wish Metalog had OWL support too, but it's a start, I guess.

Ontology Development 101: A Guide to Creating Your First Ontology

Stanford has a great tutorial online as part of their Protégé project (an open source ontology editor in Java w/an OWL plugin). You can find the tutorial here

I found the link to Protégé at Chaz Blog - seems he's running into some of the same problems I do with how to apply this stuff in practice.

What on earth are Ontologies and taxonomies?

Search Science has a great little summary here

March 19, 2005

The complexities of OWL

I've spent today writing on an essay on the Semantic Web, and reading up more particularly on OWL.

What hits me is the complexity. The OWL Guide was a big help, but I still find it difficult to see how to apply it to the real world. I mean, I can see the potential - the idea of being able to effectively convey semantics, even in the the face of data using different ontologies, and the promise of being able to query about properties that are not explicitly written out in the data through machine reasoning.

But I've yet to find a tutorial or introduction that explicitly address more directly useful scenarios instead of the "let's build a complex ontology" scenario.

That, and the lack of a wide choice of tools to reason about OWL ontologies means that we're likely still years away from seeing the real promise of the Semantic Web realised.

In the meantime there is still lots we can do to approach the Semantic Web gradually. One of the main things is to embrace RDF directly or through RDF-A or GRDDL. Without OWL we're stuck doing things like inferring mappings between various ontologies by ourselves, but the more widespread RDF datasources become, the more incentive are we creating to invest in creating tools that can solve the interoperability issue (whether by making OWL usable, or by finding something else).

I'm curious about to what extent the complexity of OWL is needed, or whether it is complex because the problem is still not sufficiently well understood and a simpler solution may come along.

Writing 4000 words on the Semantic Web

... turns out to be very easy. My biggest problem with my essay assignment is going to be to cut it down enough to fit within the word limit.

I should have done less reading before I started :)

I won't post the full essay, but I'll make notes on any interesting ideas I get while writing it, and write a few entries on it after I'm done.

Wired summary of EU software patents

Wendy M. Grossmann has written a good summary of the current status of the EU software patents mess here

Defender of the Linux faith

News.com has an interesting article titled Defender of the Linux faith | CNET News.com about Harald Welte and his work on http://gpl-violations-org">gpl-violations.org

Since setting up the project, Welte has made 25 agreements with companies that were violating the GPL, as well as setting up two preliminary injunctions and one court order. Each of these companies used GPL code without making the altered source code available--a requirement of the licence.

David Weinberger on taxonomy and tags

David Weinberger has posted som short noted from his birds of a feather on taxonomy and tags from etech that's well worth a read.

They sum up quite simply the differences between "real world" taxonomies, structured around the concept of straight subdivision of concepts, and the emerging categorisation that occurs with tagging, where everything is "one big pile" of concepts with added semantic markup that leaves the categorisation to users (whether human or software).

UK school micro managing pupils lives for no good reason

According to the BBC a UK school bans a girl for getting hair braids. What is it with UK schools and their fascist obsession with controlling students and the way they dress?

After near 5 years in this country I still don't get why this country has such an obsession with turning school children into sheep with no ability of independent expression.

March 18, 2005

eWeek: Linux & Open Source Header

SCO Has Its Day in Nasdaq Court

eWeek has an article titled SCO Has Its Day in Nasdaq Court that summarises the issues surrounding SCO's missing 10-K and their meeting with the Nasdaq Listing Qualifications Panel (No decision on their delisting yet)

Dan Gillmor has some scathing comments on Bush and taxpayer-funded propaganda.

It's scary reading. I've never had any illusions about the objectivity of the press, but I have generally assumed a certain minimum level of honesty, even if blatantly biased according to an editorial agenda.

This kind of propaganda is more insidious. A right wing journalist that writes about news from a right wing viewpoint because they believe in it I have no problems with. It's relatively obvious most of the time. But news items produced like this, with no proper attribution that allow you to consider the source with a critical mind, is nasty.

But then I'm not really that surprised, given the, from an European viewpoint, extremely right wing slant of US media (which makes it hilarious whenever US right wing people bring out the "liberal media" complaint).

European mainstream news span a quite wider spectrum, all the way from communist newspapers in many countries including L'Humanite (in English), to right wing, extremely conservative papers such as Daily Telegraph, and of course we do have