Vidar Hokstad V2.0

Home Blog

Tag: search

2008-04-24 10:08 UTC Where tagging falls apart

Tobey Maguire has a daughter named Ruby.

How I know? No, I haven't started following celebrity news. But I do regularly skim the Technorati Ruby tag.

Tags suck.

I get lots of what I'm after, namely posts about Ruby development, but I also get a hell of a lot of junk. Not junk as in spam, though there's that too, but junk as in semantic overload of terms that include meanings I have no interest in.

I don't care about celebrities and their families, nor do I care about gemstones, people gushing over their pets on their blogs or the number of other things that end up tagged "ruby", even though the tag is very much reasonable seen with the eyes of the people who posted those entries, and presumably for a lot of users.

What it boils down to is that "folksonomies", which seemed to be all people blogged about a couple of years ago, works best when they are confined to niches. Not so well when the internet at large starts tagging. It's one of those nasty cases where all is well and fine at small scale, when your audience is relatively homogenous at least in terms of the terms they use, that just falls apart as things scale up.

(Incidentally, the failure of tagging is a pretty good example of why it makes next to no sense to test if your data set isn't realistic.)

We learned the hard way at Edgeio that a varied audience and consistent tagging does not mix. While we started out trying to use tags for most things, we eventually moved more and more towards using various classification and feature extraction methods to improve ranking and make the search results for the classifieds we were fed more cohesive, as people were chronically unable to apply the same semantic meanings to the same tags.

And that was despite the fact that by far most of our listings came from large providers that fed us hundreds or thousands listings - most of them professionals in their niches.

The feedback loop doesn't work. One of the big hopes of folksonomies in many peoples minds was that common usages of specific terms would crystallize as people saw how specific tags were used elsewhere. I.e. the Ruby gemstone people might see their stuff getting swamped by the Ruby programmming language mob and start tagging their stuff "ruby gemstone" instead of just Ruby. So far I've seen very little evidence that people look at how other people tag their stuff and adapt. I'm not saying it doesn't happen, but certainly not enough to "clean up" tags with massive semantic disconnect between different groups of users.

Another tag with unfortunate disconnects is Rack. When I first took a look at it to look for people writing about the Ruby web server to framework adapter 'Rack', I was completely oblivious to the fact that of course I'd also be facing a lot of posts about scantily clad women... I was surprised, honest - while of course I knew the word is also used for breasts, it didn't cross my mind at all at the time.

Classification of data is a hard problem, and classification of short snippets of data even more so (people have less space to distinguish their listing from someone elses, and so each word has a greater chance to skew things). Tags are still useful hints - I could easily write about a specific Ruby project, for example and manage to avoid including the word Ruby in the text. If so, including the term Ruby in the text would be useful in reducing the chance of ambiguity about what the project name referred to. Rack being a good example.

But tagging isn't enough.

I'd be happy to type in extra search terms if I would get posts more closely related to what I'm looking at. Wikipedia style disambiguation, for example, or clustering like what some of the Google competitors such as Clusty are experimenting with.

But most search engines still expect your entire search term to be a literal search. I.e. if I search for "Ruby programming", even most "next generation" engines expect you to be looking for documents containing both of these terms. They may very well be, but what I want to ask for is post related to Ruby programming, regardless of the terms.

In fact, almost regardless what I search for, I am searching for something related to a concept, not something containing specific strings (unless I'm searching for a quote, which does happen).

There are lots of search startups with different approaches to this, such as the much hyped Powerset. But "generic" search is actually the lesser problem for me - I'm good at translating searches into text to find what I want. I'm more interested in finding recent blog posts matching a specific topic rather than matching specific tags or text strings from the content.

I'd really like to hear about sites that are getting close to doing the right thing in this space.


2008-04-11 08:49 UTC Dealing with information overload

I've been thinking a lot about dealing with information overload recently. WIth the ever-increasing hype for sites like Twitter and FriendFeed, neither of which I use, and a steady stream of Facebook invitations, LinkedIn requests and invites to a continuously growing set of social networking sites and bizarre (to me) services that all add some form of social networking services, I'm more and more fatigued.

I hardly keep up with even my feeds and my e-mail, never mind my IM accounts - messages from people keep building up for days before I answer them.

And the thing is, I'm not actually connecting to large numbers of people. I'm fairly anti-social and notoriously bad at keeping in touch with people I've worked with in the past etc. (it's nothing personal folks - I'm happy to hear from people, I'm just rarely initiating contact myself because I'm always deep into focusing about something or other).

I need better tools to manage it.

Friendfeed aims to combine lots of streams of data into one huge river. The problem is that my challenge isn't to get a single view - it's to effectively manage whats there:

What should I read? What should I ignore? How do I find past information? What do I need to respond to? What should I keep and what should I discard completely?

In other words I need a personal search engine and my own personal recommendation or classification engine.

If I had time I'd build one. Building a small scale search engine for this data is trivial - there are tons of packages like Sphinx that are "good enough", so the problem is mostly ranking (and that is by no means trivial for data like this, that for many elements can be as short as a single line, and that doesn't have enough internal linking for something like pagerank to work). Building a classifier is also fairly trivial - but the problem is similar: Small snippets of data + training. Again, that's not a trivial problem.

But even a badly flawed one would be vastly superior to nothing. Getting something like that to "product quality" is hard. Getting something that is more than marginally usable for personal use should be doable, and mostly limited by my continuous lack of time. So I guess I'll paraphrase Abbie Hoffman:

-Steal this idea

Pretty please?

I want someone to build a service that'll do this for me, damn it. Even if I have to install a local client to capture some of the data. Yes, it will have a ton of privacy and security implications, and no, I wouldn't trust just anyone with it.

Let's summarize:

  • Search my e-mail, feeds, pages I've liked (via StumbleUpon, DZone or others), my IM's, the streams from any social networking sites I'm on (any not covered by my RSS feeds), comments I've written on other blogs, and generally my whole "online footprint".
  • Give me a filtered view that clearly shows me what I a) need to respond to, b) need to read, c) is likely to find interesting, it's source, and relevant/related items (replies/comments from other people, other blogs referring to the same thing, etc.)
  • Oh, and if it could recommend new information sources and show me which ones I really shouldn't bother with(of the ones I'm already following) based on my pattern of usage, that'd be nice too.

Yeah, it's a tall order.


Older Entries

About me

E-mail: vidar@hokstad.com
Skype: vhokstad
View my LinkedIn profile

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Categories

StumbleUpon My link page

(Links I have stumbled and like)