Classifying Web Search Results
by Chiara FoxSearch is a subject that I’ve always been interested in. Especially internal or enterprise search, within a site. Not web search like Google or Yahoo!. Sure there’s lots of search engine optimization (SEO) or marketing (SEM) tricks you can do to improve your ranking in the web search engines. But that’s never really held any fascination for me.
Enterprise search — now that’s fascinating! It’s much easier to tune an enterprise search engine to make the results you want float to the top. (Assuming, of course, you have access to your IT department to make the changes you want.) Weighting of metadata is a simple way to do this. Tools like Verity or Vivisimo make categorization, “best bets,” and other changes to results lists easy easier to do. Though I have to admit, the librarian in me is very skeptical of the promises that those companies make. I don’t trust their auto-classification engines to do a job as good as a person could (or to do it in the time they say it takes). And I firmly believe that having someone to care and feed the classification/taxonomy/vocabulary/whatever-you-want-to-call-it is the best way to get good results.
Recently, I started looking into what is being called “vertical search.” It’s taking the approaches traditionally used on enterprise search (like classifying results) and applying it to the web at large. Folks like Kosmix and Clusty are leading the charge. This sounds a lot like what Northern Light (remember them?) was doing back in 1999 and 2000. However, unlike Northern Light, who used people to come up with their categories (the blue folders), Kosmix and Clusty are using complex algorithms to determine what the web pages are about. Kosmix, for example, focuses on a subset of the web (e.g., travel, health, politics) and subdivides the results into different categories.
Just like with the enterprise search engines, I’m a bit skeptical about this approach. The classification that they are doing isn’t very sophisticated (they use categories like “basic information” or “blogs”), but it is certainly more helpful than a list of thousands of results ala Google results. It will be interesting to see where this goes. A hybrid approach using both algorithms and human-moderated categories seems like it would give the best results. Though I don’t know of anyone really taking that kind of two-pronged approach. Do you?
This post is licensed under a
February 6th, 2008 at 5:39 pm
Here’s a search utility that I saw recently whose entire purpose is to classify your search query. It’s for web searches, but the line of thinking is similar to what you were talking about. What it does is determine the topical site that is most appropriate for your search, and push the query to that engine. It uses Amazon for books, Google Maps for directions/locations, Flickr for photos, and so on.
Boom! Intelligent Search Assistant (about).
One thing that I found happening after a couple test searches is that if I didn’t get the site I expected I would hit back and adjust my query. Given that I’m a total geek, I would quickly learn the correct query keywords to get the site I want, much like using Google’s advanced search operators. This is acceptable for personal use by a more technical audience (which appears to be how this tool came into being), but needs refinement for a general audience.
An internal application of this concept could allow the searcher to correct the engine’s guess and pick a more appropriate or specific category. This could avoid the “hit the back button and try again” loop. I’m still skeptical about using this technique on a general audience site, it’s probably too strong a filter of the results.