Regular readers of my blog know that I have a long-standing interest in the potential of full-text search technology for both litigation support and knowledge management applications in large law firms. With the seeming explosion of new companies offering advanced full-text approaches, I have been trying to sort out what is really new and what really works, so I have asked an expert. 

A couple of years ago I met Sharon Flank of DataStrategy Consulting. Sharon has a PhD from Harvard in computational linguistics, meaning she is an expert on full-text search technology (her bio is at the web site). The company offers technology due diligence and product strategy and technology planning consulting, especially in information retrieval, natural language processing (NLP), and visualization.

It occurred to me that she was the perfect person to ask about the underlying developments. Last week, I sent her an e-mail message asking the following:

“Has there been any conceptual break-through – at an algorithm level – in full-text and semantic analysis in the last 10 years? From 1990 to 1995 I looked at many products: PLS, Verity, Excalibur, Conquest, Fulcrum, and others I can’t now remember. It seems to me that those products did much of what current products currently do, except perhaps the extensive auto-classification (though that was less of a requirement back then). Clearly, the ability to process large volumes has gone up and user interfaces have improved. I’m not close to the computer science but am curious if the underlying advances have been significant, perhaps even quantum, or merely incremental. Thanks in advance for any thoughts.”

Sharon was kind enough to send back the following reply about the current state of natural language processing and full-text search:

“There are several underlying important developments over the last decade or so:

  • Incorporating user feedback to refine search results, usually indirectly rather than explicitly, making results better through machine learning. [Amazon.com is the most-often cited example of this with it’s “if you like A, you’ll also like B.”]
  • Assessments based on usage or referral. This is what makes Google so useful and popular. This approach gives higher rankings if other web sites point to a target or if that target gets a lot of hits.
  • Various approaches to using taxonomies. The better applications use taxonomies as a navigation guide but don’t force it or require administrators to implement it. Vivisimo.com is an example of interesting, automated clustering approach.
  • Better handling of phrases. Google automatically parses phrases and deals with search terms as phrases. This now seems natural but in the AltaVista days, you couldn’t tell a Venetian blind from a blind Venetian [example courtesy of Prof. George Miller, Princeton Univ. – too good not to cite].
  • Context-sensitive search is now an emerging trend. Systems track what users have previously searched for and infer interest in the same domain to refine search result. So if you look for “line” and a system knows you’ve just looked for “tacklebox,” then it infers you mean “fishing line.” Or if you search for bagels and the system knows you are in 20009, it tells you that you can buy them at Comet Liquors (which happens to sell bagels).
  • “More generally in natural language processing, the statistical and linguistic approaches are converging in a new way: use massive amounts of data (i.e. the Web) to get statistical answers to deep linguistic questions, like “How do we figure out what the most likely referent is for the pronoun ‘they’?” Or “How do we determine the correct sense for ambiguous words?” These things aren’t in search engines yet, but you can expect to see more “intelligent” features coming out of this approach.

    “Looking at this list, you can see that the conceptual changes (breakthroughs?), with the exception of better phrase handling, are primarily focused around Web searches. When dealing with one-of-a-kind document collections behind the corporate firewall, many of these developments turn out not to add much to older approaches. So, at least for enterprise search, I too remain partial to some of the older products you mention, though I am disappointed that most of the old-time vendors have not updated their approaches beyond adding taxonomy support.”

    I appreciate Sharon taking the time to provide this insight. The bottom line for litigators and litigation support professionals: you need to keep your eye on emerging technologies and not necessarily take a “one-size-fits-all” approach to managing large volumes of documents.