On the Semantics of Search

I’ve always been taken by the term Information Management. As with so many phrases in the computer lexicon, this is one that has become both very specialized – focusing primarily upon the various and sundry database applications that a given organization uses – and rather vague. Vendors seize upon this vagueness by claiming that their particular database or content management system or network dashboard will of course automate away all those messy information management issues, though by the time you unwrap it and install it you come to realize that you have in fact simply purchased yet another database whose purpose is to keep track of all the other databases.

However, the self-referential nature of this process points to one of those uncomfortable truths – information is fundamentally fractal. We organize our documents in words and sentences and paragraphs, each of which provides an implicit assertion about the conceptual breakdown of this content. A paragraph is a narrative thread that indicates that its component sentences assert a point or tell an aspect of a story. Articles present a whole thesis, and incorporates a title, publishing information, summary blocks, and increasingly categorical metadata. A chapter is typically a collection of tightly related articles, a book a collection of ordered chapters, each of which also containing bound metadata to answer the dreaded question – “What is this unit of content about?

Markup is a form of metadata, albeit metadata that, while nominally intended to be read by a human being, exists primarily as a mechanism for helping computers more readily identify these points of abstraction for processing. One form of markup, XML, works reasonably well when imposing a specific semantic layer of interpretation upon a document, but it should be understood that this interpretation is in fact arbitrary – a selection of text can be marked up in a number of different ways depending upon the intent of the person marking the selection up – an etymologist marking up uses of speech is going to have a different markup than a poet or writer (who will likely concentrate upon narrative cohesiveness), while an archivist may be far more interested in historically relevant information within the content.

Similarly, different types of documents and collections of documents also have different levels of abstraction. Most markup is a manually assigned (or at least manually arbitrated) view of a “manageable” document. A “search” is in of itself a document, albeit one consisting of linked summaries to other documents. In a small, finite space, such search documents can be manually generated, but once the number of documents cease becoming manageable, then it falls to computers to create algorithms that identify a specific set of documents based upon user criteria. Typically such search mechanism employ four parts – a formal internal set of indexes (properly indices, but) that extract some portion of the document and use this to make a key to identify the document, a query function (typically parameterized) that is able to query against those indexes to retrieve a set of document pointers, the parameters themselves (typically entered by the querant), and a transformation process to convert the resulting document abstracts into a sequence of “teasers” or similar report format.

This is the Google paradigm, and is similarly the paradigm that most contemporary search engines employ, in one form or another. However, even this is changing. Computing has long had a history of establishing hardware systems to be able to solve a certain problem then to abstract the hardware pieces over time into software, code file documents, as hardware improves. Once abstracted, such code itself becomes information to be managed. Indeed, increasingly search today involves ascertaining first what kind of search is needed in order to achieve what type of result, which means that search mechanisms are being used either directly by humans or by inference through behaviors to determine which kinds of search needs to be performed. Search becomes fractalized.

Even as this search abstraction process is going on, so too are the types of searches (and their effects). Semantic web based searches can be used to reference conceptual blocks of content, but they can also be used for representing other entities abstractly that exist primarily as assertions that they exists. It is possible to craft a query (in this case in SPARQL) that can be used to query these assertions and from them construct a document (possibly with the aid of some other transformational language such as XQuery or XSLT), but it’s worth noting that the document itself – a “bundled” collection of information – does not itself exist in such an environment until it is created, and even once created, may not be the same document from invocation to invocation of the query.

In a number of respects, this is a radical step forward for search, especially as the queries involved are most likely inferential in nature. Put another way, in such a search – the information exists in potentia, but until the query is actually invoked, does not exist within a given file system, database or repository as a clearly defined, human authored document. At the same time, because the query plus its parameters constitutes a URI, such documents are still addressable, even though they didn’t exist until they were created and are effectively dynamic, living documents from call to call.

From a search standpoint, what this means is that Semantic “search” actually moves into the realm of formal analysis, rather than just finding resources, and is also becoming more distributed in the process, as the “databases” in question may potentially span large numbers of linked data providers. Additionally, such semantic search also differs from more traditional searches in that the inferences that are developed may in turn be pushed back into the associated data stores, which means that over time such data repositories become more robust through user participation (in effect, filling in the details about a given person, location, work or similar entity).

This is the same pressure that is pushing Hadoop and similar name/value data systems. Fifty years ago, information was expensive – the cost per kilobyte of information could be measured in the tens or even hundreds of dollars, depending upon the information in question. Now the cost has dropped to micro-cents per kB, and the world is literally awash in information, so much so that it transcends the ability of human beings to understand it. Tools such as Hadoop are useful for creating basic abstraction layers on the raw data, relational data tools are useful for qualifying that data, XML databases are increasingly becoming core for working with thtat data higher in the abstraction stack, while Semantic tools provide the ability to create relationships between different core nodes of that data.

Additionally, because of latency issues and the fact that the more recent generation of data stores provide evolving information, information management becomes increasingly stochastic and asynchronous. The reliability of gathered data becomes significant, and queries will increasingly have to trade off immediacy for fidelity. When a search is made, the result is not necessarily the best fit, but only the best fit right now while data is being gathered and processed. Databases are no longer contained in a single box in the data center, but may be in data centers in Rio de Janeiro, Berlin, Beijing and Sydney – and may not even necessarily be under one company’s provenance.

In effect, data search systems are becoming quantum in nature – the actual data space exists as a wave equation, but your observation of it (through search query mechanisms) forces that wave equation to collapse to a particular state. Curiously enough, there are indications that this is more or less the same mechanisms that occur in the brain when remembering things, that memories are in effect standing waves that collapse into an internal “vision” due to the filtering effect of human perception.

This is one of the reasons why it’s perhaps time to rethink our definition of search. At the enterprise level, the goal increasingly is not to find what has already been created but rather to turn what has been created into a knowledge base to better understand what’s coming down the road. In a way, this isn’t surprising – while what has been developed in the past may hold interest to the archivist or curator (or auditor), the past doesn’t necessarily contain deep insights about the future. However, as the torrent of information becomes a deluge, enterprise search systems need to be able to gleam from what’s coming in today what will likely be coming in tomorrow. Enterprise search becomes business intelligence. That’s the way forward.

Kurt Cagle About Kurt Cagle

Kurt Cagle is the Principal Evangelist for Semantic Technology with Avalon Consulting, LLC, and has designed information strategies for Fortune 500 companies, universities and Federal and State Agencies. He is currently completing a book on HTML5 Scalable Vector Graphics for O'Reilly Media.

Leave a Comment