Archive for the ‘Search’ Category

Is MarkLogic a Search Engine?

Monday, September 26th, 2011

I am frequently asked if MarkLogic is really a search engine.  It is easy to debate whether MarkLogic fits the classic definition of a search engine.  In my opinion, this is the wrong question.  The question you should be asking is “Does MarkLogic enable great search experiences?”  The answer is undeniably Yes.

MarkLogic comes with all of the standard search capabilities like: keyword search, synonyms, fuzzy search, hit highlighting, sorting, faceted navigation and relevance.  These are the basic features that every search engine should have.   MarkLogic checks the box on every one of these and more.

The fact that MarkLogic can do all of the basics makes it just like all of the other search engines on the market.  What sets MarkLogic apart is that it is not just a search engine.  MarkLogic combines some of the best features of search with a fast performing XML database.  This combination allows MarkLogic to offer features that traditional search engines lack.  Four of the most important differentiators are:

  • multi-level searching,
  • editable search results,
  • schema flexibility,
  • and simplified architectures.

MarkLogic allows for multi-level searching.  Most search engines require you to flatten out the data for search results.  MarkLogic is an XML database.  As a result, information can be stored in a hierarchical format and queried at multiple levels.  This is particularly important in more complex search experiences.  For example, if you are searching large documents, you may want to show the documents that contain your search term along with the sections of the documents that have that term.  Normal search engines would require you to create multiple collections or a complex search screen.  MarkLogic handles these situations naturally.

MarkLogic’s database features allow you to create applications with editable search results.  Our architects call it a “Live” search tool as opposed to a “read only” search tool.  Traditional search engines are designed to be read only.  Edits to existing search data require re-indexing.  Solution providers like Avalon create special indexing routines to allow for updates to content.  These solutions are not real-time and they are not simple.  Fields can be updated or added to a MarkLogic database at any time, transactionally, with full ACID protection.  This flexibility allows us to create a number of really interesting search applications that would have been much more difficult with standard search engines.  For example, we have created tools that allow end-users or administrators to “tag” one or more search results (similar to the functionality in Flickr).  In other applications, we have created search screens where the users can edit the search results without leaving the screen.  Adding these cool features to our search applications is much easier with a combined database and search engine.

As an XML database, MarkLogic provides schema flexibility for storing and querying information.  Our developers and our clients love MarkLogic because it is easy to add new fields to the index.  Traditional search engines typically require administrators to delete and reload the data in order to add specific fields.  In extreme cases you have to re-index an entire data set.  MarkLogic’s schema flexibility becomes even more important when you are working with techniques like entity extraction.  Text Analytics tools can identify people, places and things within unstructured text.  Through this process our clients often find interesting things they want to include in their search applications.  MarkLogic makes it easy to run text analytics against unstructured documents and include the entities in the search results.  Traditional search engines add a great deal of complexity to the process and do not allow for changing structures.

Our architects like MarkLogic because of its simplified architecture.  The next time you meet with your search engine vendor, ask them for a physical architecture diagram of one of their larger implementations.  At a minimum you will have a database or file system to store documents and data, a search indexer, a search server, and a web server.  Large data sets get even more complicated.  Search results have to be clustered and replicated.  You will need multiple indexers and search servers running.  You will also likely need more than one web server and application server for your front end application.  MarkLogic is a database server, search engine and applications server in one tool.  It also has built in replication.  This means fewer servers and less complexity in your dev, test and prod environments.

One final reason to use MarkLogic to power your search applications is that MarkLogic is not just a search engine.  Traditional search engines are very powerful, but they are expensive and limited to search-based use cases.

  • Want to publish thousands of documents to your website or mobile devices.  Some of the largest publishers in the world use MarkLogic to do this every day.
  • Want to build an application that allows users to build reports on the fly by combining sections from other documents.  Those same publishers use MarkLogic offer custom publishing solutions.
  • Want to create a central repository tracking all of your digital assets.  We are working with three different customers using MarkLogic as a central repository across all of their content management systems.
  • Do you need a tool to capture unstructured information for your Big Data solution.  MarkLogic does this for numerous government customers.

At the end of the day, when your management asks you how much you spent on your search solution, it is nice to say that the tool you bought does more than just search.

In fairness, MarkLogic may not be the best solution for an organization that is looking to build a vanilla search intranet that indexes content from numerous secure repositories.   Search engines like Endeca, Autonomy, Vivisimo and Lucene/Solr were designed for these types of solutions.  If, however, you need to build a powerful search application that will change over time, MarkLogic is a great choice.  It offers many valuable features that are not available in any other search engine.

On the Semantics of Search

Tuesday, July 19th, 2011

I’ve always been taken by the term Information Management. As with so many phrases in the computer lexicon, this is one that has become both very specialized – focusing primarily upon the various and sundry database applications that a given organization uses – and rather vague. Vendors seize upon this vagueness by claiming that their particular database or content management system or network dashboard will of course automate away all those messy information management issues, though by the time you unwrap it and install it you come to realize that you have in fact simply purchased yet another database whose purpose is to keep track of all the other databases.

However, the self-referential nature of this process points to one of those uncomfortable truths – information is fundamentally fractal. We organize our documents in words and sentences and paragraphs, each of which provides an implicit assertion about the conceptual breakdown of this content. A paragraph is a narrative thread that indicates that its component sentences assert a point or tell an aspect of a story. Articles present a whole thesis, and incorporates a title, publishing information, summary blocks, and increasingly categorical metadata. A chapter is typically a collection of tightly related articles, a book a collection of ordered chapters, each of which also containing bound metadata to answer the dreaded question – “What is this unit of content about?

Markup is a form of metadata, albeit metadata that, while nominally intended to be read by a human being, exists primarily as a mechanism for helping computers more readily identify these points of abstraction for processing. One form of markup, XML, works reasonably well when imposing a specific semantic layer of interpretation upon a document, but it should be understood that this interpretation is in fact arbitrary – a selection of text can be marked up in a number of different ways depending upon the intent of the person marking the selection up – an etymologist marking up uses of speech is going to have a different markup than a poet or writer (who will likely concentrate upon narrative cohesiveness), while an archivist may be far more interested in historically relevant information within the content.

Similarly, different types of documents and collections of documents also have different levels of abstraction. Most markup is a manually assigned (or at least manually arbitrated) view of a “manageable” document. A “search” is in of itself a document, albeit one consisting of linked summaries to other documents. In a small, finite space, such search documents can be manually generated, but once the number of documents cease becoming manageable, then it falls to computers to create algorithms that identify a specific set of documents based upon user criteria. Typically such search mechanism employ four parts – a formal internal set of indexes (properly indices, but) that extract some portion of the document and use this to make a key to identify the document, a query function (typically parameterized) that is able to query against those indexes to retrieve a set of document pointers, the parameters themselves (typically entered by the querant), and a transformation process to convert the resulting document abstracts into a sequence of “teasers” or similar report format.

This is the Google paradigm, and is similarly the paradigm that most contemporary search engines employ, in one form or another. However, even this is changing. Computing has long had a history of establishing hardware systems to be able to solve a certain problem then to abstract the hardware pieces over time into software, code file documents, as hardware improves. Once abstracted, such code itself becomes information to be managed. Indeed, increasingly search today involves ascertaining first what kind of search is needed in order to achieve what type of result, which means that search mechanisms are being used either directly by humans or by inference through behaviors to determine which kinds of search needs to be performed. Search becomes fractalized.

Even as this search abstraction process is going on, so too are the types of searches (and their effects). Semantic web based searches can be used to reference conceptual blocks of content, but they can also be used for representing other entities abstractly that exist primarily as assertions that they exists. It is possible to craft a query (in this case in SPARQL) that can be used to query these assertions and from them construct a document (possibly with the aid of some other transformational language such as XQuery or XSLT), but it’s worth noting that the document itself – a “bundled” collection of information – does not itself exist in such an environment until it is created, and even once created, may not be the same document from invocation to invocation of the query.

In a number of respects, this is a radical step forward for search, especially as the queries involved are most likely inferential in nature. Put another way, in such a search – the information exists in potentia, but until the query is actually invoked, does not exist within a given file system, database or repository as a clearly defined, human authored document. At the same time, because the query plus its parameters constitutes a URI, such documents are still addressable, even though they didn’t exist until they were created and are effectively dynamic, living documents from call to call.

From a search standpoint, what this means is that Semantic “search” actually moves into the realm of formal analysis, rather than just finding resources, and is also becoming more distributed in the process, as the “databases” in question may potentially span large numbers of linked data providers. Additionally, such semantic search also differs from more traditional searches in that the inferences that are developed may in turn be pushed back into the associated data stores, which means that over time such data repositories become more robust through user participation (in effect, filling in the details about a given person, location, work or similar entity).

This is the same pressure that is pushing Hadoop and similar name/value data systems. Fifty years ago, information was expensive – the cost per kilobyte of information could be measured in the tens or even hundreds of dollars, depending upon the information in question. Now the cost has dropped to micro-cents per kB, and the world is literally awash in information, so much so that it transcends the ability of human beings to understand it. Tools such as Hadoop are useful for creating basic abstraction layers on the raw data, relational data tools are useful for qualifying that data, XML databases are increasingly becoming core for working with thtat data higher in the abstraction stack, while Semantic tools provide the ability to create relationships between different core nodes of that data.

Additionally, because of latency issues and the fact that the more recent generation of data stores provide evolving information, information management becomes increasingly stochastic and asynchronous. The reliability of gathered data becomes significant, and queries will increasingly have to trade off immediacy for fidelity. When a search is made, the result is not necessarily the best fit, but only the best fit right now while data is being gathered and processed. Databases are no longer contained in a single box in the data center, but may be in data centers in Rio de Janeiro, Berlin, Beijing and Sydney – and may not even necessarily be under one company’s provenance.

In effect, data search systems are becoming quantum in nature – the actual data space exists as a wave equation, but your observation of it (through search query mechanisms) forces that wave equation to collapse to a particular state. Curiously enough, there are indications that this is more or less the same mechanisms that occur in the brain when remembering things, that memories are in effect standing waves that collapse into an internal “vision” due to the filtering effect of human perception.

This is one of the reasons why it’s perhaps time to rethink our definition of search. At the enterprise level, the goal increasingly is not to find what has already been created but rather to turn what has been created into a knowledge base to better understand what’s coming down the road. In a way, this isn’t surprising – while what has been developed in the past may hold interest to the archivist or curator (or auditor), the past doesn’t necessarily contain deep insights about the future. However, as the torrent of information becomes a deluge, enterprise search systems need to be able to gleam from what’s coming in today what will likely be coming in tomorrow. Enterprise search becomes business intelligence. That’s the way forward.

Taxonomies, Content Management and Governance

Friday, July 8th, 2011

Good governance is on everyone’s minds these days.  It’s a concern that extends well beyond the Washington Beltway.  As applied to managing your enterprise content, including taxonomies, it is not just an abstraction.

Good governance drives the overall performance of your content program, including:

How easy it is for users to find information
How users look for information
How users store and retrieve information
How to clean up redundant content
What metadata is available
What templates are used
The need for a well-planned and well-run governance program will only increase.  The growth of unstructured information, demands for greater efficiency and cost savings, and privacy concerns are all motivations.

Are you wondering how to set up a governance program?  Are you questioning whether your existing content governance is right?  Avalon and our partner PPC are sponsoring a free webinar series that will help you Cultivate Content Management Success through Planned, Managed, and Implemented Taxonomies. For more information and to register, click here.

How to explain MarkLogic to a business user

Monday, June 13th, 2011

It is no secret we here at Avalon are enamored with MarkLogic technology. Our consultants have regular discussions that involve topics like the best way to use Java code with XQuery or how to integrate HTML5 with WebSockets to create a multi-publisher capability for MarkLogic (and no, we do not use pocket-protectors or wear hats with propellers). Now I understand these are important topics that yield very cool applications but they don’t really resonate with a business user. The typical business user (IMHO) who is being introduced to MarkLogic sometimes has a hard time wrapping his or her head around what the heck it is. When I encounter this confusion I point them to a simple analogy:

I do love old-school SNL.

So how is this analogous to MarkLogic?… very simple. Business users typically understand technology on a 1 to 1 basis. They understand that the search engine is used for searching documents and the content management system is used to change content on their web site and the database is used to store, well… data. MarkLogic simply does not fit the 1 to 1 model in the way most business users have been trained to understand technology, it is a “disruptive technology”. MarkLogic is really a platform to build countless applications to leverage any unstructured content. So what does that mean? Think about all the content/”stuff” you have that is valuable but would not naturally be a fit to be managed in a spreadsheet/database (e.g. it would probably not make sense to put your meeting notes, videos, mp3s, family photos, or this blog into a spreadsheet/database). So lets take a look at some practical MarkLogic use case examples:

Publishing - This is clearly MarkLogic’s sweet spot. After all… who has more unstructured content than a publisher? Now publishers not only have a good way to store and manage their books/magazines/journals/etc. but they can now easily create content “mashups”. What is a content mashup? Think of a student being able to buy individual chapters (or paragraphs for that matter) across multiple books instead of wasting money on content that he/she doesn’t need.

The “S” word - If you go to the MarkLogic website, you will not (at least at the time of this blog post) see Search as one of the categories under their solutions tab. This is really too bad as MarkLogic is an extremely powerful search engine. For instance, we were engaged by a large Association recently that was already using MarkLogic for publishing. This Association realized the power of MarkLogic’s search capabilities and asked us to develop a roadmap for replacing their Lucene/Solr search implementation with MarkLogic (I called it Lucene bypass surgery). They not only saw the value of using MarkLogic for search but how they could reduce costs from collapsing a layer of infrastructure, reducing support and training costs, and eliminating risk from an overly complex system. If MarkLogic ever takes on the other search vendors head to head – watch out Endeca, Autonomy, Fast, etc., etc.

Web Content Management (WCM) - WCM on MarkLogic is simply a natural extension of how to leverage your content and software investment. Avalon has been working with MarkLogic on developing a simple WCM interface to abstract all of the technical mumbo jumbo and put a straight-forward WYSIWYG interface to manage a web site with content stored in MarkLogic. More info here: http://avalonconsult.com/solutions/tools/wcm_for_marklogic

If I told you I would have to kill you - Life would be much more simple for our security agencies if al-Queda and the Mexican drug cartels would establish data centers, we could just hack into them and know what they were up to. It seems the bad guys tend to shy away from structured data (can’t imagine why). Now I don’t claim to know the exact use case(s) of how the US “three letter agencies” use MarkLogic but it is obviously valuable to be able to manage and analyze a ton of unstructured “intel”.

This is just a small sample of the applications MarkLogic can power. Geospatial, mobile, and metadata applications are just a few others that deserve attention for a MarkLogic solution.

So for all you business users out there. Don’t stress when someone in your organization comes up to you and says “I have an idea for a product that will serve our (insert your unstructured content need here) and it might also work as a (insert your other unstructured content need here). MarkLogic is the “New Shimmer” of technology… it simply works well for multiple applications.