Massive Performance Gains and Cost Savings using Hadoop

November 22nd, 2011 by Ryan Merriman

Tony Jewitt, our VP of Big Data, challenged me to impress him with an example of the power of emerging technology in solving real world Big Data problems.  I needed a large dataset with a complicated processing requirement to do that and decided to extract witness information from Senate and House hearings for the past 30 years.  I set about attempting to make some sense out of this intimidating data set by loading it into our favorite NoSQL database and presenting it in a way that would let people find meaningful information effectively.

For the purposes of my test I chose to try and expose an interface where a user can filter Senate, House and joint hearings based on who participated as a witness in the hearing.  Conceptually, this is no problem for MarkLogic — except that there was no witness meta-data attached to the docs.  So I knew I would have to put in a little work and find it ourselves.  How?  By leveraging the GATE processing framework to extract entities.  I developed a utility that allowed me to pass in a hearing document (xml format) and get back a witness-enriched xml document.

The only problem is that it would have taken FOREVER to do this across all the documents.  There were hundreds of thousands of hearings to process.  Given that it takes a little over an hour to process just 15,000 documents, that would have been way too slow in processing the entire body of data especially considering that I knew I would need to repeatedly run the process on large data sets to improve the quality and perfect the algorithm.

So at that point I had to think about what my options were for moving forward.  I could have tried to get a bigger machine — but there is a limit to how many transistors can fit on a board and it becomes less cost effective as we scale vertically.  That led me to the eventual solution — what about setting up a cluster of commodity machines and distributing the work among those?  Scaling this way (horizontally) was attractive because the cost increases linearly as we scale.  Now I was on to something, namely Hadoop.

Next I had to solve the issue of where to get the servers (without creating panic in IT or procurement).  Hadoop is designed to run on a cluster of commodity servers but purchasing them is not a trivial expense. Commodity does not mean cheap in this case.  So leveraging Amazon Web Services (AWS) for my experiment made sense because I could rent them on an as-needed basis and have flexibility to choose the hardware that best fit this particular application. GATE is CPU-intensive so I chose to rent a server designed for that type of operation.  AWS even offers a Hadoop-specific service called Elastic MapReduce which makes things even easier for a developer type like me.  This saved me from any upfront hardware cost and kept me from having to install Hadoop on every node.  And I knew my finance department would be alright with that too as long as I remembered to turn the servers off when I was done.

AWS adds some time to the learning curve but I didn’t find it to be that bad.  Managing the mapreduce jobs through the API is key because the web interface is extremely limited.  The only value I saw is the ability to monitor the status of job flows and terminate them.  You can’t add nodes and you can’t resubmit jobs.  So I decided to use the Ruby command line interface and it works great.

The next step was to modify our GATE process to fit into the Hadoop mapreduce pattern.  GATE works on whole xml documents so each record passed to the map needed to be a whole xml doc as well.  Hadoop works more efficiently against a small number of larger files rather than a large number of small files.  A SequenceFile should do the trick — so I wrote a process to package the hearings files up into that. Hadoop can be a little intimidating because of it’s complexity but I found the beginning tutorials very easy to follow.  I followed the single node setup tutorial and had a Hadoop development environment up in no time.  I loved the ability to run a Hadoop process in a single jvm (referred to as “Standalone Operation” in the documentation).  This really unleashes the power of an IDE by allowing you to inspect objects in memory.  Even when I ran into limitations because of GATE’s voracious memory appetite, I was able to quickly run it in “Pseudo-Distributed Operation” (a local cluster of separate jvm processes) and keep development moving.  All that was required was a simple command line call.

I deployed my assets to AWS by moving the SequenceFile and MapReduce code to S3.  Next I wrote some scripts to start up a cluster, provision more servers and run our GATE MapReduce process.  I started off with a couple nodes to start to ensure the AWS port was working and to create a performance baseline and started out with the aforementioned 15,000 documents.  When running the GATE process locally, without Hadoop, I found it took around 1 hour and 15 minutes to run on a capable laptop. It took about the same on our small 2 node Hadoop cluster.  There is some overhead associated with Hadoop so this was not a surprise. After I scaled our cluster up to 20 nodes and tuned it, the total time came down to around 4 and a ½ minutes (!).  I think it is safe to say that many businesses would be pretty pleased with that performance improvement.

The punchline to all this is that renting 20 AWS servers costs around $4.  Rentals are billed in hour increments so we could process even more documents for the same cost.  Our data set is small by Big Data standards but can easily scale.  All we need to do is rent more servers.

We are pretty excited about these results.  Stay tuned for the next post where we will integrate this Hadoop-based GATE process with a MarkLogic database.

Content Management — There’s an app for that!!

October 19th, 2011 by Joe Hilger

I have been working in Content Management for over 10 years and been involved in over 30 content management projects.  Sadly, I have never found a customer that was truly happy with their Content Management System (CMS).  Most customers figure out how to work with the system they have and adjust their processes accordingly.  Content Management should not be that painful.

There is a better way to manage content.  Why not replace your content management systems with a series of small, purpose-built applications?

  • Need to add a new product to your website… There’s an app for that.
  • Need to create a new marketing web page on your site using a word document… There’s an app for that.
  • Need to convert your web page to Spanish… There’s an app for that.
  • Need to create a new promotion through your website and e-mail… There’s an app for that as well.

Many of the content management problems I have encountered could be solved with a series of simple applications built on a common content platform.  One client we are working with is doing this today with great success.  It is much less expensive than you think.

The key to a successful content application model is finding a platform that allows for rapid content application development.  The platform should serve as a central repository for all content and it should have built in features like versioning, metadata management and publishing.

With the proper platform in place, applications can be developed quickly.  IT works with each department or content contributor to understand their specific needs and process flow.  Applications can be built in weeks (rather than months) because the platform allows it.  As additional applications are built, new features are added to the system that can be shared with other departments.  The result is a delegated content management process that empowers the content owners.

The client that is having success building content apps uses MarkLogic as their content platform.  They are meeting their content contributors needs and developing apps at a fraction of the cost of their previous large content management implementation.

I would love to hear more about how others are building content apps.  Please share your stories as comments.  If you want to learn more about how our clients are building these Content Management Apps send an e-mail to  info@avalonconsult.com.

Is MarkLogic a Search Engine?

September 26th, 2011 by Joe Hilger

I am frequently asked if MarkLogic is really a search engine.  It is easy to debate whether MarkLogic fits the classic definition of a search engine.  In my opinion, this is the wrong question.  The question you should be asking is “Does MarkLogic enable great search experiences?”  The answer is undeniably Yes.

MarkLogic comes with all of the standard search capabilities like: keyword search, synonyms, fuzzy search, hit highlighting, sorting, faceted navigation and relevance.  These are the basic features that every search engine should have.   MarkLogic checks the box on every one of these and more.

The fact that MarkLogic can do all of the basics makes it just like all of the other search engines on the market.  What sets MarkLogic apart is that it is not just a search engine.  MarkLogic combines some of the best features of search with a fast performing XML database.  This combination allows MarkLogic to offer features that traditional search engines lack.  Four of the most important differentiators are:

  • multi-level searching,
  • editable search results,
  • schema flexibility,
  • and simplified architectures.

MarkLogic allows for multi-level searching.  Most search engines require you to flatten out the data for search results.  MarkLogic is an XML database.  As a result, information can be stored in a hierarchical format and queried at multiple levels.  This is particularly important in more complex search experiences.  For example, if you are searching large documents, you may want to show the documents that contain your search term along with the sections of the documents that have that term.  Normal search engines would require you to create multiple collections or a complex search screen.  MarkLogic handles these situations naturally.

MarkLogic’s database features allow you to create applications with editable search results.  Our architects call it a “Live” search tool as opposed to a “read only” search tool.  Traditional search engines are designed to be read only.  Edits to existing search data require re-indexing.  Solution providers like Avalon create special indexing routines to allow for updates to content.  These solutions are not real-time and they are not simple.  Fields can be updated or added to a MarkLogic database at any time, transactionally, with full ACID protection.  This flexibility allows us to create a number of really interesting search applications that would have been much more difficult with standard search engines.  For example, we have created tools that allow end-users or administrators to “tag” one or more search results (similar to the functionality in Flickr).  In other applications, we have created search screens where the users can edit the search results without leaving the screen.  Adding these cool features to our search applications is much easier with a combined database and search engine.

As an XML database, MarkLogic provides schema flexibility for storing and querying information.  Our developers and our clients love MarkLogic because it is easy to add new fields to the index.  Traditional search engines typically require administrators to delete and reload the data in order to add specific fields.  In extreme cases you have to re-index an entire data set.  MarkLogic’s schema flexibility becomes even more important when you are working with techniques like entity extraction.  Text Analytics tools can identify people, places and things within unstructured text.  Through this process our clients often find interesting things they want to include in their search applications.  MarkLogic makes it easy to run text analytics against unstructured documents and include the entities in the search results.  Traditional search engines add a great deal of complexity to the process and do not allow for changing structures.

Our architects like MarkLogic because of its simplified architecture.  The next time you meet with your search engine vendor, ask them for a physical architecture diagram of one of their larger implementations.  At a minimum you will have a database or file system to store documents and data, a search indexer, a search server, and a web server.  Large data sets get even more complicated.  Search results have to be clustered and replicated.  You will need multiple indexers and search servers running.  You will also likely need more than one web server and application server for your front end application.  MarkLogic is a database server, search engine and applications server in one tool.  It also has built in replication.  This means fewer servers and less complexity in your dev, test and prod environments.

One final reason to use MarkLogic to power your search applications is that MarkLogic is not just a search engine.  Traditional search engines are very powerful, but they are expensive and limited to search-based use cases.

  • Want to publish thousands of documents to your website or mobile devices.  Some of the largest publishers in the world use MarkLogic to do this every day.
  • Want to build an application that allows users to build reports on the fly by combining sections from other documents.  Those same publishers use MarkLogic offer custom publishing solutions.
  • Want to create a central repository tracking all of your digital assets.  We are working with three different customers using MarkLogic as a central repository across all of their content management systems.
  • Do you need a tool to capture unstructured information for your Big Data solution.  MarkLogic does this for numerous government customers.

At the end of the day, when your management asks you how much you spent on your search solution, it is nice to say that the tool you bought does more than just search.

In fairness, MarkLogic may not be the best solution for an organization that is looking to build a vanilla search intranet that indexes content from numerous secure repositories.   Search engines like Endeca, Autonomy, Vivisimo and Lucene/Solr were designed for these types of solutions.  If, however, you need to build a powerful search application that will change over time, MarkLogic is a great choice.  It offers many valuable features that are not available in any other search engine.

On the Semantics of Search

July 19th, 2011 by Kurt Cagle

I’ve always been taken by the term Information Management. As with so many phrases in the computer lexicon, this is one that has become both very specialized – focusing primarily upon the various and sundry database applications that a given organization uses – and rather vague. Vendors seize upon this vagueness by claiming that their particular database or content management system or network dashboard will of course automate away all those messy information management issues, though by the time you unwrap it and install it you come to realize that you have in fact simply purchased yet another database whose purpose is to keep track of all the other databases.

However, the self-referential nature of this process points to one of those uncomfortable truths – information is fundamentally fractal. We organize our documents in words and sentences and paragraphs, each of which provides an implicit assertion about the conceptual breakdown of this content. A paragraph is a narrative thread that indicates that its component sentences assert a point or tell an aspect of a story. Articles present a whole thesis, and incorporates a title, publishing information, summary blocks, and increasingly categorical metadata. A chapter is typically a collection of tightly related articles, a book a collection of ordered chapters, each of which also containing bound metadata to answer the dreaded question – “What is this unit of content about?

Markup is a form of metadata, albeit metadata that, while nominally intended to be read by a human being, exists primarily as a mechanism for helping computers more readily identify these points of abstraction for processing. One form of markup, XML, works reasonably well when imposing a specific semantic layer of interpretation upon a document, but it should be understood that this interpretation is in fact arbitrary – a selection of text can be marked up in a number of different ways depending upon the intent of the person marking the selection up – an etymologist marking up uses of speech is going to have a different markup than a poet or writer (who will likely concentrate upon narrative cohesiveness), while an archivist may be far more interested in historically relevant information within the content.

Similarly, different types of documents and collections of documents also have different levels of abstraction. Most markup is a manually assigned (or at least manually arbitrated) view of a “manageable” document. A “search” is in of itself a document, albeit one consisting of linked summaries to other documents. In a small, finite space, such search documents can be manually generated, but once the number of documents cease becoming manageable, then it falls to computers to create algorithms that identify a specific set of documents based upon user criteria. Typically such search mechanism employ four parts – a formal internal set of indexes (properly indices, but) that extract some portion of the document and use this to make a key to identify the document, a query function (typically parameterized) that is able to query against those indexes to retrieve a set of document pointers, the parameters themselves (typically entered by the querant), and a transformation process to convert the resulting document abstracts into a sequence of “teasers” or similar report format.

This is the Google paradigm, and is similarly the paradigm that most contemporary search engines employ, in one form or another. However, even this is changing. Computing has long had a history of establishing hardware systems to be able to solve a certain problem then to abstract the hardware pieces over time into software, code file documents, as hardware improves. Once abstracted, such code itself becomes information to be managed. Indeed, increasingly search today involves ascertaining first what kind of search is needed in order to achieve what type of result, which means that search mechanisms are being used either directly by humans or by inference through behaviors to determine which kinds of search needs to be performed. Search becomes fractalized.

Even as this search abstraction process is going on, so too are the types of searches (and their effects). Semantic web based searches can be used to reference conceptual blocks of content, but they can also be used for representing other entities abstractly that exist primarily as assertions that they exists. It is possible to craft a query (in this case in SPARQL) that can be used to query these assertions and from them construct a document (possibly with the aid of some other transformational language such as XQuery or XSLT), but it’s worth noting that the document itself – a “bundled” collection of information – does not itself exist in such an environment until it is created, and even once created, may not be the same document from invocation to invocation of the query.

In a number of respects, this is a radical step forward for search, especially as the queries involved are most likely inferential in nature. Put another way, in such a search – the information exists in potentia, but until the query is actually invoked, does not exist within a given file system, database or repository as a clearly defined, human authored document. At the same time, because the query plus its parameters constitutes a URI, such documents are still addressable, even though they didn’t exist until they were created and are effectively dynamic, living documents from call to call.

From a search standpoint, what this means is that Semantic “search” actually moves into the realm of formal analysis, rather than just finding resources, and is also becoming more distributed in the process, as the “databases” in question may potentially span large numbers of linked data providers. Additionally, such semantic search also differs from more traditional searches in that the inferences that are developed may in turn be pushed back into the associated data stores, which means that over time such data repositories become more robust through user participation (in effect, filling in the details about a given person, location, work or similar entity).

This is the same pressure that is pushing Hadoop and similar name/value data systems. Fifty years ago, information was expensive – the cost per kilobyte of information could be measured in the tens or even hundreds of dollars, depending upon the information in question. Now the cost has dropped to micro-cents per kB, and the world is literally awash in information, so much so that it transcends the ability of human beings to understand it. Tools such as Hadoop are useful for creating basic abstraction layers on the raw data, relational data tools are useful for qualifying that data, XML databases are increasingly becoming core for working with thtat data higher in the abstraction stack, while Semantic tools provide the ability to create relationships between different core nodes of that data.

Additionally, because of latency issues and the fact that the more recent generation of data stores provide evolving information, information management becomes increasingly stochastic and asynchronous. The reliability of gathered data becomes significant, and queries will increasingly have to trade off immediacy for fidelity. When a search is made, the result is not necessarily the best fit, but only the best fit right now while data is being gathered and processed. Databases are no longer contained in a single box in the data center, but may be in data centers in Rio de Janeiro, Berlin, Beijing and Sydney – and may not even necessarily be under one company’s provenance.

In effect, data search systems are becoming quantum in nature – the actual data space exists as a wave equation, but your observation of it (through search query mechanisms) forces that wave equation to collapse to a particular state. Curiously enough, there are indications that this is more or less the same mechanisms that occur in the brain when remembering things, that memories are in effect standing waves that collapse into an internal “vision” due to the filtering effect of human perception.

This is one of the reasons why it’s perhaps time to rethink our definition of search. At the enterprise level, the goal increasingly is not to find what has already been created but rather to turn what has been created into a knowledge base to better understand what’s coming down the road. In a way, this isn’t surprising – while what has been developed in the past may hold interest to the archivist or curator (or auditor), the past doesn’t necessarily contain deep insights about the future. However, as the torrent of information becomes a deluge, enterprise search systems need to be able to gleam from what’s coming in today what will likely be coming in tomorrow. Enterprise search becomes business intelligence. That’s the way forward.