Posts Tagged ‘marklogic’

Is MarkLogic a Search Engine?

Monday, September 26th, 2011

I am frequently asked if MarkLogic is really a search engine.  It is easy to debate whether MarkLogic fits the classic definition of a search engine.  In my opinion, this is the wrong question.  The question you should be asking is “Does MarkLogic enable great search experiences?”  The answer is undeniably Yes.

MarkLogic comes with all of the standard search capabilities like: keyword search, synonyms, fuzzy search, hit highlighting, sorting, faceted navigation and relevance.  These are the basic features that every search engine should have.   MarkLogic checks the box on every one of these and more.

The fact that MarkLogic can do all of the basics makes it just like all of the other search engines on the market.  What sets MarkLogic apart is that it is not just a search engine.  MarkLogic combines some of the best features of search with a fast performing XML database.  This combination allows MarkLogic to offer features that traditional search engines lack.  Four of the most important differentiators are:

  • multi-level searching,
  • editable search results,
  • schema flexibility,
  • and simplified architectures.

MarkLogic allows for multi-level searching.  Most search engines require you to flatten out the data for search results.  MarkLogic is an XML database.  As a result, information can be stored in a hierarchical format and queried at multiple levels.  This is particularly important in more complex search experiences.  For example, if you are searching large documents, you may want to show the documents that contain your search term along with the sections of the documents that have that term.  Normal search engines would require you to create multiple collections or a complex search screen.  MarkLogic handles these situations naturally.

MarkLogic’s database features allow you to create applications with editable search results.  Our architects call it a “Live” search tool as opposed to a “read only” search tool.  Traditional search engines are designed to be read only.  Edits to existing search data require re-indexing.  Solution providers like Avalon create special indexing routines to allow for updates to content.  These solutions are not real-time and they are not simple.  Fields can be updated or added to a MarkLogic database at any time, transactionally, with full ACID protection.  This flexibility allows us to create a number of really interesting search applications that would have been much more difficult with standard search engines.  For example, we have created tools that allow end-users or administrators to “tag” one or more search results (similar to the functionality in Flickr).  In other applications, we have created search screens where the users can edit the search results without leaving the screen.  Adding these cool features to our search applications is much easier with a combined database and search engine.

As an XML database, MarkLogic provides schema flexibility for storing and querying information.  Our developers and our clients love MarkLogic because it is easy to add new fields to the index.  Traditional search engines typically require administrators to delete and reload the data in order to add specific fields.  In extreme cases you have to re-index an entire data set.  MarkLogic’s schema flexibility becomes even more important when you are working with techniques like entity extraction.  Text Analytics tools can identify people, places and things within unstructured text.  Through this process our clients often find interesting things they want to include in their search applications.  MarkLogic makes it easy to run text analytics against unstructured documents and include the entities in the search results.  Traditional search engines add a great deal of complexity to the process and do not allow for changing structures.

Our architects like MarkLogic because of its simplified architecture.  The next time you meet with your search engine vendor, ask them for a physical architecture diagram of one of their larger implementations.  At a minimum you will have a database or file system to store documents and data, a search indexer, a search server, and a web server.  Large data sets get even more complicated.  Search results have to be clustered and replicated.  You will need multiple indexers and search servers running.  You will also likely need more than one web server and application server for your front end application.  MarkLogic is a database server, search engine and applications server in one tool.  It also has built in replication.  This means fewer servers and less complexity in your dev, test and prod environments.

One final reason to use MarkLogic to power your search applications is that MarkLogic is not just a search engine.  Traditional search engines are very powerful, but they are expensive and limited to search-based use cases.

  • Want to publish thousands of documents to your website or mobile devices.  Some of the largest publishers in the world use MarkLogic to do this every day.
  • Want to build an application that allows users to build reports on the fly by combining sections from other documents.  Those same publishers use MarkLogic offer custom publishing solutions.
  • Want to create a central repository tracking all of your digital assets.  We are working with three different customers using MarkLogic as a central repository across all of their content management systems.
  • Do you need a tool to capture unstructured information for your Big Data solution.  MarkLogic does this for numerous government customers.

At the end of the day, when your management asks you how much you spent on your search solution, it is nice to say that the tool you bought does more than just search.

In fairness, MarkLogic may not be the best solution for an organization that is looking to build a vanilla search intranet that indexes content from numerous secure repositories.   Search engines like Endeca, Autonomy, Vivisimo and Lucene/Solr were designed for these types of solutions.  If, however, you need to build a powerful search application that will change over time, MarkLogic is a great choice.  It offers many valuable features that are not available in any other search engine.

Elegant “contentEditable” Solution for XML

Wednesday, April 13th, 2011

If you’ve ever wanted to do WYSIWYG editing of XML in a browser, I think you’re going to like the elegant solution I stumbled across.  The idea actually culminated from reading Kurt Cagle’s excitement around XQuery in the Browser, and my desire to create a MLUC DemoJam entry that could excite publishers about my latest passion, HTML5.  At first I thought XQuery in the Browser opened a new possibility to do simple browser-based WYSIWYG editing, but as I dug into it I found it was much simpler to do the XSLT in MarkLogic, with granular pointers attached to each editable XML node.  This solution allows me to very simply and efficiently:

  • Render XML to final HTML using existing XSLT
    • only minor modifications are required to add “sourcePath” attributes to the HTML
    • MarkLogic’s xdmp:path function makes it simple to get precise paths to the source XML content
  • Allow WYSIWYG editing of the XML directly from the browser
    • HTML5′s contentEditable attribute makes this simple
  • Use very efficient AJAX calls for immediate update to the source XML document
    • MarkLogic’s xdmp:node-replace allows pin-point updates of only the changed node

All this in less than 100 lines of code!  The video below does better justice to what I’m talking about:

Please note that I’m not trying to demo a full-featured editor here.  This is just a proof of concept.  Obviously many features need to be added before this is usable.  Nevertheless, this simple demo shows an approach that could enable many highly usable solutions for publishers.

Auto-Complete in MarkLogic Server

Wednesday, March 30th, 2011

It Must Be Insanely Fast and Meet the Requirements

Auto-complete is hard to optimize because:

  1. we’re searching on partial-words, but search engines are not optimized for that
  2. queries happen on each keystroke, so expect five to ten times your normal search query traffic in auto-complete queries
  3. if response times are slower than 100 milliseconds, users are on to the next keystroke before you show your suggestions

Another complication is that each implementation has different requirements for defining and matching the set of possible values for auto-complete suggestions.  For this article I’m assuming the most common requirements: given a defined set of possible values, we want to present as auto-complete suggestions all values which contain any word or phrase that starts with the text being entered by the user, while matching case-insensitively.

Until now, I had been using custom Java to achieve scalable auto-complete that can match any word in the set of possible values, but I’ve now found a way to do it in MarkLogic server.  But before I explain the new approach, let’s quickly review the solutions I tried which didn’t work.

Basic Element Range Indexes Don’t Meet the Requirements

Using MarkLogic Server, standard Element Range Indexes and/or search:suggest are powerful options for auto-complete, as mentioned in my previous blog post on the topic: Scalable Search Auto-Complete.  But they only meet a narrow set of use cases.  The main limitation is they can only do a “starts-with” match against the entire value.  They don’t support starts-with against any word in the set of possible values, which is a more common requirement.  So if a user starts typing “ban” they will get matches for “Bank of America” but not “Wells Fargo Bank”.

auto-complete mockup 1

XQuery example 1, “basic element-value match”, is fast but only meets a narrow set of use cases:

cts:element-value-match(xs:QName("COMPANY"),"ban*", "limit=50")

Wild-Cards aren’t Fast Enough

Wild-cards present a solution, but as in all technologies wild-cards are one of the slowest query types.  While wild-card searching in MarkLogic Server is very fast, I have not been able to tune it to be scalable enough for auto-complete.  I’ve used 1, 2, and 3-character indexes to enable fast matching using wild-cards even on the first keystrokes.  Under load I’d like at least 90% of queries below 100 ms to meet #3 above, but I’m seeing wild-card queries take as much as 1 second at the 90% line.  I’ve also tried wild-card queries in fields and elements, with very similar results.

XQuery example 2, “wild-card search”, is not fast enough:

(cts:search(/DOCUMENT/COMPANY, "ban*")/text())[1 to 50]

fn:contains isn’t Fast Enough

XPath has fn:contains which matches any sub-string, thus coming close to the required functionality, but I didn’t expect high performance because it has no index to optimize the query–it essentially conducts a full-text scan.  While I was impressed that MarkLogic’s implementation of fn:contains in vanilla XPath performs even better than wild-card queries, the need to be case-insensitive adds a serious performance hit, making this the slowest query of the bunch.

XQuery example 3, “contains”, is not fast enough:

(cts:element-values(xs:QName("COMPANY"))[contains(lower-case(.),"ban")]/text())[1 to 50]

Expanded Word Queries are Very Slow Under Load

I tried some very creative solutions, including wild-card matching of all possible words in the text, then using the words for a word query.  I expected this to be highly optimized because word queries use the search index and are very fast.  Either I did something wrong, or sending so many words in one “or” query slows things down.  Under load, these queries were even slower than contains queries.

XQuery example 3, “expanded words query”, is not fast enough:

let $words := cts:element-word-match(xs:QName("COMPANY"),"ban*")
let $wordQuery := cts:or-query((for $word in $words return $word))
return (cts:search(/DOCUMENT/COMPANY, $wordQuery))[1 to 50]

Chunked Element Range Indexes Are the Answer

Luckily, I landed on a solution that works wonderfully.  It allows matching against any word in the values, and performs very quickly: under 10 milliseconds with no load, and under 150 milliseconds with outrageously high load (100 concurrent thread in JMeter).  I create what I’ll call a “chunked” Element Range Index which takes advantage of the indexes’ optimization for “starts-with” queries by creating a new value for each word in each value.  In order to associate the partial “chunked” values to the full value, I just append the full value after a colon.  So for “Wells Fargo Bank” I would create three values: “Wells Fargo Bank:Wells Fargo Bank”, “Fargo Bank:Wells Fargo Bank”, and “Bank:Wells Fargo Bank”.  This way when a user types “ban” and I search for values starting with “ban”, one of my matches will be “Bank:Wells Fargo Bank” and I will then strip everything before the colon and present the user with the match of “Wells Fargo Bank”.

auto-complete mockup 1

XQuery example 4, creates “chunked” values on which we can create a “chunked Element Range Index”:

(: This differs from tokenize because we want the whole trailing
 : string with each token :)
declare function local:chunkOnNonWordChars($value as xs:string, $fullValue as xs:string) {
  (: only shrink the value if it contains whitespace character(s)
   : between non-whitespace characters :)
  if ( matches($value, "^\w+\W+\w") ) then (
    (: remove the first word and following whitespace characters :)
    let $smallerValue := replace($value, "^\w+\W+", "")
    return (
      concat($value, ":", $fullValue),
      local:chunkOnNonWordChars($smallerValue, $fullValue)
    )
  ) else (
    concat($value, ":", $fullValue)
  )
};

xdmp:document-insert("autocomplete-seed.xml", <xml>{
  for $value in cts:element-values(xs:QName("COMPANY"))
  return
    for $chunk in local:chunkOnNonWordChars($value, $value)
    return <AUTOCOMPLETE_COMPANY>{ $chunk }</AUTOCOMPLETE_COMPANY>
}</xml>)

XQuery example 5, “chunked element-value match”, is fast and meets the requirement to match any word from the values:

let $matches := cts:element-value-match(xs:QName("AUTOCOMPLETE_COMPANY"), "ban*", "limit=50")
let $clean_matches := for $chunk in $matches return replace($chunk, ".*:", "")
return distinct-values($clean_matches)

Advice From an Expert

I had this article reviewed by Colleen Whitney, the MarkLogic engineer who wrote (and performance optimized) search:suggest, and she provided some tips compiled from multiple engineers on her team.  We can all use these when performance tuning our matches against element range indexes:

The major take-away is that on very large datasets, resolving matches is highly data-dependent. The factors (in order of importance) are:

  • Number of total values with the same prefix as the match pattern. (I think this is a key issue given your approach.)
  • Collation.
  • Average number of characters in each value. (Longer strings are slower to compare; also if the beginnings of the strings are similar, comparison will be slower)
  • Number of unique values.

Therefore, users can improve performance on large datasets in several ways:

  • Be selective about which index(es) drive suggestions.
  • Design interaction with the UI so that suggestions aren’t given until there are a minimum of 3 characters as input to the lexicon call.
  • Use codepoint collation on string range indexes.

For most use cases, I don’t think the data sets are large enough to require a user experience limitation of only showing suggestions after 3 characters of input.  But understanding Colleen’s tests were run on data sets with up to 22 million records, I’ll have to agree that with data sets that large we’ll most likely have to make some compromises on user experience.

Full-Disclosure on My Testing Approach

My testing approach matches my previous post, using geonames.org data (though about 85,000 cities loaded this time).  I’m also still running a set of queries as if users were typing each character of each city name.  My hardware is now a Dell E6510 with 4 GB ram and Windows 7.

Here’s a snapshot of my JMeter results after one of my tests (only 30 threads looping 10 times each):

JMeter load test results screenshot

The times above are in milliseconds.  While you’ll notice that the median timings are acceptable (under 100 milliseconds) for most of the approaches, with a few outlier slow queries throwing off the averages, the 90% line is where I look to make sure the approach will have acceptable timings most of the time.  For this test run, only element-value-match and chuncked element-value match keep 90% of the responses under 100 milliseconds.  As I increase the size of the data or increase the load, the performance gap widens and it becomes more obvious that chunked element-value-match is much more scalable.

Other Requirements to Meet

While this solution is exciting, there are many potential auto-complete requirements beyond the scope this article, for example:

  • We may want to broaden the matches based on synonyms or spelling corrections.
  • Some user experiences may not need matches on the first few keystrokes, which significantly reduces the performance requirements.
  • We don’t want to query on every keystroke when keystrokes occur in quick succession.  Many javascript auto-complete libraires like YUI Auto-complete take care of this for us.
  • If we have a pre-defined order among the potential set of values, we would like to present our matches in the right order.
  • We might first show matches which start with the user’s text, then matches containing words or phrases which start with the users text.
  • We may want to group auto-complete suggestions, so we may need to create multiple chunked element range indexes, and run a query against each.  This would obviously add more overhead and require additional performance testing.

To meet many of these requirements, we’ll need to extend this approach.  For others we’ll need to fall-back to the approach of using custom Java code.

Conclusion

While the Java solution is still a viable alternative, it’s nice to now have a pure MarkLogic / XQuery solution for environments that can’t easily set up and manage a Java app server just for auto-complete.

Realtime Push with MarkLogic and Node.js via Websockets

Tuesday, November 16th, 2010

MarkLogic has an awesome alerting feature that enables you to trigger an event when new or updated database content matches certain criteria. Once a rule’s criteria is met, an action is triggered that executes an arbitrary XQuery module. You can send an email, an SMS message, perhaps place a phone call with the Twillio API, modify other content in the database, whatever your heart desires. But what if you want to deliver realtime notifications to a user in a browser?

Enter Websockets! Websockets is part of (well, was part of) the HTML5 initiative and defines a full-duplex single socket connection over which messages can be sent between client and server. Though MarkLogic and XQuery doesn’t support persistent connections such as websockets, it’s relatively easy to pair it with a 3rd party websocket capable server like Node.js with a library like node-websocket-server.

By pairing MarkLogic’s alerting feature with Node.js and websockets you can push realtime notifications to connected clients in Websockets capable browsers, mobile devices, etc.

Have a look…