Archive for the ‘Search’ Category

Scalable Search Auto-Complete

Thursday, April 15th, 2010

For us search integrators, auto-complete is one of our most enjoyable challenges.  But make no mistake, implementing scalable search auto-complete can be a challenge.

Search Auto-Complete (aka instant search, find-as-you-type, look-ahead, predictive search, and many other names) is becoming more popular because if done right it can be great for end users, but the performance is no small issue since it has to respond for *every key stroke* of every user, and if it doesn’t respond almost instantly it still bogs down your servers but doesn’t really help end users.  So a search auto-complete solution must be:

  1. responsive (sub 100 ms) – so users see responses as they type
  2. high throughput (5-10x your existing search traffic) – so it can handle every keystroke for every user

Our most recent auto-complete project was for one of the most recognizable names in the financial industry.  For their implementation we had to execute auto-complete on over 50,000 items, and while auto-complete obviously only works well if it’s *very fast* (less than 100 ms), we also had to make it scalable enough for a very high traffic site.  50,000 items is too much to send to the web browser to do my favorite approach, client-side auto-complete, so we had to use AJAX in this case.  To make a long story short, we found that the most responsive, scalable, and flexible solution was to search strings in memory in the application layer (Java in this case).  No, we didn’t go back to the search engine for each keystroke, because for this use case it simply wasn’t responsive nor scalable enough.

I keep looking closely at offerings from search vendors to see if they provide a more packaged solution than the Java approach we used, and I’m excited to report what I’ve found in MarkLogic.   Our enterprise web team has been heavy into MarkLogic work, and I decided to experiment with that platform.  I found a gem called search:suggest which has two things I like a lot:

  1. It provides an uncommonly great user experience starting down the path of a demo I’ve long looked at as the future of advanced auto-complete
  2. My testing shows it is *very responsive and scalable*

My results with 50 concurrent JMeter threads show:

  • 23 ms average response time with 411 ms max response time
  • 82 qps average throughput

I must say I’m impressed!

For another comparison point, I implemented auto-complete the way a large MarkLogic client showed me they were doing it, using cts:element query, with a wildcard (*) appended to each search item.  It didn’t do nearly as well as search:suggest, with average response times taking several seconds, and max response times over 11 seconds.  So, no big surprise here . . . standard wildcard searches are expensive.  Good thing MarkLogic offers us search:suggest.

These were quick tests, and I hope to explore this with much more rigor, but I had to share my initial findings.  If you want to understand more details about what I did, feel free to reach out and I can provide more details.  Briefly, here’s what I did:  I pulled out an old JMeter configuration I’ve used in the past to test auto-complete.  It’s based on a database culled from geonames.org, with a test script matching one letter at a time for each city name, mimicking a user typing the city names . . . know I was matching over 20,000 items, with several keystrokes per item, and a configuration to *hammer* the system to test performance and maximum throughput.

MarkLogic? A “NoSQL” Database? YES!

Wednesday, April 14th, 2010

The NoSQL movement had been garnering a lot of attention recently. It’s a trend largely facilitated by the changing demands of the transactional web. Today, the web is bidirectional and much more content oriented than it was in years past. The amount of user generated content has increased exponentially and relational databases are not tailored to handle massive amounts of semi-structured content.

One NoSQL option that we at Avalon are very excited about is Marklogic. Some will argue that it does not fit into their definition of “NoSQL,” but Mark Logic CEO Dave Kellogg did a good job of positioning Marklogic as a relational database alternative (aka NoSQL) in a post last week.

Recently I’ve been spending a lot of time exploring how Marklogic, as a powerful document (yes, key [aka URI] – value) and XML database, can be used to support social media and user generated content use cases. For example, threaded comments are much more naturally represented in a hierarchical, ordered format like XML. Marklogic and xQuery make it easier to store, manipulate, and search these data structures.

To demonstrate, I built a simple element reordering example using Marklogic and jQuery. Even this simple example would be non-trivial in a traditional relational model where lists and order with respect to other elements are unnatural at best.



So this is obviously a super simplified example, but hopefully it gives you a small glimpse into one of the capabilities greatly simplified by Marklogic through XML and xQuery.

I will be at the MarkLogic User conference in San Francisco May 4-6th. I hope to see you there!

Solr in 5 Minutes at Denver Open Source Users Group

Monday, February 1st, 2010

Tomorrow night (Feb 2) I’m excited to participate in the Denver Open Source Users Group’s (DOSUG) first “Lightning Talk” night where I will be presenting an overview of Solr, a leading open-source enterprise search engine. The event is an O’Reilly Ignite style event.  There will be 14+ short 5-minute presentations on various topics.

If you’re in Denver tomorrow night, come check it out and join in the fun. You can find the meeting details here.

Also, check out our latest news to see other events that Avalon will appear at in the coming months.

A Wake-up Call: Google, support Faceted Search!

Monday, January 11th, 2010

I believe confusion persists about whether Google’s search appliance truly supports faceted search.  First I’ll point out that Google Labs has two projects which claim to add faceted search features to the GSA (more details on those below).  Those projects probably work for some and looked tempting when we first found them, but as we dug deeper we found they are what I’ll call “bolt-on” faceted search, which simply isn’t scalable for most of our clients, because we’ve only seen two approaches to “bolt-on” faceted search when a search engine doesn’t provide native support:

  1. Run a separate query for each facet value – this multiplies the number of queries by the number of facet values, obviously adding significantly more, usually too much more, search engine traffic
  2. Pull a large set of results and calculate facets on that set – this approach obviously cannot provide an accurate list of facets or matching results for each facet across the entire result set when there are too many results to reasonably pull all matching results into the “bolt-on” code to run the calculations

In both cases the core problem is that the search engine doesn’t offer native support for faceted search.  Faceted Search is one of those features that simply must be supported by the engine in order to provide accurate, scalable facets for high-volume or high-traffic enterprise search implementations.

So the simple fact is: as of version 6, Google Search Appliance does not offer native support for faceted search.  As I said, there are two “bolt-on” solutions which we can’t recommend, and here’s why:

  1. gsa-faceted-search takes approach #1 above, running a query on the search engine once for each facet value. You can see this in code-snippet-1.1.txt on lines 75 and 79. For the example they provide, that would multiply your search engine traffic by 11x whatever traffic you had before adding this feature. For our customers it’s not unusual to have over 50 facet values, so that would require the search engine to handle 50x the normal traffic.
  2. 41:    for (var i in facetDefinition) {
    . . .
    75:        xmlDoc.load(countURL);
    . . .
    79:            xmlhttp.open("GET", countURL, false);
    
    
  3. GSA Lab’s parametric project takes approach #2 above (Pull a large set of results and calculate facets on that set).  If you look at googleParametric.js, you’ll see that line 168 only requests the first 100 results to calculate its values, so you’ll never see a facet value unless it’s attached to one of the first 100 results, and the counts for each facet are not accurate because they can only know how many of the first 100 results matched that facet.
  4. 168:   var url = "http://" + mTGSAHost + "/search?" + mTURL + "&num=100";
    

This problem is not unique to Google.  I’ve heard many Ultraseek implementers brag about their success with creative ways to “bolt-on” faceted search.  But every time I’ve dug deeper, I’ve found one of the two approaches listed above, either not scalable or not accurate.  Luckily, Ultraseek 6 added IDOL (which has long had native support for facet search) under the hood, so Ultraseek customers can now upgrade and have scalable and accurate faceted search.  We’ve helped many customers through this process, and been very pleased with the outcome.

Why Google has not yet woken up to faceted search, I cannot explain . . . for more hand-wringing on that topic, see Daniel Takenlung’s post.  Hopefully for their tens of thousands of Google Search Appliance customers, Google will resolve this issue soon.