Querying JSON with N1QL – Couchbase Reimagines SQL

A traditional relational database makes use of an encoding strategy laying out tables, each of which has a particular known schema, and then uses the Structured Query Language (SQL) to find specific matching values within the various tables, using principles first laid out by IBM designer Ted Codd in the early 1970s. A table could represent a set of atomic attributes, but could only indirectly associate a record with a different set of values by creating a matching a foreign key in the table to a primary key elsewhere – what’s called a join in relational circles.  The process of identifying such relationships and then building tables for this is called normalization, with the most basic form of normalization being called the first normal form.

The advantage of being able to normalize document structures is that you can identify certain columns within a table for special indexing. Typically, an index associates a value with a set of records in a given table. For example, if you had an employee database you could create an index on the last name. Then, even with just a partial map you can very quickly identify the identifier of all of the records for which an employee had that last name. This means that rather than having to look through every single entry in the table, you only have to do a comparison against an ordered list of names, which will generally be much faster (especially if you have several people sharing the same name). When you try to compare against both first and last name, the SQL engine then performs an indexed lookup and takes the intersection of records, which can be done much faster than a linear search as well. It is this mixture of normalization and indexing that makes relational databases comparatively faster than other kinds of databases.

Document NoSQL databases represent an interesting challenge from a search perspective, however. A document is often referred to as a denormalized structure, though this isn’t quite technically true. Instead, most NoSQL databases typically will take a document structure, and will then break it down into first normal form tables internally, but will also hide the key relationships as implicit joins – fast and optimized, but not necessarily evident. Moreover, JSON structures are combinations of arrays (linear sequences of items) and hashes (an unordered collection of key/value pairs). Structures may include both arrays and hashes, possibly with hashes of arrays, arrays of hashes, arrays of arrays and hashes of hashes. All of these make using SQL with JSON all but impossible, unless SQL itself is modified to take into account the ideosyncracies of document structures.

Querying JSON within document stores for the most part has come down to one of two approaches. The first is a simple indexing solution – you can retrieve content by using specific key value indexes to find for a specific hash key, a document that contains this hash key. then use set logic operations (intersection, union, not, etc.) to reduce this set.  This works reasonably well when you’ve taken the time to set up indexes, but doesn’t necessarily perform well when you have a commonly used property such as “name” – for instance:

find("name","John")

might end up matching thousands of documents in all kinds of contexts that isn’t of interest to you.

The other solution has been to use JSONiq (originally developed by a group of developers from Oracle, 28msec and others). to treat JSON structures as being analogous to XML structures, with JSONiq hewing closely to the XPath/XQuery model. While this can work quite well for retrieving content, for many people who are more used to working with SQL the document path approach can be off-putting.

A N1QL For Your Thoughts

Couchbase tried a different tack with their N1QL language, now part of the Couchbase 4.0 preview. Their interest was getting people who were familiar with SQL to make use of their database via a SQL-esque language called N1QL , one that was close enough in structure to make it easy to adopt. The term itself comes from First normal form (N1) Query Language, though this is easily turned into “Nickel”. It represents a fairly significant shift in thinking: JSON structures are close enough in form to make querying it through N1QL attractive to SQL developers who don’t want to lose their favorite tools.

In Couchbase, from a query standpoint, all documents are in a universal array within a bucket.  If you pass in an array of objects, via JSON, they will be added to this universal array. This means that every document at its root is a JSON object. Also, unlike with relational databases, there is no guarantee that documents within this array – what Couchbase refers to as a keyspaces – have  the same or even similar structures (schemas), though in most cases it makes sense to have a keyspace hold relatively similar documents.  Typically, the keyspaces serve a function similar to tables as an organizational tool – you may have an Employee keyspace,  a Project keyspace, and a Department keyspace, as an example, with each keyspace having subtly different structures but in general having documents that represent Employees, Projects and Departments respectively.

Setting up Couchbase Server — Couchbase Server 4.0 — BETAGetting started with N1QL in the most recent Couchbase 4.0 preview was much simpler than it was in the original preview, as was the whole installation process (http://docs.couchbase.com/4.0/getting-started/set-up-couchbase.html), with the only real potential for confusion being a screen that asks whether you wish to set up a cluster to contain only data, or to also contain indexes and query. To install N1QL, you must have all three servers working within the cluster.

The Data and Index RAM quotas can also be set. By default Couchbase allocates roughly half of the available memory, but this can be reduced considerable then scaled up as needed.

Once set, it’s worth taking a moment to load the sample databases, including the beer database, which contains both documents pertaining to beer brands and to brewers, as a playground.  For the examples discussed here, you should open up the web administrator tool (typically this will be on http://localhost:8091 if installed locally, and set up a new data “bucket” called “employees”.)

Once the server has started, you can launch the Couchbase query client by running

$couchbase/bin/cbq

where $couchbase is the main Couchbase folder on Windows, Mac, or various flavors of Linux. This will make it possible to directly type in N1QL commands. Adapters for nodejs, Java, Python and other application platforms are available as well – in general they follow a pattern of exposing a connection object to the database along with a specific query object retrieved from the connection object to run specific N1QL queries. The queries themselves generally do not change whether run from cbq or from external libraries.

Just as SQL has a Data Manipulation Language (DML), so too does N1QL. The following populates the employees database with five employees:

INSERT INTO employees (KEY,VALUE) VALUES (
"janedoe",{"employee":{"empId":"janedoe","firstName":"Jane","lastName":"Doe","startDate":"2012-03-01","gender":"female"}});
INSERT INTO employees (KEY,VALUE) VALUES (
"janettedoe",{"employee":{"empId":"janettedoe","firstName":"Janette","lastName":"Doe","startDate":"2012-04-01","gender":"female"}});
INSERT INTO employees (KEY,VALUE) VALUES (
"johndoe",{"employee":{"empId":"johndoe","firstName":"John","lastName":"Doe","startDate":"2012-04-01","gender":"male"}});
INSERT INTO employees (KEY,VALUE) VALUES (
"johnkeel",{"employee":{"empId":"johnkeel","firstName":"John","lastName":"Keel","startDate":"2012-04-04","gender":"male"}});
INSERT INTO employees (KEY,VALUE) VALUES (
"samvimes",{"employee":{"empId":"samvimes","firstName":"Sam","lastName":"Vimes","startDate":"2012-04-03","gender":"male"}});
INSERT INTO employees (KEY,VALUE) VALUES (
"karendohr",{"employee":{"empId":"karendohr","firstName":"Karen","lastName":"Dohr","startDate":"2012-06-01","gender":"female"}});
<span style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; line-height: 1.5;">The first argument in the VALUES() statement is a specific primary key from the entry. This may (and usually will) be referenced elsewhere, such as in the empId field, but this provides a way of accessing information via a user friendly key.  In general, documents are not immediately indexed when inserted.  Instead, once a set of documents has been loaded, the current index should be dropped and a new index created:
</span>
DROP PRIMARY INDEX ON employees;
CREATE PRIMARY INDEX ON employees;

Once content has been indexed, it is fully available for querying via N1QL using a SELECT statement, in a manner similar to that used by SQL.

SELECT employee from employees;

The output, however, is likely not what you’d expect from SQL, instead, coming out as JavaScript:

{
    "requestID": "e366b364-c7c7-46ed-8d12-6ae286c25438",
    "signature": {
        "employee": "json"
    },
    "results": [
        {
            "employee": {
                "empId": "johnkeel",
                "firstName": "John",
                "gender": "male",
                "lastName": "Keel",
                "startDate": "2012-04-04"
            }
        },
        {
            "employee": {
                "empId": "johndoe",
                "firstName": "John",
                "gender": "male",
                "lastName": "Doe",
                "startDate": "2012-04-01"
            }
        },
        {
            "employee": {
                "empId": "janedoe",
                "firstName": "Jane",
                "gender": "female",
                "lastName": "Doe",
                "startDate": "2012-03-01"
            }
        },
        {
            "employee": {
                "empId": "janettedoe",
                "firstName": "Janette",
                "gender": "female",
                "lastName": "Doe",
                "startDate": "2012-04-01"
            }
        },
        {
            "employee": {
                "empId": "samvimes",
                "firstName": "Sam",
                "gender": "male",
                "lastName": "Vimes",
                "startDate": "2012-04-03"
            }
        },
        {
            "employee": {
                "empId": "karendohr",
                "firstName": "Karen",
                "gender": "female",
                "lastName": "Dohr",
                "startDate": "2012-06-01"
            }
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "25.738414ms",
        "executionTime": "25.588915ms",
        "resultCount": 6,
        "resultSize": 1430
    }
}

As is typical in such queries, the result comes with a significant amount of metadata. A status indicator along with metrics indicating how long the queries took to run (elapsed time and execution time will usually be close during single queries, but may differ when multiple queries are being serviced simultaneously). The result size indicates the number of bytes returned from the query.  The output is JSON, which means that a json string to JavaScript object (or similar serialization object) will be required before the data can be utilized in a program.

A similar query can be used to both  filter out specific facets and shape the output. For instance, suppose that you wanted to find all female employees, but you were only interested in the first and last name of the person. The query to do this is as follows:

SELECT employee.firstName, employee.lastName from employees where employee.gender="female";

The output is terser, since only properties are being returned:

{
 "requestID": "d044f291-a3ee-4e73-818c-dc0b4591dc49",
 "signature": {
 "firstName": "json",
 "lastName": "json"
 },
 "results": [
 {
 "firstName": "Karen",
 "lastName": "Dohr"
 },
 {
 "firstName": "Janette",
 "lastName": "Doe"
 },
 {
 "firstName": "Jane",
 "lastName": "Doe"
 }
 ],
 "status": "success",
 "metrics": {
 "elapsedTime": "65.967121ms",
 "executionTime": "65.877558ms",
 "resultCount": 3,
 "resultSize": 227
 }
}

This kind of output accords well with table structures, but JSON is multi-dimensional: it can support arbitrarily nested structures and levels. For instance, suppose that you wanted to add an address to Jane Doe’s entry. This can be accomplished with an UPSERT statement, which combines the UPDATE and INSERT functions into a single statement that inserts a new record only if the document doesn’t exist. Otherwise UPSERT updates the document, replacing old content with new:

UPSERT into employees (KEY,VALUE) VALUES ("janedoe",{"employee":{
   "empId":"janedoe","firstName":"Jane",
   "lastName":"Doe","startDate":"2012-03-01","gender":"female",
   "address":{"street":"123 Sesame St.","city":"New York","state":"NY",
       "postalCode":"01254"}}});

You will likely use UPSERT far more than INSERT.

This more complex structure can similarly be queried, but it’s also possible to create a different structure using this:

SELECT concat(employee.firstName," ",employee.lastName) as fullName, employee.address from employees 
where employee.address.state = "NY";

When run, this produces the output

{
 "requestID": "2e5c3529-2bcd-4a2d-8fa3-ce306e8be503",
 "signature": {
 "address": "json",
 "fullName": "string"
 },
 "results": [
 {
   "address": {
      "city": "New York",
      "postalCode": "01254",
      "state": "NY",
      "street": "123 Sesame St."
    },
   "fullName": "Jane Doe"
 }
 ],
 "status": "success",
 "metrics": {
 "elapsedTime": "63.526706ms",
 "executionTime": "63.114672ms",
 "resultCount": 1,
 "resultSize": 235
 }
}

Note here that a new property “fullName” has been created using the AS statement, which holds the concatenated first and last names, while the address block is returned as an entire JSON structure.  Also note that even when only one item is returned in a query, that this is still contained in an array (this eliminates the need to have to test whether an item is an object or an array, which can be a frustrating experience when done repeatedly).

The examples covered here scratch the surface of what can be done with N1QL. It is possible to join documents together along primary/foreign keys, iterate over documents to create transformed documents, jperform aggregate operations and create global secondary indexes (GSIs for handling indexes that utilize multiple keys, such as geo-spatial indexes). While there are still some limitations (most notably being that it is not currently possible for users to define libraries of functions),  what N1QL does offer is a dramatically expanded toolset for querying NoSQL.

Why N1QL Matters

One of the key problems that SQL faces is a very old design flaw – it was never intended to be serialized. In most cases, when a SQL query is run, the output is a table, and access to that table is then provided via some kind of data translator that lets people iterate over the result set. There is no standard way to serialize a relational database, which means that there are an almost absurd number of non-standard ways to do so, to the extent that the most reliable way of performing a transfer between databases is to literally set up a set of instructions to construct not only table schemas but even data and indexes.

Now, a great deal of information in the world can in fact be contained in an output table, if you’re willing to go through some kind of a recordset navigator. However, there is also a significant percentage – perhaps even the overwhelming majority of all datasets in fact – where denormalizing everything down into giant, frequent sparse tables with loss of relationships loses information, increases redundancy, and adds considerably to the potential of data becoming corrupt or dirty.

This realization has already made its way into the world of services. Most services today utilize JSON or XML to transmit content, because structure matters. A JSON data structure is in essence a self-contained database, and an array of such structures can generally transmit schematic structure and provide implicit joins. What’s more, JSON handles missing information well, and it also handles structural variations that are very difficult for relational table structures and SQL to deal with.

The JavaScript community (and the JSON movement) have generally taken a Wild West approach to standardization, especially in comparison to XML. The XML community developed the XQuery standard for querying XML document databases, and around 2010, a group of JSON enthusiasts developed a language called JSONiq, which utilized XQuery and XPath-like structures to manage navigating across JSON documents and working with JSON databases. However, while undoubtedly powerful, JSONiq tends to appeal to those people who are used to working with document structures first. Given the large number of people with more traditional SQL skills, JSONiq can appear at best cryptic, and at worst more cumbersome than the tools that they are familiar with.

N1QL has the potential to bridge that gap. It is more than just another JavaScript library. N1QL also doesn’t suffer the callback misery that many Javascript database engines run into, due to the asynchronous nature of many database calls. As a language, N1QL segues nicely into the SQL patterns, critical not only for queries but also (perhaps even more so) for updates, which makes adapting it into other tools easier than is the case for XML-like tools.

There is no doubt in my mind that Couchbase’s N1QL is still a work in progress, but even at this stage, it is a powerful tool, and it may be well worth the while of the Couchbase team to attempt to establish N1QL as an ISO standard. Given the role that JSON NoSQL databases are poised to play moving forward, being able to standardize on a common web SQL is a not inconsiderable achievement. Certainly, expect it to become one of the pillars of querying the web.

Kurt Cagle is Principal Evangelist for Avalon Consulting, LLC, a consulting services company specializing in NoSQL, Big Data and Data Analytics Solutions.

 

Kurt Cagle About Kurt Cagle

Kurt Cagle is the Principal Evangelist for Semantic Technology with Avalon Consulting, LLC, and has designed information strategies for Fortune 500 companies, universities and Federal and State Agencies. He is currently completing a book on HTML5 Scalable Vector Graphics for O'Reilly Media.

Leave a Comment

*