MarkLogic 9 introduces row-oriented views and the Optic API

AAEAAQAAAAAAAA1uAAAAJDVjMjMwYzFkLWM4OTgtNDY0My04YTc4LTQxNDc4MTI0M2E3MA

Image courtesy Normann Copenhagen (https://www.flickr.com/photos/normann-copenhagen/) under license CC BY-SA 2.0

MarkLogic has always been adept at storing a variety of data types: XML, JSON, RDF, text, and binary. Its powerful search capabilities, rich APIs, and support for many kinds of data in a variety of languages have made it a flexible choice for addressing a wide range of technical challenges without the need for complex integration. The upcoming release of MarkLogic 9 will expand on this foundation, adding a new level of support for row-oriented data. The new features will make it easy to store and search data while respecting its natural structure.

MarkLogic 9 will enable developers to create row-oriented views of documents using Template Driven Extraction. The Optic API will support familiar SQL concepts such as selects, joins, aggregation, order by, and sort. The new row-oriented document views will benefit from a row index and an optimizer for increased performance. These features will be integrated with core document search and semantic triple capabilities, supporting queries that involve multiple data types. APIs will be included for Java, XQuery, JavaScript, and REST.

Let’s look at how Template Driven Extraction and the Optic API will evolve the way data modeling is approached and enable developers to bring existing knowledge of SQL to bear when writing queries.

Use case: Electronic Health Record data

Medical data stored by an Electronic Health Record (EHR) presents some interesting challenges.

Background

EHRs need to store and query patient demographic information and Continuity of Care Documents (CCD), among other data.

Patient demographics include standard fields such as patient name, sex, and date of birth.

A Continuity of Care Document is a standardized document that captures many clinical encounters for a patient over time. It contains all information about a patient that may be relevant for providing future care. It includes fields such as the problem list, medications, medication allergies, lab tests, and procedures. Physician notes are included in some of the sections.

For privacy reasons, CCDs are commonly de-identified by ensuring that they do not contain demographic data, which is stored separately. A medical record number can be included in both pieces of data that can be used to join demographic data to CCDs when needed, e.g. by a doctor when providing care to a patient.

Among its capabilities, an EHR needs to meet these requirements:

  • Store patient demographics and CCDs
  • Query both types of data
  • Integrate data from different source systems
  • Send data to other systems
  • Join demographic data to CCDs

Why use MarkLogic?

The lack of definite structure within CCDs and the tendency for them to be thought of as a single unit suggests that they should be stored whole, in a document database. Breaking them up into rows and columns to force them into a relational database would be limiting and inflexible.

Patient demographic information is a prime example of structured data. The structured data is a natural fit for a relational database model involving rows and columns.

So we have two types of data, one requiring a document database and the other more suitable for a relational database. Wouldn’t it be convenient if we had some means of storing both types of data side by side in one database that could deliver the strengths of both database types?

Enter MarkLogic 9, the Enterprise NoSQL Database with Template Driven Extraction, row indexes, and the Optic API.

Goals

Using Template Driven Extraction and the Optic API, we can store and search both types of data and we can efficiently perform joins between the data types. All in a way that feels natural and familiar. Our goals in demonstrating this are to:

  1. Create a single view for demographic data from two different source systems.
  2. Perform SQL-style joins between the demographic data and the clinical data.
  3. Combine SQL-style search with full text document search in the same query.
  4. Query the demographic data, joining it to the clinical data, and retrieve the clinical data for the patients matching the query.

Note that even though JSON and JavaScript will be used for sample data and code, XML and XQuery are fully supported as well. Also, since MarkLogic 9 is not yet generally available, some details such as syntax in the examples below are subject to change.

Inserting data

In Figure 1, we insert clinical data for a sample of patients. These documents represent CCDs, each storing medical information for a single patient. They include the mrn field, which acts as a foreign key, referencing the demographics documents that have the identifying information about each patient. To save space, this code inserts the same imaginary, oversimplified “CCD” for each patient.

Figure 1: Insert clinical data that could represent a CCD

// insert sample patient clinical data
declareUpdate();

function insertCCD(mrn) {
  var doc = {
    "ccd": {
      "mrn": mrn,
      "problems": [ "diabetes", "obesity" ],
      "medications": [ "prozac" ],
      "allergies": [ "wheat", "pollen" ],
      "lab-tests": [],
      "procedures": [ "bronchoscopy" ]
    }
  };
  xdmp.documentInsert("/ccd/" + mrn, doc, {"collections": "ccd"});
}

insertCCD("320842");
insertCCD("320843");
insertCCD("320844");
insertCCD("620842");

Next, we insert patient demographic records. If data from multiple systems is integrated, it may be desirable to store the original data unchanged from each data source. Uniform row views can be created from documents with different structures, presenting one consistent target for queries. In Figure 2, we insert demographic records with two differing structures, one structure from source system A and one from source system B.

Figure 2: Insert patient demographic records

declareUpdate();
// insert patient demographic records from source system A
function insertDemographicsA( mrn, firstName, lastName, sex, birthdate ) {
  var doc = {
    "demographic-record": {
      "mrn": mrn,
      "ccd-uri": "/ccd/" + mrn,
      "first-name": firstName,
      "last-name": lastName,
      "sex": sex,
      "dob": birthdate
    }
  };
  xdmp.documentInsert('/demographics/' + mrn, doc,
    {"collections": "demographics"});
}

insertDemographicsA("320842", "John", "Doe", "M", "1994-01-01");
insertDemographicsA("320843", "Jane", "Doe", "F", "2000-01-01");
insertDemographicsA("320844", "Steve", "Doe", "M", "1982-01-01");

// insert patient demographic records from source system B
var mrn = "620842";
var doc = {
  "patient": {
    "record-id": mrn,
    "ccd-uri": "/ccd/" + mrn,
    "name": {
      "first": "Bob",
      "middle": "Allen",
      "last": "Smith"
    },
    "sex": "M",
    "birthDate": {
      "year": "1990",
      "month": "01",
      "day": "02"
    },
    "address": {
      "street": "3012 Alder Dr.",
      "city": "Noremburg",
      "state": "MA",
      "zip": "30293"
    }
  }
};
xdmp.documentInsert('/demographics/' + mrn, doc,
  {"collections": "demographics"});

Defining views

Now that we know how our data is structured, we can create templates to define views.

In Figure 3, we insert templates that create row views from the two different patient demographic record structures. tde.template-insert() inserts a template into the Schemas database and assigns a special collection that all templates belong to. Inserting a template also creates a view and causes rows from documents that match the context to be included in the row index. After the template is inserted, the row index is automatically updated when new matching documents arrive.

Notice the context element in the two templates. In the template for data that came from source system A, the context is set to “/demographic-record”. In the template for data from source system B, it is set to “/patient”. These values match the root nodes for the respective demographic documents that we already inserted. To control which documents your templates are applied to, you also have the option of specifying directories and collections.

Figure 3: Insert templates to create views for two alternative demographic document structures

declareUpdate();
var tde = require("/MarkLogic/tde.xqy");

// template to create a view for demographic data from source system A
var tmpl = xdmp.toJSON(
{
  "template": {
    "context": "/demographic-record",
    "rows": [
      {
        "schemaName": "patient_data",
        "viewName": "demographics",
        "columns": [
          {
            "name": "mrn",
            "scalarType": "int",
            "val": "mrn",
            "invalidValues": "ignore"
          },
          {
            "name": "ccd_uri",
            "scalarType": "string",
            "val": "ccd-uri",
            "invalidValues": "ignore"
          },
          {
            "name": "first_name",
            "scalarType": "string",
            "val": "first-name",
            "invalidValues": "ignore"
          },
          {
            "name": "last_name",
            "scalarType": "string",
            "val": "last-name",
            "invalidValues": "ignore"
          },
          {
            "name": "sex",
            "scalarType": "string",
            "val": "sex",
            "invalidValues": "ignore"
          },
          {
            "name": "dob",
            "scalarType": "date",
            "val": "dob",
            "invalidValues": "ignore"
          }
        ]
      }
    ]
  }
}
);
tde.templateInsert( "demographics-template.xml", tmpl );

// template to create a view for demographic data from source system B
var tmpl2 = xdmp.toJSON(
{
  "template": {
    "context": "/patient",
    "rows": [
      {
        "schemaName": "patient_data",
        "viewName": "demographics",
        "columns": [
          {
            "name": "mrn",
            "scalarType": "int",
            "val": "record-id",
            "invalidValues": "ignore"
          },
          {
            "name": "ccd_uri",
            "scalarType": "string",
            "val": "ccd-uri",
            "invalidValues": "ignore"
          },
          {
            "name": "first_name",
            "scalarType": "string",
            "val": "name/first",
            "invalidValues": "ignore"
          },
          {
            "name": "last_name",
            "scalarType": "string",
            "val": "name/last",
            "invalidValues": "ignore"
          },
          {
            "name": "sex",
            "scalarType": "string",
            "val": "sex",
            "invalidValues": "ignore"
          },
          {
            "name": "dob",
            "scalarType": "date",
            "val": "birthDate/year || '-' || birthDate/month" +
              "|| '-' || birthDate/day",
            "invalidValues": "ignore"
          }
        ]
      }
    ]
  }
}
);
tde.templateInsert( "demographics-template-v2.xml", tmpl2 );

By inserting these templates, we have defined a relational lens over our patient demographic data. This does not alter the original data or the document indexes, but it causes the row indexes to be populated from the data. This means that in addition to working with the documents, we can also work with the data as rows and columns.

The concept of lenses provides great flexibility in how you use your data. A semantic lens can also be defined over a document by adding a template that defines the triples to index (the syntax is similar to that of relational templates). You could even have a relational and a semantic lens defined at the same time over a single source document. All without the need for complicated ETL processes.

Now we can examine the view using SQL in Query Console by setting Query Type to “SQL”. Figure 4 shows the result of a query showing data from the demographics view.

Figure 4: SELECT * FROM demographics ORDER BY dob DESC;

Querying

With views created and row indexes populated, we can use the Optic API to write queries in a way that is conceptually similar to SQL.

Figure 5 shows how to select demographic data from the row index populated by the consistent view we created from the two different data sources. The query also includes a cts.elementValueQuery that runs against document indexes. It’s searching relational data and document data at the same time, in one query, using the power of MarkLogic document searches combined with the new SQL capabilities.

Figure 5: Select data from a view populated by two different data sources

const op = require('/MarkLogic/optic');

const minYear = xs.date('1990-01-01');
const q = cts.elementValueQuery( xs.QName("state"), "MA" );

var sex = 'M';

op.fromView('patient_data', 'demographics', null, null, q)
  .select(['mrn', 'first_name','last_name', 'dob', 'sex'])
  .orderBy('dob')
  .where( op.and(
    op.gt(op.col('dob'), minYear),
    op.eq(op.col('sex'), sex)
  ))
  .limit(25)
  .result();

The call to op.fromView reads a row set from the view named “demographics” in the schema named “patient_data” (defined when we created the view template). A cts query limits the results to patients in a specific state. From that row set, we select certain fields, order by the birthdate and keep only those results where the birthdate is more recent than the date constant we defined as minYear, and limit the number of returned results. The final call to result() executes the query. Until that point, we are only building a query plan.

The similarity to SQL makes the Optic API feel familiar. The use of the constant minYear above demonstrates how easy it is to build queries in a readable way (this is a simple example, but variables can easily be used with Optic queries in more complex ways).

Notice that the cts query references the “state” field in the source documents, and this field was not mapped to a column by our demographics view. The Optic API supports arbitrarily complex queries as filters. Because the query runs against document indexes, not the row index, it can reference any field in the source documents.

In Figure 6, we see another way to achieve the same result, this time using xdmp.sql() and binding query parameters into the query. Like the Optic query in Figure 5, this variation can access the original documents using a cts query, but it uses string concatenation, which is harder to read and more error prone to write than the Optic equivalent.

Figure 6: Querying the demographic view without using Optic (this is not recommended)

const minYear = xs.date('1990-01-01');
const q = cts.elementValueQuery( xs.QName("state"), "MA" );

var sex = 'M';

var bindings = {"min": minYear, "sex": sex};
var sqlString = "SELECT mrn, first_name, last_name, dob " +
  "FROM demographics " +
  "WHERE dob > @min AND sex = @sex ORDER BY dob LIMIT 25";
xdmp.sql(sqlString, null, bindings, q);

Figure 7 shows a SQL query that retrieves a similar result set, but since it cannot access the “state” field in the source documents (because that field was not included in the template and therefore is not in the row index), it cannot limit the results by state.

Figure 7: A plain SQL query that selects data from a view over two document structures

SELECT mrn, first_name, last_name, dob
FROM demographics
WHERE
  dob > '1990-01-01' AND
  sex = 'M'
ORDER BY dob
LIMIT 25

What if you needed more than just the row data in your application? With the Optic API you can move from documents to rows – for performing analysis, aggregating, and filtering on the data – and back to documents again. The query in Figure 8 retrieves the entire clinical data documents for the patients of interest.

Compared with the query from Figure 5, this query does two things differently. First, it selects the ccd_uri column, which holds URIs for CCD documents. Second, it uses joinInnerDoc() to read the independent CCD documents and embed them into the resulting rows.

Figure 8: Join demographic data to clinical CCD data

const op = require('/MarkLogic/optic');

const minYear = xs.date('1990-01-01');
const q = cts.elementValueQuery( xs.QName("state"), "MA" );

var sex = 'M';

op.fromView('patient_data', 'demographics', null, null, q)
  .select(['mrn', 'first_name','last_name', 'dob', 'sex', 'ccd_uri'])
  .orderBy('dob')
  .limit(25)
  .where( op.and(
    op.gt(op.col('dob'), minYear),
    op.eq(op.col('sex'), sex)
  ))
  .joinInnerDoc('clinical-data', 'ccd_uri')
  .result();

The results from running this query are shown in Figure 9.

Figure 9: Results from joining demographic data to clinical CCD data

{
  "patient_data.demographics.mrn": 620842,
  "clinical-data": {
    "ccd": {
      "mrn": "620842",
      "problems": [
        "diabetes",
        "obesity"
      ],
      "medications": [
        "prozac"
      ],
      "allergies": [
        "wheat",
        "pollen"
      ],
      "lab-tests": [],
      "procedures": [
        "Bronchoscopy"
      ]
    }
  },
  "patient_data.demographics.first_name": "Bob",
  "patient_data.demographics.last_name": "Smith",
  "patient_data.demographics.dob": "1990-01-02",
  "patient_data.demographics.sex": "M",
  "patient_data.demographics.ccd_uri": "/ccd/620842"
}

MarkLogic 9 reinforces the position of MarkLogic as the leading commercial NoSQL database offering unparalleled flexibility, polyglot persistence, power, and accessibility. Together with Template Driven Extraction, the Optic API simplifies data integration, giving developers more convenient ways to store and search data without the need for complex ETL processes.

The examples presented here are only an introduction to what is possible with MarkLogic 9. For a good look at MarkLogic 9 Security features, read the post “MarkLogic 9 Security” by my colleague Harry Bakken. The team at Avalon Consulting, LLC will be blogging more on MarkLogic 9 over the coming months. Follow us on LinkedIn to receive proactive notice of our posts. Additionally, learn how enterprise IT organizations are succeeding in big data initiatives by reading the independent research report by 451 Research: “Avalon Consulting, LLC positioned as the big-data professional for partners and customers”

About Karl Erisman

Leave a Comment

*