Data Virtualization Comes of Age


Tell me what I want to know and what I need to know.

In an ideal world, that simple request should be easy for most data systems to answer. You ask a computer “What areas should we be investing in to get the maximum return for our investments? Who are the best people in the field that may be in a position to hire? How is my business doing, really?”

These are seemingly simple questions, but in practice there are very few data systems on the planet that can answer these questions without the application of a great deal of systemized expertise and often huge amounts of processing time. Part of the reason here is that the answers to each of these are subjective rather than strictly quantifiable – how do you define “areas”, “fine”, and “doing”, for instance. What is the scope of the question – if your budget is a few thousand dollars, the answer will be different than if it’s a billion dollars? What information is expected – a quantifiable “doing well” or a detailed analysis and report of finances and expenditures? Finally, what is the context – who is the “we” and “my” here, what fields or areas are people currently invested in, and what is already known about the world relative to these questions?

Most traditional databases are designed to service applications, storing specific information that is designed for one primary use. A given database server may in fact have multiple databases, but each is an independent entity, generally unaware of and not interacting with other databases even on the same system, let along outside the confines of a department. They form data silos, giving rise the impression that information is available within that silo, but that data in one silo is effectively invisible to data in any other silo.

There have been a number of attempts to combat siloization over the years. Data warehouses represent one approach, in effect moving to a larger scale centralized data centers. This approach generally concentrated the physical databases, but did relatively little for the sharing of information between them. Master Data Management (MDM) tried to capture the slower moving common data in an organization, providing a way to synchronize resource identifiers across systems, but this did little to handle the growing number of non-relational data sources, from documents to NoSQL database to data feeds and services. Data lakes look to be a new approach, in effect using Hadoop servers to store large amounts of older relational data, but again, this is largely an archival solution – consolidating the data, but not necessarily the conceptual models behind that data.

This principle can best be summed up as Natural Language Processing, or NLP. While it’s definition has varied over the years, NLP typically means that you can give a command or ask a query without needing to know a specific programming or query language, or needing to know the underlying data model. Using NLP, for instance, it becomes possible to ask questions such as “Find me all countries in Europe where a major export is cheese, then show me which cheeses are the most popular in each country.”

This is a reasonably complex query, and not one that traditional text searches will handle well. Indeed, a traditional search will tell you whether you have a document that contains the various keywords and can use distance measuring to determine whether two terms may or may not be germane to one another. If your goal is to find documents that contain specific information, this approach is a good first pass proxy, but it doesn’t necessarily answer the question – it only points you in the direction of the appropriate document or library.

NLP takes a different approach. First, different types (or classes) of things are identified, such as countries, political unions, foods, and trade goods. Similarly, relationships and assertions are identified – France is a country. England is a country. The United States is a country. France is in Europe, England is in Europe, the United States is not. Cheese is a food. Food is a form of export for a country. Cheddar is a cheese, Brie is a cheese, wine is not a cheese, and so forth.

Natural Language Processing then looks at all relationships to establish the core query, in this case, for example, using terms like “most popular” to identify that some measure is needed, most likely based upon per capita population that likes a given kind of cheese, and that the output will need to then take the maximum measure of a set of measures.

The relationships themselves may also be expressed in multiple ways – the containment relationship between France and Europe could be listed as “is in”, “is part of”, “is a member of”, “belongs to” and so forth. If you have a data model, you can also say that all countries are a part of a continent, which means that even without explicitly knowing that there is a relationship, semantics lets you say “Since France is a country, it must have a ‘is part of’ relationship with a continent, and that there is a reciprocal relationship – ‘has part’ – that connects continents to countries. The combination of these two sets of inferences means that an NLP process can understand and properly interpret multiple ways of saying the same thing.

It should be noted that there’s a lot of information that needs to be available to answer this query. Most data systems work upon the assumption that you collect a minimal amount of information, but NLP systems, much like Business Intelligence systems, work primarily by creating a gestalt, or world view, of the particular domain that you’re interested in.

This is one reason why NLP systems have first appeared in specialized domains, such as medicine, manufacturing or emergency response. Each of these domains is relatively bounded, so while there is a lot of information to be gained there, this information usually tends to be focused across perhaps a few dozen distinct data types. As has been known for some time, it’s actually easier to create expert systems than it is to create general purpose knowledge machines (such as IBM’s Watson, Apple’s Siri or Google’s Now applications).

What all of these have in common, by the way, is that they are actually distributed systems, not only in terms of the client devices, but also in terms of how the information itself is stored. In some respects, these bear some resemblance to how the human brain stores information. Each has a general model (an ontology) that retrieves common, frequently referenced information (what many in programming would call a cache). At the same time, one type of information that these systems holds is the location of deeper information. This referenced data is less frequently accessed but is usually far more comprehensive.This information is also “slower” – it takes longer to search, and requires less energy to maintain.

A good analogy to this is trying to remember the name of the local baseball or football team vs. tryinging to remember that team’s win/loss statistics for the previous season, or the current team’s lineup. If you’re interested in sports, you need the first piece of information for common reference, but the latter will only be used sporadically, often in very specific contexts. It makes no sense to keep both kinds of information “local”, though you may temporarilly promote such information to having higher importance (and hence faster access times) when the situation calls for it, though

Within the enterprise, this kind of data architecture is becoming increasingly known as Data Virtualization. This doesn’t mean the data is any less real (in comparison to any other kind of data), but does abstract out the specific need for the typical user to understand the appropriate data structure or query protocol, as well as abstracting out the need to have a specialized API. “Write me a report in PDF format showing the best places to buy different kinds of cheese in Europe.” is a command, not a query, but the command (write me a report) takes as parameters both an indication of output (“in PDF format”) and a query bounding the data and specifying the result set (“showing the best places to buy different kinds of cheese in Europe”).

Most data virtualization schemes rely upon a mechanism to convert relational or semi-structured data into RDF, a language for describing assertions or statements about information. RDF (the Resource Description Framework language) is itself an abstract framework, with different implementations of RDF being available for different types of data stores, but typically it makes use of specialized indexes, tables that relate one thing to another, to do the hard work of manipulating those assertions. RDF in turn makes use of a special query language called SPARQL that creates templates for matching assertions that in turn can generate tables of data, much like SQL does for a relational database.

Not all data virtualizaton schemes use RDF, but most do make use of some similar semantic representation, getting away from the table/column/row paradigm of the traditional relational database and employing global concept identifiers and data assertions using multi-modal, or hybrid, data stores that can store not only relational data, but also document centric data such as XML and RDF, and typically a semantic layer built on RDF.

Such database are intended to provide a way to store and access all forms of data, not just SQL tables or JSON, and also to provide a way to relate that information together into an underlying model that can intelligently provide the means to query or update that content, without necessarily needing to be aware that this is in fact how the information is stored. They are frequently internally distributed, and many are also designed to work in a federated fashion as well, pulling in information from other databases, documents stores and object stores as well as updating to the higher latency cached servers such as Hadoop’s HBase, HFS or Pig.

These servers are also poised to become the backbone for both digital asset management systems as well as business intelligence systems. In the former case, both data (the asset itself) and the associated metadata about that asset are stored in such a system, with this metadata making use of the shared information context. In the latter case, business intelligence systems are able to perform data analytics using the metadata from the system. The earliest systems will likely see these hybrid virtual servers as RDBS, but semantic aware analytics tools (such as R) are beginning to prove much richer access to this large-scale data environment.

It is still early days for such technologies, but systems such as MarkLogic, OpenLink’s Virtuoso, Apache’s JackRabbit and others are already bearing fruit, while NLP systems such as Siri, Go, Bing, Watson and Wolfram are powering intelligent agents that communicate over mobile and computer devices, talking to general data virtualization servers on the cloud. In addition to these are a host of myriad services handling everything from real time translation to recommendation engines (from Amazon to GoodReads to Rotten Tomatoes to Quixley).

Enterprise Data Virtualization can thus be seen as the next stage of the Big Data Revolution, taking advantage of the innovations for data acquisition and transformation that Big Data systems offer, providing and learning context for the data that it contains and manages, and providing interfaces that are both human and machine readible with little to no specialized knowledge.

Care for a slice of Gouda?

Kurt Cagle About Kurt Cagle

Kurt Cagle is the Principal Evangelist for Semantic Technology with Avalon Consulting, LLC, and has designed information strategies for Fortune 500 companies, universities and Federal and State Agencies. He is currently completing a book on HTML5 Scalable Vector Graphics for O'Reilly Media.

Leave a Comment