Why Big Data Will Make Semantics Feasible … and Semantics Will Make Big Data Worthwhile

You’ve invested the money into a Hadoop stack, from Hortonworks or Cloudera or any of a half dozen other Hadoop solutions providers, trained up your Java developers, set up HBase and HFS, set up the requisite parallel process pipelines and worked up the algorithms, and now you have a potent grid computing system. You run a pilot project or two, the programmers are happy because they’re doing the latest and greatest, and you finally gear up to turn this massive processing turbine on all of the problems that the company has … only to discover that there simply aren’t that many problems that can’t be handled by the garden variety SQL databases or even the new crop of NoSQL databases such as Couchbase or Cassandra. You’ve chewed through the logs from your sales transactions, and discovered that you did in fact sell a lot of stuff last year, but in reality such logs have only marginal utility unless they can be related to something else.

A similar case can be made for document enrichment. Tools that can both recognize terms and embed them in documents are definitely useful for being able to do searches based upon keywords, but this generally works best for identifying when two or more documents share keywords, and while this can more narrowly define “related” matches, it won’t necessarily help you in identifying what those relationships are. Thus, document enrichment for the most part has been seen as being primarily a niche market, and making the case for sometimes expensive licenses becomes somewhat harder if the value proposition is not there.

It is into these and similar issues of “data farming” that semantic technologies offer a major benefit – and conversely one where semantics can in turn be supported more effectively by these other technologies.

Semantic RDF can best be thought of as “SQL for other people’s data”.  Most relational databases are built primarily around the proposition that you have control over your own data. You can define the schemas, can relate the indexes, can determine the data types and data models. In  great number of cases, perhaps even the majority of them, this is precisely what is needed. You know your own problem domains, the information is “local” to a specific set of object types and adding or removing content involves adding or removing rows from a table.

The problem comes when you start dealing with other people’s data. The “other people” in this case may be people outside of your organization, may be people in a different department, may even be someone in the same department but using data that he or she developed separately from yours. Getting that data from their system to yours, or vice versa, has always been one of the greater problems of database development. Certainly, the information can be serialized (originally through some kind of data binding object, more recently through the use of XML or JSON), but managing the indexes has always been tricking, and in general the possibility of data collision has always been high.

Semantics makes two underlying changes to this model. The first is to identify each of the things – the records about people, organizations, products, activities, and so forth – in a given data environment with a unique uniform resource identifier (or URI). This URI (or IRI, for internationalization purposes) take the place of numeric keys, and also serve to shift the focus away from records and towards “resources”, which is a somewhat more abstract concept. There may be many systems that have different primary indexes for the same thing within your environment, but each system would share the same IRI for that thing.

The second assumption is what’s called the Open World Model.  Imagine for a given record in a table in a SQL database that you were to retrieve the primary record ID for that record in that table, the column name and the associated value for that row and column. These three values are known as a triple – the record ID corresponds to the aforementioned URI for the resource of that record, something referred to as the subject. The second term, the column name, corrresponds to a property or predicate, while the third term, either a value or an index to another record, becomes the object  of the triple. If it’s an index, then that object also gets replaced with a URI.

In practice, what this means is that you can take a table row with eight columns and turn it into eight assertions that describe that same information for that record. This may seem like you’re introducing a lot of redundancy into the system (and you are, to a certain extent), but one very important consequence of this is that the resource in question is no longer bound to just those properties that are initially defined. In theory, you can make any number of assertions about each resource in the system.

This is both semantics greatest strength and greatest weakness. When you query the database that holds all of these assertions, what you are then doing is constraining the set of all terms such that one or more of the three terms in the assertion are matched, while one or the other of the terms constrains other assertions. You know longer need to know which table a given property is in, and because you are dealing with global identifiers, you know longer even need to have all of the assertions in the same database. This makes it possible to create queries without necessarily knowing everything (or potentially even much of anything initially) of the schema of the system. This capability in turn makes it possible to create assertions as you chain such queries together, in effect making inferences that are much more sophisticated than you can do with a relational database, and can even make such inferences working with other people’s data.

On the other hand, you need more horsepower to make these types of queries, because you’ve lost a certain amount of optimization of the indexes involved. Similarly, while transforming a relational database to a semantic one is relatively easy (primarily involving the consistent conversion of indexed keys to URIs), other data formats are not necessarily as easy to process. For HTML content, or XML document languages such as DocBook or DITA, the presentation/document specific aspects of the markup may actually contain little in the way of semantic information about the meaning of specific passages within the document. Thus the “interesting” semantics are not necessarily how the article is presented but what the article is about. This is an example where Hadoop and similar map/reduce infrastructures, working in conjunction with content extraction tools or products, could identify and autotag resources and relationships within these documents.

Ironically, Big Data infrastructures also make inferential processing more reasonable. Suppose that you had a list of products, an indication of when (and at least indirectly, by whom) they were purchased, and you wanted to find out all of the ways that the products may be related to one another in order to create a more effective marketing campaign.  The common factors may not necessarily be something that can be found simply by looking at common keywords are groups, but instead, may be due to a factor that each item has with some common thing (they share a common set of stores, for instance) or even something where there may be two orders of separation between the resources (a product may be produced by a company under different brands). When the information (and potential numbers of relationships) begins to get huge, searching shifts from synchronous queries within in-memory databases to having to search in a federated manner.

Recently there has been some interest on a number of front on building such a distributed semantic data system on top of Hadoop, and other systems (such as CouchDB and Cassandra) can already be configured to handle at least some aspects of these queries. The key is the development of a common query language, which now seems to be converging towards the SPARQL language produced by the W3C (the same group that brought you HTML and XML). SPARQL is SQL write large, intended for working with web based federated triple stores built upon the open world assumption.

SPARQL has the potential to be bigger than SQL was in the 1980s and 1990s, although its worth noting that many of the same people that worked on SQL as young hotshots contributed to the SPARQL specification some twenty years later with the knowledge about the web, extensible query functions and a huge amount of “best practices” experience behind them. It can be applied across a “database” that spans dozens or even hundreds of service “end points” across the Internet, but it can also be applied to an in-memory database on a web page in a browser.

Additionally, when OWL definitions (themselves RDF) are included for various resources, they make discovery possible – a single query can retrieve all of the relevant properties, relationships and constraints acting on a given type of object. While such an approach does not work as well with SOAP based SOA services, it works remarkably well with REST oriented ones.

Given that, a SPARQL layer on top of HBASE, Hive or HFS, or other map/reduce type architectures would be incredibly useful. The challenge is in getting there.  A few commercial products sit on top of various implementations of Hadoop (such as  Loom by Revelytix which sits atop Hortonworks implementation). Seattle based GraphLab is working on a Hadoop oriented semantic data system.  Additionally, companies such as MarkLogic have recently announced supporting both SPARQL and Hadoop in their upcoming server release.  Given the potential that Hadoop has both in producing RDF and searching it in a distributed map/reduce) fashion, it’s likely that other companies will be exploring this niche within the year.

As data analytics increasingly require “other people’s data” to find key relationships, perform business intelligence analytics and handle pattern and sentiment analysis, RDF triples and SPARQL provides a universal, standards-drive approach to make such queries feasible and manageable. Whether this be on Hadoop or other platforms (such as existing triple stores such as TopQuadrant, Virtuoso, AllegroGraph or OWLIM, which are themselves moving into the same big data space), semantic technologies are making their way into the Big Data space in a Big Way.

Kurt Cagle is Semantic Information Architect for Avalon Consulting LLC, specializing in semantic technologies, enterprise search and metadata management for Fortune 500 companies and Federal and state agencies.

Kurt Cagle About Kurt Cagle

Kurt Cagle is the Principal Evangelist for Semantic Technology with Avalon Consulting, LLC, and has designed information strategies for Fortune 500 companies, universities and Federal and State Agencies. He is currently completing a book on HTML5 Scalable Vector Graphics for O'Reilly Media.


  1. The open sources project,

    is a big data project, but it is not on top of HBASE, Hive or HFS, or other map/reduce type architectures, right?

    So, a fundatmental question: Does Bid Data have to be on the of HBASE, Hive or HFS, or other map/reduce type architectures?

    • Sam,

      Nope. Keep in mind that “Big Data” is not a technology, it’s a problem … it is in fact several problems. What do you with data that’s too high velocity? too varied in structure? that comes in large data sets? that isn’t connected conceptually to your data models? Map/reduce is an effective strategy for some of these problems, especially those that lend themselves to a divide and conquer approach (anything involving indexing, for instance), but not all problems fit neatly into this arena) and Hadoop isn’t always ideal for this type of processing. If I have data that’s in JSON, Cassandra or Couchbase with node, or MarkLogic 6+, all have more efficient indexers that don’t have to be custom written. If the data is in XML, MarkLogic or eXist are both good tools. Indexing in relational databases are generally highly efficient. Recursive stack processing may or may not lend itself well to Hadoop, but a lot of processing out there is more effectively done recursively.

      In other words, while HBase, Hive and HFS are all useful tools, they also have a significant overhead, especially in ETL, that may already be solved with other platforms.

Leave a Comment