Ontology for Fun and Profit

screen-shot-2012-01-06-at-8-36-19-am-copyThe IT sector is notable for coming up with unique job titles – from the various C permutations – CIO, CTO, CDO, etc. – to Test Wranglers, Scrum Masters and UX Designers. One that you’re likely to start hearing more about is the title of Ontologist.

While this sounds like it could be something more likely to be found in a medical clinic, perhaps specializing in diseases of the lower intestines or something equally disturbing, the ontologist’s role actually has more to do with linguistics … and with data modeling, especially at the enterprise level.

When a programmer puts together an application, one part of that is the definition of the key “things” that the application is concerned with, along with the properties these things have, the actions that they can perform, and the events that they respond to. Programmers refer to this as object oriented programming, but another way of thinking about this is that these programmers are, in effect, writing new words to a computer language, each with its own “meaning” – its impact upon the computer system, as expressed in code.

Database designers face a similar problem, though in this case the language that they are defining includes not only the terms involved but also the relationships that exist between terms. While not always true, there is typically a clear relationship between a particular table in a SQL database and a class of entities – people, accounts, completed sales, and so forth. In the trade, this is known as an Entity Relationship, or ER, with each field defining either “atomic” properties (those that can be described as a a string of text, a number or a date, in general) a primary key or id, and possibly one or more foreign keys. These foreign keys in turn match primary keys in records (rows) of other tables – another way of saying that a relationship exists between the record in one table and another record in another table.

This collection of entities, properties and relationships together form what is known as a data model. Data models differ from object models in one critical way. Object models typically have methods, properties, relationships and events, while data models usually only have properties and relationships. This, however, is not as big a difference as it seems; most of those (procedural) methods change the state of an underlying data model in some way, just as most events indicate that a state has changed within the data model.

In object oriented programming, the data model tends to exist partially within the various code objects, with visibility into these objects restricted in scope. When working with a database, however, the objects in the application effectively store their state within the database, not within the objects themselves. With many applications, especially those where performance is critical, storing the state within the objects makes sense – when the data is local and highly transient, such as in a game, but even in a game, this state is periodically “serialized” (turned into a temporary sequence of information) then persisted to a longer term data model, so that it can be reloaded lately (such as a person saving a game at a save point so they can stop it and come back later).

As “Big Data” moves to the fore, though, this model is changing. The data model becomes more important, because more than one person or process is going to end up using that data. Indeed, there may be hundreds, thousands or millions of people using the same data, with dozens of applications needing to talk to it. This is where database design becomes critical. When programmers put together applications, they may share a database, but in many organizations what will typically happen is that the databases become “silo’d” around a specific set of applications in a given department, and communication between these databases and other databases that are designed to track and manage information about overlapping entities across departments breaks down. An organization may have lots of data, but without some way to establish agreements about the entities involved – even how those entities are identified – the data may as well not exist outside its original use.

Put this another way – it’d be very much like one department coding everything in English, a second in German, the third in Russian and the fourth in Japanese. As anyone who has ever had to learn a second language knows, translating between these very seldom is just a matter of changing one word or phrase for another. Each language has its own grammatical structures and rules, and there may be subtle (or glaring) differences in idiom, in how two different languages express the same concept. With natural languages, translating falls into the domain of a linguist. The same thing applies to translating data models, and this falls into the realm of a computer linguist – which is more or less exactly what an ontologist does.

The ontologist identifies the data models in play within an organization, then finds ways to map between these models, or to create a super-model (often called a canonical model) that identifies the concepts and important relationships that the enterprise overall needs to track. Often that ontologist will work with database architects to map relationships with various data stores, and may also work with senior developers and system architects to set up a framework to consistently build upon the language that the enterprise is using for internal communication.

These ontologists then work to translate these conceptual (or logical) models into physical models – XML Schema files for XML documents, UML for object models, ER diagrams for databases, RDF and OWL data models for semantic systems – the artifacts that developers, database designers and admins, system architects and so forth need in order to power their applications.

As such, organizational ontologists are critical for large scale projects – they work with business analysts to determine the model that most closely fits the various requirements on a system, but also work to insure that the model is useful for applications beyond the immediate task at hand. They also may be heavily involved in data governance, as one of the purposes of the data model is to provide a consistent framework for understanding the lifecycle of data over its lifespan.

What makes for a good ontologist? Typically, experience with data standards development plays a big part, a solid understanding of UML and modeling requirements, experience with both relational and non-relational (XML, JSON, RDF) data stores and query, some time in the field as a programmer or application architect, and similar experience. Many ontologists are fully grounded in semantic technologies, and so are fluent with RDF, OWL and similar languages, or work with organizations to define XML schemas and definitions for large organizations. A linguistics background doesn’t hurt, though more at the theoretical level rather than at the translation level, and many ontologists tend to be involved with natural language processing, machine intelligence and AI related projects.

Like any technical specialists, ontologists tend to command fairly high salaries, anywhere from $90K up to the $200K for people with extensive experience, and it is also one of those areas where ontologists will be more likely than most to set themselves up in consulting practice, as ontological work tends to be concentrated and discrete rather than long-running and continuous.

On the flip side, hiring an ontologist as a consultant is typically money well spent, as a good ontologist can help an organization gain access to much if not all of the information that it generates about its business processes, personnel and products. As such he or she can save millions of dollars by reducing the need for duplicate or overlapping data initiatives, providing a general framework that can dramatically cut down on development time and documentation on new projects, and can, with some data, even opening up potential revenue generation opportunity with license-able data and metadata content.

As data and metadata interchange becomes the life blood of companies in the next years and decades, the ontologist will become a central person as part of your data management team, and will take his or place alongside the visualization designer and the information analytics specialist as part of the next wave of IT professionals.

Kurt Cagle is Principal Evangelist for Avalon Consulting, LLC (http://www.avalonconsult.com), a consulting company specializing in big data solutions, data analytics and data management strategies, and is the author of several books on web data technologies and practices. (And by the way, if you’re looking for work as an ontologist, we’re hiring!)

Kurt Cagle About Kurt Cagle

Kurt Cagle is the Principal Evangelist for Semantic Technology with Avalon Consulting, LLC, and has designed information strategies for Fortune 500 companies, universities and Federal and State Agencies. He is currently completing a book on HTML5 Scalable Vector Graphics for O'Reilly Media.

Leave a Comment