Metadata Management – Semantics for Big Data?

Semantics & the Art of Metadata Management


Over the course of the last six months, I’ve been fairly heavily involved in the realm of Metadata Management.

Never heard of it? You will.

Chances are, somewhere in your organization, there’s a room or two filled with media – pictures, sound files, audio files, software, as well as paper and electronic versions of contracts, marketing copy, press releases, production graphics and all the other myriad media necessary to run a business, no matter how big that particular business is. Increasingly, these resources have moved out of file cabinets and into digital hard drives, sometimes within folders that made sense using file conventions that were well defined ahead of time, sometimes just willy-nilly dumped into one giant Downloads folder.  Yet even with the best of intentions, what worked for a few hundred images can get hopelessly disorganized when you start talking about tens of thousands to millions of text, images, audio and video files.

One thing worth considering is that these repositories are no longer just the domain of large libraries and archives.  As digital convergence becomes more and more the norm, this has meant that these resources become raw materials for various productions, and the kind of digital archiving needs that once were typical only of the largest studios now can be found in even a modestly-sized company, agency or university marketing department. We’re all now multimedia producers.

Digital asset management systems (DAMs) started appearing about six or seven years ago. Most of these systems worked by providing a metadata record between the physical asset (which may be stored in a specialized database) and the user. This metadata made it easier to organize this content, because titles, keywords, descriptions, publication information and other organizing data could be associated with each media file. In effect, this made electronic curation feasible for a growing number of resources.

An interesting characteristic, however, of organizational data is the fact that it’s fractal – it tends to fold in on itself. There is, in fact, a very human tendency to fold this information into hierarchies, because hierarchies are easily navigated, and fits in with the notion that information can be “contained”. This hierarchical system of classification is known as a taxonomy, and is perhaps the most frequently used way of dealing with metadata.

However, this organization comes at a price. Consider a sports team such as the Major League Baseball team the Seattle Mariners. If you were putting together a listing of teams within the MLB, one taxonomy that suggests itself for identifying this team would be something like:

 /Sports_Franchise/Major_League_Baseball/Western_Division/Seattle_Mariners.

On the other hand, if you were creating a directory for organizations in various cities, then you may have a very different taxonomy:

 /Seattle/Commercial_Enterprises/Sports_Franchises/Mariners.

Both of these taxonomies are valid (or put another way, neither of these taxonomies are invalid), but they work on an assumption of very different contexts. As significantly, while they describe (presumably) the same organization, there is nothing in the names within these taxonomies that could tell a computer that they were in fact the same entity.

An increasingly commonly heard phrase is “An Internet of Things”.  Like a lot of memes this has multiple meanings. One interpretation is that this is the Internet of devices or sensors, where each device is connected via some kind of wireless server to other devices. There’s the mobile view of the web, where the sensors in question are smartphones and tablets, each computers that not only provide an endpoint for the user to connect to the web, but also serve content back to distributed hubs for aggregation of data, part of the Big Data revolution. Add into this the explosive growth in passive RFID chips which reveal themselves when they enter the coverage zone of an RFID receiver. Certainly this constitutes an Internet of Things.

Yet another way of looking at these is to see each such object as being an “entity” with a known “address” that identifies it globally, and the properties that are associated with that entity – its metadata – provide a description of the state of that entity at a particular time. If you remove the requirement that the address has to be a real world location, then the address can in fact be thought of as an absolute name for that entity – even if that entity is not itself a physical construct but rather a conceptual abstraction. The “Seattle Mariners” is an entity, but that entity is not in and of itself a physical object, but an organization.

The things that make up that organization, the players, the coaches, the management, the facilities, may refer to physical objects or people, but from an information management perspective this is simply a byproduct of an information model – there is the person “Felix Hernandez”, and then there is the information entity “Felix Hernandez”, who has a set of attributes – games played, games started, season earned run average,  lifetime ERA, preferred pitches, win-loss record, and so forth, that describe this entity.  This latter entity is useful for working with data models and applications (Fantasy Baseball is essentially a data model and simulation on that model using these entities as starting points).

What makes this significant though is that while Felix Hernandez is a baseball player he is also a legal person with bank accounts, a mortgage, income tax records and so forth. This Felix has a whole different set of properties – is in effect engaged in a completely different model and no doubt has very different identifiers within these models- yet it is also undeniably the same person. The same can be said for the Seattle Mariners – it is simultaneously a baseball team and a business, and in an era of big data, the former model (a team’s win-loss record for a given set of seasons) can have a very real impact upon the latter model (the profit or loss of the franchise as a business). The collective set of information about the Seattle Mariners can then be thought of as its metadata.

Metadata means many different things, though most are related. Metadata is the set of properties, as well as property values, that has been collected about a given entity or resource. It is the set of alternative values that any given property can take (the list of all teams in the MLB, or the American League West, or sports teams in Seattle). It includes the relationships that exist between this entity and others (player:Felix_Hernandez is a member of team:The_Seattle_Mariners) and type associations (player:Felix_Hernandez is a baseballPosition:Pitcher). It is type constraints – Earned Run Averages are always a three decimal floating point number, as an example. This metadata has a history, and evolves over time … and its rapidly becoming overwhelming.

Many organizations are now facing (or will soon face) a metadata crisis. As business units evolve and data software and systems emerge to solve specific problems, they introduce explicit models about the things that this business cares about. Yet because most relational databases are intended to solve specific needs, these models about the very same things are disconnected from one another, identifiers change based upon which database holds them, and business lose value from this data because there is no clean way to see the interrelationships that exist between the different models. Organizations have started employing Master Data Management solutions to track keys across systems, but this process is at best a stop-gap, useful for synchronization but because of that ill-suited as a queryable repository.

Similarly, a lot of organizations have attempted to impose a top down data model on all of their structures, but these efforts typically have a pretty dismal track record, primarily because of the fact that there are quite often multiple distinct models that any given entity participates in, but also because modeling is as much art as it is science, and different modelers may express information at different levels of granularity, or see relationships to different types of objects.

This ultimately is where semantics comes in. Tim Berners-Lee, who was responsible for setting in motion what would become the World Wide Web in 1989 and 1990 by linking together physical addresses on the nascent web, at first via strings of numbers (DNS addresses) then later via Uniform Resource Locators (or URLs), began wondering (in 2000) if the same thing couldn’t be done with more abstract ideas – linking a resource (in this case using a Uniform Resource Identifier or  URI) via a relationship or predicate to either another resource or to a value. This assertion (subject, predicate, object) could be combined with other assertions to create a graph of connections, some connecting the subject, some to the object, some even to the predicate, all to provide pieces of description for each resource or entity in the system.

This Resource Description Framework (RDF) made possible a web of concepts, but unlike the physical web made up of routers and servers and clients, this web was made of resources, relationships and values. Yet just as the Internet evolved with HTML and Javascript and images and protocols and all the myriad other things that make the web work, RDF evolved a data model framework called RDFS (RDF Schema), later supplemented with OWL (Web Ontology Language) and OWL2, and a query language called SPARQL (which stands, recursively, for SPARQL Protocol and RDF Query Language) that made it possible to ask questions (and later update) this RDF using a SQL-like language, which recently (2011) was released as a much richer 2.0 version that includes update capabilities, turning it into a true database language.

Of course, if this semantic language was so powerful, why isn’t it universally used? There has long been a chicken and the egg type of hurdle facing RDF semantics. Languages like SPARQL are JOIN intensive – in essence you have one gigantic table with three fields (and its actually a bit more complicated than that) that consist of nothing but index keys to other parts of the table. This means that, much as is true with NoSQL data systems in general, you need powerful enough computers and data storage systems to perform these in a reasonable amount of time. The lack of consistent update capabilities slowed adoption (the same thing happened with SQL, by the way – it wasn’t until there was a consistent update mechanism in SQL databases that they became ubiquitous). The query language, while relatively simple, requires a different way of thinking about information.

Finally, there was no real need for it except in very specialized circumstances until comparatively recently, when data began to flow out of its containers and into the wild, interacting with other data. It is this metadata jungle that SPARQL was intended from the beginning to address, and in places where it has been implemented Semantics has been a real game changer, making metadata management much easier.  Semantics not a panacea – there is still work that needs to be done from the outset to see the data model and identify the core entities within a system – but it is quite powerful in being able to compare and manipulate not only data but data models, something that’s a fairly radical capability.

Perhaps what makes semantics so exciting in the space of metadata management is that it becomes increasingly possible to combine semantics with search. Not all metadata is in relationships – in almost any system, titles, descriptions, abstracts, body copy, summaries, biographies and other narrative content also contain relevant information about certain resources. Textual searches, field (or element) searches, lexicons and thesauri can match text query strings and identify candidate fragments, then semantic solutions can use inferencing logic to eliminate those things that aren’t relevant to the item while finding relationships that may extend across four or five (or even dozens) of links between documents.

For instance, given the expression “Felix Hernandez”, you can find the teams he worked with, the catchers he interacted with, the players he faced and the teams that they belonged to, all by this combination of search+semantics. Search by itself is not sufficient – you need to infer connections via word proximity, order and phrasing. Semantics by itself needs the scaffolding of search to connect the URI-based index keys with labels and descriptions that are meaningful to the general person.  Together these two capabilities complement one another, and increasingly it is this combination of search+semantics that is becoming the game changer in industries as diverse as publishing, education, military analysis, manufacturing, legal analysis and retail.

Semantic Metadata Management is a new practice for Avalon Consulting, LLC, intended to help companies and organizations get a better handle on their data design and modeling, digital asset management, MDM coordination, virtual production workflows and curation problems by applying semantic technologies, data enrichment, map/reduce and machine learning to large scale information management. While we have certain technologies that work best for us, we believe that effective metadata management is platform agnostic, and as such have developed competencies across a number of semantic datasystems and integration platforms, the better to provide effective metadata management solutions for you and your organization.

 

Kurt Cagle About Kurt Cagle

Kurt Cagle is the Principal Evangelist for Semantic Technology with Avalon Consulting, LLC, and has designed information strategies for Fortune 500 companies, universities and Federal and State Agencies. He is currently completing a book on HTML5 Scalable Vector Graphics for O'Reilly Media.

Comments

  1. Excellent article! Of particular interest was your discussion on combining semantics and search to the problem of metadata management. Your “Felix Hernandez” example brings home the point that the search/semantics combo can be very powerful and allows enterprises to do things they’d not be able to do with a traditional SQL on RDBMS based solution approach to metadata management.

    Also, I agree with your comments about master data management initiatives being a stop gap if they’re based on traditional “model everything first in a relational system” approach. The resulting system is inflexible and just can’t keep up with rapid change.

  2. Hi Kurt,

    I’m trying to think in turtle lately, and have questions regarding this line in you post:
    “(player:Felix_Hernandez is a member of team:The_Seattle_Mariners) and type associations (player:Felix_Hernandez is a baseballPosition:Pitcher)”.
    Are you suggesting “player”, “team” and “baseballPosition” are name space prefixes? This is just your short hand for compressing a larger set of triples, right?
    Your search+semantics points are SO spot-on!

    • Bob,

      In essence, yes, the prefixes are intended to show a “type” model that tends to be less obvious when you use a formal namespace. For instance, when I talk about player:Felix_Hernandez, this likely refers to something like “http://seattlemariners.mlb.com/ns/class/Player#Felix_Hernandez”. The prefix would then be assigned prior to the query:

      prefix player:

      I find that it makes a lot of sense to conceptually use namespaces as type identifiers, even though it tends to lead to more namespaces. Usually, in that case I’ll store the prefixes and related namespaces in a Javascript hash map, then generate all of the namespaces as a header, even if they are not necessarily used in the query (it has a minor performance impact).

Leave a Comment

*