In an earlier blog post, I wrote about the need for “flexible models” when managing digital asset metadata. A flexible model allows different digital assets to have different properties. It is also designed to handle the fact that we cannot know exactly what type of metadata needs to be stored. This flies in the face of most database design, which is typically predicated on knowing exactly what information we are going to collect.
In this post, I will look at implementing a flexible data model inside a relational database. The key principle is to create tables that hold generic name-value pairs rather than sets of specific attributes. While seemingly simple, this approach becomes quite complicated. In this post I will thus build up the model from a simple form to one that is more complete.
Figure 1. Simple “schemaless” model allows any number of name-value pairs to be associated with digital assets.
In the model depicted in Figure 1, every property of an asset is represented by a single “Property” entity. Rather than specify the number of allowed properties, the model allows the user to enter as many as necessary. Each Property contains attributes for its name and value, as well as a data type so that the calling application can determine how to format and validate the value. For example, a Property with a DataType of “integer” would only consist of whole numbers, while a “date” would take the form MM/DD/YYYY.
A major limitation of the model shown in Figure 1 is that it cannot handle multi-part metadata. A common example of this type of data is a person’s name. When the model is translated to a table, names would need to be entered like this:
Table 1. Example table holding name-value pairs representing a person’s name
The FirstName and LastName are associated with the asset via the AssetId, but the two Properties are not clearly associated with one another. This creates a problem if a second name is entered for the same asset:
Table 2. Table illustrates the difficulty relating FirstName properties to LastName properties when more than one person is entered for a single asset.
In Table 2, it is unclear whether John’s full name is John Doe or John Smith because the FirstName and LastName pairs are not grouped in any way. The model should thus be refined by adding a grouping key:
Figure 2. Model allows related properties to be grouped together.
Table 3. Grouping key can express the relationship between FirstName and LastName properties.
The model in Figure 2 is better because now it is possible to see that “John” and “Doe” belong to the same group. However, the model is still not satisfactory. On the technical side, the model is not in third normal form because the Name and DataType attributes are not fully dependent on the primary key of the asset. From a usage perspective, there is also no clear way to control the ordering of the property groups. For example, how would the user specify that John Doe’s name should appear before Jane Smith? This is an important consideration if you are listing names in a catalog or on a web page.
The final version of the data model (shown in Figure 3) solves these normalization and ordering problems by creating a new table to represent the property type and by adding an attribute that can control the sorting of groups.
Figure 3. Model is now in Third Normal Form and provides an attribute for sorting groups of properties.
This version of the model demonstrates the difficult of achieving a flexible model inside a relational database. Querying for metadata requires joining four different tables, dealing with grouping keys, and then sorting values. And despite all this effort, the data model still cannot handle truly complex data. For example, what happens if the user wants to associate one or more addresses or phone numbers with a specific person? This would require creating groups within groups, which simply cannot be handled without greatly complicating the model and the ability to query it.
These problems underscore the fact that relational databases are not optimized for data that does not adhere to a single schema. If the metadata associated with your assets is more complex than name-value pairs or a few multi-part fields, then a relational database will probably not be able to provide the storage you need.
These shortcomings are well known and have led to new types of databases that are better able to implement flexible data models. In my next blog posts, I will look at how digital asset metadata can be modeled and then stored in document- and graph-oriented databases.
Note: This is the second in a series of blog posts discussing the need for flexible data models when managing digital asset metadata. The series is based on Demian Hess’ article “Managing digital asset metadata”, Journal of Digital Media Management, Vol. 3, No. 2 (November 2014).