You want to see a CEO freak out? Just whisper the words “Sarbanes-Oxley” in his or her ear. Financial regulatory compliance is seen as a significant headache because it requires that you need to pull together disparate financial data from multiple sources within your organization, often with conflicting definition of terms and definitions, and all too often bound into databases that were never intended to communicate with one another. It also forces you to archive your data accumulated over time in order to perform auditing, and as importantly, maintain the systems to access that data, no matter how archaic.
This adds an interesting dimension, then, to Big Data. There is a tendency to think about Big Data as the process of filtering out Twitter or Facebook feeds for sentiment analysis or similar cool projects, but at Avalon Consulting, LLC, one thing that has become obvious is that many Hadoop projects (the poster child for Big Data technology) usually come down to providing an inexpensive way of archiving business data in an inexpensive, scalable manner over time. Many Hadoop projects involve moving business data out of relational databases and into Hive or Pig so that they can still be queried while making it possible to retire older data systems that existed primarily for compliance purposes.
Of course, the ability to query that language in a meaningful fashion needs to be retained as well. This information – data models and metadata about the data in question – are often overlooked in the business migration strategy, forcing a move towards forensic or archeological programming – attempting to figure out what the intent of the data was when first gathered.
A field called Revenue in a database is meaningless without context – over what period was the revenue generated? over what regions or departments was that revenue gathered? was this net or gross revenue? – all too often this understanding of the data gets lost even when the data itself gets retained. A field called RVNU (or even P5) is even more cryptic, yet especially when data resides in spread sheets or similar “independent” databases such notations are all too common. Two databases may have the same Revenue field that nonetheless mean vastly different things.
This need to gather and maintain the metadata about the data moves beyond on any single database or application. It is in fact an enterprise-wide requirement, something that many organizations now recognize with Chief Data Officers (CDOs) or Chief Information Officers (CIOs), and with the recognition that data governance (more properly, metadata governance) is not simply a compliance requirement, but rather is increasingly a necessity as organizations subsume a wide variety of data systems, from traditional RDBMs to NoSQL data stores and graph stores, as well as the recognition that the data ecosystem extends well beyond traditional “databases” to include documents, spreadsheets (databases in the wild), various news feeds and increasingly external data streams that contain relevant information about the organization, not just by the organization.
During the decade from 2000 on, data began moving between departments within an organization, and while you saw some movement towards extending this out to supply channels and downstream to select data consumers, most of the growth in that period involved data communication growing up out of the divisional silos to become enterprise data. Increasingly, that data is spilling over the enterprise walls (like water flooding a valley) towards a wide number of consumers, many of whom have different needs, and all of whom need not only the raw data but also the metadata.
The role of the CDO, and of data governors in general, is to act to program the organization to more effectively capture and manage that metadata. Some of that involves standardization, though in general most efforts to create a single consistent organizational data standard fail for a variety of reasons (fodder for another post). More of that involves making sure that the concepts that the organization uses are consistently defined, so that even if different standards exist, there is an understanding about how such concepts translate from one environment to the next.
Numerous vendors hawk Master Data Management (MDM) products, but most of these also fail because they do not take into account the fact that an organization must do the hard work of establishing its own corporate “ontology” first, defining the concepts that are most important for it to track, regardless of the guise that these concepts wear. Without that effort, most MDMs simply become translators between a few entrenched databases.
Data governance, then, should be seen as a strategic role – establishing best practices with both in-house applications and third party vendors usage of data, creating a repository for metadata that can be used to identify the relationships between concepts and set up rules for translation between those concepts, insuring that new data systems are brought in not to create additional silos but instead to feed the enterprise metadata ecosystem.
As such data governance is likely to be a critical function for most organizations, bridging the gap between the technological and business management spheres and insuring that the data (and metadata) that an organization becomes an asset, not a liability.
(Side Note: For additional context, see the recent post on the importance of Data Governance to Big Data by my colleague Wayne Applebaum, in which he dives into Gartner’s assertion that the world faces an information crisis and how Data Governance is a necessary tool to navigate, if not avoid, that crisis.)
Kurt Cagle is Principal Evangelist for Avalon Consulting, LLC (http://www.avalonconsult.com), a consulting company specializing in big data solutions, data analytics and data management strategies, and is the author of several books on web data technologies and practices.