Enterprise data integration with the MarkLogic Data Hub Framework

Data integration is one of the most familiar and difficult business challenges. For years, we have invested in technologies that are inflexible by modern standards, but today’s demands for agility and volume cannot be met by yesterday’s systems and strategies.

Due to the vast accumulation of data in legacy systems and the fragmentation of data into silos by line of business and technology, we face a complex problem. A number of alternative solutions have emerged in response. These tend to have a common theme: ETL (Extract, Transform, and Load). Unfortunately, the usual approaches to ETL are expensive and time consuming.

In addition to managing existing data within the enterprise, there are times when we need to integrate new data from external systems. This presents a similar set of problems.

Damon Feldman of MarkLogic examined the data integration problem and alternative solutions in his article, Data Lakes, Data Hubs, Federation: Which One Is Best? See also Matt Allen’s piece, Get the ETL Out of Here. Both of these articles, among others from MarkLogic, serve as an excellent reference for understanding the problem of enterprise data integration, typical attempts at solving the problem, and how MarkLogic takes a different approach. They show in great detail why and how you can achieve better results faster with a data hub built on MarkLogic.

As described in those articles, the process of building a data hub looks like this: data is first moved from source systems into a central system, without the need for the data to fit a common schema. Thanks to the MarkLogic universal index, the data is already usable at this stage. Next, commonalities are identified between data from the various sources and adjustments are made to make it more uniform, a process called harmonization. This adjusted form of the data is indexed for fast search and analytics. The process is iterative: further refinements, more added data, and additional indexing all occur as the project naturally evolves. But initial costs are minimized and the data is usable immediately, unlike with traditional ETL.

Some time has passed since those articles were written, and while they are no less relevant, there is a newsworthy development: the MarkLogic Data Hub Framework. It is an open source project with contributors who are engineers at MarkLogic. While it is not officially supported, it is production-ready. The Data Hub Framework can help you put theory into practice by taking care of important details within the process described above, allowing you to focus on the big picture.

This article will provide an overview of Data Hub Framework concepts and features. In a follow-up post, we will take a closer look as we build a data hub that will create harmony instead of the discord that data integration can bring.

What makes MarkLogic the natural choice for building a data hub?

There are a number of reasons that MarkLogic has earned its title as the leading commercial NoSQL (or multi-model) database. Many of its strengths are of particular value in building a data hub. This is not a coincidence. MarkLogic has long embraced data hubs as part of their vision.

  • Flexible indexing: MarkLogic stores XML, JSON, text, binary, geospatial, and RDF data. It indexes both structure and content automatically.
  • Scalability: MarkLogic clusters can scale horizontally to hundreds of nodes and billions of documents using commodity hardware while maintaining high performance.
  • Versatility: MarkLogic has unparalleled support for both structured and unstructured data, and it is getting even better with the upcoming release of MarkLogic 9, featuring a new row index, Template Driven Extraction, and the Optic API. I covered the new features in detail in a previous post: MarkLogic 9 introduces row-oriented views and the Optic API.
  • Accessibility: APIs are offered for XQuery, JavaScript, and Java, in addition to the REST API.
  • Well-integrated: MarkLogic is an Enterprise-grade multi-model database, search engine, and web server, with semantics, SQL support, advanced security, clustering capabilities, and more, all integrated into one product.

Building on these existing virtues, the Data Hub Framework provides tools to accelerate development of a data hub that encapsulates the strengths of MarkLogic and the wisdom gained from years of experience implementing data hubs using the platform.

What is the Data Hub Framework?

The MarkLogic Data Hub Framework is a set of libraries and a GUI to make it easier to build data hubs that follow best practices. It is designed to support DevOps activities. You can quickly build a complete data hub using the GUI. You can also access all the functionality by using the libraries directly. This combination provides convenience to get started easily and flexibility to migrate parts of the hub to your build process as your project matures.

How is the Data Hub Framework used?

A complete data hub that conforms to best practices can be built by following these steps:

  1. Gather input data.
  2. Identify entities (business objects).
  3. Create input flows to load the data. These define the data source and how it should be loaded. There can be different input flows for different types of data to be loaded.
  4. Run input flows to load the data as is into the staging database.
  5. Identify common attributes of importance for indexing.
  6. Create harmonization flows that will separate the common attributes (while still preserving the original data) and make other adjustments to the data such as unit conversion or enrichment.
  7. Run harmonization flows, which will write harmonized data into the final database.
  8. Run queries and begin using the data hub.
  9. Repeat the steps iteratively, loading more data and refining harmonization as the project evolves.

Use case: a steel distributor data hub

In the next post, we will begin building a data hub by following the steps listed above. Our use case will be a data hub for a steel distributor, a company that buys steel products from various manufacturers, categorizes the products, and resells them. The specifications for each steel product are received in a form specific to the manufacturer, which can vary widely. Manufacturers from different countries use different steel grading standards. There are also different units used to specify dimensions. These diverse specs need to be stored, searched, and presented uniformly by the distributor. We’ll use the Data Hub Framework to harmonize the data so products can be found regardless of the standards used in their country of origin. As a bonus, we will show how MarkLogic semantics features can be used to understand the relationships between different steel alloys. We will use the semantics support in the Data Hub Framework to enrich product specification documents with deeper meaning on their way into the data hub and we will demonstrate how this added meaning enables more powerful searches.

Follow us on LinkedIn to receive proactive notice of our posts. Additionally, learn how enterprise IT organizations are succeeding in big data initiatives by reading the independent research report by 451 Research: “Avalon Consulting, LLC positioned as the big-data professional for partners and customers”.

About Karl Erisman

Leave a Comment