This is my second blog focusing on data technology. In my first installment “Data Data Everywhere” I explored NoSQL technologies, specifically Couchbase. Recently, I had the opportunity to attend the SAP Data Hub training course. In that course, we focused on developing pipelines and moving data from one storage technology to another. We also looked at discovery functionality along with security and out-of-the-box operators. I was impressed with the software’s architecture (Kubernetes) and easy to reuse pipeline functionality.
In this article, I’ll explain SAP Data Hub’s mission and look at a fun make-believe use case that represents other real-world scenarios. But first, let’s briefly take a look at the state of data and how we got here.
Big Data
Some label our current economic period the Information Age – characterized by the shift from an industrial economy to one based on information technology. I would agree with this sentiment dating back to the late-80s, but today I would call it the Age of Big Data. Big Data has introduced an interesting conundrum.
- With all this data available, how do we get the most value from it?
- How do we use it to support business processes?
- How does this contribute to competitive advantage?
- What’s in it for me?
Forrester estimates that 60-73% of all data within an enterprise goes unused. Today, we have more data available to us than ever before. Humans have created more data in the last two years than in our entire history (read that again and remember 60 to 73% is unused). This phenomenon is exponentially growing with more people using more devices and more devices becoming smarter along with numerous machines affixed with all kinds of sensors recording physical operations digitally – IoT Internet of Things.
Hadoop
Hadoop was developed to handle large amounts of data by distributing it across clusters of commodity hardware. The native file system, HDFS, handles data distribution rapidly across nodes. It has the advantage of splitting up large files improving data locality and processing efficiency per node. Hadoop is the technology platform enabling companies to build all-encompassing “data lakes”.
Data Lakes
A data lake is a centralized easily accessible repository for all your organization’s data. It can be a staging area for other applications downstream, or it can be the repository for developing analytics or machine learning algorithms. A data lake is not a data warehouse because it includes any data whether structured or unstructured and there are no upfront transformation requirements. Hadoop with HDFS is well suited to handle this type of “Wild West” environment.
In order for data lakes to add value as a central repository, ETL (Extract-Transform-Load) tools interfacing with all-encompassing data lakes require sophisticated functionality that support data governance and improve data quality. Otherwise, data lakes risk becoming unused data swamps adding little value to the enterprise. A new tool from SAP named Data Hub plays a critical role in facilitating the access and use of content in data lakes.
SAP Data Hub
Gartner labels ETL tools “Data Integration Tools” which: ingest, transform, combine, and provision data across spectrums of data types within the enterprise or to partners. Our goal as data stewards is to facilitate data consumption requirements of all applications and business processes; anytime, anywhere, and on any device. SAP is a leader in this area.
Many companies are struggling to get value out of their enterprise data in combination with their Big Data initiatives. SAP Data Hub is a tool facilitating data operations which provide data access and metadata governance capabilities. It is a data sharing, pipelining, and orchestration solution that accelerates and expands the flow of data across diverse landscapes.
After my training course, I wanted to test my new skills on a use case scenario. Since I also work with Couchbase, I wanted my use case to take advantage of this NoSQL database. It was snowing outside while I was thinking of this technology along with potential use cases. The weather inspired me and below I’ll explain my ideas about personal weather stations.
Weather Station Use Case
Weather is a big deal to a lot of people. There’s more than one TV channel dedicated just to the weather. Weather is also a big deal to business. In 2017, there were 16 separate billion-dollar disaster weather events. There were fires and drought, flooding and tropical cyclones, along with hurricanes. The cumulative damage of all events in 2017 is estimated at $306.2 billion, which shattered the previous U.S. annual record cost of $214.8 billion (CPI-adjusted) in 2005.
I was thinking, “wouldn’t it be cool to have a personal weather station?” – a device I could monitor while at home and also on my phone. Through time I could track the weather at my house and measure fluctuations.
I discovered that Weather Underground offers an opportunity to share weather data with a community of like-minded enthusiasts. They also promote personal weather stations that upload observations to their network. In addition to Weather Underground, I wanted a weather station where I control and direct the data. Here’s a short video of my wish list device.
Architecture Overview
My fictitious idea is to deploy personal weather stations across my home state of Colorado. In the image below, blue arrows indicate the flow of data to different environments. Weather stations upload observations every 10 minutes to Amazon AWS S3 bucket, labeled datahubweather below. SAP Data Hub monitors the bucket and reads text files as they are written. Then, Data Hub inserts these observations into a Couchbase cluster. Couchbase supports a website and a mobile application. The mobile application leverages Couchbase Lite and Sync gateway facilitating social interactivity for weather enthusiasts like me!
Data flow for weather observations.
Example weather observation including timestamp.
This is an image of the user interface for SAP Data Hub while creating the data flow described previously. The Avalon Couchbase Go operator is easily configured via parameters for host, username and password.
The Weather for Monday – Friday 4/29 to 5/3/2019
Even though I don’t have a personal weather station, there is a free service that I can use to collect weather data. Open Weather Map is a free service I used to collect current weather measurements by REST calls. I used the service by passing longitude and latitude coordinates for 12 locations in Colorado shown below.
That week, every ten minutes I ran a pipeline in SAP Data Hub that collects weather observations from Open Weather Map and saved them to the AWS S3 bucket. The main data pipeline collected this data from S3 and inserted it into Couchbase. At the end of the week, I used Microsoft Power BI to run analytics on weather data in Couchbase by running N1QL queries. The image below is a time series for temperature. We started off the week cooler, and by Friday temperatures were in the 60s and 50s in Colorado. In the graph, you may notice a couple straight lines. Those lines indicate a gap in time where my pipeline failed. SAP Data Hub has a built-in trace functionality for capturing pipeline errors. After a little debugging, I noticed that the connection was lost at midnight each night. I fixed this by archiving files in S3 that were previously read.
This is a time series for wind. On May 2, Aspen had wind gusts up to 70 miles per hour!
This is a time series showing temperature for Evergreen, Colorado where I live. Earlier in the week, it was snowing and cooler but by Friday the sun was out and it was 63 degrees!
Wrap Up
This was a fun project and I was pleased how easy SAP Data Hub made the process of collecting data and storing it into Couchbase and it was relatively quick to get started. Data Hub has built-in operators which handle S3 read/ writes of the weather observation files. I wrote the Couchbase operator in Golang and this operator is now available for reuse with any data from any data source. The architecture of Data Hub is also impressive. Whenever you run a new data pipeline, a new pod is created in the Kubernetes cluster. This has the advantage of being scaled out horizontally as you add more data flows.
SAP Data Hub is a useful tool for any organization struggling with data overload and exploring ways to move data across diverse environments. It has built-in discovery tools and search capability which help organizations share data across groups of people. The tool is positioned to support a vision of getting the right information to the right people at the right time.
Leave a Comment