data data everywhere and all the ships did sink

This is my second blog focusing on data technology. In my first installment “Data Data Everywhere” I explored NoSQL technologies, specifically Couchbase. Recently, I had the opportunity to attend the SAP Data Hub training course. In that course, we focused on developing pipelines and moving data from one storage technology to another. We also looked at discovery functionality along with security and out-of-the-box operators. I was impressed with the software’s architecture (Kubernetes) and easy to reuse pipeline functionality.

In this article, I’ll explain SAP Data Hub’s mission and look at a fun make-believe use case that represents other real-world scenarios. But first, let’s briefly take a look at the state of data and how we got here.

Big Data

Some label our current economic period the Information Age – characterized by the shift from an industrial economy to one based on information technology.  I would agree with this sentiment dating back to the late-80s, but today I would call it the Age of Big Data.  Big Data has introduced an interesting conundrum.

  • With all this data available, how do we get the most value from it?
  • How do we use it to support business processes?
  • How does this contribute to competitive advantage?
  • What’s in it for me?

Forrester estimates that 60-73% of all data within an enterprise goes unused.  Today, we have more data available to us than ever before.  Humans have created more data in the last two years than in our entire history (read that again and remember 60 to 73% is unused).  This phenomenon is exponentially growing with more people using more devices and more devices becoming smarter along with numerous machines affixed with all kinds of sensors recording physical operations digitally – IoT Internet of Things.

Hadoop

Hadoop was developed to handle large amounts of data by distributing it across clusters of commodity hardware.  The native file system, HDFS, handles data distribution rapidly across nodes. It has the advantage of splitting up large files improving data locality and processing efficiency per node.  Hadoop is the technology platform enabling companies to build all-encompassing “data lakes”.

Data Lakes

A data lake is a centralized easily accessible repository for all your organization’s data. It can be a staging area for other applications downstream, or it can be the repository for developing analytics or machine learning algorithms. A data lake is not a data warehouse because it includes any data whether structured or unstructured and there are no upfront transformation requirements. Hadoop with HDFS is well suited to handle this type of “Wild West” environment.

In order for data lakes to add value as a central repository, ETL (Extract-Transform-Load) tools interfacing with all-encompassing data lakes require sophisticated functionality that support data governance and improve data quality. Otherwise, data lakes risk becoming unused data swamps adding little value to the enterprise. A new tool from SAP named Data Hub plays a critical role in facilitating the access and use of content in data lakes.

SAP Data Hub

Gartner labels ETL tools “Data Integration Tools” which: ingest, transform, combine, and provision data across spectrums of data types within the enterprise or to partners. Our goal as data stewards is to facilitate data consumption requirements of all applications and business processes; anytime, anywhere, and on any device. SAP is a leader in this area.

Many companies are struggling to get value out of their enterprise data in combination with their Big Data initiatives. SAP Data Hub is a tool facilitating data operations which provide data access and metadata governance capabilities. It is a data sharing, pipelining, and orchestration solution that accelerates and expands the flow of data across diverse landscapes.

After my training course, I wanted to test my new skills on a use case scenario. Since I also work with Couchbase, I wanted my use case to take advantage of this NoSQL database. It was snowing outside while I was thinking of this technology along with potential use cases. The weather inspired me and below I’ll explain my ideas about personal weather stations.

Weather Station Use Case

Weather is a big deal to a lot of people. There’s more than one TV channel dedicated just to the weather. Weather is also a big deal to business. In 2017, there were 16 separate billion-dollar disaster weather events. There were fires and drought, flooding and tropical cyclones, along with hurricanes. The cumulative damage of all events in 2017 is estimated at $306.2 billion, which shattered the previous U.S. annual record cost of $214.8 billion (CPI-adjusted) in 2005.

I was thinking, “wouldn’t it be cool to have a personal weather station?” – a device I could monitor while at home and also on my phone. Through time I could track the weather at my house and measure fluctuations.

I discovered that Weather Underground offers an opportunity to share weather data with a community of like-minded enthusiasts. They also promote personal weather stations that upload observations to their network. In addition to Weather Underground, I wanted a weather station where I control and direct the data. Here’s a short video of my wish list device.

Architecture Overview

My fictitious idea is to deploy personal weather stations across my home state of Colorado. In the image below, blue arrows indicate the flow of data to different environments. Weather stations upload observations every 10 minutes to Amazon AWS S3 bucket, labeled datahubweather below. SAP Data Hub monitors the bucket and reads text files as they are written. Then, Data Hub inserts these observations into a Couchbase cluster. Couchbase supports a website and a mobile application. The mobile application leverages Couchbase Lite and Sync gateway facilitating social interactivity for weather enthusiasts like me!

Data flow for weather observations.

architecture

Example weather observation including timestamp.

weather

This is an image of the user interface for SAP Data Hub while creating the data flow described previously.  The Avalon Couchbase Go operator is easily configured via parameters for host, username and password.

avalonOperator

dh_model

The Weather for Monday – Friday 4/29 to 5/3/2019

Even though I don’t have a personal weather station, there is a free service that I can use to collect weather data.  Open Weather Map is a free service I used to collect current weather measurements by REST calls.  I used the service by passing longitude and latitude coordinates for 12 locations in Colorado shown below.

map

That week, every ten minutes I ran a pipeline in SAP Data Hub that collects weather observations from Open Weather Map and saved them to the AWS S3 bucket.  The main data pipeline collected this data from S3 and inserted it into Couchbase.  At the end of the week, I used Microsoft Power BI to run analytics on weather data in Couchbase by running N1QL queries.  The image below is a time series for temperature.  We started off the week cooler, and by Friday temperatures were in the 60s and 50s in Colorado.  In the graph, you may notice a couple straight lines.  Those lines indicate a gap in time where my pipeline failed.  SAP Data Hub has a built-in trace functionality for capturing pipeline errors.  After a little debugging, I noticed that the connection was lost at midnight each night.  I fixed this by archiving files in S3 that were previously read.

temp

This is a time series for wind.  On May 2, Aspen had wind gusts up to 70 miles per hour!

wind

This is a time series showing temperature for Evergreen, Colorado where I live.  Earlier in the week, it was snowing and cooler but by Friday the sun was out and it was 63 degrees!

temp2

Wrap Up

This was a fun project and I was pleased how easy SAP Data Hub made the process of collecting data and storing it into Couchbase and it was relatively quick to get started. Data Hub has built-in operators which handle S3 read/ writes of the weather observation files. I wrote the Couchbase operator in Golang and this operator is now available for reuse with any data from any data source. The architecture of Data Hub is also impressive. Whenever you run a new data pipeline, a new pod is created in the Kubernetes cluster. This has the advantage of being scaled out horizontally as you add more data flows.

SAP Data Hub is a useful tool for any organization struggling with data overload and exploring ways to move data across diverse environments. It has built-in discovery tools and search capability which help organizations share data across groups of people. The tool is positioned to support a vision of getting the right information to the right people at the right time.

Will Thayer About Will Thayer

Will Thayer is a Principal Consultant Technologist at Avalon Consulting, LLC. Will has more than 20 years’ experience in planning, strategy, development, and training. His expertise includes web application development, big data analytics, NoSQL, management information systems, and system development life cycle practices. As an Adjunct Professor at the University of Denver, Will taught graduate and undergraduate students for 5 years. His research in Open EDI and XML EDI has appeared in books as well as trade periodicals. Will lives in Evergreen, Colorado where he enjoys skiing, hiking, biking, and camping in the Rocky Mountains with his wife and children.

Leave a Comment

*