Is It Really Big Data?

Just because you have a lot of data doesn’t automatically mean that you are dealing with Big Data.

This came to mind recently when I was building a digital asset management system and was asked to talk about my “Big Data” project. The term didn’t seem to fit, but at first I wasn’t sure why. I definitely had a lot of data–millions of files–and designing a system that could store all the assets, package them up into zip files, and then send them to users on-demand certainly presented challenges. But it didn’t feel like Big Data. No, I finally realized, this was something else. It was Cumbersome Data.

The term Big Data has been tossed around a lot in the media, so it would be good to go back to one of the earliest definitions offered by Gartner, Inc.

Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

One of the key aspects of this definition is the emphasis on insight and decision making. To my mind, a Big Data project is about finding new and innovative ways to analyze a lot of information.

Many people who think they are dealing with Big Data are actually not doing any data analysis. Instead, they are trying to find cost-effective and efficient ways to manage large volumes of data. That is what I call Cumbersome Data, and here is my alternate definition:

Cumbersome data” is a large volume of information assets that requires a significant amount of time or system resources for storing, packaging and/or transmitting.

In other words, whereas Big Data is about analysis, Cumbersome Data is about logistics. In particular, as my definition indicates, I have found three major logistical challenges: storage, packaging and transmission.

Storage is probably the simplest issue to explain: Cumbersome Data takes a lot of disk space. While it’s certainly true that storage is now very cheap, when you are dealing with terabytes of data, it still gets expensive. For the digital asset management system I was designing, we started with six terabytes of data and needed a way to store it that would also offer good backup and recovery as the data volume continued to grow. SAN and NAS solutions cost too much–nearly 80 cents per gigabyte per month. In the end, cloud storage proved to be the ideal solution, offering secure, scalable storage at a tenth of the cost.

Of course, once we decided to use cloud storage, the next logistical challenge was transmitting our data to the cloud. Uploading that much information over a 10 Mbps internet connection would take weeks. Indeed, through trial and error, we discovered that any upload or download that takes more than an hour and a half to complete is almost guaranteed to fail: bytes will be lost and the data will be corrupted.

To get our initial load of data into cloud storage, we found that FedEx was the fastest and cheapest transmission method. A truck traveling at 65 miles per hour may seem low tech, but it has astonishing bandwidth.

Packaging is probably the least obvious challenge presented by Cumbersome Data and refers to the physical organization of the data itself. For example, if you store millions of files on a hard drive, you will immediately find that you cannot place it all in a single directory due to file and operating system limitations.  By using cloud storage, we were able to avoid this problem, but were still confronted with the challenge of how to package the assets into zip files for distribution to users. Standard zip compression limits archives to 4GB of data, which was too small for our purposes. In the end, we needed to write code that would automatically divide asset requests into multiple zip files.

Cloud storage, FedEx trucks, and zip archives. None of this is particularly sexy and not nearly as exciting as developing a new data visualization or data mining technique. But that is the nature of Cumbersome Data: it is tedious and unglamorous.

But it’s also vital. You can’t begin analyzing terabytes of biomedical research or weeks of raw sensor data or streams of Twitter feeds for meaningful information until you have figured out where you’re going to store the data, how you’ll transmit it, and how you will package it up for consumption.

Napoleon famously said that “an army marches on its stomach.” In other words, the importance of logistics can never be overlooked. Digital Asset Management and Content Management Systems are definitely logistical systems that must confront the challenges of Cumbersome Data. The investment in these systems is well worth the cost, for they are the foundations upon which true Big Data applications can be built.

Demian Hess About Demian Hess

Demian Hess is Avalon Consulting, LLC's Director of Digital Asset Management and Publishing Systems. Demian has worked in online publishing since 2000, specializing in XML transformations and content management solutions. He has worked at Elsevier, SAGE Publications, Inc., and PubMed Central. After studying American Civilization and Computer Science at Brown University, he went on to complete a Master's in English at Oregon State University, as well as a Master's in Information Systems at Drexel University.

Comments

  1. Thanks for this posting. Why did you create your own custom packaging system with multiple ZIP files rather than simply adopting the ZIP64 file format? ZIP64 is widely supported, and it has no 32-bit (4GiB) limit.

Leave a Comment

*