Tony Jewitt, our VP of Big Data, challenged me to impress him with an example of the power of emerging technology in solving real world Big Data problems. I needed a large dataset with a complicated processing requirement to do that and decided to extract witness information from Senate and House hearings for the past 30 years. I set about attempting to make some sense out of this intimidating data set by loading it into our favorite NoSQL database and presenting it in a way that would let people find meaningful information effectively.
For the purposes of my test I chose to try and expose an interface where a user can filter Senate, House and joint hearings based on who participated as a witness in the hearing. Conceptually, this is no problem for MarkLogic — except that there was no witness meta-data attached to the docs. So I knew I would have to put in a little work and find it ourselves. How? By leveraging the GATE processing framework to extract entities. I developed a utility that allowed me to pass in a hearing document (xml format) and get back a witness-enriched xml document.
The only problem is that it would have taken FOREVER to do this across all the documents. There were hundreds of thousands of hearings to process. Given that it takes a little over an hour to process just 15,000 documents, that would have been way too slow in processing the entire body of data especially considering that I knew I would need to repeatedly run the process on large data sets to improve the quality and perfect the algorithm.
So at that point I had to think about what my options were for moving forward. I could have tried to get a bigger machine — but there is a limit to how many transistors can fit on a board and it becomes less cost effective as we scale vertically. That led me to the eventual solution — what about setting up a cluster of commodity machines and distributing the work among those? Scaling this way (horizontally) was attractive because the cost increases linearly as we scale. Now I was on to something, namely Hadoop.
Next I had to solve the issue of where to get the servers (without creating panic in IT or procurement). Hadoop is designed to run on a cluster of commodity servers but purchasing them is not a trivial expense. Commodity does not mean cheap in this case. So leveraging Amazon Web Services (AWS) for my experiment made sense because I could rent them on an as-needed basis and have flexibility to choose the hardware that best fit this particular application. GATE is CPU-intensive so I chose to rent a server designed for that type of operation. AWS even offers a Hadoop-specific service called Elastic MapReduce which makes things even easier for a developer type like me. This saved me from any upfront hardware cost and kept me from having to install Hadoop on every node. And I knew my finance department would be alright with that too as long as I remembered to turn the servers off when I was done.
AWS adds some time to the learning curve but I didn’t find it to be that bad. Managing the mapreduce jobs through the API is key because the web interface is extremely limited. The only value I saw is the ability to monitor the status of job flows and terminate them. You can’t add nodes and you can’t resubmit jobs. So I decided to use the Ruby command line interface and it works great.
The next step was to modify our GATE process to fit into the Hadoop mapreduce pattern. GATE works on whole xml documents so each record passed to the map needed to be a whole xml doc as well. Hadoop works more efficiently against a small number of larger files rather than a large number of small files. A SequenceFile should do the trick — so I wrote a process to package the hearings files up into that. Hadoop can be a little intimidating because of it’s complexity but I found the beginning tutorials very easy to follow. I followed the single node setup tutorial and had a Hadoop development environment up in no time. I loved the ability to run a Hadoop process in a single jvm (referred to as “Standalone Operation” in the documentation). This really unleashes the power of an IDE by allowing you to inspect objects in memory. Even when I ran into limitations because of GATE’s voracious memory appetite, I was able to quickly run it in “Pseudo-Distributed Operation” (a local cluster of separate jvm processes) and keep development moving. All that was required was a simple command line call.
I deployed my assets to AWS by moving the SequenceFile and MapReduce code to S3. Next I wrote some scripts to start up a cluster, provision more servers and run our GATE MapReduce process. I started off with a couple nodes to start to ensure the AWS port was working and to create a performance baseline and started out with the aforementioned 15,000 documents. When running the GATE process locally, without Hadoop, I found it took around 1 hour and 15 minutes to run on a capable laptop. It took about the same on our small 2 node Hadoop cluster. There is some overhead associated with Hadoop so this was not a surprise. After I scaled our cluster up to 20 nodes and tuned it, the total time came down to around 4 and a ½ minutes (!). I think it is safe to say that many businesses would be pretty pleased with that performance improvement.
The punchline to all this is that renting 20 AWS servers costs around $4. Rentals are billed in hour increments so we could process even more documents for the same cost. Our data set is small by Big Data standards but can easily scale. All we need to do is rent more servers.
We are pretty excited about these results. Stay tuned for the next post where we will integrate this Hadoop-based GATE process with a MarkLogic database.