Archive for the ‘Cloud Computing’ Category

Massive Performance Gains and Cost Savings using Hadoop

Tuesday, November 22nd, 2011

Tony Jewitt, our VP of Big Data, challenged me to impress him with an example of the power of emerging technology in solving real world Big Data problems.  I needed a large dataset with a complicated processing requirement to do that and decided to extract witness information from Senate and House hearings for the past 30 years.  I set about attempting to make some sense out of this intimidating data set by loading it into our favorite NoSQL database and presenting it in a way that would let people find meaningful information effectively.

For the purposes of my test I chose to try and expose an interface where a user can filter Senate, House and joint hearings based on who participated as a witness in the hearing.  Conceptually, this is no problem for MarkLogic — except that there was no witness meta-data attached to the docs.  So I knew I would have to put in a little work and find it ourselves.  How?  By leveraging the GATE processing framework to extract entities.  I developed a utility that allowed me to pass in a hearing document (xml format) and get back a witness-enriched xml document.

The only problem is that it would have taken FOREVER to do this across all the documents.  There were hundreds of thousands of hearings to process.  Given that it takes a little over an hour to process just 15,000 documents, that would have been way too slow in processing the entire body of data especially considering that I knew I would need to repeatedly run the process on large data sets to improve the quality and perfect the algorithm.

So at that point I had to think about what my options were for moving forward.  I could have tried to get a bigger machine — but there is a limit to how many transistors can fit on a board and it becomes less cost effective as we scale vertically.  That led me to the eventual solution — what about setting up a cluster of commodity machines and distributing the work among those?  Scaling this way (horizontally) was attractive because the cost increases linearly as we scale.  Now I was on to something, namely Hadoop.

Next I had to solve the issue of where to get the servers (without creating panic in IT or procurement).  Hadoop is designed to run on a cluster of commodity servers but purchasing them is not a trivial expense. Commodity does not mean cheap in this case.  So leveraging Amazon Web Services (AWS) for my experiment made sense because I could rent them on an as-needed basis and have flexibility to choose the hardware that best fit this particular application. GATE is CPU-intensive so I chose to rent a server designed for that type of operation.  AWS even offers a Hadoop-specific service called Elastic MapReduce which makes things even easier for a developer type like me.  This saved me from any upfront hardware cost and kept me from having to install Hadoop on every node.  And I knew my finance department would be alright with that too as long as I remembered to turn the servers off when I was done.

AWS adds some time to the learning curve but I didn’t find it to be that bad.  Managing the mapreduce jobs through the API is key because the web interface is extremely limited.  The only value I saw is the ability to monitor the status of job flows and terminate them.  You can’t add nodes and you can’t resubmit jobs.  So I decided to use the Ruby command line interface and it works great.

The next step was to modify our GATE process to fit into the Hadoop mapreduce pattern.  GATE works on whole xml documents so each record passed to the map needed to be a whole xml doc as well.  Hadoop works more efficiently against a small number of larger files rather than a large number of small files.  A SequenceFile should do the trick — so I wrote a process to package the hearings files up into that. Hadoop can be a little intimidating because of it’s complexity but I found the beginning tutorials very easy to follow.  I followed the single node setup tutorial and had a Hadoop development environment up in no time.  I loved the ability to run a Hadoop process in a single jvm (referred to as “Standalone Operation” in the documentation).  This really unleashes the power of an IDE by allowing you to inspect objects in memory.  Even when I ran into limitations because of GATE’s voracious memory appetite, I was able to quickly run it in “Pseudo-Distributed Operation” (a local cluster of separate jvm processes) and keep development moving.  All that was required was a simple command line call.

I deployed my assets to AWS by moving the SequenceFile and MapReduce code to S3.  Next I wrote some scripts to start up a cluster, provision more servers and run our GATE MapReduce process.  I started off with a couple nodes to start to ensure the AWS port was working and to create a performance baseline and started out with the aforementioned 15,000 documents.  When running the GATE process locally, without Hadoop, I found it took around 1 hour and 15 minutes to run on a capable laptop. It took about the same on our small 2 node Hadoop cluster.  There is some overhead associated with Hadoop so this was not a surprise. After I scaled our cluster up to 20 nodes and tuned it, the total time came down to around 4 and a ½ minutes (!).  I think it is safe to say that many businesses would be pretty pleased with that performance improvement.

The punchline to all this is that renting 20 AWS servers costs around $4.  Rentals are billed in hour increments so we could process even more documents for the same cost.  Our data set is small by Big Data standards but can easily scale.  All we need to do is rent more servers.

We are pretty excited about these results.  Stay tuned for the next post where we will integrate this Hadoop-based GATE process with a MarkLogic database.

Installing MarkLogic on an EC2 Micro Instance – Free for 1 year!

Wednesday, November 3rd, 2010


Perhaps you’ve heard about Amazon EC2′s new promotion, the AWS Free Usage Tier. New AWS customers can run a Linux Micro instance free for 1 year. Micro Instances consist of 613 MB of memory, up to 2 ECUs (for short periodic bursts), EBS storage only, 32-bit or 64-bit platform.

Earlier this year MarkLogic announced support for Amazon EC2 as part of their cloud strategy. With this they provide a preconfigured Amazon EC2 AMI (Amazon Machine Image) for MarkLogic. However, this AMI cannot be used to boot a micro instance within the free tier boundaries because micro instances must be booted from an EBS image. Also, the free tier includes up to 10GB of EBS.

Have no fear! With minimal effort you can in fact run MarkLogic on an EC2 micro instance, though keep in mind that it is quite resource constrained. Below is a recorded screencast demo of the process using this CentOS base image provided by RightScale.

Alternatively, I made the image I created in the video public. It’s AMI ID is ami-4682752f.

My First Experience with Amazon EC2

Wednesday, October 13th, 2010

A client needed a prototype server that was low cost, had potential to scale, and whose management could easily be transferred. We needed a single linux-based server that we could turn on and off for security and to control cost. Amazon’s EC2 service fit the bill well, so we set up an account and I got started on implementation.

A few of us at Avalon Consulting, LLC had already used EC2 for a couple other clients. It seemed easy enough and I was eager to get my hands dirty and see how this service really worked. I thought this would be a simple procedure, but like any project we hit bumps in the road that made it more of a challenge than expected.

The first thing I did was create the instance, being sure to select the proper options so that billing was credited properly. This is when I hit the first issue: all data was destroyed if the instance was stopped. After much research I found that I had to create an instance from an Elastic Block Store (EBS) Image. Once I created the instance I needed to attach a separate EBS volume and figure out how to mount (attach) it to the instance.

Having resolved those issues, I hit my second issue: the DNS name assigned to the instance changed every time it was stopped and started. Since we were using this internally, we were hoping to avoid having to give it a proper domain name and just use the name EC2 assigned it. The solution to this was to get an Elastic IP address. Once EC2 assigned us an IP address, we found a domain and changed its appropriate entries to point to our new site.

This doesn’t sound like too much of an effort, but two things made it difficult. First, documentation is not easy to find or follow. There was many references available, but bringing everything together for what I thought was a basic implementation required understanding a lot of the system as a whole. For example, they provide a Getting Started Guide, Developer Guide and User Guide among others. Although there is a good amount of information, some chapters in one guide I would have expected to be in another, so every time I go back a refer to them I have to fish through each one to find what I’m looking for. Buried in one of them is a link to the PDF that I needed to create an instance based on an EBS volume.

The second difficulty was cost. The model EC2 and most cloud-type services offer are based on what services used and at what quantity. Estimating a monthly cost is not very straight-forward. Some of the costs to consider:

  • The instance(s) themselves
  • Bandwidth used
  • An EBS volume
  • I/O access to an EBS volume
  • Number of API calls to the account
  • Static IP address

Some of these items are not a simple cost model. For example, there is no cost to assign an IP address, but if it is unused, such as when an instance is not running, there is a small cost per hour. The cost of an instance depends on the number and type of CPUs and RAM available for the instance. Hourly costs can be reduced by purchasing a “reserved instance.”

So what did I learn from this experience? I learned the basics for establishing an EC2 presence. I also learned that an administrator’s job is not going away any time soon. One of my coworkers had no experience with cloud services and asked how one would troubleshoot application issues. His assumption was since this was in the cloud, a lot of administration had been relinquished to technology. Setting up an EC2 system and reading the documentation assumes a certain level of proficiency with network and system administration.

Overall this was a good experience. The learning curve was a little slow, mostly due to documentation. Estimating costs is not an exact science. But EC2 and other cloud services have a lot to offer, and can offload hardware and data center woes while allowing the customer full control over their systems.