Getting Started with Hadoop

Avalon is successfully helping a number of our clients derive business benefit from Hadoop.  And in that process, we see a very common problem:  many of the great developers and architects we encounter just don’t know where to start in terms of getting that base level of technical knowledge in Hadoop.  And they’re too busy doing their real job to try and figure out that path on their own. Sound familiar?

So …. we thought we’d put together this “starter kit” to frame a productive self-study path for those of you who are busy developers but eager to get started with Hadoop.  Enjoy!

Reading:

Books get out of date pretty quickly, however, many of our engineers have read and recommend “Hadoop: The Definitive Guide” as a good starting resource.

Introductory Online References:

This reference from Yahoo is good coverage of all the concepts (although it does not follow the new Hadoop API).

This reference is straight from the Apache Hadoop distribution and is a good introduction to the initial concepts.

Hadoop Distributions (there are many to choose from):

Our main recommendation is to just download and try Hadoop. That is how our engineers learned most of what we know today.

hortonworks-logo1

We recommend
starting with Hortonworks…. 

 

….For three reasons:

  1. all of their software is open source (not open core plus proprietary like most);
  2. they have the largest number of contributors to the Apache Hadoop project;
  3. all their software is running at Yahoo which includes individual clusters of 4000 nodes and a total installment in the 10’s of thousands.

Again, download and try Hadoop.  The Hortonworks Sandbox is a great way for you to get started.

And last but not least, resources to help you take it to the next level:

A good primer on selecting servers for your Hadoop cluster.

This article goes into more detail about the different node types and
what should be selected.

And finally, another useful reference on clusters.

Start with these resources, and you’ll be well on your way to MapReducing with the best of them!

Tony Jewitt About Tony Jewitt

Tony Jewitt is Avalon's Vice President of Big Data Solutions. He has more than 25 years of experience within the information technology industry. Prior to Avalon, Jewitt was CEO of The Hive Group, a leading data visualization software company, Senior Vice President of Marketing at InStranet (acquired by Salesforce.com), Group Vice President of Extranet Business Development at Business Objects where he led the company's pioneering work in extranets and e-business. During his tenure with Business Objects, Jewitt grew the extranet division from start-up to over $40m in annual revenues in four years. Before joining Business Objects, he served as Director of Business Development for Oracle Corporation's Education Division and worked as a consultant and systems engineer at EDS. Jewitt holds a Bachelor of Science in Electrical and Computer Engineering from the University of Texas at Austin.

Leave a Comment

*