Hadoop Ecosystem Cheat Sheet

For someone evaluating Hadoop, the considerably large list of components in the Hadoop ecosystem can be overwhelming.  Below you’ll find a reference table with keywords you may have heard in discussions concerning Hadoop as well as a brief description. Image courtesy of Hortonworks.

HDP 2.2 Components

Name Description
HDFS Hadoop’s underlying distributed file system
YARN Provides resource management for a Hadoop cluster.  An improvement introduced in Hadoop 2.0, YARN enables you to utilize multiple data processing engines
MapReduce Batch processing framework that, along with HDFS and YARN, forms the core of the Hadoop platform
Hive Provides a SQL interface to Hadoop.  Allows those familiar with SQL to immediately begin running analytics in Hadoop
Pig Pig is a scripting language similar to Python or Bash that provides high-level analytics capabilities
Ambari Web-based cluster management tool.  Allows configuration and management of a Hadoop cluster from one central web UI
Oozie Hadoop’s official job scheduler and workflow management tool.  Allows you to create workflows (directed acyclic graphs of sequential Hadoop actions) and coordinators (scheduled repeating workflows)
Falcon A framework for managing data processing pipelines.  Allows you to manage data flow between multiple clusters, data lifecycle (retention and eviction) and data replication
Sqoop Tool for importing/exporting data between Hadoop and structured data stores such as a relational database
HBase A fault tolerant NoSQL database that provides random, real-time access to data stored in Hadoop.  Designed to handle tables in the billions of rows and millions of columns
Accumulo A sorted, distributed key-value data store with cell-level security
ZooKeeper A centralized service that assists in synchronization and maintaining configurations for distributed services (such as HBase)
Storm A real-time computation system designed to handle large streams of data within Hadoop
Kafka Publish-subscribe messaging system typically used in conjunction with Storm to buffer streams as well as provide high reliability when handling high throughput
Spark A distributed computation engine that has a simple, high-level API.  Allows users to persist a dataset in memory, drastically increasing performance in cases where an iterative algorithm is used
Solr Enables you to index textual data via Hadoop, providing full-text search capabilities
Knox A REST API gateway that provides authentication and access services to a Hadoop cluster and represents a single point of entry
Ranger Tool that enables centralized security policy administration for a Hadoop cluster (Formerly known as Argus and XASecure)



Adam Westerman About Adam Westerman

Hadoop Consultant at Avalon Consulting, LLC

Leave a Comment