
A couple of weeks ago, I attended a three-day course on
Hadoop from the guys at
Cloudera. Although I had heard and read about Hadoop before, this was a great opportunity to learn many details on Hadoop and find out about several tools that make up the Hadoop ecosystem. If, like me before, you only have a rough idea of what's in Hadoop, you should be interested in the post. Take what I say with a grain of salt, since I am no expert in Hadoop. However, because I am not an expert, I think I can guarantee a fresher look and you can trust I am not trying to sell you the project. But, if you
are an expert and you read the post, you might want to give feedback in case I got something wrong.
Hadoop is an open source java implementation of the
MapReduce framework introduced by Google. The main developer and contributor to Hadoop, however, is Yahoo. It might seem weird that one of Google's main competitors releases an open source version of a framework they introduced. More so, when Google has recently been granted a patent for it. However, it seems unlikely that Google can execute their patent.
One of the main reasons is that Map and Reduce functions have been known and used in functional programming for many years. Another reason is that Hadoop has gained a huge popularity as part of the Apache project. Enforcing the patent would not get Google much love from many companies that are now making a living of it or using it as an important component in their web architecture.
But, before we go into any more detail, it would be good to understand what can Hadoop be used for and when we should think about adopting it. First, and above all, Hadoop is a framework for
data analysis and processing. Therefore, if you have no data, or if you have no need to process it, do not continue with this post. Hadoop is sometimes presented as an alternative to traditional relational databases. However, it is not a database (although it does provide a noSQL one called HBase as one of its tools), it is a framework for distributing data processes. Ok, so here was the second keyword:
distribution. If you think you can do whatever you need to do in a single machine, you don't need Hadoop. However, you might want to look at it anyway, since distributing your data processes can be cheaper and also much more reliable. And finally, and related to the previous, using Hadoop only makes sense if you are processing
large datasets and by large I mean several TB's.
However, even if your problem fits into the three previous conditions (distributed processing of large datasets) you can still not be completely sure Hadoop is your solution. Distributed relational databases are still an option. I won't go into the details, but you might want to read at some voices that are recently stepping in to defend the scalability of relational databases and their applicability in highly demanding large datasets. These two posts are good reading: "
Getting Real about NoSQL and the SQL-Isn't-Scalable Lie" and "
SCALE 8x: Relational vs. non-relational". I would also recommend this recent presentation on "
What every developer should know about database scalability"
So now that we have some intuition of when Hadoop may be of interest, let me introduce the two main issues behind Hadoop: MapReduce and HDFS.
MapReduce is a programming model introduced by Google, which is at the core of Hadoop. It is based on the use of two functions taken from functional programming: Map and Reduce. Map processes a (key,value) pair into a list of intermediate (key,value) pairs. Reduce takes an intermediate key and the set of values for that key. Both the mapper and reducer functions are written by the user. The framework groups together intermediate values associated with the same key in order to pass them to the corresponding Reduce.
MapReduce claims to be a sufficiently generic programming model that most data processing tasks can be decomposed in such a way. If you are interested in learning more, I recommend you start with
Google's paper. You can also take a look at
Google's set of videos introducing the framework. If you want a more "academic" presentation, you might want to take a look at
these UC Berkeley classes.
The other important core issue in Hadoop I mentioned before is the
Hadoop Distributed File System (
HDFS). HDFS is the equivalent of the Google File System (GFS) used in the original MapReduce framework. This filesystem is optimized for reading in streaming large files (from several gigabytes to terabytes). Note that HDFS does not allow, for instance, to edit a file once it has been written.
Ok, so now we have the basics in place: how do we use Hadoop? Since Hadoop is written in
Java, the most straightforward to get started is by using its Java API. If you look at the
Hadoop Map/Reduce tutorial, for instance, you will see how the framework is introduced through its Java API.
But, if you want to use Hadoop but would rather keep away from Java, there are plenty of other options. First, there is
Hadoop Streaming, which allows to use arbitrary program code with Hadoop. Stdin and Stdout are used for data flow, and each mapper and reducer is defined in a separate program. This comes in very handy if you want to use Hadoop through a scripting language. Now, if you want to have a greater performance in your mapper and reducer functions and would like to call compiled C++ code instead, your solution is called
Hadoop Pipes.
Now, what if you still would like to access the Hadoop framework but do not fancy the MapReduce programming mode? In other words, is there any higher-level and more programmer friendly way to interface with Hadoop? And the answer is, of course, yes. There are several ways to do this but I will mention two of them:
Hive and
Pig.
Hive is a tool developed at Facebook that allows for an SQL-like access to the Hadoop infrastructure. Although the project is not very mature yet, this is a very interesting option to consider and it seems to be giving
very good results to Facebook. The other option is to use
Pig, developed by Yahoo. Pig provides a higher-level language called Pig Latin that increases productivity, especially if you are dealing with non-java programmers that are closer to the domain (e.g. data analysts). Pig Latin is a dataflow language and it even has a graphical front-end plugin for Eclipse called
PigPen.
I would not like to finish this personal overview of the Hadoop Ecosystem without mentioning
Mahout, a project for distributed machine learning with Hadoop. Among its examples,
Mahout includes an implementation of several collaborative filtering algorithms for recommendation. I would also encourage you to take a look at this list of
academic papers about or using Hadoop.
At this point I have to say that I have mixed feelings about Hadoop and about MapReduce itself. Although it is a powerful framework with immediate application to real-life problems that involve very large datasets, the model feels more like a kludge than a paradigm shift. I understand why people turn to tools like Hive and Pig that hide the MapReduce complexity behind more friendly models such as ER and Dataflow networks. Providing a framework that is both efficient, usable but also conceptually illuminating is definitely an area to work in the future. And it seems that I am not the only one thinking along these lines. Even Yahoo themselves are
looking into new ways that go beyond Hadoop and MapReduce.