One design pattern that both Google and Facebook share is the ability to distribute computations among large clusters of machines that all share a common data source. The pattern is called Map/Reduce, and Hadoop is an open source implementation of this. This article is an introduction to Hadoop. Even if you donʼt currently have a massive scaling issue, it can be worthwhile to become familiar with Map/Reduce as a concept, and playing with Hadoop is a good way to do that.
What exactly is Map/Reduce? The central idea behind Map/Reduce is distributed processing. You have a cluster of machines that host a shared file system where the data is stored, and allow for distributed job management.
Processes run across the cluster are called jobs. These jobs have two phases: a map phase where the data is collected, and a reduce phase where the data is further processed to create a final result set.
Letʼs take the example of finding a soul mate on a dating site. All the profiles for millions of users are stored in the shared file system, which is spread across a cluster of machines.
We submit a job to the cluster with the profile pattern that we’re seeking. In the map phase, the cluster finds all the profiles that match our requirements. In the reduce phase these profiles are sorted to find the top matches, among which only the top ten are returned.
Hadoop is a Map/Reduce framework that’s broken into two large pieces. The first is the Hadoop File System (HDFS). This is a distributed file system written in Java that works much like a standard Unix file system. On top of that is the Hadoop job execution system. This system coordinates the jobs, prioritizes and monitors them, and provides a framework that you can use to develop your own types of jobs. It even has a handy web page where you can monitor the progress of your jobs.
From a high level the Hadoop cluster system looks like Figure 1, “The Hadoop cluster architecture”.
Clients connect to the cluster and submit jobs. Those jobs then go to MapReduce agents, which process the jobs on the data in the local HDFS portion of the file system. All of the HDFS nodes talk to the NameNode to register themselves within the cluster.
Now, if you’re looking at all this and saying, “Thatʼs all well and good, but itʼs Java and we use .NET,” there’s no need to worry. Hadoop is a platform, and you can have clients to this platform in whatever language you want. And because Hadoop is so good at being a distributed processing platform, itʼs garnering support across all languages.
In addition to the Hadoop core, Iʼm going to introduce the Hive system. Hive is a distributed SQL database that sits on top of Hadoop and HDFS. There are three reasons why it’s worth knowing about Hive. First, because it makes it much easier to use Hadoop, to introduce it as a technology that you can use in your projects today. Second, because it uses SQL, a technology that most of use are familiar with. Third, because itʼs in production use by Facebook, which means that it’s robust, stable, and well-maintained.
With Hadoop and Hive you’ll be able to host databases that hold billions of records and run queries on them (which are implemented as Hadoop jobs) in a reasonable amount of time. That time will be adjustable by changing the size and performance characteristics of the machines in the cluster.
Just a word now to put you at ease about the whole cluster issue. Hadoop allows you to maintain and deploy a cluster of machines, but it’s unnecessary to have a cluster of machines just to play with it. In fact, any old single machine will do. You can run HDFS, Hadoop, and Hive all locally, and it works just fine.
Let’s now dig into each of the technologies in more depth.
Go to the Hadoop Releases page and download the stable Hadoop release. The Hadoop release has both the job execution framework and the HDFS system built into it. Unzip the archive into a directory in your file system, and cd into it. For the purposes of this tutorial, we’ll be running Hadoop in what’s called “pseudo-distributed mode.” That is, Hadoop will simulate a distributed cluster by running each node in a separate Java process. To configure Hadoop for this mode, we’ll need to edit three configuration files, all of which are located in the
Just replace the configuration element in each file with the following information:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
The other configuration setting that’s required is in the
conf/hadoop-env.sh file. Open that file and set JAVA_HOME to point to the root of your Java installation. For example, on Mac OS X:
With all this in place, you should be able to run the following command from the Hadoop directory:
Hadoop 0.20.2 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707 Compiled by chrisdo on Fri Feb 19 08:07:34 UTC 2010
HDFS is a valuable resource in its own right. With HDFS you can take a set of commodity machines and build out a single federated file system. If more disk space is required, you can either add more machines or add more disks to each machine, or both.
You can access your HDFS through several methods. There’s an API that your programs can use to connect to the file system and retrieve, add, remove, or update files and directories. A command line interface can also be used, as well as a web interface to browse the directories and access the files. There’s even a Fuse implementation (MountableHDFS) where you can mount files right on your desktop.
There are two primary elements to the HDFS installation. One is the NameNode server. This is a single server process that’s located on one machine. Each of the HDFS machines connects to the NameNode server to coordinate its identity and manage its piece of the file system.
The other element is the HDFS node running on the each machine. It handles the storing of data on that particular machine, and connecting to the NameNode.
To get started, let’s first format the NameNode:
bin/hadoop namenode -format
Now we’ll start all the Hadoop daemons (this includes the HDFS name node and data node, as well as the Hadoop job and task trackers):
With everything up and running, you should now be able to execute file system commands using the hadoop executable. For example:
bin/hadoop fs -ls hdfs:/
Found 1 items drwxr-xr-x - jherr supergroup 0 2010-06-25 15:10 /tmp
-fs (meaning file system) parameter has many commands including the usual ls, mv, rm, tail, head, and so forth. There are also commands to move files into the HDFS file system from the local file system, moveToLocal, and vice versa, moveFromLocal.
We can add a file to the file system like this:
$ bin/hadoop fs -put test.xml hdfs:/ $ bin/hadoop fs -ls hdfs:/ Found 2 items -rw-r--r-- 1 jherr supergroup 17 2010-06-25 15:13 /test.xml drwxr-xr-x - jherr supergroup 0 2010-06-25 15:10 /tmp
You can also view this file system in your browser. Go to
http://localhost:50070, which is your NameNode’s dashboard. Click on the Browse the filesystem link and you’ll see your file system, as shown in Figure 2, “The HDFS in the browser”.
As you can see, a distributed file system based in Java that runs across Unix, Windows, and Mac, as well as working on commodity hardware, is a valuable resource all on its own. Having said that, itʼs worth noting that HDFS offers no panacea. It’s fairly inefficient time-wise for random access, and when storing lots of small files. HDFS is optimized for its central purpose, which is to provide a shared data store for Hadoop jobs. If you’re looking for an extensible file system for images, HTML files, or similar, you might look at NFS, or using a hosted system like Amazonʼs S3.
Now that the underlying HDFS is configured and running, itʼs time to do the same for the JobTracker and MapReduce portions of Hadoop.
If the JobTracker is running, you should be able to navigate to port 50030 on the machine thatʼs running the tracker. The dashboard is shown in Figure 3, “The Hadoop JobTracker”.
With the JobTracker running and the HDFS set up underneath, you’re ready to install Hive and do some real work with distributed SQL.
Hive sits on top of Hadoop, so once your cluster is set up, itʼs a snap for Hive to be up and running. The process starts with downloading and installing Hive. You’ll need to check out Hive from the Subversion repository and compile it with ant, as described in that document.
Once it’s installed and the Hive environment variables are set up, you can run the Hive command line client like so:
Hive history file=/tmp/jherr/hive_job_log_jherr_201006251643_880032913.txt
OK Time taken: 5.508 seconds
The show tables command shows that there are no tables currently in the distributed database. From here you can follow the instructions on the Getting Started page to add a large movie database to the Hive installation.
This database provides a good starting point to experiment with queries, and to view how Hive creates Hadoop jobs on the fly to satisfy the query. Take, for example, a very simple query on the movie database, as shown below:
SELECT movieid, AVG( rating ) FROM u_data GROUP BY movieid;
Total MapReduce jobs = 1 ... Starting Job = job_201006251641_0002, Tracking URL = http://localhost: 50030/jobdetails.jsp?jobid=job_201006251641_0002 Kill Command = /Users/jherr/hadoop/bin/../bin/hadoop job - Dmapred.job.tracker=localhost:8021 -kill job_201006251641_0002 2010-06-25 04:51:40,648 map = 0%, reduce =0% ... 2010-06-25 04:52:04,895 map = 100%, reduce =100% Ended Job = job_201006251641_0002 OK 1 3.8783185840707963 2 3.2061068702290076 ... 1682 3.0 Time taken: 29.449 seconds
Iʼve removed some of the details but you can see the general flow. The Hive client creates a job to satisfy the query, which you can monitor from the JobTracker web application. The Hive client then monitors the job and produces the result set.
You can see an example of the job page when itʼs completed in Figure 4, “The completed job”.
There’s even a cool graphic section at the bottom of the page that shows the progress of the Hadoop cluster nodes as they map all the data, then reduce it down to the expected query results. This is shown in Figure 5, “The job status graph”.
In practice you can use the command line client to connect to Hive or one of the drivers, such as HiveODBC. In addition you can provide your own code to Hive to add custom query functions and filtering that’s then distributed into the cluster. The very handy Hive Getting Started guide shows an example of this in Python.
It’s important to note that there are several ways to make use of Hadoop; Hive is just one tool that utilizes Hadoop’s resources. Itʼs certainly not. There are lots of projects that use Hadoop in innovative ways, such as:
Of course, you can also develop your own project to leverage the distributed power of Hadoop and HDFS.
Hadoop is a solution for your scaling problems today, as well as a framework for developing your own highly scalable solutions in the future. Itʼs a well-written and documented solution that’s open source and very popular. There are even a number of books out on Hadoop that can give you an in-depth look into this technology.
Ultimately, Hadoop is an emerging technology you should become familiar with, as it’s most likely to come in handy for your future projects.