Latest revision as of 09:56, 16 July 2014

Installation

Download the tarball (hadoop-2.4.1.tar.gz) from Hadoop website.

tar xzvf hadoop-2.4.1.tar.gz

Make sure JAVA_HOME is set to a Java installation. If it is not available, we can include it by editing etc/hadoop/hadoop-env.sh and specify the JAVA_HOME variable. For example

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

It is convenient to put the Hadoop binary directory on your command-line path. For example, I append the following 2 lines to ~/.bashrc file.

export HADOOP_INSTALL=/home/brb/hadoop-2.4.1
export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin

Check that Hadoop runs by typing

hadoop version

Public datasets

http://stackoverflow.com/questions/12915128/small-data-sets-for-hadoop-mapreduce
NOAA weather data from 1901 to 2014.

Books

Hadoop: The Definitive Guide

Distributed File System (DFS)

For very large files: TBs, PBs
Each file is partitioned into chunks, typically 64MB
Each chunk is replicated several times (>=3), on different racks, for fault tolerance
Implementations: Google's DFS (GFS, proprietary) and Hadoop's DFS (HDFS, open source)

@@ Line 25: / Line 25: @@
 = Books =
 Hadoop: The Definitive Guide
+= Distributed File System (DFS) =
+* For very large files: TBs, PBs
+* Each file is partitioned into chunks, typically 64MB
+* Each chunk is replicated several times (>=3), on different racks, for fault tolerance
+* Implementations: Google's DFS (GFS, proprietary) and Hadoop's DFS (HDFS, open source)