Hadoop

From 太極
Jump to navigation Jump to search

Installation

Download the tarball (hadoop-2.4.1.tar.gz) from Hadoop website.

tar xzvf hadoop-2.4.1.tar.gz

Make sure JAVA_HOME is set to a Java installation. If it is not available, we can include it by editing etc/hadoop/hadoop-env.sh and specify the JAVA_HOME variable. For example

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

It is convenient to put the Hadoop binary directory on your command-line path. For example, I append the following 2 lines to ~/.bashrc file.

export HADOOP_INSTALL=/home/brb/hadoop-2.4.1
export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin

Check that Hadoop runs by typing

hadoop version

Public datasets

Books

Hadoop: The Definitive Guide

Distributed File System (DFS)

  • For very large files: TBs, PBs
  • Each file is partitioned into chunks, typically 64MB
  • Each chunk is replicated several times (>=3), on different racks, for fault tolerance
  • Implementations: Google's DFS (GFS, proprietary) and Hadoop's DFS (HDFS, open source)