Hadoop: Difference between revisions

From 太極
Jump to navigation Jump to search
(Created page with "= Installation = Download the tarball (hadoop-2.4.1.tar.gz) from [http://hadoop.apache.org/ Hadoop] website. <pre> tar xzvf hadoop-2.4.1.tar.gz </pre> Make sure JAVA_HOME is s...")
 
No edit summary
 
(One intermediate revision by the same user not shown)
Line 8: Line 8:
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
</pre>
</pre>
If everything works well, we can check that Hadoop runs by typing
It is convenient to put the Hadoop binary directory on your command-line path. For example, I append the following 2 lines to ~/.bashrc file.
<pre>
export HADOOP_INSTALL=/home/brb/hadoop-2.4.1
export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
</pre>
 
Check that Hadoop runs by typing
<pre>
<pre>
hadoop version
hadoop version
Line 19: Line 25:
= Books =
= Books =
Hadoop: The Definitive Guide
Hadoop: The Definitive Guide
= Distributed File System (DFS) =
* For very large files: TBs, PBs
* Each file is partitioned into chunks, typically 64MB
* Each chunk is replicated several times (>=3), on different racks, for fault tolerance
* Implementations: Google's DFS (GFS, proprietary) and Hadoop's DFS (HDFS, open source)

Latest revision as of 09:56, 16 July 2014

Installation

Download the tarball (hadoop-2.4.1.tar.gz) from Hadoop website.

tar xzvf hadoop-2.4.1.tar.gz

Make sure JAVA_HOME is set to a Java installation. If it is not available, we can include it by editing etc/hadoop/hadoop-env.sh and specify the JAVA_HOME variable. For example

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

It is convenient to put the Hadoop binary directory on your command-line path. For example, I append the following 2 lines to ~/.bashrc file.

export HADOOP_INSTALL=/home/brb/hadoop-2.4.1
export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin

Check that Hadoop runs by typing

hadoop version

Public datasets

Books

Hadoop: The Definitive Guide

Distributed File System (DFS)

  • For very large files: TBs, PBs
  • Each file is partitioned into chunks, typically 64MB
  • Each chunk is replicated several times (>=3), on different racks, for fault tolerance
  • Implementations: Google's DFS (GFS, proprietary) and Hadoop's DFS (HDFS, open source)