Hadoop: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
|||
Line 25: | Line 25: | ||
= Books = | = Books = | ||
Hadoop: The Definitive Guide | Hadoop: The Definitive Guide | ||
= Distributed File System (DFS) = | |||
* For very large files: TBs, PBs | |||
* Each file is partitioned into chunks, typically 64MB | |||
* Each chunk is replicated several times (>=3), on different racks, for fault tolerance | |||
* Implementations: Google's DFS (GFS, proprietary) and Hadoop's DFS (HDFS, open source) |
Latest revision as of 09:56, 16 July 2014
Installation
Download the tarball (hadoop-2.4.1.tar.gz) from Hadoop website.
tar xzvf hadoop-2.4.1.tar.gz
Make sure JAVA_HOME is set to a Java installation. If it is not available, we can include it by editing etc/hadoop/hadoop-env.sh and specify the JAVA_HOME variable. For example
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
It is convenient to put the Hadoop binary directory on your command-line path. For example, I append the following 2 lines to ~/.bashrc file.
export HADOOP_INSTALL=/home/brb/hadoop-2.4.1 export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
Check that Hadoop runs by typing
hadoop version
Public datasets
- http://stackoverflow.com/questions/12915128/small-data-sets-for-hadoop-mapreduce
- NOAA weather data from 1901 to 2014.
Books
Hadoop: The Definitive Guide
Distributed File System (DFS)
- For very large files: TBs, PBs
- Each file is partitioned into chunks, typically 64MB
- Each chunk is replicated several times (>=3), on different racks, for fault tolerance
- Implementations: Google's DFS (GFS, proprietary) and Hadoop's DFS (HDFS, open source)