How to Install and Configure Single Node Hadoop Cluster

The Apache Hadoop is a framework for large data sets to process across one or more clusters with programming model. Apache Hadoop is designed such a way that it will be scalable unto thousands of machines and each machine will offer dedicated computation and storage. In this Article, we will discuss How to install and Configure Single node Hadoop Cluster

Installing Hadoop on enterprise-level setup required multi-node cluster configuration. But for a development environment and learning purpose, configuring single node Hadoop cluster is sufficed. Let’s see the installation in different operating systems.

90% offer on – Hands-On with Hadoop 2: 3-in-1 Online Udemy Course

Get the limited time offer on Udemy for Hands-On with Hadoop 2:3-in-1 Online Course and Run your own Hadoop clusters on your own machine or in the cloud

Actual Price: $199
Offer Price: $4.99
Hurry

Installing Hadoop

Step 1: Download Install JDK.

Install JDK on Ubuntu with apt-get repository. For the same, update and upgrade the apt-get repository by passing the following command.

$ sudo apt-get update
$ sudo apt-get upgrade

then, install JDK with the following command.

$ sudo apt-get install default-jdk.

Step 2: Download and extract the Installation package

Download the required version Hadoop binary from the Hadoops Release page. In this tutorial, we will take 2.10 and we are going to use wget tool to download the package.

$ wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz

Then, Extract the downloaded file to the required path bypassing the following command.

$ tar -xvf hadoop-2.10.0.tar.gz

Step 3: Add the Hadoop path and Java path to the bash file.

Open .bashrc file and add the following lines.

export HADOOP_HOME=<HADOOP_BINARY_PATH>
export JAVA_HOME=<JAVA_BINARY_PATH>
export PATH=$PATH:$HADOOP_HOME/bin

To apply to all active session and forever, source the .bashrc file

export HADOOP_HOME=<HADOOP_BINARY_PATH>
export JAVA_HOME=<JAVA_BINARY_PATH>
export PATH=$PATH:$HADOOP_HOME/bin

Step 4: Configure Hadoop

As part of configuring Hadoop cluster, we need to edit following files

  • core-site.xml
  • hdfs-site.xml
  • mapred-site.xml
  • yarn-site.xml
  • hadoop–env.sh

Then, Let’s see all the configuration files one by one

core-site.xml

core-site.xml contains the setting for Hadoop core like I/O and filesystems which will be used by HDFS and MapReduce. So, inside <configuration> tag, edit name and value as required. In this tutorial, we are using as default.

hdfs-site.xml

hdfs-site.xml contains the setting for HDFS core like NameNode, DataNode, Replication factors, Block size and more.

mapred-site.xml

mapred-site.xml contains the settings for MapReduce application core like Number of CPU cores, Size of Mapper, Reducer Process and more.

Also, this file will not be available directly at the installed directory, but it is available as a template and we can rename it as follows.

$ mv mapred-site.xml.template mapred-site.xml

Then, open the file and edit as required

yarn-site.xml

yarn-site.xml file contains the settings for memory management and logic selections.

hadoop-env.sh

This is the simple script file that contains the environment variables that are required to run the Hadoop cluster.

Step 5: Format the NameNode.

Formating the namenode will reset all the settings and apply the new setting done earlier. So, this can be done with the namenode executable file as follows.

$ bin/hadoop namenode -format

Note: Formating the NameNode again will erase all the data. So, make sure before doing it.

Step 6: Start the Daemons.

Following are the daemons that need to be started

  • NameNode – Start it by running: $ ./hadoop-daemon.sh start namenode
  • DataNode – Start it by running:  $ ./hadoop-daemon.sh start datanode
  • ResourceManager – Start it by running: $ ./yarn-daemon.sh start resourcemanager
  • NodeManager – Start it by running: $ ./yarn-daemon.sh start nodemanager
  • JobHistoryServer – Start it by running: $ ./mr-jobhistory-daemon.sh start historyserver

Then, If you want to start all together, run the following script

$ ./start-all.sh

To check all the running daemons, run

$ jps

This will give you output,

22343 NameNode
28534 NodeManager
28378 ResourceManager
22654 JobHistoryServer
24342 Jps
24653 DataNode

Then, If you see this output, Hadoop Single node Configuration is done. You can check the NameNode interface from the port 50070 (http://localhost:50070/dfshealth.html).

That is all. Single node Hadoop cluster configuration is done.

Conclusion

We have discussed How to install and Configure Single node Hadoop Cluster. Then, In our next article, we will discuss how to configure multi-node Hadoop Cluster. Stay tuned and subscribe DigitalVarys for more articles and study materials on DevOpsAgileDevSecOps and App Development.

2 thoughts on “How to Install and Configure Single Node Hadoop Cluster”

Leave a Reply