Hadoop-Cluster

MapReduce in Cluster.

Prerequisites

You need at least two Linux based hosts to set up a hadoop cluster. One would act as a Master and other hosts act as slaves. You can add any number of slaves to your cluster. Here, I have used two hosts for the sake of simplicity and demonstration where one act as master and other as slave.

Steps-by-Step Process

You might need to update your system

apt-get update

Hadoop wokrs on java. If you don't have java already installed, install java by using :

apt-get install default-jdk -y

See Java Version:

java -version

There are many versions of Hadoop available online. Select one of the stable release and download using the wget command.

http://www.apache.org/dyn/closer.cgi/hadoop/common/

wget http://apache.mirrors.lucidnetworks.net/hadoop/common/stable/hadoop-3.2.0.tar.gz

Once the download is complete, extract the package:

tar -xvzf hadoop-3.2.0.tar.gz

Move the extracted files into local directory /usr/local

mv hadoop-3.2.0 /usr/local/hadoop

Open the hadoop-env.sh file using nano command and look for the export JAVA_HOME= line. Uncomment it and add the static value and dynamic value for JAVA_HOME

nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
export JAVA_HOME=$(readlink -f /usr/bin/java | sed " s:bin/java::" )

Check if the Hadoop is correctly installed by running the below command:

/usr/local/hadoop/bin/hadoop

Ceate a directory for your input classes.

mkdir wordcount_classes

Execute the java file with the following command (If java file is not in present working directory, use full directory address where java file is located):

javac -classpath ${usr/local/hadoop/bin/hadoop classpath} -d wordcount_classes/ '/home/paras/Downloads/WordCount.java'

you can find out the classpath by issuing: $ echo $(usr/local/hadoop/bin/hadoop classpath)

Consolidate the files in the wordcount_classes/ directory into a single jar file:

jar -cvf wc.jar -C wordcount_classes/ .

Run the jar file in hadoop:

/usr/local/hadoop/bin/hadoop jar wc.jar WordCount /usr/input /output

Hadoop in Pseudo-Distributed Mode

Edit core-site.xml:

sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml

Insert into configuration tags:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

Edit hdfs-site.xml:

sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Insert into configuration tags:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Create public private key pairs

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ chmod 0600 ~/.ssh/authorized_keys

Now ssh localhost

Format the filesystem:

/usr/local/hadoop/bin/hdfs namenode -format

Start the namenodes, secondary namenodes and datanodes

/usr/local/hadoop/sbin/start-dfs.sh

Create the Directory in HDFS and insert the input file in it

$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user 
$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user/paras
$ /usr/local/hadoop/bin/hdfs dfs -put '/home/paras/Downloads/WordCountText.txt' /user/paras

Run Hadoop to execute the jar file:

$ /usr/local/hadoop/bin/hadoop jar wc.jar WordCount /user/paras /output

You can see the contents of output directory. Go to http://localhost:50070/ for Hadoop 2.x versions or older and http://localhost:9870/ for Hadoop 3.x versions:

Opening the file part-r-00000:

Run WordCount on the Hadoop cluster with 2 VMs

Configure two VMs
Rename the master hostname as HadoopMaster Rename the slave hostname as HadoopSlave

3)install Java on both : 4)install hadoop on both: Make changes in configuration of below mentioned files:

core-site.xml:

sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml

  <name>fs.default.name</name>
  <value>hdfs://HadoopMaster:9000</value>
</property>

hdfc-site.xml:

HadoopMaster:

sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
<property>
	<name>dfs.replication</name>
	<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>
</property>
</configuration>

HadoopSlave:

sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
<property>
      <name>dfs.replication</name>
      <value>1</value>
 </property>

 <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/usr/local/hadoop_tmp/hdfs/datanode</value>
 </property>
</configuration>

yarn-site.xml

--HadoopMaster and HadoopSlave--:

sudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml

<property>
	<name>yarn.resourcemanager.resource-tracker.address</name>
	<value>HadoopMaster:8025</value>
</property>
<property>
	<name>yarn.resourcemanager.scheduler.address</name>
	<value>HadoopMaster:8035</value>
</property>
<property>
	<name>yarn.resourcemanager.address</name>
	<value>HadoopMaster:8050</value>
</property>

mapred-site.xml sudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml

<configuration>
<property>
	<name>mapreduce.job.tracker</name>
	<value>HadoopMaster:5431</value>
</property>
<property>
	<name>mapred.framework.name</name>
	<value>yarn</value>
</property>
</configuration>

masters Only on master node i.e. HadoopMaster* sudo gedit /usr/local/hadoop/etc/hadoop/masters

HadoopMaster

workers Only on master node i.e. HadoopMaster* sudo gedit /usr/local/hadoop/etc/hadoop/workers

HadoopSlave

hosts sudo gedit /etc/hosts

HadoopMaster and HadoopSlave

127.0.0.1	localhost
<master node's IPv4 Address> HadoopMaster
<slave node's IPv4 Address> HadoopSlave

Note: If the Hadoop version is 2.X or older, then you might need to change slaves instead of workers

~/.bashrc file Open the file using sudo gedit ~/.bashrc and add the below lines at the end of the file:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

If you have not configured your hadoop-env.sh file, edit the file as mentioned in above section (Single Node Standalone Mode)

Setting Passwordless connection between master and Slave Refer to the below link to set up a passwordless ssh to localhost and the remote machine. This has to be done on master node as well as slave node.

https://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/

Format the filesystem:

/usr/local/hadoop/bin/hdfs namenode -format

Start the namenodes, secondary namenodes and datanodes

/usr/local/hadoop/sbin/start-dfs.sh

Create the Directory in HDFS and insert the input file in it

$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user 
$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user/paras
$ /usr/local/hadoop/bin/hdfs dfs -put '/home/paras/Downloads/WordCountText.txt' /user/paras

Run Hadoop to execute the jar file:

$ /usr/local/hadoop/bin/hadoop jar wc.jar WordCount /user/paras /output

References

http://pingax.com/install-apache-hadoop-ubuntu-cluster-setup/

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

https://www.linuxhelp.com/how-to-install-hadoop-in-ubuntu

https://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
LICENSE		LICENSE
README.md		README.md
WordCount.java		WordCount.java
WordCountText.txt		WordCountText.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop-Cluster

Prerequisites

Steps-by-Step Process

Hadoop in Pseudo-Distributed Mode

Run WordCount on the Hadoop cluster with 2 VMs

References

About

Releases

Packages

Languages

License

parasgulati8/Hadoop-Cluster

Folders and files

Latest commit

History

Repository files navigation

Hadoop-Cluster

Prerequisites

Steps-by-Step Process

Hadoop in Pseudo-Distributed Mode

Run WordCount on the Hadoop cluster with 2 VMs

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages