MapReduce in Cluster.
You need at least two Linux based hosts to set up a hadoop cluster. One would act as a Master and other hosts act as slaves. You can add any number of slaves to your cluster. Here, I have used two hosts for the sake of simplicity and demonstration where one act as master and other as slave.
You might need to update your system
apt-get update
Hadoop wokrs on java. If you don't have java already installed, install java by using :
apt-get install default-jdk -y
See Java Version:
java -version
There are many versions of Hadoop available online. Select one of the stable release and download using the wget
command.
http://www.apache.org/dyn/closer.cgi/hadoop/common/
wget http://apache.mirrors.lucidnetworks.net/hadoop/common/stable/hadoop-3.2.0.tar.gz
Once the download is complete, extract the package:
tar -xvzf hadoop-3.2.0.tar.gz
Move the extracted files into local directory /usr/local
mv hadoop-3.2.0 /usr/local/hadoop
Open the hadoop-env.sh file using nano
command and look for the export JAVA_HOME=
line. Uncomment it and add the static value and dynamic value for JAVA_HOME
nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
export JAVA_HOME=$(readlink -f /usr/bin/java | sed " s:bin/java::" )
Check if the Hadoop is correctly installed by running the below command:
/usr/local/hadoop/bin/hadoop
Ceate a directory for your input classes.
mkdir wordcount_classes
Execute the java file with the following command (If java file is not in present working directory, use full directory address where java file is located):
javac -classpath ${usr/local/hadoop/bin/hadoop classpath} -d wordcount_classes/ '/home/paras/Downloads/WordCount.java'
you can find out the classpath by issuing:
$ echo $(usr/local/hadoop/bin/hadoop classpath)
Consolidate the files in the wordcount_classes/ directory into a single jar file:
jar -cvf wc.jar -C wordcount_classes/ .
Run the jar file in hadoop:
/usr/local/hadoop/bin/hadoop jar wc.jar WordCount /usr/input /output
Edit core-site.xml:
sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml
Insert into configuration tags:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Edit hdfs-site.xml:
sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Insert into configuration tags:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Create public private key pairs
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Now ssh localhost
Format the filesystem:
/usr/local/hadoop/bin/hdfs namenode -format
Start the namenodes, secondary namenodes and datanodes
/usr/local/hadoop/sbin/start-dfs.sh
Create the Directory in HDFS and insert the input file in it
$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user
$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user/paras
$ /usr/local/hadoop/bin/hdfs dfs -put '/home/paras/Downloads/WordCountText.txt' /user/paras
Run Hadoop to execute the jar file:
$ /usr/local/hadoop/bin/hadoop jar wc.jar WordCount /user/paras /output
You can see the contents of output directory. Go to http://localhost:50070/ for Hadoop 2.x versions or older and http://localhost:9870/ for Hadoop 3.x versions:
Opening the file part-r-00000:
- Configure two VMs
- Rename the master hostname as HadoopMaster Rename the slave hostname as HadoopSlave
3)install Java on both : 4)install hadoop on both: Make changes in configuration of below mentioned files:
core-site.xml:
sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml
<name>fs.default.name</name>
<value>hdfs://HadoopMaster:9000</value>
</property>
hdfc-site.xml:
HadoopMaster:
sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>
</property>
</configuration>
HadoopSlave:
sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_tmp/hdfs/datanode</value>
</property>
</configuration>
yarn-site.xml
--HadoopMaster and HadoopSlave--:
sudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>HadoopMaster:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>HadoopMaster:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>HadoopMaster:8050</value>
</property>
mapred-site.xml
sudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.job.tracker</name>
<value>HadoopMaster:5431</value>
</property>
<property>
<name>mapred.framework.name</name>
<value>yarn</value>
</property>
</configuration>
masters
Only on master node i.e. HadoopMaster*
sudo gedit /usr/local/hadoop/etc/hadoop/masters
HadoopMaster
workers
Only on master node i.e. HadoopMaster*
sudo gedit /usr/local/hadoop/etc/hadoop/workers
HadoopSlave
hosts
sudo gedit /etc/hosts
HadoopMaster and HadoopSlave
127.0.0.1 localhost
<master node's IPv4 Address> HadoopMaster
<slave node's IPv4 Address> HadoopSlave
Note: If the Hadoop version is 2.X or older, then you might need to change slaves instead of workers
~/.bashrc file
Open the file using sudo gedit ~/.bashrc
and add the below lines at the end of the file:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
If you have not configured your hadoop-env.sh file, edit the file as mentioned in above section (Single Node Standalone Mode)
Setting Passwordless connection between master and Slave Refer to the below link to set up a passwordless ssh to localhost and the remote machine. This has to be done on master node as well as slave node.
https://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/
Format the filesystem:
/usr/local/hadoop/bin/hdfs namenode -format
Start the namenodes, secondary namenodes and datanodes
/usr/local/hadoop/sbin/start-dfs.sh
Create the Directory in HDFS and insert the input file in it
$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user
$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user/paras
$ /usr/local/hadoop/bin/hdfs dfs -put '/home/paras/Downloads/WordCountText.txt' /user/paras
Run Hadoop to execute the jar file:
$ /usr/local/hadoop/bin/hadoop jar wc.jar WordCount /user/paras /output
http://pingax.com/install-apache-hadoop-ubuntu-cluster-setup/
http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
https://www.linuxhelp.com/how-to-install-hadoop-in-ubuntu
https://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/