Hadoop Single Node Installation On Ubuntu
WHAT IS BIG DATA
Big data refers to extremely large and complex data sets that cannot be effectively managed, processed, or analyzed using traditional data processing applications. It encompasses the four V's: volume, velocity, variety, and veracity.
Volume: Big data refers to a vast amount of data generated from various sources, such as social media, sensors, transactions, and more. It typically exceeds the capacity of traditional database systems.
Velocity: Big data is generated at high speed and often in real-time. Data is continuously produced, collected, and processed rapidly, requiring efficient and timely analysis.
Variety: Big data includes various types of data, such as structured, unstructured, and semi-structured data. Structured data refers to organized data in a fixed format, while unstructured data is more flexible, including text, images, videos, social media posts, and more.
Veracity: Big data can have issues with accuracy, reliability, and trustworthiness. Veracity refers to the uncertainty and noise present in the data due to factors like data inconsistency, incompleteness, and biases.
Volume: Big data refers to a vast amount of data generated from various sources, such as social media, sensors, transactions, and more. It typically exceeds the capacity of traditional database systems.
Velocity: Big data is generated at high speed and often in real-time. Data is continuously produced, collected, and processed rapidly, requiring efficient and timely analysis.
Variety: Big data includes various types of data, such as structured, unstructured, and semi-structured data. Structured data refers to organized data in a fixed format, while unstructured data is more flexible, including text, images, videos, social media posts, and more.
Veracity: Big data can have issues with accuracy, reliability, and trustworthiness. Veracity refers to the uncertainty and noise present in the data due to factors like data inconsistency, incompleteness, and biases.
HADOOP:
Hadoop is an open-source framework designed to store and process large datasets in a distributed computing environment. It provides a reliable, scalable, and cost-effective solution for handling big data. The key components of Hadoop are:
Hadoop Distributed File System (HDFS): HDFS is a distributed file system that stores data across multiple machines in a cluster. It is designed to handle large files and provides high throughput data access. HDFS divides files into blocks and replicates them across multiple nodes for fault tolerance.
MapReduce: MapReduce is a programming model and computational algorithm used for processing and analyzing large datasets in parallel across a Hadoop cluster. It breaks down data processing tasks into two stages: map and reduce. The map stage processes data in parallel across different nodes, and the reduce stage aggregates the results to produce the final output.
YARN (Yet Another Resource Negotiator): YARN is the resource management layer of Hadoop. It manages and allocates resources, such as CPU and memory, across the Hadoop cluster. YARN enables the execution of various data processing frameworks, including MapReduce, Apache Spark, Apache Hive, and others, on the same cluster.
Hadoop Ecosystem: Hadoop has a rich ecosystem of tools and frameworks that complement its core components. These include Apache Hive, Apache Pig, Apache Spark, Apache HBase, Apache Kafka, and many others. These tools provide higher-level abstractions and functionalities for data processing, querying, streaming, real-time analytics, and more.
Hadoop Distributed File System (HDFS): HDFS is a distributed file system that stores data across multiple machines in a cluster. It is designed to handle large files and provides high throughput data access. HDFS divides files into blocks and replicates them across multiple nodes for fault tolerance.
MapReduce: MapReduce is a programming model and computational algorithm used for processing and analyzing large datasets in parallel across a Hadoop cluster. It breaks down data processing tasks into two stages: map and reduce. The map stage processes data in parallel across different nodes, and the reduce stage aggregates the results to produce the final output.
YARN (Yet Another Resource Negotiator): YARN is the resource management layer of Hadoop. It manages and allocates resources, such as CPU and memory, across the Hadoop cluster. YARN enables the execution of various data processing frameworks, including MapReduce, Apache Spark, Apache Hive, and others, on the same cluster.
Hadoop Ecosystem: Hadoop has a rich ecosystem of tools and frameworks that complement its core components. These include Apache Hive, Apache Pig, Apache Spark, Apache HBase, Apache Kafka, and many others. These tools provide higher-level abstractions and functionalities for data processing, querying, streaming, real-time analytics, and more.
INSTALLATION PROCEDURE:
STEP – 1: UBUNTU INSTALLATION
Install ubuntu to your system using the
link below -
https://ubuntu.com/tutorials/how-to-run-ubuntu-desktop-on-a-virtual-machine-using- virtualbox#download
Then add the
ubuntu into the virtualbox
STEP – 2: JDK INSTALLATION IN UBUNTU
Open the terminal then give the following commands: user@ubuntu: sudo apt-get update
user@ubuntu: sudo apt-get install openjdk-18-jdk
After
it got installed just type the following to verify its installation user@ubuntu:~$ java -version
openjdk
version "18.0.2-ea" 2022-07-19
OpenJDK Runtime
Environment (build 18.0.2-ea+9-Ubuntu-222.04)
OpenJDK 64-Bit Server VM (build 18.0.2-ea+9-Ubuntu-222.04, mixed mode, sharing)
STEP – 3: INSTALL
SSH AND KEY GENERATION
user@ubuntu:~$ sudo apt-get
install ssh
user@ubuntu:~$ which ssh
/usr/bin/ssh user@ubuntu:~$ which sshd
/usr/sbin/sshd
user@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/user/.ssh/id_rsa):
Created
directory '/home/user/.ssh'.
Your
identification has been saved in /home/user/.ssh/id_rsa Your public key has been saved in /home/user/.ssh/id_rsa.pub The key
fingerprint is:
SHA256:Iyn/jP7AREk2sxrfVmnWoIVkbsiNuMVx2XdmTfHCIW8 user@ubuntu The key's
randomart image is:
+---[RSA 3072]--- +
| | | |
*.++o . .+o| * &oo.+.++.o| o X = =..+E .| |
| |
B + + . . | |
| |
+ = S | |
| |
= o . | |
| |
+ | |
| |
= | |
| |
.o.+ | |
+----[SHA256]---- +
user@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
user@ubuntu:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't
be established.
ED25519 key fingerprint is SHA256:XPvGuZdchRV4Ocr+WUzq6GViiJEGpDnCi6Bp3a3IybM. This
key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added 'localhost' (ED25519) to the list of known hosts. Welcome to Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64)
STEP – 4: HADOOP INSTALLATION
https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.3/hadoop-3.3.3.tar.gz
Install Hadoop using the above link.
and extract the tar file using below command
tar -zxvf /home/user/hadoop-3.3.3
STEP – 5: MOVE HADOOP
Move the Hadoop to the /usr/local/hadoop directory using the following command:
user@ubuntu:~$ sudo mv hadoop/ /usr/local
user@ubuntu:~$ ls /usr/local
bin etc games hadoop include lib man sbin share src
STEP – 6: SETUP
CONFIGURATION FILES
· ~/.bashrc
·
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
·
/usr/local/hadoop/etc/hadoop/core-site.xml
·
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
· /usr/local/hadoop/etc/hadoop/hdfs-site.xml
1) ~/.bashrc
user@ubuntu:~$ update-alternatives --config
java
There are 2 choices for the alternative java (providing /usr/bin/java).
Selection Path Priority Status
* 0 |
/usr/lib/jvm/java-18-openjdk-amd64/bin/java |
1811 |
auto
mode |
1 |
/usr/lib/jvm/java-18-openjdk-amd64/bin/java |
1811 |
manual mode |
2 |
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java |
1081 |
manual mode |
Press <enter> to keep the current choice[*], or type selection
number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide
/usr/bin/java (java) in manual mode
update-alternatives:
error: error creating symbolic link '/etc/alternatives/java.dpkg-tmp': Permission denied
user@ubuntu:~$ gedit .bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_INSTALL=/usr/local/hadoop-3.3.3
export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin
export
HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END
user@ubuntu:~$ source .bashrc
user@ubuntu:~$ hadoop version Hadoop 3.3.4
Source code
repository https://github.com/apache/hadoop.git -r a585a73c3e02ac62350c136643a5e7f6095a3dbb
Compiled by
stevel on 2022-07-29T12:32Z Compiled with protoc 3.7.1
From source with checksum fb9dd8918a7b8a5b430d61af858f6ec
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common- 3.3.4.jar
2)
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
user@ubuntu:~$ gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Inside the file write the below line:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
3) /usr/local/hadoop/etc/hadoop/core-site.xml
user@ubuntu:~$ sudo mkdir -p /app/hadoop/tmp
[sudo] password
for user:
user@ubuntu:~$
sudo chown user:user /app/hadoop/tmp user@ubuntu~$ /usr/local/hadoop/etc/hadoop/core-site.xml Paste the below code in this file:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose scheme
and authority determine the FileSystem implementation. The uri's scheme determines the config
property (fs.SCHEME.impl) naming the
FileSystem implementation class. The uri's authority is used to determine
the host, port, etc. for a filesystem.</description>
</property>
</configuration>
4) /usr/local/hadoop/etc/hadoop/mapred-site.xml
user@ubuntu:~$ gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml
Paste the below code in this file:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs
are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
5) /usr/local/hadoop/etc/hadoop/hdfs-site.xml
user@ubuntu:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode user@ubuntu:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode user@ubuntu:~$ sudo chown -R user:user /usr/local/hadoop_store
user@ubuntu:~$ gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Paste the below code in the file:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block
replication.
The actual
number of replications can be specified when the file is created. The default
is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
STEP – 7: Format the New Hadoop Filesystem
user@ubuntu:~$ hadoop namenode -format
user@ubuntu:~$ start-all.sh
WARNING: Attempting to start all Apache Hadoop
daemons as user in 10 seconds. WARNING: This is not a recommended production deployment configuration.
WARNING: Use CTRL-C to abort. Starting namenodes on [localhost] Starting
datanodes
Starting
secondary namenodes [user]
2022-08-15 20:22:22,957 WARN util.NativeCodeLoader: Unable
to load native-hadoop library for your platform... using builtin-java classes
where applicable
Starting resourcemanager Starting nodemanagers
user@ubuntu:~$ jps
25587 ResourceManager
25173 DataNode
25046 NameNode
26038 Jps
25368 SecondaryNameNode
25711 NodeManager
STEP – 8: VIEW HADOOP
To view the Hadoop,
to your browser and type, localhost:9870/ for Hadoop version 3.2.4
version.
Comments
Post a Comment