WHAT IS BIG DATA

Big data refers to extremely large and complex data sets that cannot be effectively managed, processed, or analyzed using traditional data processing applications. It encompasses the four V's: volume, velocity, variety, and veracity.
Volume: Big data refers to a vast amount of data generated from various sources, such as social media, sensors, transactions, and more. It typically exceeds the capacity of traditional database systems.
Velocity: Big data is generated at high speed and often in real-time. Data is continuously produced, collected, and processed rapidly, requiring efficient and timely analysis.
Variety: Big data includes various types of data, such as structured, unstructured, and semi-structured data. Structured data refers to organized data in a fixed format, while unstructured data is more flexible, including text, images, videos, social media posts, and more.
Veracity: Big data can have issues with accuracy, reliability, and trustworthiness. Veracity refers to the uncertainty and noise present in the data due to factors like data inconsistency, incompleteness, and biases.

HADOOP:

Hadoop is an open-source framework designed to store and process large datasets in a distributed computing environment. It provides a reliable, scalable, and cost-effective solution for handling big data. The key components of Hadoop are:
Hadoop Distributed File System (HDFS): HDFS is a distributed file system that stores data across multiple machines in a cluster. It is designed to handle large files and provides high throughput data access. HDFS divides files into blocks and replicates them across multiple nodes for fault tolerance.
MapReduce: MapReduce is a programming model and computational algorithm used for processing and analyzing large datasets in parallel across a Hadoop cluster. It breaks down data processing tasks into two stages: map and reduce. The map stage processes data in parallel across different nodes, and the reduce stage aggregates the results to produce the final output.
YARN (Yet Another Resource Negotiator): YARN is the resource management layer of Hadoop. It manages and allocates resources, such as CPU and memory, across the Hadoop cluster. YARN enables the execution of various data processing frameworks, including MapReduce, Apache Spark, Apache Hive, and others, on the same cluster.
Hadoop Ecosystem: Hadoop has a rich ecosystem of tools and frameworks that complement its core components. These include Apache Hive, Apache Pig, Apache Spark, Apache HBase, Apache Kafka, and many others. These tools provide higher-level abstractions and functionalities for data processing, querying, streaming, real-time analytics, and more.

INSTALLATION PROCEDURE:

STEP – 1: UBUNTU INSTALLATION

Install ubuntu to your system using the link below -

https://ubuntu.com/tutorials/how-to-run-ubuntu-desktop-on-a-virtual-machine-using- virtualbox#download

Then add the ubuntu into the virtualbox

STEP – 2: JDK INSTALLATION IN UBUNTU

Open the terminal then give the following commands: user@ubuntu: sudo apt-get update

user@ubuntu: sudo apt-get install openjdk-18-jdk

After it got installed just type the following to verify its installation user@ubuntu:~$ java -version

openjdk version "18.0.2-ea" 2022-07-19

OpenJDK Runtime Environment (build 18.0.2-ea+9-Ubuntu-222.04)

OpenJDK 64-Bit Server VM (build 18.0.2-ea+9-Ubuntu-222.04, mixed mode, sharing)

STEP – 3: INSTALL SSH AND KEY GENERATION

user@ubuntu:~$ sudo apt-get install ssh

user@ubuntu:~$ which ssh

/usr/bin/ssh user@ubuntu:~$ which sshd

/usr/sbin/sshd

user@ubuntu:~$ ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.

Enter file in which to save the key (/home/user/.ssh/id_rsa):

Created directory '/home/user/.ssh'.

Your identification has been saved in /home/user/.ssh/id_rsa Your public key has been saved in /home/user/.ssh/id_rsa.pub The key fingerprint is:

SHA256:Iyn/jP7AREk2sxrfVmnWoIVkbsiNuMVx2XdmTfHCIW8 user@ubuntu The key's randomart image is:

+---[RSA 3072]--- +

\| \| \|	.++o . .+o\| &oo.+.++.o\| o X = =..+E .\|
\|	B + + . . \|
\|	+ = S \|
\|	= o . \|
\|	+ \|
\|	= \|
\|	.o.+ \|

+----[SHA256]---- +

user@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

user@ubuntu:~$ ssh localhost

The authenticity of host 'localhost (127.0.0.1)' can't be established.

ED25519 key fingerprint is SHA256:XPvGuZdchRV4Ocr+WUzq6GViiJEGpDnCi6Bp3a3IybM. This key is not known by any other names

Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added 'localhost' (ED25519) to the list of known hosts. Welcome to Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64)

STEP – 4: HADOOP INSTALLATION

https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.3/hadoop-3.3.3.tar.gz

Install Hadoop using the above link.

and extract the tar file using below command

tar -zxvf /home/user/hadoop-3.3.3

STEP – 5: MOVE HADOOP

Move the Hadoop to the /usr/local/hadoop directory using the following command:

user@ubuntu:~$ sudo mv hadoop/ /usr/local

user@ubuntu:~$ ls /usr/local

bin etc games hadoop include lib man sbin share src

STEP – 6: SETUP CONFIGURATION FILES

· ~/.bashrc

· /usr/local/hadoop/etc/hadoop/hadoop-env.sh

· /usr/local/hadoop/etc/hadoop/core-site.xml

· /usr/local/hadoop/etc/hadoop/mapred-site.xml.template

· /usr/local/hadoop/etc/hadoop/hdfs-site.xml

1) ~/.bashrc

user@ubuntu:~$ update-alternatives --config java

There are 2 choices for the alternative java (providing /usr/bin/java).

Selection Path Priority Status

* 0	/usr/lib/jvm/java-18-openjdk-amd64/bin/java	1811	auto mode
1	/usr/lib/jvm/java-18-openjdk-amd64/bin/java	1811	manual mode
2	/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java	1081	manual mode

Press <enter> to keep the current choice[*], or type selection number: 2

update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide

/usr/bin/java (java) in manual mode

update-alternatives: error: error creating symbolic link '/etc/alternatives/java.dpkg-tmp': Permission denied

user@ubuntu:~$ gedit .bashrc

#HADOOP VARIABLES START

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_INSTALL=/usr/local/hadoop-3.3.3

export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin

export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END

user@ubuntu:~$ source .bashrc

user@ubuntu:~$ hadoop version Hadoop 3.3.4

Source code repository https://github.com/apache/hadoop.git -r a585a73c3e02ac62350c136643a5e7f6095a3dbb

Compiled by stevel on 2022-07-29T12:32Z Compiled with protoc 3.7.1

From source with checksum fb9dd8918a7b8a5b430d61af858f6ec

This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common- 3.3.4.jar

2) /usr/local/hadoop/etc/hadoop/hadoop-env.sh

user@ubuntu:~$ gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Inside the file write the below line:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

3) /usr/local/hadoop/etc/hadoop/core-site.xml

user@ubuntu:~$ sudo mkdir -p /app/hadoop/tmp

[sudo] password for user:

user@ubuntu:~$ sudo chown user:user /app/hadoop/tmp user@ubuntu~$ /usr/local/hadoop/etc/hadoop/core-site.xml Paste the below code in this file:

<name>hadoop.tmp.dir</name>

<value>/app/hadoop/tmp</value>

<description>A base for other temporary directories.</description>

</property>

<name>fs.default.name</name>

<value>hdfs://localhost:54310</value>

<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>

</property>

</configuration>

4) /usr/local/hadoop/etc/hadoop/mapred-site.xml

user@ubuntu:~$ gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml

Paste the below code in this file:

<name>mapred.job.tracker</name>

<value>localhost:54311</value>

<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map

and reduce task.

</description>

</property>

</configuration>

5) /usr/local/hadoop/etc/hadoop/hdfs-site.xml

user@ubuntu:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode user@ubuntu:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode user@ubuntu:~$ sudo chown -R user:user /usr/local/hadoop_store

user@ubuntu:~$ gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Paste the below code in the file:

<name>dfs.replication</name>

<description>Default block replication.

The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

</description>

</property>

<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop_store/hdfs/namenode</value>

</property>

<name>dfs.datanode.data.dir</name>

<value>file:/usr/local/hadoop_store/hdfs/datanode</value>

</property>

</configuration>

STEP – 7: Format the New Hadoop Filesystem

user@ubuntu:~$ hadoop namenode -format

user@ubuntu:~$ start-all.sh

WARNING: Attempting to start all Apache Hadoop daemons as user in 10 seconds. WARNING: This is not a recommended production deployment configuration.

WARNING: Use CTRL-C to abort. Starting namenodes on [localhost] Starting datanodes

Starting secondary namenodes [user]

2022-08-15 20:22:22,957 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Starting resourcemanager Starting nodemanagers

user@ubuntu:~$ jps

25587 ResourceManager

25173 DataNode

25046 NameNode

26038 Jps

25368 SecondaryNameNode

25711 NodeManager

STEP – 8: VIEW HADOOP

To view the Hadoop, to your browser and type, localhost:9870/ for Hadoop version 3.2.4 version.

\| \| \|	.++o . .+o\| &oo.+.++.o\| o X = =..+E .\|
\|	B + + . . \|
\|	+ = S \|
\|	= o . \|
\|	+ \|
\|	= \|
\|	.o.+ \|

Search This Blog

Big Data

Hadoop Single Node Installation On Ubuntu

WHAT IS BIG DATA

HADOOP:

INSTALLATION PROCEDURE:

STEP – 1: UBUNTU INSTALLATION

STEP – 2: JDK INSTALLATION IN UBUNTU

STEP – 3: INSTALL SSH AND KEY GENERATION

STEP – 4: HADOOP INSTALLATION

STEP – 5: MOVE HADOOP

STEP – 6: SETUP CONFIGURATION FILES

1) ~/.bashrc

2) /usr/local/hadoop/etc/hadoop/hadoop-env.sh

3) /usr/local/hadoop/etc/hadoop/core-site.xml

4) /usr/local/hadoop/etc/hadoop/mapred-site.xml

5) /usr/local/hadoop/etc/hadoop/hdfs-site.xml

STEP – 7: Format the New Hadoop Filesystem

STEP – 8: VIEW HADOOP

Comments

Post a Comment

Popular posts from this blog

APACHE-SPARK INSTALLATION ON UBUNTU

Data Migration Using Apache Sqoop

APACHE -CASSANDRA INSTALLATION ON UBUNTU