hadoop

Getting started with hadoop

Remarks#

What is Apache Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Reference:

Apache Hadoop

Versions#

VersionRelease NotesRelease Date
3.0.0-alpha12016-08-30
2.7.3Click here - 2.7.32016-01-25
2.6.4Click here - 2.6.42016-02-11
2.7.2Click here - 2.7.22016-01-25
2.6.3Click here - 2.6.32015-12-17
2.6.2Click here - 2.6.22015-10-28
2.7.1Click here - 2.7.12015-07-06

Installation or Setup on Linux

A Pseudo Distributed Cluster Setup Procedure

Prerequisites

  • Install JDK1.7 and set JAVA_HOME environment variable.

  • Create a new user as “hadoop”.

    useradd hadoop

  • Setup password-less SSH login to its own account

     su - hadoop
     ssh-keygen
     << Press ENTER for all prompts >>
     cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
     chmod 0600 ~/.ssh/authorized_keys
  • Verify by performing ssh localhost

  • Disable IPV6 by editing /etc/sysctl.conf with the followings:

     net.ipv6.conf.all.disable_ipv6 = 1
     net.ipv6.conf.default.disable_ipv6 = 1
     net.ipv6.conf.lo.disable_ipv6 = 1
  • Check that using cat /proc/sys/net/ipv6/conf/all/disable_ipv6

    (should return 1)

Installation and Configuration:

  • Download required version of Hadoop from Apache archives using wget command.

     cd /opt/hadoop/
     wget http:/addresstoarchive/hadoop-2.x.x/xxxxx.gz
     tar -xvf hadoop-2.x.x.gz
     mv hadoop-2.x.x.gz hadoop 
    (or)
    
     ln -s hadoop-2.x.x.gz hadoop
     chown -R hadoop:hadoop hadoop
  • Update .bashrc/.kshrc based on your shell with below environment variables

      export HADOOP_PREFIX=/opt/hadoop/hadoop
      export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
      export JAVA_HOME=/java/home/path
      export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_PREFIX/sbin:$JAVA_HOME/bin
  • In $HADOOP_HOME/etc/hadoop directory edit below files

    • core-site.xml

      <configuration>
          <property>
              <name>fs.defaultFS</name>
              <value>hdfs://localhost:8020</value>
          </property>
      </configuration> 
    • mapred-site.xml

      Create mapred-site.xml from its template

      cp mapred-site.xml.template mapred-site.xml

        <configuration>
            <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
            </property>
        </configuration> 
    • yarn-site.xml

       <configuration>
           <property>
               <name>yarn.resourcemanager.hostname</name>
               <value>localhost</value> 
           </property>
           <property>
               <name>yarn.nodemanager.aux-services</name>
               <value>mapreduce_shuffle</value>
           </property>
       </configuration>
    • hdfs-site.xml

       <configuration>
           <property>
               <name>dfs.replication</name>
               <value>1</value>
           </property>
           <property>
               <name>dfs.namenode.name.dir</name>
               <value>file:///home/hadoop/hdfs/namenode</value>
           </property>
           <property>
               <name>dfs.datanode.data.dir</name> 
               <value>file:///home/hadoop/hdfs/datanode</value> 
           </property>
       </configuration>

    Create the parent folder to store the hadoop data

    mkdir -p /home/hadoop/hdfs
  • Format NameNode (cleans up the directory and creates necessary meta files)

    hdfs namenode -format
  • Start all services:

    start-dfs.sh && start-yarn.sh
    mr-jobhistory-server.sh start historyserver
       

Instead use start-all.sh (deprecated).

  • Check all running java processes

    jps 
  • Namenode Web Interface: https://localhost:50070/

  • Resource manager Web Interface: https://localhost:8088/

  • To stop daemons(services):

    stop-dfs.sh && stop-yarn.sh
    mr-jobhistory-daemon.sh stop historyserver

Instead use stop-all.sh (deprecated).

Installation of Hadoop on ubuntu

Creating Hadoop User:

sudo addgroup hadoop

Adding a user:

sudo adduser --ingroup hadoop hduser001

enter image description here

Configuring SSH:

su -hduser001
ssh-keygen -t rsa -P ""
cat .ssh/id rsa.pub >> .ssh/authorized_keys

Note: If you get errors [bash: .ssh/authorized_keys: No such file or directory] whilst writing the authorized key. Check here.

enter image description here enter image description here enter image description here

Add hadoop user to sudoer’s list:

sudo adduser hduser001 sudo

enter image description here

Disabling IPv6:

enter image description here enter image description here

Installing Hadoop:

sudo add-apt-repository ppa:hadoop-ubuntu/stable
sudo apt-get install hadoop

enter image description here enter image description here

Hadoop overview and HDFS

enter image description here

  • Hadoop is an open-source software framework for storage and large-scale processing of data-sets in a distributed computing environment.
  • It is sponsored by Apache Software Foundation.
  • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

History

  • Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
  • Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.
  •  It was originally developed to support distribution for the search engine project.

Major modules of hadoop

  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  • Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.

Hadoop File System  Basic Features

  • Highly fault-tolerant.
  • High throughput.
  • Suitable for applications with large data sets.
  • Can be built out of commodity hardware.

Namenode and Datanodes

  • Master/slave architecture.
  • HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.
  • The DataNodes manage storage attached to the nodes that they run on.
  • HDFS exposes a file system namespace and allows user data to be stored in files.
  • A file is split into one or more blocks and set of blocks are stored in DataNodes.
  • DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.

enter image description here

  • HDFS is designed to store very large files across machines in a large cluster.
  • Each file is a sequence of blocks.
  • All blocks in the file except the last are of the same size.
  • Blocks are replicated for fault tolerance.
  • The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.
  • BlockReport contains all the blocks on a Datanode.

Hadoop Shell Commands

  • Common commands used:-
    1. ls Usage: hadoop fs –ls Path(dir/file path to list).
    2. Cat Usage: hadoop fs -cat PathOfFileToView

enter image description here

Link for hadoop shell commands:- https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html


This modified text is an extract of the original Stack Overflow Documentation created by the contributors and released under CC BY-SA 3.0 This website is not affiliated with Stack Overflow