Connecting Sqoop to other databases/datastores

Introduction#

Shows how a sqoop script could be used to import data from various datastores/databases.

Load JDBC Driver

For accessing the MS SQL Server database Sqoop requires an additional JDBC driver which can be downloaded from Microsoft. The following steps will install MSSQL Server JDBC driver to Sqoop:

wget 'https://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz'

tar -xvzf sqljdbc_4

cp sqljdbc_4.0/enu/sqljdbc4.jar /usr/hdp/current/sqoop-server/lib/

Validate the connection

To check that the connection to the server is valid:

sqoop list-tables --connect "jdbc:sqlserver://<server_ip>:1433;database=<database_name>" 
                  --username <user_name>
                  --password <password>

Before doing this it is recommended to check if SqlServer’s configuration allows remote access to 1433 port.

Open Configuration Manager => SQL Server Network configuration => Protocols for MSSQLSERVER and check that Protocol is enabled and the needed IP and port is enabled and is active.

Import table to new catalog

To import data from SQL Server to Hadoop:

sqoop import --table TestTable
             --connect "jdbc:sqlserver://192.168.1.100:1433;database=Test_db"  
             --username user 
             --password password  
             --split-by id  
             --target-dir /user/test

split-by – used mandatory if no primary key
target-dir – new catalog, which does not exist yet

Import the results of a query from a relational database into HDFS:

Query can be used instead of table in import operation:

sqoop import --query 'select Id,Message from TestTable where $CONDITIONS' 
             --where 'id>100' 
             --connect "jdbc:sqlserver://192.168.1.100:1433;database=Test_db
             --username user 
             -–password password 
             --split-by id  
             --target-dir /user/test/ 
             --fields-terminated-by '\t'

where $CONDITIONS - mandatory even if where condition does not exist
split-by - mandatory, specifies the column for the split operation. Used to split tasks in import MapReduce job

Import data directly into Hive Warehouse

The data can be imported directly into Hive:

sqoop import --hive-import 
             --table EventLog 
             --connect "jdbc:sqlserver://192.168.1.99:1433;database=Test_db" 
             --username user 
             --password password 
             --split-by id 
             --fields-terminated-by '\t'

Import data from RDBMS to HBase table

The following sqoop command will be used to import the data from RDBMS table into HBase table, if the table does not exists in HBase it will create a new table and import the data into this table

sqoop import \
        --query 'select emp_id, emp_name, emp_sal from employee where $CONDITIONS' \
        --connect "jdbc:sqlserver://192.168.1.99:1433;database=test_db" \
        --username username \
        -–password password \
        --hbase-create-table \
        --hbase-table employee_table \
        --hbase-row-key emp_id