Tuesday, 27 March 2018

SQOOP Part II

(What is Hadoop and its ecosytem tools - this is the place for all your answers - Know about Hadoop - A Big Data Handling Framework )

Sqoop Part I


Hey Hi All , Welcome to SQOOP Part II 


Lets start with the main requirement importing data from Mysql table -Customers to hdfs . 

Syntax : sqoop import --connect jdbc:mysql://localhost/<database> --username <uname> -P --table <table name> --target-dir '/directory/'

Command : sqoop import --connect jdbc:mysql://127.0.0.1/InsuranceEuropeDB --username root -P --table Customers --target-dir '/CustomersData'



Ops !! Fail as No Primary key found for table Customers .  (I should have to ask Primary key details as a third questions from Onshore coordinator looks like this table does not have any)

So why it require Primary Key ??

- >  As we know we are using SQOOP so that multiple threads run and do our tasks fast. Primary key holds unique and not null data which helps SQOOP to decide how many mapper required to process the whole data in equal distributed manner .

Example :

1. With Duplicate data :

1    2   2    3    3    3   7   9    

Suppose above data we are getting as Rating from Movies . ( Sqoop Default Mapper = 4 )

Sqoop calculates Min  -  1
                            Max  - 9  


mapper 1  -  1 to 2    - 3 records 
mapper 2  -  3 to 4     - 3 records
mapper 3 - 5 to 6      - no records 
mapper 4  - 7 to 9      - 1 records 

Now you can see that some mapper processing 3 times than other like mapper1 process 3 records and mapper4 only single records , also look mapper3 enjoying bench (sitting idle). Thats the reason SQOOP forces to use Primary Key . 

2. Null 

if there are NULL in the fields which we used to decides mapper  , it will also do same as above process . Some Mapper will process more data some less . 

3. Primary Key - Unique and Not null 

 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10

Suppose above data we are getting as Rating from Movies . ( Sqoop Default Mapper = 4 )

Sqoop calculates Min  -  1
                            Max  - 16     (16 / 4  = 4 records /Mapper)


mapper 1  -  first 4 records    - 4 records 
mapper 2  -  then 4 records     - 4 records
mapper 3 - then 4 records      - 4 records 
mapper 4  - 4 records     - 4 records 

So as far we understand that Primary Key field is needed to run the SQOOP job but for creating Parallelism (Divide the data equally for each Mapper )

So can we run above job , which fail (Primary Key not found in Customers Table )

There are solutions as we cannot change source table . 

1. Pass number of Mapper as 1 (Which is defaults to 4 )

2 . Search any field in table which we can use to create Parallelism to some extents . ( next blog)

1. Pass number of Mapper as 1 (Which is defaults to 4 )

sqoop import --connect jdbc:mysql://127.0.0.1/InsuranceEuropeDB --username root -P --table Customers --target-dir '/CustomersData' -m 1






Observe these two new members :

1 ) --target-dir '/CustomersData'   - It will create CustomersData folder . 

If folder already exists then you need to remove CustomersData folder and then run the job . Otherwise you will come accross this error as well shown below :




2) Now lets meet second member that is  [m -1]

That means we are asking SQOOP to use only one mapper for transfering data from MYSQL to hdfs location . 

That's completed my assigned tasks . 

As per discussion , I need to mail Onshore Team with hdfs path . 

Morning Team ,

Hope you are doing good . 

As per the request , we have completed the task . Please find the details as below :

HDFS Path : \CustomersData

Attaching Screenshot also with this mail having hdfs path .





We will discuss one more method in (next blog)  along with if any new requirements ;) 

Till then Keep Coding Keep Healthy 

Comment and Like Guys!!





No comments:

Post a Comment