BigData Journey: 2018

Friday, 6 April 2018

Sqoop : Split-by - only numeric columns ?

Hi All ,

For better parallelization, Sqoop forces to use numeric -split by column .

But in real world you may end up with non numeric columns in your source table . In that we can set following Textsplitter properties to true in our Sqoop import command :

Dorg.apache.sqoop.splitter.allow_text_splitter=true

This we use for non-numeric split-by column .

Hope you like the article .

Till then Keep Coding Keep Healthy

Comment and Like Guys!!

Tuesday, 27 March 2018

SQOOP Part II

(What is Hadoop and its ecosytem tools - this is the place for all your answers - Know about Hadoop - A Big Data Handling Framework )

Sqoop Part I

Hey Hi All , Welcome to SQOOP Part II

Lets start with the main requirement importing data from Mysql table -Customers to hdfs .

Syntax : sqoop import --connect jdbc:mysql://localhost/<database> --username <uname> -P --table <table name> --target-dir '/directory/'

Command : sqoop import --connect jdbc:mysql://127.0.0.1/InsuranceEuropeDB --username root -P --table Customers --target-dir '/CustomersData'

Ops !! Fail as No Primary key found for table Customers . (I should have to ask Primary key details as a third questions from Onshore coordinator looks like this table does not have any)

So why it require Primary Key ??

- > As we know we are using SQOOP so that multiple threads run and do our tasks fast. Primary key holds unique and not null data which helps SQOOP to decide how many mapper required to process the whole data in equal distributed manner .

Example :

1. With Duplicate data :

1 2 2 3 3 3 7 9

Suppose above data we are getting as Rating from Movies . ( Sqoop Default Mapper = 4 )

Sqoop calculates Min - 1
Max - 9

mapper 1 - 1 to 2 - 3 records
mapper 2 - 3 to 4 - 3 records
mapper 3 - 5 to 6 - no records
mapper 4 - 7 to 9 - 1 records

Now you can see that some mapper processing 3 times than other like mapper1 process 3 records and mapper4 only single records , also look mapper3 enjoying bench (sitting idle). Thats the reason SQOOP forces to use Primary Key .

2. Null

if there are NULL in the fields which we used to decides mapper , it will also do same as above process . Some Mapper will process more data some less .

3. Primary Key - Unique and Not null

11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10

Suppose above data we are getting as Rating from Movies . ( Sqoop Default Mapper = 4 )

Sqoop calculates Min - 1
Max - 16 (16 / 4 = 4 records /Mapper)

mapper 1 - first 4 records - 4 records
mapper 2 - then 4 records - 4 records
mapper 3 - then 4 records - 4 records
mapper 4 - 4 records - 4 records

So as far we understand that Primary Key field is needed to run the SQOOP job but for creating Parallelism (Divide the data equally for each Mapper )

So can we run above job , which fail (Primary Key not found in Customers Table )

There are solutions as we cannot change source table .

1. Pass number of Mapper as 1 (Which is defaults to 4 )

2 . Search any field in table which we can use to create Parallelism to some extents . ( next blog)

1. Pass number of Mapper as 1 (Which is defaults to 4 )

sqoop import --connect jdbc:mysql://127.0.0.1/InsuranceEuropeDB --username root -P --table Customers --target-dir '/CustomersData' -m 1

Observe these two new members :

1 ) --target-dir '/CustomersData' - It will create CustomersData folder .

If folder already exists then you need to remove CustomersData folder and then run the job . Otherwise you will come accross this error as well shown below :

2) Now lets meet second member that is [m -1]

That means we are asking SQOOP to use only one mapper for transfering data from MYSQL to hdfs location .

That's completed my assigned tasks .

As per discussion , I need to mail Onshore Team with hdfs path .

Morning Team ,

Hope you are doing good .

As per the request , we have completed the task . Please find the details as below :

HDFS Path : \CustomersData

Attaching Screenshot also with this mail having hdfs path .

We will discuss one more method in (next blog) along with if any new requirements ;)

Till then Keep Coding Keep Healthy

Comment and Like Guys!!

SQOOP ( For Structured data)

(What is Hadoop and its ecosytem tools - this is the place for all your answers - Know about Hadoop - A Big Data Handling Framework )

Sqoop :

Sqoop is a command line interface application for transferring data between relational Database and Hadoop

where Hadoop refers hdfs (hadoop distributed file system) , Hive(Datawarehouse) , Hbase(NoSqlDB) etc

Suppose your Manager came and said hey man onshore team need some tables on hdfs (i.e fetch records from MYSQL table Customers and store into hdfs) and this need to deliver today itself .

what are the things you need to ask ?

1. Credentials required .
2. Table belongs to which database .

we will add questions as we progress ...

Replied:
Onshore coordinator has shared below image with credentials in separate mail from source system where you can find table belongs to which database i.e InsuranceEuropeDB.

Now lets start our coding ..

New thing SQOOP to explore

Start with : Sqoop help

It will provide all the commands supported by current Sqoop version .

Lets make a connection with mysql and check whether able to connect with requested database .

we can use above two commands i.e . 1) list-database 2) list-tables

sometimes we forgot what are the other things we need to pass with the command . Dont worry about that ask help from SQOOP .

sqoop help <command>

Sample :

Syntax : sqoop list-databases --connect jdbc:mysql://<ip address> --username <username> --password <password>
Command:
sqoop list-databases --connect jdbc:mysql://127.0.0.1 --username root --password password

As we can see that internally its calling SHOW DATABASE only . Waoo I can see database "InsuranceEuropeDB" this is awesome ;)

Also I am highlighted the Consider using -P instead ( in place of --password <password>) , don't worry guys ,lets try it!!

Command:
sqoop list-databases --connect jdbc:mysql://127.0.0.1 --username root -P

Press enter then it will ask password to type , it will not be visible . (Me : Great Sqoop you are taking care of this ... Friend who know Sqoop : hmm so use -P only ( observe that P is capital with single hyphen))

This is great I cannot see password .

Let proceed for next command to list tables .

Me : Opps nothing came why ? I use list-tables ,connect ,username , password all still !!

Friend who know Sqoop : Hey Man , mention database also so that it will fetch tables name from that database .

Me : Yes I got it .. Thanks

Me : Sqoop is awesome . After lunch I will do import , till evening I have to provide them hdfs path (having Customers table data )(Next blog ) .

Hope you like my blogs . Please comment and like . Ask doubts if you have related to SQOOP .

Happy Coding
Amit Shyamal Dass

Sunday, 18 March 2018

MYSQL uninstall and installation on Ubuntu(Cloudera Machine)

Hey Hi Friends ,

There may be case where we don't remember password and finally we require software to completely uninstall and then reinstall again .

In this blog , we will learn how to uninstall and install MYSQL from Ubuntu (Cloudera Machine)

For uninstall : (Only 2 steps you need to completely remove MYSQL from system )

1. sudo apt-get remove --purge mysql-server mysql-client mysql-common

Press Y to continue

2. sudo rm -rf /var/lib/mysql

Installation of MYSQL on Ubuntu (Cloudera Machine)

1. sudo apt-get install mysql-server

2. After installation , it will prompt for setting the password.

3. You are ready to use mysql .

I will come with more blogs related to this .

If you like my blog please like and comment

Happy Coding ...

Sunday, 11 March 2018

How to setup Git on your Windows Desktop

1. First you need to download Git from below link

Link : https://desktop.github.com/

2. You need to create Github Login credentials ( Username ,emailid and set your Password)

Link : https://github.com/join?source=github-desktop

3 . Verify your emailid and Configure Git.

4. Now you are ready to use GIT features .

5. Now let see how to take a clone from Github to our Desktop

To clone the project we only need to copy above highlighted link

Now click on Clone a repository ( step 4 screen)

Here , in above screenshot , first choose URL tab then you have to provide cloning Github link which you just copied and local path (remember it will automatically create a folder GITHUB) , then click on clone

Cloning will take some time (depends on project size which you are cloning)

6. Now go to the folder(GITHUB) and explore it !!

Thanks