Saturday, 11 March 2017

Dive into Big Data - Hadoop

To start your journey towards big data / hadoop you have to follow following steps :


1. Brush up Core Java and SQL .  


2. Basic unix commands . For this you can refer this link Basic Unix Commands



3. Now you are ready to swim in the world of Big Data


Big Data  =  This is the combination of Big + Data . We already have RDBMS to handle data but now we are getting digital data from everywhere around the world that makes data big . This is Big Data .


To store and process we have a powerful Java framework called as HADOOP


Who has created - Doug Cutting
Where - Yahoo
When - 2006 


How ? its goes to open source as it was Yahoo product .


Yahoo later provided Hadoop to Apache foundation and now this is the top level open source project in Apache.
                                                         Hadoop
                                                               |
                                                               |
                                  ----------------------------------------------
                                  |                                                     |
                              Storage                                        Processing
                              (HDFS)                                      (MapReduce)

1. Storage : For storage of big data , hadoop uses HDFS i.e. Hadoop Distributed File System .
2. MapReduce : This framework is designed in Java to process large size data in parallel.


We will look both in detail :
1. Storage : Refer this link Hadoop Storage

2.MapReduce: Refer this link Hadoop Processing -MapReduce

 

Hadoop Ecosytem Tools :


1.Data Analytical Tool  : Hive 

 

Apache Hive is a Open source Data Warehouse infrastructure built on top of Apache Hadoop for providing data summarization, query, and analysis.

 

This tool was initially developed by Facebook . Later they have contributed to Apache . 

 This tool is used for structure type of data .

 

Refer this link :  Apache Hive 

 

Download Hive from Original Apache website : apache/hive/hive-2.1.1/


2. Data Transformation Tool : Pig  

This tool is for structure as well as semi -structure data .


3. Data Ingestion Tool : Sqoop


Its a Open Source , Product from Apache . 

Full name of Sqoop i.e SQ+OOP  = SQL to HADOOP

This tool is used to transfer data from Relational Database to Hadoop supporting storage system and vice versa . 

Interesting facts about SQOOP : It is not used only with open source framework i.e. Hadoop but also used by industry giant like below :

1. Informatica provides a Sqoop based connector .

2. Pentaho provides open source Sqoop based connector.

3. Microsoft uses a Sqoop based connector to help transfer from Microsoft SQL Server DB to Hadoop.

and many more ...

Refer this link : SQOOP

Also refer this link to  Refresh your PostgreSQL Knowledge  . This Link will help you to go through from the all SELECT queries to fetch record from RDBMS ( i,e, PostgreSQL) to transfer data into Hadoop distributed File system , hive or No Sql database like HBASE.


4.Data Ingestion Tool : Flume (Coming Soon)

This is also a Data Ingestion Tool but this tool is used to transfer the semi-structure data from any web server to hdfs/hive/HBASE .

Example : Apache Log file stored in remote web server can be transfer using Flume to HDFS .

Apache Kafka :  


First lets go through how we can use Kafka channels in Flume as a reliable and highly available channel for any source/sink combination.


Refer this blog : Kafka Channel in Flume

In this blog you will get to know how to transfer data from webserver to hdfs 

 

5.NoSql Databases (Coming Soon)







2 comments:

  1. Hadoop gives better solution for Big Data problems, Your article impressed me to take Hadoop Certification. Thanks for your motivation.
    Regards
    Hadoop Training Chennai | Hadoop Training in Chennai

    ReplyDelete
    Replies
    1. Thank you so much for your kind words .

      Delete