BigData Journey: Dive into Big Data - Hadoop

Saturday, 11 March 2017

Dive into Big Data - Hadoop

To start your journey towards big data / hadoop you have to follow following steps :

1. Brush up Core Java and SQL .

2. Basic unix commands . For this you can refer this link Basic Unix Commands

3. Now you are ready to swim in the world of Big Data

Big Data = This is the combination of Big + Data . We already have RDBMS to handle data but now we are getting digital data from everywhere around the world that makes data big . This is Big Data .

To store and process we have a powerful Java framework called as HADOOP .

Who has created - Doug Cutting
Where - Yahoo
When - 2006

How ? its goes to open source as it was Yahoo product .

Yahoo later provided Hadoop to Apache foundation and now this is the top level open source project in Apache.
Hadoop
|
|
----------------------------------------------
| |
Storage Processing
(HDFS) (MapReduce)

1. Storage : For storage of big data , hadoop uses HDFS i.e. Hadoop Distributed File System .
2. MapReduce : This framework is designed in Java to process large size data in parallel.

We will look both in detail :
1. Storage : Refer this link Hadoop Storage

2.MapReduce: Refer this link Hadoop Processing -MapReduce

Hadoop Ecosytem Tools :

1.Data Analytical Tool : Hive

Apache Hive is a Open source Data Warehouse infrastructure built on top of Apache Hadoop for providing data summarization, query, and analysis.

This tool was initially developed by Facebook . Later they have contributed to Apache .

This tool is used for structure type of data .

Refer this link : Apache Hive

Download Hive from Original Apache website : apache/hive/hive-2.1.1/

2. Data Transformation Tool : Pig

This tool is for structure as well as semi -structure data .

3. Data Ingestion Tool : Sqoop

Its a Open Source , Product from Apache .

Full name of Sqoop i.e SQ+OOP = SQL to HADOOP

This tool is used to transfer data from Relational Database to Hadoop supporting storage system and vice versa .

Interesting facts about SQOOP : It is not used only with open source framework i.e. Hadoop but also used by industry giant like below :

1. Informatica provides a Sqoop based connector .

2. Pentaho provides open source Sqoop based connector.

3. Microsoft uses a Sqoop based connector to help transfer from Microsoft SQL Server DB to Hadoop.

and many more ...

Refer this link : SQOOP

Also refer this link to Refresh your PostgreSQL Knowledge . This Link will help you to go through from the all SELECT queries to fetch record from RDBMS ( i,e, PostgreSQL) to transfer data into Hadoop distributed File system , hive or No Sql database like HBASE.

4.Data Ingestion Tool : Flume (Coming Soon)

This is also a Data Ingestion Tool but this tool is used to transfer the semi-structure data from any web server to hdfs/hive/HBASE .

Example : Apache Log file stored in remote web server can be transfer using Flume to HDFS .

Apache Kafka :

First lets go through how we can use Kafka channels in Flume as a reliable and highly available channel for any source/sink combination.

Refer this blog : Kafka Channel in Flume

In this blog you will get to know how to transfer data from webserver to hdfs

5.NoSql Databases (Coming Soon)

2 comments:

Ishu Sathya19 February 2018 at 21:59
Hadoop gives better solution for Big Data problems, Your article impressed me to take Hadoop Certification. Thanks for your motivation.
Regards
Hadoop Training Chennai | Hadoop Training in Chennai
ReplyDelete
Replies

Subscribe to: Post Comments (Atom)