Tuesday, 28 March 2017
Monday, 27 March 2017
PostgreSQL
Its a Open source relational database management system ( DBMS )
This blog is specially target for Sqoop .
Sqoop is a special tool designed on the top of Apache Hadoop which is used to transfer data from Any RDBMS to hdfs/hive/Hbase and vise versa.
In Sqoop we may asked to fetch whole table or specific columns from table for that we must aware about RDBMS select query .
In this Blog we focus on PostgreSQL :
SELECT Queries :
To fetch all records from Customer_Benefits table :
SELECT * FROM Customer_Benefits;
Using AND and OR :
SELECT * FROM Customer_Benefits
WHERE policy = 'Child Endowment'
AND region = 'NY' OR product_id ='4343';
Comparison :
SELECT * FROM Customer_Benefits WHERE benefits_id > 3 AND premium = 4000;
Alternative to BETWEEN using Comparison :
SELECT name FROM Customer_Benefits
WHERE premium >= 5000 AND premium <= 9070;
Using CASE :
SELECT benefits_id, CASE WHEN benefits_id=70 THEN 'Old' WHEN benefits_id=89 THEN 'Old' ELSE 'Fresh' END AS old FROM Customer_Benefits;
Using CASE,LIMIT :
SELECT customer_name, CASE WHEN premium > 4000 THEN 'over $4000.00' WHEN premium = 4000 THEN '$4000.00' ELSE 'under $4000.00' END AS premium_range FROM Customer_Benefits LIMIT 20;
Comparison <= ,!= :
SELECT * FROM Customer_Benefits
WHERE premium <= 9000 AND policy != 'Diamond life';
Using DISTINCT :
SELECT DISTINCT policy FROM Customer_Benefits;
Using DISTINCT ON ,ORDER BY:
SELECT DISTINCT ON (policy) policy, date FROM Customer_Benefits ORDER BY policy, date DESC;Retrieves the most recent report for each policy. But if we had not used ORDER BY to force descending order of date-time values for each policy, we'd have gotten a report from an unpredictable time for each policy. The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON
group.
Using EXCEPT :
SELECT policy FROM Customer_Benefits EXCEPT SELECT policy FROM Customer_Benefits where benefits_id > 3;
Using EXPLAIN :
EXPLAIN SELECT * FROM Customer_Benefits WHERE customer_name LIKE 'J%';
Using GROUP BY :
SELECT city, max(premium) FROM Customer_Benefits GROUP BY city;
Using HAVING :
SELECT city, max(premium) FROM Customer_Benefits GROUP BY city HAVING max(premium) < 4000;
Using HAVING and INNER JOIN :
SELECT count(e.policy) AS "number of policy", p.name AS customer_name FROM Policy_Detail AS e INNER JOIN Customer_Benefits AS p ON (e.p_id = p.id) GROUP BY customer_name HAVING count(e.policy) > 1;
Using IS NULL :
SELECT * FROM Customer_Benefits WHERE customer_name IS NULL
Using LIKE :
SELECT * FROM Customer_Benefits WHERE customer_name LIKE ('%s'); SELECT * FROM employee WHERE name LIKE '%D____;
Using UNION :
SELECT customer_name FROM Customer_Benefits where benefits_id = 1
UNION
SELECT customer_name FROM Customer_Benefits where benefits_id = 2
LIMIT 11;
Hadoop - Processing
Start your Big Data Learning journey from here with us . Home Page : Dive into Big Data Hadoop
Hadoop - Processing i.e. MapReduce
In this blog we will discuss on latest MapReduce framework which is MR2 (above 2.0 versions).
Brief on MapReduce :
Life cycle of data being processed through the MapReduce framework :
1. Creation of input split
2.Mapper
3.Combiner
4.Partitioner
5.Shuffle and Sort
6.Reducer
Lets explain above cycle using example :
1. Assume that we have a 384 GB of data to process .
File : Customer_Data.csv
Fields : CustomerID,Name_of_the_customer,Product,Price,Date_of_Purchase
Sample Data : 001,Tony Bridge,Apple 7,49990,Amazon,1/4/2017
002,Smita Arora Singh , Apple 7 Plus , 60000,Ebay,1/4/2017
2. Input Split : At this stage data will divide to process in parallel.
Input Split is nothing but the Splitting of input data in logical partition .
Example: As we have consider we have 384 GB of data and default setting of block size as 128 MB.
Physical Partitions : 384GB /128 MB = approx 3000 parts
But there may be possibility that your parts1 and parts2 makes a complete record ,then this inputSplit concepts come into picture and parts1 and parts2 will become InputSplit 1 and so on .
That means it is Input Split is logical partition of data.
Record Reader : This Depends on the InputFormat default is TextInputFormat. Lets consider we have not set any Input format then it takes as TextInputFormat.
Note :We have to understand this concept that anything input to Mapper & Reducer should be in the form of (Key,Value) format .
Record Reader as name specify read a single record at a time .
As the Input format is TextInputFormat then it will do as below :
(Key ,Value )
( 0,001,Tony Bridge,Apple 7,49990,Amazon,1/4/17)
(44,002,Smita Arora Singh , Apple 7 Plus , 60000,Ebay,1/4/17)
where key as offset of data and value as actual record.
3. Mapper : This requires input in Key-Value Pair. It will pick each records from Input Split and Process it . Each Mapper will run for each records in the Input Split.
Syntax:
public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
extends Object
Assume the requirement is to find total number of product sale till date.
Example :
Input of Mapper :
( 0,001,Tony Bridge,Apple 7,49990,Amazon,1/4/17)
Key : 0
Value :001,Tony Bridge,Apple 7,49990,Amazon,1/4/17
public class RetailMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text Product = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String parts=value.toString().split(",");Product = parts[1]
context.write(Product, one);
}
}
Sample Mapper Output :
Apple 7 , 1
Apple 7 Plus , 1
Samsung S8 , 1
Description of Mapper Phase :
1.Split Values with comma (,) delimiter.
2. Consider Product as a Key and rest of values as One
Key : Product
Value : One
As we can see that this not gives the required output . Remaining part of the coding will be in Reducers phase .
4.Combiner : This phase is also called as Mini Reducer . The combiner should combine key/value pairs with the same key. Each combiner may run zero, once, or multiple times.
There might be situation where Mapper is producing lots of output files to process in that case we can use Combiner . So that reducer has to process less files .
Example : Lets assume there are 500 Mappers and 1 Reducer and your task is to minimize the time required by reducer to process the data or minimize the total job time.
In order to this we can use Combiner in our Map-Reduce Program.
This phase is also use to optimize the Map-Reduce code .
5. Partitioning : The mechanism of sending which output of Mapper to which Reducer is called Partitioning.
Default Partitioner uses a hashing on the keys to distribute them to the reduce tasks, but you can override it and use your own custom Partitioner.
Custom Partitioner Example :
For this we need to implement "getPartition(Text key, Text value, int numPartitions)"
Let consider product having cost less that 50K should go to Reducer1 and product cost more than 50K go to Reducer2.
Then method would be like as follow :
public int getPartition(Text key, Text value, int numPartitions)
{
String parts=value.toString().split(",");
String Product_Cost = parts[2]
int p_cost =Integer.parseInt(parts[2]);
if((p_cost>=1)&&(p_cost<=50000))
return 0;
else
return 1;}
6.Shuffle and Sort :
Shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting.
The shuffle and sort phase is done by the MapReduce framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key.
7.Reducer :
The reducer obtains sorted key/[values list] pairs, sorted by the key. The value list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair.
Syntax :
public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
extends Object
Lets complete our requirement which is remaining in Mapper Phase :
public class IntSumReducer<Key> extends Reducer<Key,IntWritable,
Key,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Key key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Reducer Output as <Key:Product, Value :Sales till date>:
Apple 7 , 78909889
Apple 7 Plus , 67978979
Samsung S8, 7898789
Now we have the Total product sales till date .
Saturday, 11 March 2017
Dive into Big Data - Hadoop
To start your journey towards big data / hadoop you have to follow following steps :
1. Brush up Core Java and SQL .
2. Basic unix commands . For this you can refer this link Basic Unix Commands
Big Data = This is the combination of Big + Data . We already have RDBMS to handle data but now we are getting digital data from everywhere around the world that makes data big . This is Big Data .
To store and process we have a powerful Java framework called as HADOOP .
Who has created - Doug Cutting
Where - Yahoo
When - 2006
How ? its goes to open source as it was Yahoo product .
Yahoo later provided Hadoop to Apache foundation and now this is the top level open source project in Apache.
Hadoop
|
|
----------------------------------------------
| |
Storage Processing
(HDFS) (MapReduce)
1. Storage : For storage of big data , hadoop uses HDFS i.e. Hadoop Distributed File System .
2. MapReduce : This framework is designed in Java to process large size data in parallel.
We will look both in detail :
1. Storage : Refer this link Hadoop Storage
2.MapReduce: Refer this link Hadoop Processing -MapReduce
Hadoop Ecosytem Tools :
1.Data Analytical Tool : Hive
Apache Hive is a Open source Data Warehouse infrastructure built on top of Apache Hadoop for providing data summarization, query, and analysis.
This tool was initially developed by Facebook . Later they have contributed to Apache .
This tool is used for structure type of data .
Refer this link : Apache Hive
Download Hive from Original Apache website : apache/hive/hive-2.1.1/
2. Data Transformation Tool : Pig
This tool is for structure as well as semi -structure data .
3. Data Ingestion Tool : Sqoop
Its a Open Source , Product from Apache .
Full name of Sqoop i.e SQ+OOP = SQL to HADOOP
This tool is used to transfer data from Relational Database to Hadoop supporting storage system and vice versa .
Interesting facts about SQOOP : It is not used only with open source framework i.e. Hadoop but also used by industry giant like below :
1. Informatica provides a Sqoop based connector .
2. Pentaho provides open source Sqoop based connector.
3. Microsoft uses a Sqoop based connector to help transfer from Microsoft SQL Server DB to Hadoop.
and many more ...
Refer this link : SQOOP
Also refer this link to Refresh your PostgreSQL Knowledge . This Link will help you to go through from the all SELECT queries to fetch record from RDBMS ( i,e, PostgreSQL) to transfer data into Hadoop distributed File system , hive or No Sql database like HBASE.
4.Data Ingestion Tool : Flume (Coming Soon)
This is also a Data Ingestion Tool but this tool is used to transfer the semi-structure data from any web server to hdfs/hive/HBASE .
Example : Apache Log file stored in remote web server can be transfer using Flume to HDFS .
Apache Kafka :
First lets go through how we can use Kafka channels in Flume as a reliable and highly available channel for any source/sink combination.
Refer this blog : Kafka Channel in Flume
In this blog you will get to know how to transfer data from webserver to hdfs
5.NoSql Databases (Coming Soon)
Hadoop - Storage
Hadoop - Storage i.e. HDFS ( Hadoop Distributed File System)
This framework supports Distributed File System that means now your data can store in multiple location .
These are the few things which you need to consider while working in HDFS :
1.Block size : This is the splitting data size which you need to decide at the time of Hadoop configuration on your cluster . Default is 128 MB ( Hadoop Version 2)
If block size was set to less than 64 , there would be a huge number of blocks throughout the cluster,
This causes master machine to manage an enormous amount of metadata.
Configuration parameter & Value :
dfs.blocksize = 134217728
Above value is in bytes ( equivalent to 128 MB)
Configuration file :
hdfs-site.xml
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
Example :
1. Block size : 128 MB (Which is default)
Suppose you have 1 GB of data = 1024 MB
Now if you have set block size as 128 then there will be 8 Blocks created
Your data information is stored by Master Machine like on which slave machine your data is stored.
2. Block size : 32 MB
Suppose you have 1 GB of data = 1024 MB
Now if you have set block size as 32 then there will be 32 Blocks created
As you see now Blocks increases so Master Machine has to keep information about 32 Blocks data.
Note : It is advisable to use Block size of 128 MB or more .
2. Block Replication :
Using this we can create replications / backup for the blocks .
Configuration parameter & Value :
dfs.replication = 3
This is the default value of dfs.replication in case not provided.
Configuration file :
hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
Example :
When it can be useful :
If any of the slave machine data corrupts or off (i.e. unable to provide data ) then this replication parameter plays a important role.
Lets see how process works :
Suppose we have the data of 256 MB that means if block size is 128 MB then it will create 2 blocks ( as we learnt above )
Now if we have configured replication factor as well that means now your 2 blocks replicates on 3 machines .
Now if any of the one slave machine corrupt you have the data on other 2 slave machines .
3. File System Permissions :
In hdfs you will be required to set file permission : If we need to provide all access to Owner and only read and execute access to Group and Other then we set value as 022 .
<property>
<name>fs.permissions.umask-mode</name>
<value>022</value>
</property>
As we can see its a 3 digit no (022) , where 0 - for Owner 2 - Group & 2 - for Other.
we have 0 to 7 number representing type of permission below :
0 : read, write and execute1 : read and write2 : read and execute3 : read only4 : write and execute5 : write only6 : execute only7 : no permissions
Suppose if we give 077 as fs.permissions.umask-mode then it means 0 - read,write and execute for Owner
7 - no permissions for Group
7 - no permissions for Other
4.
For more number of parameters please refer Apache Hadoop - hdfs-site.xml parameters
Subscribe to:
Posts (Atom)