BigData Journey: April 2017

Tuesday, 25 April 2017

Apache Kafka Part 1 Webserver --> Using Flume ( Kafka Channel )--> HDFS (Same server)

Apache Kafka comes into picture because traditional messaging system don't scale up to handle big data in real time.

Developed by Linkedin engineers .

Apache Kafka is a distributed messaging framework that meets the demands of big data by scaling on commodity hardware .

Best for real-time use cases

Lets look into a example where we need to extract data from webserver and put into HDFS .

1) If your webserver resides on the same Hadoop Cluster .

Webserver --> Using Flume ( Kafka Channel )--> HDFS

# Sources, channels, and sinks names as source1,channel1,sink1
# agent name, in this case 'logagent'.
logagent.sources = source1
logagent.channels = channel1
logagent.sinks = sink1

# spooldir Source Configuration

logagent.sources.source1.type = spooldir

logagent.sources.source1.spoolDir = /log/C_12345

# the hostname that Flume Syslog source will be running on
logagent.sources.source1.host = localhost
# the port that Flume Syslog source will listen on
logagent.sources.source1.port = 5040

#Bind the source to the channel
logagent.sources.source1.channels = channel1

# HDFS Sink configuration
logagent.sinks.sink1.type = hdfs
logagent.sinks.sink1.hdfs.path= hdfs://<hadoop Cluster IP>/flume
logagent.sinks.sink1.hdfs.fileType =DataStream
logagent.sinks.sink1.hdfs.useLocalTimeStamp =true
logagent.sinks.sink1.hdfs.rollInterval =600

#Bind the sink to the channel
logagent.sinks.sink1.channel = channel1

# Kafka Channel Configuration
logagent.channels.channel1.type =org.apache.flume.channel.kafka.KafkaChannel
logagent.channels.channel1.capacity = 10000
logagent.channels.channel1.transactionCapacity = 1000
logagent.channels.channel1.brokerList = kafkaf-2:9092,kafkaf-3:9092
logagent.channels.channel1.topic = channel1
logagent.channels.channel1.zookeeperConnect = kafkaf-1:2181
logagent.channels.channel1.groupId = flume2

Save this as logagent.conf and save to conf directory in my case the path is /hadoop/inst/apache-flume-1.6.0-bin/conf/logagent.conf

Login to your hadoop cluster ->
Go to flume directory -> then type below command

.flume-ng agent --conf /hadoop/inst/apache-flume-1.6.0-bin/conf/ -f /hadoop/inst/apache-flume-1.6.0-bin/conf/logagent.conf -Dflume.root.logger=INFO ,console -n logagent

--conf /hadoop/inst/apache-flume-1.6.0-bin/conf/
syntax : --conf <path of flume conf folder >

-f /hadoop/inst/apache-flume-1.6.0-bin/conf/logagent.conf
syntax : -f <path of logagent.conf file>

-Dflume.root.logger=INFO ,console
This is to print all the messages on the console .

-n logagent
syntax : -n <name of the agent > // Note : not the conf file name

In the next blog -> we will see how to fetch records from external server and put into hdfs .

Thursday, 20 April 2017

Something About Amazon Web Services (AWS)

Something about AWS:

We all must have knowledge or hear that Cloud Computing is the future . Lets understand the famous Cloud service provider i.e. Amazon.

Cloud computing provides a simple way to access servers, storage, databases, and a broad set of application services over the Internet.

1. As you all know about Jeff Bezos (one of the Richest Man ) who started AMAZON.

2. Amazon officially launched Amazon Web Services (AWS) in 2006.

3.Amazon Web Services is a secure cloud services platform, offering compute power, database storage, content delivery and other functionality to help businesses scale and grow.

4.Their services operate from 16 geographical regions across the world & they are planning to add two more region i.e. Paris and Sweden this year.

Screenshot showing Services and features count of AWS

AWS owns and maintains the network-connected hardware required for these application services, while you provision and use what you need via a web application.

Trade capital expense for variable expense:

Instead of investing heavily in data centers and servers before you know how you're going to use them, you can pay only when you consume computing resources and pay only the amount you consume.

Benefit from massive economies of scale:

By using cloud computing, you can achieve a lower variable cost than you can get on your own. Because usage from hundreds of thousands of customers is aggregated in the cloud, providers such as Amazon Web Services can achieve higher economies of scale, which translates into lower pay-as-you-go prices.

Stop guessing capacity:

Eliminate guessing about your infrastructure capacity needs. When you make a capacity decision before deploying an application, you often end up either sitting on expensive idle resources or dealing with limited capacity. With cloud computing, these problems go away. You can access as much or as little as you need and scale up and down as required with only a few minutes's notice.

Increases speed and agility :

In a cloud computing environment, new IT resources are only ever a click away, which means you reduce the time it takes to make those resources available to your developers from weeks to just minutes. This results in a dramatic increase in agility for the organization, because the cost and time it takes to experiment and develop is significantly lower.

Stop spending money on running and maintaining data centers:

Focus on projects that differentiate your business, not the infrastructure. Cloud computing lets you focus on your own customers, instead of on the heavy lifting of racking, stacking, and powering servers.

Go global in minutes:

Easily deploy your application in multiple regions around the world with just a few clicks. This means you can provide a lower latency and better experience for your customers simply and at minimal cost.

Next blog : Introduction to AWS Dashboard - Region,Availability zones

Monday, 17 April 2017

Hadoop Info - Be interview ready - One Active Namenode

4) Interview might ask that why not using two active nodes at a time in a HA cluster .

You can refer this link : About HA Cluster (High Availability Cluster)

Answer :

There should be only one active Namenode running in HA cluster .

If there are two active Namenode running in cluster then its lead to corruption of the data. This scenario is termed as split-brain scenario.

To overcome this problem we can use fencing (Its a process of ensuring that only one Namenode remains active)

Hope you like my articles . Please comment and like.

Sunday, 16 April 2017

Hadoop Info - Be Interview Ready - YARN came before 2.x version

3. Do we know that before 2.0 , YARN, Kerberos security,HDFS federation & Namenode HA features are already there ?

Answer :

In November 2011, version 0.23 of Hadoop was released.that has these features but that was not that matures as 2.0 version . Hence 0.23 upgraded to stable version and name as 2.0.

Friday, 14 April 2017

Hadoop Info - Be Interview Ready- Hadoop Production Master node

Hope you visited my previous blog on High Availability Hadoop Cluster

2. You are working as a Hadoop Developer , but do you know that what will be the minimum configuration required for Production Master node if you are working in Small Hadoop Cluster /Medium Hadoop Cluster / Large Hadoop Cluster ?

Answer :

If you are working with less than or equal to 25 nodes its consider as Small Hadoop Cluster.

Cluster upto 400 nodes consider as Medium Hadoop Cluster

Cluster above 400 nodes consider as Large Hadoop Cluster

Minimum configuration required for Production Master Node :

This is for Small Cluster :

Dual quad-core 2.6 Ghz CPU

32 GB of DDR3 RAM

This is for Medium Cluster :

Dual quad-core 2.6 Ghz CPU

64 GB of DDR3 RAM

This is for Large Cluster :

Dual quad-core 2.6 Ghz CPU

128 GB of DDR3 RAM

Hadoop Info - Be Interview Ready - HA Cluster

1) Share your knowledge about HA cluster .

This features comes in 2.0 and above Apache Hadoop version .

Before that Namenode is the single point of failure means that if Namenode unavailable then whole cluster become unavailable.

To again it to work , manually we have to restart the Namenode services.

This High Availability architecture provides the solution to our problem that is allowing us to have two Namenodes.

1. Active Namenode

2. Passive Namenode ( Also called as Standby Namenode)

So now your cluster is having two Namenodes . If one Namenode unavailable/down then you have other Namenode to take over the work and thus reduce the cluster down time.

This should be possible only when both Namenode shares the same metadata information.

Next Hadoop Info share on 15/04/2017 . Be in touch .

Please comment or like as well . To make articles more accurate for Hadoop lovers ( Data lovers) .

Sunday, 9 April 2017

Core Java - 1/10

This blog will help to refresh Core Java Knowledge .

Hadoop Framework was designed using Java . This is because it is a prerequisite for Hadoop Learning.

In this blog we will take useful keywords which will help in your Hadoop Journey.

Before starting lets learn the naming convention used in Java :

1. Class name & Interface name should start with Upper Case.

2. Method name & Variable name should start with Lower Case.

If the name of the Class,Interface,Method or Variable will be combination of two words then second word should start with Capital Letter .

Example : sumCalculator is a Method Name .

1. Break :

This will break the current loop . Lets take the below example

public class FoodMarket {
public static void main(String args[]) {
String [] fruits = {"Apple","Orange","Pineapple"};
for(String f : fruits ) {

if( f == "Pineapple"){
break;
}
System.out.print( f );
System.out.print("\n");
}

// After the break statement control will come here.
}
}

Output :
Apple
Orange

As we see from the above example :

1. We have created a fruits array of String type of length 3 ( we can find that using fruits.length method)
2.for(String f : fruits) here f is a String variable . f will hold 3 values as fruits array is of 3.
3.First loop f is holding Apple ( f=Apple) in first loop then it compares f with "Pineapple" as condition false so it will ignore the if block and proceed for System.out.println statements which returns print the value of f .
4.Second loop f is holding Orange ( f=Orange) in second loop then it compares f with "Pineapple" as condition false so it will ignore the if block and proceed for System.out.println statements which returns print the value of f .
5.Third/last loop f is holding Pineapple ( f=Pineapple) in third loop then it compares f with "Pineapple" as condition true so it will execute the if block , in the if block there is break; so it will go out of current loop i.e for(String f : fruits )

2. Continue :

The difference between Break and Continue is that Continue will break the block and continue from the current loop.

Lets take the below example

public class FoodMarket {
public static void main(String args[]) {
String [] fruits = {"Apple","Orange","Pineapple"};

for(String f : fruits ) {

if( f == "Orange"){

continue; // This block will skip and continue for loop with // next value
}

System.out.print( f );
System.out.print("\n");

}

}
}

Output :

Apple
Pineapple

Explanation : As soon as it see the continue statement in the block only that block will be skipped . And it will continue loop with the next value.

Next class we will look into Object and Class

Subscribe to: Posts (Atom)