Tuesday, 25 April 2017

Apache Kafka Part 1 Webserver --> Using Flume ( Kafka Channel )--> HDFS (Same server)

Apache Kafka comes into picture because traditional messaging system don't scale up to handle big data in real time.


Developed by Linkedin engineers .


Apache Kafka is a distributed messaging framework that meets the demands of big data by scaling on commodity hardware .


Best for real-time use cases


Lets look into a example where we need to extract data from webserver and put into HDFS .


1) If your webserver resides on the same Hadoop Cluster .


Webserver --> Using Flume ( Kafka Channel )--> HDFS

# Sources, channels, and sinks names as source1,channel1,sink1
# agent name, in this case 'logagent'.
logagent.sources  = source1
logagent.channels = channel1
logagent.sinks    = sink1


# spooldir Source Configuration

logagent.sources.source1.type     = spooldir

logagent.sources.source1.spoolDir = /log/C_12345

# the hostname that Flume Syslog source will be running on
logagent.sources.source1.host     = localhost
# the port that Flume Syslog source will listen on
logagent.sources.source1.port     = 5040


#Bind the source to the channel
logagent.sources.source1.channels = channel1

# HDFS Sink configuration
logagent.sinks.sink1.type          = hdfs
logagent.sinks.sink1.hdfs.path= hdfs://<hadoop Cluster IP>/flume
logagent.sinks.sink1.hdfs.fileType =DataStream
logagent.sinks.sink1.hdfs.useLocalTimeStamp =true
logagent.sinks.sink1.hdfs.rollInterval =600


#Bind the sink to the channel
logagent.sinks.sink1.channel       = channel1


# Kafka Channel Configuration
logagent.channels.channel1.type                =org.apache.flume.channel.kafka.KafkaChannel
logagent.channels.channel1.capacity            = 10000
logagent.channels.channel1.transactionCapacity = 1000
logagent.channels.channel1.brokerList          = kafkaf-2:9092,kafkaf-3:9092
logagent.channels.channel1.topic               = channel1
logagent.channels.channel1.zookeeperConnect    = kafkaf-1:2181
logagent.channels.channel1.groupId = flume2


Save this as logagent.conf and save to conf directory in my case the path is /hadoop/inst/apache-flume-1.6.0-bin/conf/logagent.conf

Login to your hadoop cluster ->
Go to flume directory -> then type below command 


.flume-ng agent --conf  /hadoop/inst/apache-flume-1.6.0-bin/conf/  -f /hadoop/inst/apache-flume-1.6.0-bin/conf/logagent.conf -Dflume.root.logger=INFO ,console -n logagent

 --conf  /hadoop/inst/apache-flume-1.6.0-bin/conf/
syntax : --conf <path of flume conf folder >



-f  /hadoop/inst/apache-flume-1.6.0-bin/conf/logagent.conf
syntax : -f <path of logagent.conf file>



-Dflume.root.logger=INFO ,console
This is to print all the messages on the console .


 -n logagent
syntax : -n <name of the agent >       // Note : not the conf file name 



In the next blog -> we will see how to fetch records from external server and put into hdfs .


No comments:

Post a Comment