Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Scenarios of Hadoop producers and consumers

Copy link to this message
Re: Scenarios of Hadoop producers and consumers
Indeed Hadoop is not the ideal platform for stream processing, but there are plenty of use cases for Kakfa + Hadoop. I use it to consolidate log data from many different systems into HDFS. I have N systems using the log4j appender producing to a Kafka broker, and then in my Hadoop cluster I run a simple job that consumes that data and writes out an HDFS file. This, in effect, is what other log aggregators like Flume do - however, we have Kafka in our stack for other pub/sub stuff so it made sense to use it for log aggregation as well.

To answer your question about consuming in Hadoop, the RecordReader will just continue to return records until the queue is exhausted. If you could manage to produce data faster than Hadoop was reading it out (very unlikely), the Hadoop job would run forever (or a least for quite a while). I believe you end up with one RecordReader per Kafka partition, so allocating more partitions would increase your throughput to Hadoop (at least until you saturate the network between the Kafka brokers and Hadoop)

Hope this helps

On Oct 30, 2012, at 8:40 PM, Michal Haris wrote:

> When you need your data streams to be incrementally loaded into hadoop for
> offline batch processing and/or ad-hoc querying - some things cannot (or
> are expensive to) be computed in real-time. So you have a hadoop job that
> consumes kafka stream, potentially formats the data and saves into hdfs.
> On 30 October 2012 23:28, Hussein Baghdadi <[EMAIL PROTECTED]> wrote:
>> Hi,Kafka comes with a support for Hadoop. I'm not sure what does this
>> mean.Kafka is a publish-subscribe messaging system. What are some of the
>> typical usage of Kafka-support for Hadoop producers and consumers?Well,
>> producers are easy to digest. MapReduce job emitting data to Kafka.But what
>> about Hadoop consumers?Hadoop is a batching system, not a continuous
>> running system (as Storm or Dempsy). Say Kafka gets some data, what will
>> happen?Thanks for help and time.
> --
> Michal Haris
> Software Engineer
> www.visualdna.com | t: +44 (0) 207 734 7033