Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> multiple Hadoop consumer tasks per partition

Copy link to this message
Re: multiple Hadoop consumer tasks per partition

What is "poor throughput"?  Here at Metamarkets we use Kafka in AWS to
collect events in realtime and haven't had throughput issues beyond
the general 20MB/s bandwidth that you tend to get inside of AWS.  We
are using our own consumer to write things into S3 rather than the
hadoop consumer, but I doubt there are significant differences.

Fwiw, I believe the reason Kafka maintains a single consumer per
topicXpartition is so that it can reliably restart from where things
left off.  The way the consumer maintains state is by colocating the
high watermark with the last persisted data so that it can restart
from where it left off.  If you have multiple consumers randomly
pulling from the same partition, then in order to replay that
correctly, you are going to need to maintain the full start and end
watermark of all chunks pulled.  Otherwise, if one of the consumers
fail and you go to replay that data, you will probably get it out of
sequence and ultimately end up with a whole bunch of data loss.

Have you looked into if you are bottlenecked on broker IO, network
between the consumer and the broker or somewhere else?

On Mon, Sep 17, 2012 at 7:04 PM, Matthew Rathbone
> Hey,
> So I'm currently running one mapper per-partition. I guess I didn't state
> this, but my code is based on the hadoop-consumer in the contrib/ project.
> I was really wondering whether anyone has tried multiple consumers per
> partition.
> On Mon, Sep 17, 2012 at 6:54 PM, Min Yu <[EMAIL PROTECTED]> wrote:
>> If you want run each Mapper job per partition,
>> https://github.com/miniway/kafka-hadoop-consumer
>> might help.
>> Thanks
>> Min
>> 2012. 9. 18. 오전 6:51 Matthew Rathbone <[EMAIL PROTECTED]> 작성:
>> > Hey guys,
>> >
>> > I've been using the hadoop consumer a whole lot this week, but I'm seeing
>> > pretty poor throughput with one task per partition. I figured a good
>> > solution would be to have multiple tasks per partition, so I wanted to
>> run
>> > my assumptions by you all first:
>> >
>> > This should enable the broker to round robin events between tasks right?
>> >
>> > When I record the high-watermark at the end of the mapreduce job there
>> will
>> > be N entries for each partition (one per task), so is it correct to just
>> > take max(watermarks)?
>> > -- my assumption is that as they're getting events round-robin,
>> everything
>> > should have been consumed up to the highest watermark found. Does this
>> hold
>> > true?
>> >
>> > Is anyone else using the consumer like this?
>> >
>> >
>> >
>> > --
>> > Matthew Rathbone
>> > Foursquare | Software Engineer | Server Engineering Team
>> > [EMAIL PROTECTED] | @rathboma <http://twitter.com/rathboma> |
>> > 4sq<http://foursquare.com/rathboma>
> --
> Matthew Rathbone
> Foursquare | Software Engineer | Server Engineering Team
> [EMAIL PROTECTED] | @rathboma <http://twitter.com/rathboma> |
> 4sq<http://foursquare.com/rathboma>