You raise a very good point regarding consumer retries. You are correct
that if 1 of the consumers fail, the retry job will pull dupes, and that
will cause problems, I hadn't thought about that, it would certainly cause
problems. A way around it would be to make the MR job not retry tasks, that
way if one mapper fails, the job fails (which should be ok when we're
running it on a 10 minute loop).
Our setup is this:
3 brokers, with 3 partitions, so 9 distinct partitions.
So given that, we have 9 mappers, each pulling from a single partition.
These mappers get around 3.5-4MB/s total, which is very poor. I think this
is likely due to how busy our hadoop cluster is, as there are 24 mappers
per box. I think we're cpu bound on our mapreduce job, but I could be
wrong, as it's hard to tell in EC2.
I think I'm going to try disabling retries and having multiple consumers
(in the same consumer group) subscribe to individual partitions, that way I
think I can double our throughput. I'm also doing some file moving (in s3)
at the end of the job which takes a long time, so I'll disable that too.
On Mon, Sep 17, 2012 at 7:44 PM, Eric Tschetter <[EMAIL PROTECTED]>wrote:
> What is "poor throughput"? Here at Metamarkets we use Kafka in AWS to
> collect events in realtime and haven't had throughput issues beyond
> the general 20MB/s bandwidth that you tend to get inside of AWS. We
> are using our own consumer to write things into S3 rather than the
> hadoop consumer, but I doubt there are significant differences.
> Fwiw, I believe the reason Kafka maintains a single consumer per
> topicXpartition is so that it can reliably restart from where things
> left off. The way the consumer maintains state is by colocating the
> high watermark with the last persisted data so that it can restart
> from where it left off. If you have multiple consumers randomly
> pulling from the same partition, then in order to replay that
> correctly, you are going to need to maintain the full start and end
> watermark of all chunks pulled. Otherwise, if one of the consumers
> fail and you go to replay that data, you will probably get it out of
> sequence and ultimately end up with a whole bunch of data loss.
> Have you looked into if you are bottlenecked on broker IO, network
> between the consumer and the broker or somewhere else?
> On Mon, Sep 17, 2012 at 7:04 PM, Matthew Rathbone
> <[EMAIL PROTECTED]> wrote:
> > Hey,
> > So I'm currently running one mapper per-partition. I guess I didn't state
> > this, but my code is based on the hadoop-consumer in the contrib/
> > I was really wondering whether anyone has tried multiple consumers per
> > partition.
> > On Mon, Sep 17, 2012 at 6:54 PM, Min Yu <[EMAIL PROTECTED]> wrote:
> >> If you want run each Mapper job per partition,
> >> https://github.com/miniway/kafka-hadoop-consumer
> >> might help.
> >> Thanks
> >> Min
> >> 2012. 9. 18. 오전 6:51 Matthew Rathbone <[EMAIL PROTECTED]> 작성:
> >> > Hey guys,
> >> >
> >> > I've been using the hadoop consumer a whole lot this week, but I'm
> >> > pretty poor throughput with one task per partition. I figured a good
> >> > solution would be to have multiple tasks per partition, so I wanted to
> >> run
> >> > my assumptions by you all first:
> >> >
> >> > This should enable the broker to round robin events between tasks
> >> >
> >> > When I record the high-watermark at the end of the mapreduce job there
> >> will
> >> > be N entries for each partition (one per task), so is it correct to
> >> > take max(watermarks)?
> >> > -- my assumption is that as they're getting events round-robin,
> >> everything
> >> > should have been consumed up to the highest watermark found. Does this
> >> hold
> >> > true?
> >> >
> >> > Is anyone else using the consumer like this?
> >> >
> >> >
> >> >
> >> > --
> >> > Matthew Rathbone
> >> > Foursquare | Software Engineer | Server Engineering Team
Foursquare | Software Engineer | Server Engineering Team
[EMAIL PROTECTED] | @rathboma <http://twitter.com/rathboma> |