Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - hadoop-consumer code in contrib package


Copy link to this message
-
Re: hadoop-consumer code in contrib package
navneet sharma 2013-01-15, 17:06
Thanks Felix for sharing your work. Contrib hadoop-consumer looks like the
same way.

I think i need to really understand this offset stuff. So far i have used
only high level consumer.When consumer is done reading all the messages, i
used to kill the process(because it won't on its own).

Again i used Producer to pump more messages and Consumer to read the new
messages(which is a new process as i killed the last consumer).

But i never saw messages getting duplicating.

Now its not very clear for me that how offsets is tracked specifically when
i am re-launching the consumer?
And why retention policy is not working when used with SimpleConsumer? For
my experiment i made it 4 hours.

Please help me understand.

Thanks,
Navneet
On Tue, Jan 15, 2013 at 4:12 AM, Felix GV <[EMAIL PROTECTED]> wrote:

> I think you may be misunderstanding the way Kafka works.
>
> A kafka broker is never supposed to clear messages just because a consumer
> read them.
>
> The kafka broker will instead clear messages after their retention period
> ends, though it will not delete the messages at the exact time when they
> expire. Instead, a background process will periodically delete a batch of
> expired messages. The retention policies guarantee a minimum retention
> time, not an exact retention time.
>
> It is the responsibility of each consumer to keep track of which messages
> they have consumed already (by recording an offset for each consumed
> partition). The high-level consumer stores these offsets in ZK. The simple
> consumer has no built-in capability to store and manage offsets, so it is
> the developer's responsibility to do so. In the case of the hadoop consumer
> in the contrib package, these offsets are stored in offset files within
> HDFS.
>
> I wrote a blog post a while ago that explains how to use the offset files
> generated by the contrib consumer to do incremental consumption (so that
> you don't get duplicated messages by re-consuming everything in subsequent
> runs).
>
>
> http://felixgv.com/post/69/automating-incremental-imports-with-the-kafka-hadoop-consumer/
>
> I'm not sure how up to date this is, regarding the current Kafka versions,
> but it may still give you some useful pointers...
>
> --
> Felix
>
> --
> Felix
>
>
> On Mon, Jan 14, 2013 at 1:34 PM, navneet sharma <
> [EMAIL PROTECTED]
> > wrote:
>
> > Hi,
> >
> > I am trying to use the code supplied in hadoop-consumer package. I am
> > running into following issues:
> >
> > 1) This code is using SimpleConsumer which is actually contacting Kafka
> > Broker without Zookeeper. Because of which messages are not getting
> cleared
> > from broker.
> > And i am getting duplicate messages in each run.
> >
> > 2) The retention policy specified as log.retention.hours in
> > server.properties is not working. Not sure if its due to SimpleConsumer.
> >
> > Is it expected behaviour. Is there any code using high level consumer for
> > same work?
> >
> > Thanks,
> > Navneet Sharma
> >
>