Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> hadoop-consumer code in contrib package


+
navneet sharma 2013-01-14, 18:35
Copy link to this message
-
Re: hadoop-consumer code in contrib package
I think you may be misunderstanding the way Kafka works.

A kafka broker is never supposed to clear messages just because a consumer
read them.

The kafka broker will instead clear messages after their retention period
ends, though it will not delete the messages at the exact time when they
expire. Instead, a background process will periodically delete a batch of
expired messages. The retention policies guarantee a minimum retention
time, not an exact retention time.

It is the responsibility of each consumer to keep track of which messages
they have consumed already (by recording an offset for each consumed
partition). The high-level consumer stores these offsets in ZK. The simple
consumer has no built-in capability to store and manage offsets, so it is
the developer's responsibility to do so. In the case of the hadoop consumer
in the contrib package, these offsets are stored in offset files within
HDFS.

I wrote a blog post a while ago that explains how to use the offset files
generated by the contrib consumer to do incremental consumption (so that
you don't get duplicated messages by re-consuming everything in subsequent
runs).

http://felixgv.com/post/69/automating-incremental-imports-with-the-kafka-hadoop-consumer/

I'm not sure how up to date this is, regarding the current Kafka versions,
but it may still give you some useful pointers...

--
Felix

--
Felix
On Mon, Jan 14, 2013 at 1:34 PM, navneet sharma <[EMAIL PROTECTED]
> wrote:

> Hi,
>
> I am trying to use the code supplied in hadoop-consumer package. I am
> running into following issues:
>
> 1) This code is using SimpleConsumer which is actually contacting Kafka
> Broker without Zookeeper. Because of which messages are not getting cleared
> from broker.
> And i am getting duplicate messages in each run.
>
> 2) The retention policy specified as log.retention.hours in
> server.properties is not working. Not sure if its due to SimpleConsumer.
>
> Is it expected behaviour. Is there any code using high level consumer for
> same work?
>
> Thanks,
> Navneet Sharma
>

 
+
navneet sharma 2013-01-15, 17:06
+
Felix GV 2013-01-15, 18:17
+
navneet sharma 2013-01-17, 00:41
+
Jun Rao 2013-01-17, 05:12
+
navneet sharma 2013-01-17, 14:21
+
Jun Rao 2013-01-17, 15:29