Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> hadoop-consumer code in contrib package


+
navneet sharma 2013-01-14, 18:35
+
Felix GV 2013-01-14, 22:43
Copy link to this message
-
Re: hadoop-consumer code in contrib package
Thanks Felix for sharing your work. Contrib hadoop-consumer looks like the
same way.

I think i need to really understand this offset stuff. So far i have used
only high level consumer.When consumer is done reading all the messages, i
used to kill the process(because it won't on its own).

Again i used Producer to pump more messages and Consumer to read the new
messages(which is a new process as i killed the last consumer).

But i never saw messages getting duplicating.

Now its not very clear for me that how offsets is tracked specifically when
i am re-launching the consumer?
And why retention policy is not working when used with SimpleConsumer? For
my experiment i made it 4 hours.

Please help me understand.

Thanks,
Navneet
On Tue, Jan 15, 2013 at 4:12 AM, Felix GV <[EMAIL PROTECTED]> wrote:

> I think you may be misunderstanding the way Kafka works.
>
> A kafka broker is never supposed to clear messages just because a consumer
> read them.
>
> The kafka broker will instead clear messages after their retention period
> ends, though it will not delete the messages at the exact time when they
> expire. Instead, a background process will periodically delete a batch of
> expired messages. The retention policies guarantee a minimum retention
> time, not an exact retention time.
>
> It is the responsibility of each consumer to keep track of which messages
> they have consumed already (by recording an offset for each consumed
> partition). The high-level consumer stores these offsets in ZK. The simple
> consumer has no built-in capability to store and manage offsets, so it is
> the developer's responsibility to do so. In the case of the hadoop consumer
> in the contrib package, these offsets are stored in offset files within
> HDFS.
>
> I wrote a blog post a while ago that explains how to use the offset files
> generated by the contrib consumer to do incremental consumption (so that
> you don't get duplicated messages by re-consuming everything in subsequent
> runs).
>
>
> http://felixgv.com/post/69/automating-incremental-imports-with-the-kafka-hadoop-consumer/
>
> I'm not sure how up to date this is, regarding the current Kafka versions,
> but it may still give you some useful pointers...
>
> --
> Felix
>
> --
> Felix
>
>
> On Mon, Jan 14, 2013 at 1:34 PM, navneet sharma <
> [EMAIL PROTECTED]
> > wrote:
>
> > Hi,
> >
> > I am trying to use the code supplied in hadoop-consumer package. I am
> > running into following issues:
> >
> > 1) This code is using SimpleConsumer which is actually contacting Kafka
> > Broker without Zookeeper. Because of which messages are not getting
> cleared
> > from broker.
> > And i am getting duplicate messages in each run.
> >
> > 2) The retention policy specified as log.retention.hours in
> > server.properties is not working. Not sure if its due to SimpleConsumer.
> >
> > Is it expected behaviour. Is there any code using high level consumer for
> > same work?
> >
> > Thanks,
> > Navneet Sharma
> >
>

 
+
Felix GV 2013-01-15, 18:17
+
navneet sharma 2013-01-17, 00:41
+
Jun Rao 2013-01-17, 05:12
+
navneet sharma 2013-01-17, 14:21
+
Jun Rao 2013-01-17, 15:29
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB