Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka, mail # user - hadoop-consumer code in contrib package


+
navneet sharma 2013-01-14, 18:35
+
Felix GV 2013-01-14, 22:43
+
navneet sharma 2013-01-15, 17:06
+
Felix GV 2013-01-15, 18:17
+
navneet sharma 2013-01-17, 00:41
+
Jun Rao 2013-01-17, 05:12
Copy link to this message
-
Re: hadoop-consumer code in contrib package
navneet sharma 2013-01-17, 14:21
That makes sense.

I tried an alternate approach- i am using high level consumer and going
through Hadoop HDFS APIs and pushing data in HDFS.

I am not creating any jobs for that.

The only problem i am seeing here is that the consumer is designed to run
forever. Which means i need to find out how to close the HDFS file and kill
consumer.

Is there any way to kill or close high level consumer gracefully?

I am running v0.7.0. I don't mind upgrading to higher version if that
allows me this kind of consumer handling.

Thanks,
Navneet
On Thu, Jan 17, 2013 at 10:41 AM, Jun Rao <[EMAIL PROTECTED]> wrote:

> I think the main reason for using SimpleConsumer is to manage offsets
> explicitly. For example, this is useful when Hadoop retries failed tasks.
> Another reason is that Hadoop already does load balancing. So, there is not
> much need to balance the load again using the high level consumer.
>
> Thanks,
>
> Jun
>
> On Wed, Jan 16, 2013 at 4:40 PM, navneet sharma <
> [EMAIL PROTECTED]
> > wrote:
>
> > Thanks Felix.
> >
> > One question still remains. Why SimpleConsumer?
> > Why not high level Consumer? If i change the code to high level consumer,
> > will it create any challenges?
> >
> >
> > Navneet
> >
> >
> > On Tue, Jan 15, 2013 at 11:46 PM, Felix GV <[EMAIL PROTECTED]> wrote:
> >
> > > Please read the Kafka design paper <
> http://kafka.apache.org/design.html
> > >.
> > >
> > > It may look a little long, but it's as short as it can be. Kafka
> differs
> > > from other messaging system in a couple of ways, and it's important to
> > > understand the fundamental design choices that were made in order to
> > > understand the way Kafka works.
> > >
> > > I believe my previous email already answers both your offset tracking
> and
> > > retention questions, but if my explanation are not clear enough, then
> the
> > > next best thing is probably to read the design paper :)
> > >
> > > --
> > > Felix
> > >
> > >
> > > On Tue, Jan 15, 2013 at 12:01 PM, navneet sharma <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > Thanks Felix for sharing your work. Contrib hadoop-consumer looks
> like
> > > the
> > > > same way.
> > > >
> > > > I think i need to really understand this offset stuff. So far i have
> > used
> > > > only high level consumer.When consumer is done reading all the
> > messages,
> > > i
> > > > used to kill the process(because it won't on its own).
> > > >
> > > > Again i used Producer to pump more messages and Consumer to read the
> > new
> > > > messages(which is a new process as i killed the last consumer).
> > > >
> > > > But i never saw messages getting duplicating.
> > > >
> > > > Now its not very clear for me that how offsets is tracked
> specifically
> > > when
> > > > i am re-launching the consumer?
> > > > And why retention policy is not working when used with
> SimpleConsumer?
> > > For
> > > > my experiment i made it 4 hours.
> > > >
> > > > Please help me understand.
> > > >
> > > > Thanks,
> > > > Navneet
> > > >
> > > >
> > > > On Tue, Jan 15, 2013 at 4:12 AM, Felix GV <[EMAIL PROTECTED]>
> wrote:
> > > >
> > > > > I think you may be misunderstanding the way Kafka works.
> > > > >
> > > > > A kafka broker is never supposed to clear messages just because a
> > > > consumer
> > > > > read them.
> > > > >
> > > > > The kafka broker will instead clear messages after their retention
> > > period
> > > > > ends, though it will not delete the messages at the exact time when
> > > they
> > > > > expire. Instead, a background process will periodically delete a
> > batch
> > > of
> > > > > expired messages. The retention policies guarantee a minimum
> > retention
> > > > > time, not an exact retention time.
> > > > >
> > > > > It is the responsibility of each consumer to keep track of which
> > > messages
> > > > > they have consumed already (by recording an offset for each
> consumed
> > > > > partition). The high-level consumer stores these offsets in ZK. The
> > > > simple
> > > > > consumer has no built-in capability to store and manage offsets, so

 
+
Jun Rao 2013-01-17, 15:29