Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - Incremental Hadoop + SimpleKafkaETLJob


Copy link to this message
-
Re: Incremental Hadoop + SimpleKafkaETLJob
Felix GV 2012-01-24, 22:28
It's ok, we're all busy and open source is essentially volunteer work.

Besides, you guys didn't promise any time frame, as far as I remember, so
technically there is no deadline at which you'll ever "break your promise"
hehe...

Still looking forward to it though :)

--
Felix

On Tue, Jan 24, 2012 at 5:12 PM, Richard Park <[EMAIL PROTECTED]>wrote:

> Yeah, sorry about missing the promise to release code.
> I'll talk to someone about releasing what we have.
>
> On Tue, Jan 24, 2012 at 11:05 AM, Felix GV <[EMAIL PROTECTED]> wrote:
>
> > Hello :)
> >
> > For question 1:
> >
> > The hadoop consumer in the contrib directory has almost everything it
> needs
> > to do distributed incremental imports out of the box, but it requires a
> bit
> > of hand holding.
> >
> > I've created two scripts to automate the process. One of them generates
> > initial offset files, and the other does incremental hadoop consumption.
> >
> > I personally use a cron job to periodically call the incremental consumer
> > script with specific parameters (for topic and HDFS path output).
> >
> > You can find all of the required files in this gist:
> > https://gist.github.com/1671887
> >
> > The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
> > eventually but I think they didn't have time to get around to it yet.
> When
> > they do release it, it's probably going to be better than my scripts, but
> > for now, I think those scripts are the only publically available way to
> do
> > this stuff without writing it yourself.
> >
> > I don't know about question 2 and 3.
> >
> > I hope this helps :) !
> >
> > --
> > Felix
> >
> >
> >
> > On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <[EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > >
> > > I'm investigating using Kafka and would really appreciate getting some
> > > more experienced opinion on the way things work together.
> > >
> > > Our application instances are creating Protocol Buffer serialized
> > messages
> > > and pushing them to topics in Kafka:
> > >
> > > * Web log requests
> > > * Product details viewed
> > > * Search performed
> > > * Email registered
> > > etc...
> > >
> > > I would like to be able to perform incremental loads from these topics
> > > into HDFS and then into the rest of the batch processing. I guess I
> had 3
> > > broad questions
> > >
> > > 1) How do people trigger the batch loads? Do you just point your
> > > SimpleKafkaETLJob input to the previous runs outputted offset file? Do
> > you
> > > move files between runs of the SimpleKafkaETLJob- move the part-* file
> > into
> > > one place and move the offsets into an input directory ready for the
> next
> > > run?
> > >
> > > 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
> > > outputs Long/Text writables and is marked as deprecated (this is in the
> > 0.7
> > > source). Is there an alternative class that should be used instead, or
> is
> > > the hadoop-consumer being deprecated overall?
> > >
> > > 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text
> lines,
> > > are most people using Kafka for passing text messages around or using
> > JSON
> > > data etc.?
> > >
> > > Thanks,
> > > Paul
> >
>