Kafka, mail # user - Incremental Hadoop + SimpleKafkaETLJob

Re: Incremental Hadoop + SimpleKafkaETLJob
Felix GV 2012-01-24, 19:05
Hello :)

For question 1:

The hadoop consumer in the contrib directory has almost everything it needs
to do distributed incremental imports out of the box, but it requires a bit
of hand holding.

I've created two scripts to automate the process. One of them generates
initial offset files, and the other does incremental hadoop consumption.

I personally use a cron job to periodically call the incremental consumer
script with specific parameters (for topic and HDFS path output).

You can find all of the required files in this gist:

The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
eventually but I think they didn't have time to get around to it yet. When
they do release it, it's probably going to be better than my scripts, but
for now, I think those scripts are the only publically available way to do
this stuff without writing it yourself.

I don't know about question 2 and 3.

I hope this helps :) !


On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <[EMAIL PROTECTED]> wrote:

> Hi,
> I'm investigating using Kafka and would really appreciate getting some
> more experienced opinion on the way things work together.
> Our application instances are creating Protocol Buffer serialized messages
> and pushing them to topics in Kafka:
> * Web log requests
> * Product details viewed
> * Search performed
> * Email registered
> etc...
> I would like to be able to perform incremental loads from these topics
> into HDFS and then into the rest of the batch processing. I guess I had 3
> broad questions
> 1) How do people trigger the batch loads? Do you just point your
> SimpleKafkaETLJob input to the previous runs outputted offset file? Do you
> move files between runs of the SimpleKafkaETLJob- move the part-* file into
> one place and move the offsets into an input directory ready for the next
> run?
> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
> outputs Long/Text writables and is marked as deprecated (this is in the 0.7
> source). Is there an alternative class that should be used instead, or is
> the hadoop-consumer being deprecated overall?
> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
> are most people using Kafka for passing text messages around or using JSON
> data etc.?
> Thanks,
> Paul