Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Incremental Hadoop + SimpleKafkaETLJob


Copy link to this message
-
Re: Incremental Hadoop + SimpleKafkaETLJob
Hello :)

For question 1:

The hadoop consumer in the contrib directory has almost everything it needs
to do distributed incremental imports out of the box, but it requires a bit
of hand holding.

I've created two scripts to automate the process. One of them generates
initial offset files, and the other does incremental hadoop consumption.

I personally use a cron job to periodically call the incremental consumer
script with specific parameters (for topic and HDFS path output).

You can find all of the required files in this gist:
https://gist.github.com/1671887

The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
eventually but I think they didn't have time to get around to it yet. When
they do release it, it's probably going to be better than my scripts, but
for now, I think those scripts are the only publically available way to do
this stuff without writing it yourself.

I don't know about question 2 and 3.

I hope this helps :) !

--
Felix

On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I'm investigating using Kafka and would really appreciate getting some
> more experienced opinion on the way things work together.
>
> Our application instances are creating Protocol Buffer serialized messages
> and pushing them to topics in Kafka:
>
> * Web log requests
> * Product details viewed
> * Search performed
> * Email registered
> etc...
>
> I would like to be able to perform incremental loads from these topics
> into HDFS and then into the rest of the batch processing. I guess I had 3
> broad questions
>
> 1) How do people trigger the batch loads? Do you just point your
> SimpleKafkaETLJob input to the previous runs outputted offset file? Do you
> move files between runs of the SimpleKafkaETLJob- move the part-* file into
> one place and move the offsets into an input directory ready for the next
> run?
>
> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
> outputs Long/Text writables and is marked as deprecated (this is in the 0.7
> source). Is there an alternative class that should be used instead, or is
> the hadoop-consumer being deprecated overall?
>
> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
> are most people using Kafka for passing text messages around or using JSON
> data etc.?
>
> Thanks,
> Paul
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB