-Re: Incremental Hadoop + SimpleKafkaETLJob
Richard Park 2012-01-24, 22:33
Let me try to answer the other questions.
For 1. Latest offset files that are written by the mappers are then used as
input for subsequent run throughs. We output these files to a temp dir
after which a hdfs mv occurs to a 'completed' directory for a pseudo atomic
commit. Subsequent run throughs search for the latest completed run. We're
careful not to have several jobs pulling from the same offsets.
2. I believe SimpleKafkaETLMapper was written as an example. I'm unsure why
it's deprecated except that it may be outdated.
3. We don't use SimpleKafkaETLMapper. In our case, all of our data in Kafka
is serialized into Avro. We happen to keep avro when we pull the data into
Hadoop as well.
On Tue, Jan 24, 2012 at 2:12 PM, Richard Park <[EMAIL PROTECTED]>wrote:
> Yeah, sorry about missing the promise to release code.
> I'll talk to someone about releasing what we have.
> On Tue, Jan 24, 2012 at 11:05 AM, Felix GV <[EMAIL PROTECTED]> wrote:
>> Hello :)
>> For question 1:
>> The hadoop consumer in the contrib directory has almost everything it
>> to do distributed incremental imports out of the box, but it requires a
>> of hand holding.
>> I've created two scripts to automate the process. One of them generates
>> initial offset files, and the other does incremental hadoop consumption.
>> I personally use a cron job to periodically call the incremental consumer
>> script with specific parameters (for topic and HDFS path output).
>> You can find all of the required files in this gist:
>> The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
>> eventually but I think they didn't have time to get around to it yet. When
>> they do release it, it's probably going to be better than my scripts, but
>> for now, I think those scripts are the only publically available way to do
>> this stuff without writing it yourself.
>> I don't know about question 2 and 3.
>> I hope this helps :) !
>> On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> > I'm investigating using Kafka and would really appreciate getting some
>> > more experienced opinion on the way things work together.
>> > Our application instances are creating Protocol Buffer serialized
>> > and pushing them to topics in Kafka:
>> > * Web log requests
>> > * Product details viewed
>> > * Search performed
>> > * Email registered
>> > etc...
>> > I would like to be able to perform incremental loads from these topics
>> > into HDFS and then into the rest of the batch processing. I guess I had
>> > broad questions
>> > 1) How do people trigger the batch loads? Do you just point your
>> > SimpleKafkaETLJob input to the previous runs outputted offset file? Do
>> > move files between runs of the SimpleKafkaETLJob- move the part-* file
>> > one place and move the offsets into an input directory ready for the
>> > run?
>> > 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
>> > outputs Long/Text writables and is marked as deprecated (this is in the
>> > source). Is there an alternative class that should be used instead, or
>> > the hadoop-consumer being deprecated overall?
>> > 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
>> > are most people using Kafka for passing text messages around or using
>> > data etc.?
>> > Thanks,
>> > Paul