Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Incremental Hadoop + SimpleKafkaETLJob


Copy link to this message
-
Re: Incremental Hadoop + SimpleKafkaETLJob
Anyone have code that does incremental S3?

Russell Jurney
twitter.com/rjurney
[EMAIL PROTECTED]
datasyndrome.com

On Jan 25, 2012, at 8:36 AM, Felix GV <[EMAIL PROTECTED]> wrote:

> Yeah those shell scripts are basically the continuation of what I was doing
> in my last blog posts. I planned to make new blog posts about them but I
> just never got around to it. Then I saw your message and it gave me the
> little kick in the arse I needed to at least gist those things :) ...
>
> Hopefully, it can save you some time :) !
>
> --
> Felix
>
>
>
> On Wed, Jan 25, 2012 at 3:30 AM, Paul Ingles <[EMAIL PROTECTED]> wrote:
>
>> Thanks Felix- I found your blog posts before and it really helped me
>> figure out how to get things working so I'll definitely give the shell
>> scripts a run.
>>
>>
>>
>> On 24 Jan 2012, at 19:05, Felix GV wrote:
>>
>>> Hello :)
>>>
>>> For question 1:
>>>
>>> The hadoop consumer in the contrib directory has almost everything it
>> needs
>>> to do distributed incremental imports out of the box, but it requires a
>> bit
>>> of hand holding.
>>>
>>> I've created two scripts to automate the process. One of them generates
>>> initial offset files, and the other does incremental hadoop consumption.
>>>
>>> I personally use a cron job to periodically call the incremental consumer
>>> script with specific parameters (for topic and HDFS path output).
>>>
>>> You can find all of the required files in this gist:
>>> https://gist.github.com/1671887
>>>
>>> The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
>>> eventually but I think they didn't have time to get around to it yet.
>> When
>>> they do release it, it's probably going to be better than my scripts, but
>>> for now, I think those scripts are the only publically available way to
>> do
>>> this stuff without writing it yourself.
>>>
>>> I don't know about question 2 and 3.
>>>
>>> I hope this helps :) !
>>>
>>> --
>>> Felix
>>>
>>>
>>>
>>> On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm investigating using Kafka and would really appreciate getting some
>>>> more experienced opinion on the way things work together.
>>>>
>>>> Our application instances are creating Protocol Buffer serialized
>> messages
>>>> and pushing them to topics in Kafka:
>>>>
>>>> * Web log requests
>>>> * Product details viewed
>>>> * Search performed
>>>> * Email registered
>>>> etc...
>>>>
>>>> I would like to be able to perform incremental loads from these topics
>>>> into HDFS and then into the rest of the batch processing. I guess I had
>> 3
>>>> broad questions
>>>>
>>>> 1) How do people trigger the batch loads? Do you just point your
>>>> SimpleKafkaETLJob input to the previous runs outputted offset file? Do
>> you
>>>> move files between runs of the SimpleKafkaETLJob- move the part-* file
>> into
>>>> one place and move the offsets into an input directory ready for the
>> next
>>>> run?
>>>>
>>>> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
>>>> outputs Long/Text writables and is marked as deprecated (this is in the
>> 0.7
>>>> source). Is there an alternative class that should be used instead, or
>> is
>>>> the hadoop-consumer being deprecated overall?
>>>>
>>>> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
>>>> are most people using Kafka for passing text messages around or using
>> JSON
>>>> data etc.?
>>>>
>>>> Thanks,
>>>> Paul
>>
>>