Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> Reading Kafka directly from Pig?


+
David Arthur 2013-08-07, 14:42
+
Jun Rao 2013-08-07, 14:49
+
Russell Jurney 2013-08-07, 14:59
Copy link to this message
-
Re: Reading Kafka directly from Pig?
I'd be happy to, if and when it becomes a real thing. Still very alpha
quality right now

On 8/7/13 10:58 AM, Russell Jurney wrote:
> David, can you share the code on Github so we can take a look? This
> sounds awesome.
>
> Russell Jurney http://datasyndrome.com
>
> On Aug 7, 2013, at 7:49 AM, Jun Rao <[EMAIL PROTECTED]> wrote:
>
>> David,
>>
>> That's interesting. Kafka provides an infinite stream of data whereas Pig
>> works on a finite amount of data. How did you solve the mismatch?
>>
>> Thanks,
>>
>> Jun
>>
>>
>> On Wed, Aug 7, 2013 at 7:41 AM, David Arthur <[EMAIL PROTECTED]> wrote:
>>
>>> I've thrown together a Pig LoadFunc to read data from Kafka, so you could
>>> load data like:
>>>
>>> QUERY_LOGS = load 'kafka://localhost:9092/logs.**query#8' using
>>> com.mycompany.pig.**KafkaAvroLoader('com.**mycompany.Query');
>>>
>>> The path part of the uri is the Kafka topic, and the fragment is the
>>> number of partitions. In the implementation I have, it makes one input
>>> split per partition. Offsets are not really dealt with at this point - it's
>>> a rough prototype.
>>>
>>> Anyone have thoughts on whether or not this is a good idea? I know usually
>>> the pattern is: kafka -> hdfs -> mapreduce. If I'm only reading from this
>>> data from Kafka once, is there any reason why I can't skip writing to HDFS?
>>>
>>> Thanks!
>>> -David
>>>
 
+
David Arthur 2013-08-07, 15:20