Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Reading Kafka directly from Pig?


Copy link to this message
-
Re: Reading Kafka directly from Pig?
David, can you share the code on Github so we can take a look? This
sounds awesome.

Russell Jurney http://datasyndrome.com

On Aug 7, 2013, at 7:49 AM, Jun Rao <[EMAIL PROTECTED]> wrote:

> David,
>
> That's interesting. Kafka provides an infinite stream of data whereas Pig
> works on a finite amount of data. How did you solve the mismatch?
>
> Thanks,
>
> Jun
>
>
> On Wed, Aug 7, 2013 at 7:41 AM, David Arthur <[EMAIL PROTECTED]> wrote:
>
>> I've thrown together a Pig LoadFunc to read data from Kafka, so you could
>> load data like:
>>
>> QUERY_LOGS = load 'kafka://localhost:9092/logs.**query#8' using
>> com.mycompany.pig.**KafkaAvroLoader('com.**mycompany.Query');
>>
>> The path part of the uri is the Kafka topic, and the fragment is the
>> number of partitions. In the implementation I have, it makes one input
>> split per partition. Offsets are not really dealt with at this point - it's
>> a rough prototype.
>>
>> Anyone have thoughts on whether or not this is a good idea? I know usually
>> the pattern is: kafka -> hdfs -> mapreduce. If I'm only reading from this
>> data from Kafka once, is there any reason why I can't skip writing to HDFS?
>>
>> Thanks!
>> -David
>>

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB