Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Fwd: Reading Kafka directly from Pig?


Copy link to this message
-
Fwd: Reading Kafka directly from Pig?
Cool stuff, a Pig Kafka UDF.

Russell Jurney http://datasyndrome.com

Begin forwarded message:

*From:* David Arthur <[EMAIL PROTECTED]>
*Date:* August 7, 2013, 7:41:30 AM PDT
*To:* [EMAIL PROTECTED]
*Subject:* *Reading Kafka directly from Pig?*
*Reply-To:* [EMAIL PROTECTED]

I've thrown together a Pig LoadFunc to read data from Kafka, so you could
load data like:

QUERY_LOGS = load 'kafka://localhost:9092/logs.query#8' using
com.mycompany.pig.KafkaAvroLoader('com.mycompany.Query');

The path part of the uri is the Kafka topic, and the fragment is the number
of partitions. In the implementation I have, it makes one input split per
partition. Offsets are not really dealt with at this point - it's a rough
prototype.

Anyone have thoughts on whether or not this is a good idea? I know usually
the pattern is: kafka -> hdfs -> mapreduce. If I'm only reading from this
data from Kafka once, is there any reason why I can't skip writing to HDFS?

Thanks!
-David
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB